← Back ICRA 2026

AIM-SLAM: Dense Monocular SLAM Via Adaptive and Informative Multi-View Keyframe Prioritization with Foundation Model

Jinwoo Jeon, Dong-Uk Seo, Eungchang Mason Lee, Hyun Myung

PDF

AI summary

Key figure (auto-extracted from paper)

AIM-SLAM improves dense monocular SLAM accuracy and consistency by adaptively selecting informative, overlapping keyframes for foundation model inference instead of using fixed or consecutive windows.

Monocular SLAM Foundation Models Keyframe Prioritization Dense Reconstruction Sim(3) Optimization VGGT

Problem

Existing foundation model-based SLAM systems rely on fixed-length or temporally consecutive keyframe windows, which often include redundant frames with limited geometric information gain, leading to structural inconsistencies and scale drift.

Approach

The framework uses a SIGMA module to adaptively prioritize keyframes based on 3D voxel overlap and information gain, then jointly optimizes their poses in Sim(3) space for consistent dense reconstruction.

Key results

Introduces SIGMA module for adaptive multi-view keyframe prioritization
Formulates joint multi-view Sim(3) optimization for uncalibrated inputs
Achieves state-of-the-art pose estimation accuracy on real-world datasets
Enables accurate, globally consistent dense 3D reconstruction with ROS integration

Why it matters

It provides a scalable, calibration-free SLAM framework that maximizes the utility of geometric foundation models for robotics and autonomous navigation applications.

Abstract

Recent advances in geometric foundation models have emerged as a promising alternative for addressing the challenge of dense reconstruction in monocular visual simulta- neous localization and mapping (SLAM). Although geometric foundation models enable SLAM to leverage variable input views, the previous methods remain confined to two-view pairs or fixed-length inputs without sufficient deliberation of geometric context for view selection. To tackle this problem, we propose AIM-SLAM, a dense monocular SLAM frame- work that exploits an adaptive and informative multi-view keyframe prioritization with dense pointmap predictions from visual geometry grounded transformer (VGGT). Specifically, we introduce the selective information- and geometric-aware multi- view adaptation (SIGMA) module, which employs voxel overlap and information gain to retrieve a candidate set of keyframes and adaptively determine its size. Furthermore, we formulate a joint multi-view Sim(3) optimization that enforces consistent alignment across selected views, substantially improving pose estimation accuracy. The effectiveness of AIM-SLAM is demon- strated on real-world datasets, where it achieves state-of-the-art pose estimation performance and accurate dense reconstruction results. Our system supports ROS integration, with code is available at https://aimslam.github.io/.

Index terms

SLAM Mapping Deep Learning for Visual Perception