← Back ICRA 2026

HI-SLAM2: Geometry-Aware Gaussian SLAM for Fast Monocular Scene Reconstruction

Wei Zhang, Qing Cheng, David Skuddis, Niclas Zeller, Daniel Cremers, Norbert Haala

PDF

AI summary

Key figure (auto-extracted from paper)

HI-SLAM2 simultaneously achieves high geometric accuracy and high-fidelity rendering in monocular SLAM by combining learned depth/normal priors with 3D Gaussian Splatting and a novel grid-based scale alignment strategy.

Monocular SLAM 3D Gaussian Splatting Dense Reconstruction Depth Estimation Visual Perception Real-time Mapping

Problem

Monocular 3D reconstruction suffers from scale ambiguity and noisy depth estimates, while existing SLAM methods typically force a tradeoff between rendering quality and geometric accuracy or require expensive depth sensors.

Approach

The system uses a hybrid architecture that corrects monocular depth scale distortions via a grid-based alignment strategy and leverages 3D Gaussian Splatting as an explicit, incrementally growing map for fast online tracking and joint pose-geometry optimization.

Key results

Grid-based scale alignment corrects monocular depth distortions
3D Gaussian Splatting enables efficient online mapping and high-quality rendering
Surpasses RGB-D methods in both geometry accuracy and visual fidelity
Hierarchical optimization reduces trajectory error by 29.3%

Why it matters

It enables real-time, high-fidelity 3D scene reconstruction and navigation using only lightweight, low-cost RGB cameras, eliminating the need for expensive depth sensors or LiDAR.

Abstract

We present HI-SLAM2, a geometry-aware Gaussian SLAM system that achieves fast and accurate monocular scene reconstruction using only RGB input. Existing neural SLAM or 3DGS-based SLAM methods often tradeoff between rendering quality and geometry accuracy, our research demonstrates that bothcanbeachievedsimultaneouslywithRGBinputalone.Thekey idea of our approach is to enhance the ability for geometry estima- tion by combining easy-to-obtain monocular priors with learning- based dense SLAM, and then using 3-D Gaussian splatting as our core map representation to efficiently model the scene. Upon loop closure, our method ensures on-the-fly global consistency through efficient pose graph bundle adjustment and instant map updates by explicitly deforming the 3-D Gaussian units based on anchored keyframe updates. Furthermore, we introduce a grid-based scale alignment strategy to maintain improved scale consistency in prior depths for finer depth details. Through extensive experiments on Replica, ScanNet, Waymo Open, ETH3D SLAM and ScanNet++ datasets, we demonstrate significant improvements over existing neural SLAM methods and even surpass RGB-D-based methods in both reconstruction and rendering quality.

Index terms

SLAM Mapping Dense Reconstruction Visual Learning