Research Analyzer
← Back ICRA 2026

CDV-SLAM: Compact Deep Visual SLAM with Unified Semantic and Geometric Perception

Ya Fan, Rongling Lang

PDF

AI summary

Key figure (auto-extracted from paper)
Unifying semantic and geometric perception via a compact foundation model significantly boosts monocular SLAM accuracy and robustness while maintaining real-time efficiency.
Monocular SLAM Semantic-Geometric Fusion Visual Foundation Models Scale Correction Optical Flow Compact Deep Learning

Problem

Monocular visual SLAM degrades in challenging scenarios like fast motion, dynamic objects, and scale ambiguity, and existing deep learning methods struggle to unify these cues compactly without heavy computational costs.

Approach

CDV-SLAM reuses frozen semantic features from a vision foundation model to predict optical flow, segment dynamic objects, and estimate depth, fusing them with lightweight geometric features in a flow network and applying local scale correction during bundle adjustment.

Key results

  • 42% reduction in average Absolute Trajectory Error on KITTI
  • Flow-only visual odometry surpasses geometric baselines on TartanAir and EuRoC
  • Compact semantic-geometric fusion enables efficient dynamic object filtering and depth prediction
  • Local scale correction in bundle adjustment effectively suppresses scale drift

Why it matters

Enables robust, real-time monocular localization for robotics and autonomous systems by efficiently combining semantic understanding with geometric tracking.

Abstract

Robust monocular visual Simultaneous Localiza- tion and Mapping (SLAM) serves as a cornerstone for vari- ous applications. However, its performance frequently suffers degradation in challenging scenarios including fast motion, dynamic objects, and scale ambiguity. This paper proposes CDV-SLAM, a compact deep visual SLAM framework that unifies geometric and semantic perception through a shared visual foundation model. A tight semantic-geometric fusion network is devised to predict optical flow in fast motion. Semantic features are efficiently reused to obtain segmentation and monocular depth for dynamic objects exclusion and scale acquisition. To further address scale drift, we introduce local scale correction in bundle adjustment. Experimental results demonstrate a 42% decrease in average Absolute Trajectory Error (ATE) on the KITTI dataset over the state-of-the- art. Furthermore, our flow-only visual odometry surpasses geometric-only methods on the TartanAir and EuRoC datasets, with a marginal speed reduction of 6%. Our code is publicly available at https://github.com/FrankYard/CDV-SLAM.

Index terms

Deep Learning Methods SLAM Localization

Related papers