MASTD3R-SLAM: Monocular Adaptive Semantic Tracking and Dynamic Reconstruction SLAM
Fengwei Yang, Qingran Lin, Chaolun Zhu
AI summary
Problem
Traditional and neural SLAM systems struggle with dynamic scenes, suffering from tracking drift and mapping artifacts when processing arbitrary video inputs without predefined camera parameters or static scene assumptions.
Approach
The method fuses semantic segmentation with depth anomaly detection to generate adaptive dynamic masks, applies coarse-to-fine point cloud alignment for pose correction, and uses 3D Gaussian splatting with dynamic suppression to refine rendering and eliminate ghosting artifacts.
Key results
- Improved tracking ATE accuracy by over 20% compared to the MASt3R-SLAM baseline
- Eliminated dynamic object ghosting and rendering artifacts in reconstructed maps
- Achieved stable real-time processing at 13 FPS on arbitrary video inputs
- Outperformed state-of-the-art baselines in trajectory accuracy and rendering fidelity across indoor and outdoor datasets
Why it matters
Enables reliable visual navigation and dense mapping for robots and AR/VR systems operating in unpredictable, real-world environments with moving objects.
Abstract
The challenge of dynamic scenes has long been one of the core issues in the application and generalization of SLAM systems. Traditional visual SLAM systems often rely on depth sensors and prior camera parameters, making it difficult to correct dynamic challenges from arbitrary input images while simultaneously constructing dense maps. Recently, view-oriented point cloud prediction foundation models have attracted significant attention. Their impressive capability of performing 3D reconstruction without requiring camera priors has led to the emergence of SLAM systems such as SLAM3R and MASt3R-SLAM. However, these systems face challenges when applied to dynamic scenes and cannot directly use traditional methods for correction, such as semantic masking or optical flow segmentation. To address this issue, we propose MASTD3R-SLAM, a SLAM method specifically designed for dynamic scenes that supports arbitrary video inputs. The method combines fused mask-based processing with coarse- to-fine pointmap alignment and optimization to achieve point cloud–to–pose re-mapping correction, and further performs Gaussian rendering to remove rendering artifacts and suppress dynamic mapping interference. Compared to the original base- line, our approach improves tracking ATE accuracy by more than 20% and successfully restores the correct 3D map.