← Back ICRA 2026

MASTD3R-SLAM: Monocular Adaptive Semantic Tracking and Dynamic Reconstruction SLAM

Fengwei Yang, Qingran Lin, Chaolun Zhu

PDF

AI summary

Key figure (auto-extracted from paper)

MASTD3R-SLAM enables robust, high-fidelity 3D reconstruction and tracking in dynamic scenes using arbitrary monocular video without camera priors.

Dynamic SLAM Monocular Reconstruction 3D Gaussian Splatting Semantic Masking Pose Correction Neural Rendering

Problem

Traditional and neural SLAM systems struggle with dynamic scenes, suffering from tracking drift and mapping artifacts when processing arbitrary video inputs without predefined camera parameters or static scene assumptions.

Approach

The method fuses semantic segmentation with depth anomaly detection to generate adaptive dynamic masks, applies coarse-to-fine point cloud alignment for pose correction, and uses 3D Gaussian splatting with dynamic suppression to refine rendering and eliminate ghosting artifacts.

Key results

Improved tracking ATE accuracy by over 20% compared to the MASt3R-SLAM baseline
Eliminated dynamic object ghosting and rendering artifacts in reconstructed maps
Achieved stable real-time processing at 13 FPS on arbitrary video inputs
Outperformed state-of-the-art baselines in trajectory accuracy and rendering fidelity across indoor and outdoor datasets

Why it matters

Enables reliable visual navigation and dense mapping for robots and AR/VR systems operating in unpredictable, real-world environments with moving objects.

Abstract

The challenge of dynamic scenes has long been one of the core issues in the application and generalization of SLAM systems. Traditional visual SLAM systems often rely on depth sensors and prior camera parameters, making it difficult to correct dynamic challenges from arbitrary input images while simultaneously constructing dense maps. Recently, view-oriented point cloud prediction foundation models have attracted significant attention. Their impressive capability of performing 3D reconstruction without requiring camera priors has led to the emergence of SLAM systems such as SLAM3R and MASt3R-SLAM. However, these systems face challenges when applied to dynamic scenes and cannot directly use traditional methods for correction, such as semantic masking or optical flow segmentation. To address this issue, we propose MASTD3R-SLAM, a SLAM method specifically designed for dynamic scenes that supports arbitrary video inputs. The method combines fused mask-based processing with coarse- to-fine pointmap alignment and optimization to achieve point cloud–to–pose re-mapping correction, and further performs Gaussian rendering to remove rendering artifacts and suppress dynamic mapping interference. Compared to the original base- line, our approach improves tracking ATE accuracy by more than 20% and successfully restores the correct 3D map.

Index terms

SLAM Mapping Visual Tracking