MTE-SLAM: Multi-Tier Feature Fusion for Efficient Neural Semantic SLAM
Danqi Lu, Changxin Huang, Zhuangzhuang Chen, Zhiliang Lin, Dachong Li, Yanbin Chang, Jianqiang Li
AI summary
Problem
Existing semantic neural SLAM methods rely on coarse feature fusion, leading to blurred object boundaries, loss of fine details, and high computational overhead that hinders real-time performance.
Approach
The framework introduces a Multi-Tier Feature Fusion module to progressively integrate global scene context and local spatial continuity, combined with a Feature Redundancy Suppressor that dynamically prunes uninformative features to accelerate training and inference.
Key results
- Centimeter-level 3D reconstruction with sharp semantic boundaries
- State-of-the-art tracking and semantic segmentation accuracy
- Up to 4x faster inference and training than competing semantic SLAM systems
- Robust performance validated on Replica and ScanNet datasets
Why it matters
Provides a computationally efficient and highly accurate semantic mapping solution critical for real-time robotics, autonomous navigation, and augmented reality applications.
Abstract
Neural implicit representations have demonstrated excellent performance in Simultaneous Localization and Map- ping (SLAM) by virtue of their ability to jointly model geometry, color and camera poses. Recent studies have at- tempted to integrate scene semantic information into implicit representation frameworks, significantly improving the ability of environmental understanding. Nevertheless, most existing methods rely on direct semantic coloring or rough fusing other modalities, resulting in underutilized semantic clues. This further causes problems such as blurred small objects, loss of fine structures and unclear regional boundaries. Additionally, redundant features introduced in the process reduce system efficiency. To address these challenges, we propose MTE-SLAM, an accurate and efficient end-to-end neural RGB-D semantic SLAM framework that synergizes Multi-Tier Feature Fusion (MTFF) and Feature Redundancy Suppressor (FRS). MTFF progressively fuses semantic features at global and local scales. The global context enhancement module captures scene-level semantic correlations, while the local continuity enhancement module refines neighborhood consistency, generating detailed and coherent semantic maps. FRS adaptively filters redundant features based on their importance and temporal variation, reducing parameters and computation while preserving repre- sentational power to accelerate training and inference. Compre- hensive evaluations on Replica and ScanNet demonstrate that MTE-SLAM achieves centimeter-level reconstruction, state-of- the-art tracking and semantic accuracy, and runs up to four times faster than existing semantic SLAM systems.