← Back ICRA 2026

MTE-SLAM: Multi-Tier Feature Fusion for Efficient Neural Semantic SLAM

Danqi Lu, Changxin Huang, Zhuangzhuang Chen, Zhiliang Lin, Dachong Li, Yanbin Chang, Jianqiang Li

PDF

AI summary

Key figure (auto-extracted from paper)

MTE-SLAM achieves state-of-the-art tracking and semantic accuracy with centimeter-level reconstruction while running up to four times faster than existing semantic SLAM systems.

Neural SLAM Semantic Mapping Multi-Tier Feature Fusion Feature Redundancy Suppression Real-time Perception RGB-D SLAM

Problem

Existing semantic neural SLAM methods rely on coarse feature fusion, leading to blurred object boundaries, loss of fine details, and high computational overhead that hinders real-time performance.

Approach

The framework introduces a Multi-Tier Feature Fusion module to progressively integrate global scene context and local spatial continuity, combined with a Feature Redundancy Suppressor that dynamically prunes uninformative features to accelerate training and inference.

Key results

Centimeter-level 3D reconstruction with sharp semantic boundaries
State-of-the-art tracking and semantic segmentation accuracy
Up to 4x faster inference and training than competing semantic SLAM systems
Robust performance validated on Replica and ScanNet datasets

Why it matters

Provides a computationally efficient and highly accurate semantic mapping solution critical for real-time robotics, autonomous navigation, and augmented reality applications.

Abstract

Neural implicit representations have demonstrated excellent performance in Simultaneous Localization and Map- ping (SLAM) by virtue of their ability to jointly model geometry, color and camera poses. Recent studies have at- tempted to integrate scene semantic information into implicit representation frameworks, significantly improving the ability of environmental understanding. Nevertheless, most existing methods rely on direct semantic coloring or rough fusing other modalities, resulting in underutilized semantic clues. This further causes problems such as blurred small objects, loss of fine structures and unclear regional boundaries. Additionally, redundant features introduced in the process reduce system efficiency. To address these challenges, we propose MTE-SLAM, an accurate and efficient end-to-end neural RGB-D semantic SLAM framework that synergizes Multi-Tier Feature Fusion (MTFF) and Feature Redundancy Suppressor (FRS). MTFF progressively fuses semantic features at global and local scales. The global context enhancement module captures scene-level semantic correlations, while the local continuity enhancement module refines neighborhood consistency, generating detailed and coherent semantic maps. FRS adaptively filters redundant features based on their importance and temporal variation, reducing parameters and computation while preserving repre- sentational power to accelerate training and inference. Compre- hensive evaluations on Replica and ScanNet demonstrate that MTE-SLAM achieves centimeter-level reconstruction, state-of- the-art tracking and semantic accuracy, and runs up to four times faster than existing semantic SLAM systems.

Index terms

SLAM