← Back ICRA 2026

SEF-MAP: Subspace-Decomposed Expert Fusion for Robust Multimodal HD Map Prediction

Haoxiang Fu, Lingfeng Zhang, Hao Li, Ruibing Hu, Zhengrong Li, Guanjing Liu, Zimu Tan, Long Chen, Hangjun Ye, Xiaoshuai Hao

PDF

AI summary

Key figure (auto-extracted from paper)

SEF-MAP achieves state-of-the-art HD map prediction by explicitly decomposing multimodal features into specialized subspaces and adaptively weighting them based on uncertainty, significantly boosting robustness under degraded sensor conditions.

HD map prediction multimodal fusion subspace decomposition uncertainty-aware gating robust perception autonomous driving

Problem

Current multimodal fusion methods for HD map prediction struggle with modality misalignment and performance degradation when one sensor is degraded by noise, occlusion, or poor lighting.

Approach

The framework decomposes bird's-eye-view features into four semantic subspaces, assigns each to a specialized expert, and uses an uncertainty-aware gating mechanism to dynamically weight expert outputs based on predictive variance.

Key results

Explicitly disentangles multimodal BEV features into four semantic subspaces to mitigate cross-modal misalignment.
Introduces distribution-aware masking with specialization losses to enforce expert roles and improve unimodal robustness.
Designs an uncertainty-aware gating mechanism with balance regularizers for adaptive expert selection and collapse prevention.
Achieves state-of-the-art performance with +4.2% mAP on nuScenes and +4.8% mAP on Argoverse2.

Why it matters

Provides autonomous driving systems with a robust, interpretable fusion framework that maintains high map prediction accuracy even when camera or LiDAR inputs are partially degraded or missing.

Abstract

High-definition (HD) maps are essential for au- tonomous driving, yet multi-modal fusion often suffers from inconsistency between camera and LiDAR modalities, leading to performance degradation under low-light conditions, occlu- sions, or sparse point clouds. To address this, we propose SEF- MAP, a Subspace-Expert Fusion framework for robust multi- modal HD map prediction. The key idea is to explicitly disentan- gle BEV features into four semantic subspaces: LiDAR-private, Image-private, Shared, and Interaction. Each subspace is as- signed a dedicated expert, thereby preserving modality-specific cues while capturing cross-modal consensus. To adaptively com- bine expert outputs, we introduce an uncertainty-aware gating mechanism at the BEV-cell level, where unreliable experts are down-weighted based on predictive variance, complemented by a usage balance regularizer to prevent expert collapse. To enhance robustness in degraded conditions and promote role specialization, we further propose distribution-aware masking: during training, modality-drop scenarios are simulated using EMA-statistical surrogate features, and a specialization loss enforces distinct behaviors of private, shared, and interaction experts across complete and masked inputs. Experiments on nuScenes and Argoverse2 benchmarks demonstrate that SEF- MAP achieves state-of-the-art performance, surpassing prior methods by +4.2% and +4.8% in mAP, respectively. SEF-MAP provides a robust and effective solution for multi-modal HD map prediction under diverse and degraded conditions.

Index terms

Deep Learning for Visual Perception Sensor Fusion Mapping