← Back ICRA 2024

DefFusion: Deformable Multimodal Representation Fusion for 3D Semantic Segmentation

Rongtao Xu, Changwei wang, Duzhen Zhang, Man Zhang, Shibiao Xu, Weiliang Meng, Xiaopeng Zhang

PDF

Abstract

The complementarity between camera and LiDAR data makes fusion methods a promising approach to improve 3D semantic segmentation performance. Recent transformer- based methods have also demonstrated superiority in segmenta- tion. However, multimodal solutions incorporating transformers are underexplored and face two key inherent difficulties: over- attention and noise from different modal data. To overcome these challenges, we propose a Deformable Multimodal Rep- resentation Fusion (DefFusion) framework consisting mainly of a Deformable Representation Fusion Transformer and Dy- namic Representation Augmentation Modules. The Deformable Representation Fusion Transformer introduces the deformable mechanism in multimodal fusion, avoiding over-attention and improving efficiency by adaptively modeling a 2D key/value set for a given 3D query, thus enabling multimodal fusion with higher flexibility. To enhance the 2D representation and 3D representation, the Dynamic Representation Enhancement Module is proposed to dynamically remove noise in the input representation via Dynamic Grouped Representation Genera- tion and Dynamic Mask Generation. Extensive experiments val- idate that our model achieves the best 3D semantic segmentation performance on SemanticKITTI and NuScenes benchmarks.

Index terms

Semantic Scene Understanding Autonomous Agents Sensor Fusion