AttBEV: Enhancing Multi-Modal 3D Object Detection with CBAM Attention in BEVFusion for Autonomous Driving
Na Zhang, Edmundo Guerra Paradas, Antoni Grau Saldes
AI summary
Problem
BEVFusion's simple feature concatenation causes inefficient cross-modal interaction and information loss during BEV projection, limiting detection accuracy for dynamic objects and edge cases.
Approach
The framework replaces BEVFusion's linear fusion with a lightweight CBAM attention module that dynamically recalibrates LiDAR and camera features across channel and spatial dimensions, while swapping the standard voxel encoder for a distance-adaptive variant to improve efficiency.
Key results
- Achieves 0.6795 NDS and 0.6426 mAP on nuScenes, surpassing BEVFusion by 2.63% and 1.79%
- Reduces localization and scale errors while significantly improving orientation and velocity estimation
- Introduces a lightweight CBAM-Fuser that dynamically balances cross-modal features without heavy computational overhead
- Replaces standard voxel encoding with DynamicSimpleVFE to enable real-time processing on embedded hardware
Why it matters
Enhances perception safety and deployment feasibility for autonomous vehicles by delivering a more accurate, computationally efficient multi-sensor fusion pipeline.
Abstract
Multimodal fusion has an important research value in environmental perception for autonomous driving. Among them, BEVFusion has become one of the mainstream framework for LiDAR camera fusion by unifying multimodal features in the bird’s-eye view (BEV) space. However, its performance is limited by inefficient cross-modal interaction and information loss during BEV projection, especially for dynamic objects and edge cases. To address these limitations, we propose AttBEV, an advanced fusion architecture that introduces a CBAM at the feature fusion layer: a lightweight attention mechanism that improves the model’s ability to capture key information through dynamic feature calibration of channel and spatial dimensions. Extensive experiments on the nuScenes dataset demonstrate that AttBEV achieves superior performance compared to BEVFusion on most evaluation metrics. NDS reaches 0.6795, which is 2.63% higher than BEVFusion’s 0.6532, and mAP reaches 0.6426, which is 1.79% higher than BEVFusion’s 0.6247. In general, AttBEV outperforms existing methods in both model accuracy and generalization ability and significantly improves the performance of 3D object detection in autonomous driving scenarios.