← Back ICRA 2026

AttBEV: Enhancing Multi-Modal 3D Object Detection with CBAM Attention in BEVFusion for Autonomous Driving

Na Zhang, Edmundo Guerra Paradas, Antoni Grau Saldes

PDF

AI summary

Key figure (auto-extracted from paper)

AttBEV boosts multi-modal 3D object detection accuracy and real-time performance by replacing BEVFusion's static fusion with dynamic CBAM attention and adaptive voxel encoding.

Multi-modal fusion 3D object detection CBAM attention BEV representation autonomous driving LiDAR-camera fusion

Problem

BEVFusion's simple feature concatenation causes inefficient cross-modal interaction and information loss during BEV projection, limiting detection accuracy for dynamic objects and edge cases.

Approach

The framework replaces BEVFusion's linear fusion with a lightweight CBAM attention module that dynamically recalibrates LiDAR and camera features across channel and spatial dimensions, while swapping the standard voxel encoder for a distance-adaptive variant to improve efficiency.

Key results

Achieves 0.6795 NDS and 0.6426 mAP on nuScenes, surpassing BEVFusion by 2.63% and 1.79%
Reduces localization and scale errors while significantly improving orientation and velocity estimation
Introduces a lightweight CBAM-Fuser that dynamically balances cross-modal features without heavy computational overhead
Replaces standard voxel encoding with DynamicSimpleVFE to enable real-time processing on embedded hardware

Why it matters

Enhances perception safety and deployment feasibility for autonomous vehicles by delivering a more accurate, computationally efficient multi-sensor fusion pipeline.

Abstract

Multimodal fusion has an important research value in environmental perception for autonomous driving. Among them, BEVFusion has become one of the mainstream framework for LiDAR camera fusion by unifying multimodal features in the bird’s-eye view (BEV) space. However, its performance is limited by inefficient cross-modal interaction and information loss during BEV projection, especially for dynamic objects and edge cases. To address these limitations, we propose AttBEV, an advanced fusion architecture that introduces a CBAM at the feature fusion layer: a lightweight attention mechanism that improves the model’s ability to capture key information through dynamic feature calibration of channel and spatial dimensions. Extensive experiments on the nuScenes dataset demonstrate that AttBEV achieves superior performance compared to BEVFusion on most evaluation metrics. NDS reaches 0.6795, which is 2.63% higher than BEVFusion’s 0.6532, and mAP reaches 0.6426, which is 1.79% higher than BEVFusion’s 0.6247. In general, AttBEV outperforms existing methods in both model accuracy and generalization ability and significantly improves the performance of 3D object detection in autonomous driving scenarios.

Index terms

Computer Vision for Transportation Sensor Fusion Intelligent Transportation Systems