LSSAttn: Towards Dense and Accurate View Transformation for Multi-Modal 3D Object Detection
Qi Jiang, HAO SUN
Abstract
Fusing the camera and LiDAR information in the unified BEV representation serves as the elegant paradigm for the 3D detection tasks. Current multi-modal fusion methods in BEV can be categorized into LSS-based and Transformer- based in terms of their view transformation. The former lever- ages inaccurate depth prediction and massive pseudo points for perspective-to-BEV transformation while the latter only fetches sparse image features to the BEV representation. To overcome their shortcomings, an optimized view transformation is proposed, which can be easily modulated into the LSS- based methods. The proposed module capitalizes on the LSS mechanism to establish dense associations between perspective pixels and BEV grids. It utilizes the attention mechanism to compute similarity scores for each associated pair during feature aggregation. Starting from the BEVFusion baseline, we further introduce (1) cross-attention within the associated subsets to transfer image features into the BEV, and (2) a multi-scale feature fusion mechanism for LSS-based view transformation. Extensive experiments on nuScenes validate the effectiveness and efficiency of our proposed module, which achieves an increase of 1.3% in mAP compared to the baseline model.