AYDIV: Adaptable Yielding 3D Object Detection Via Integrated Contextual Vision Transformer
Tanmoy Dam,Sanjay Bhargav Dharavath,Sameer Alam,Nimrod Lilith,Supriyo Chakraborty,Mir Feroskhan
Abstract
Combining LiDAR and camera data has shown potential in enhancing short-distance object detection in au- tonomous driving systems. Yet, the fusion encounters difficulties with extended distance detection due to the contrast between Li- DAR’s sparse data and the dense resolution of cameras. Besides, discrepancies in the two data representations further complicate fusion methods. We introduce AYDIV, a novel framework integrating a tri-phase alignment process specifically designed to enhance long-distance detection even amidst data discrepancies. AYDIV consists of the Global Contextual Fusion Alignment Transformer (GCFAT), which improves the extraction of camera features and provides a deeper understanding of large-scale patterns; the Sparse Fused Feature Attention (SFFA), which fine-tunes the fusion of LiDAR and camera details; and the Volumetric Grid Attention (VGA) for a comprehensive spatial data fusion. AYDIV’s performance on the Waymo Open Dataset (WOD) with an improvement of 1.24% in mAPH value(L2 difficulty) and the Argoverse2 Dataset with a performance improvement of 7.40% in AP value demonstrates its efficacy in comparison to other existing fusion-based methods. Our code is publicly available at https://github.com/sanjay-810/AYDIV2