SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection
Philipp Wolters, Johannes Gilg, Torben Teepe, Fabian Herzog, Felix Fent, Gerhard Rigoll
AI summary
Problem
Dense Bird's-Eye-View methods waste computation on empty grid cells when processing sparse radar data, while query-based detectors suffer from false positives and poor localization due to implicit depth modeling. Bridging the view disparity between dense camera features and sparse radar points remains a key challenge for efficient, robust 3D perception.
Approach
The method processes radar points and camera features directly without dense grids, using Sparse Frustum Fusion to align cross-modal features in perspective space. It refines object queries with distance-weighted Range-Adaptive Radar Aggregation and filters them via Local Self-Attention to focus only on spatially relevant neighbors.
Key results
- State-of-the-art 3D detection on nuScenes (67.1 NDS, 60.0 mAP)
- Best tracking performance on nuScenes (63.1 AMOTA)
- Real-time inference speed on consumer-grade GPUs
- Strong long-range and adverse-weather generalization on TruckScenes
Why it matters
Enables affordable, real-time, and robust 3D perception for autonomous driving by eliminating the computational bottlenecks of dense grid-based fusion while maintaining high accuracy and localization precision.
Abstract
In this work, we present SpaRC, a novel sparse fusion transformer for 3D perception that integrates multi- view image semantics with Radar and Camera point features. The fusion of radar and camera modalities has emerged as an efficient perception paradigm for autonomous driving systems. While conventional approaches utilize dense Bird’s Eye View (BEV)-based architectures for depth estimation, contempo- rary query-based transformers excel in camera-only detection through object-centric methodology. However, these query- based approaches exhibit limitations in false positive detections and localization precision due to implicit depth modeling. We address these challenges through three key contributions: (1) sparse frustum fusion (SFF) for cross-modal feature alignment, (2) range-adaptive radar aggregation (RAR) for precise object localization, and (3) local self-attention (LSA) for focused query aggregation. In contrast to existing methods requiring computationally intensive BEV-grid rendering, SpaRC operates directly on encoded point features, yielding substantial im- provements in efficiency and accuracy. Empirical evaluations on the nuScenes and TruckScenes benchmarks demonstrate that SpaRC significantly outperforms existing dense BEV-based and sparse query-based detectors. Our method achieves state-of-the- art performance of 67.1 NDS and 63.1 AMOTA. The code is available at https://phi-wol.github.io/sparc/.