← Back ICRA 2026

SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection

Philipp Wolters, Johannes Gilg, Torben Teepe, Fabian Herzog, Felix Fent, Gerhard Rigoll

PDF

AI summary

Key figure (auto-extracted from paper)

SpaRC achieves state-of-the-art 3D object detection and tracking with real-time speed by replacing inefficient dense grids with a sparse, object-centric radar-camera fusion transformer.

radar-camera fusion 3D object detection sparse transformer autonomous driving real-time perception multi-modal fusion

Problem

Dense Bird's-Eye-View methods waste computation on empty grid cells when processing sparse radar data, while query-based detectors suffer from false positives and poor localization due to implicit depth modeling. Bridging the view disparity between dense camera features and sparse radar points remains a key challenge for efficient, robust 3D perception.

Approach

The method processes radar points and camera features directly without dense grids, using Sparse Frustum Fusion to align cross-modal features in perspective space. It refines object queries with distance-weighted Range-Adaptive Radar Aggregation and filters them via Local Self-Attention to focus only on spatially relevant neighbors.

Key results

State-of-the-art 3D detection on nuScenes (67.1 NDS, 60.0 mAP)
Best tracking performance on nuScenes (63.1 AMOTA)
Real-time inference speed on consumer-grade GPUs
Strong long-range and adverse-weather generalization on TruckScenes

Why it matters

Enables affordable, real-time, and robust 3D perception for autonomous driving by eliminating the computational bottlenecks of dense grid-based fusion while maintaining high accuracy and localization precision.

Abstract

In this work, we present SpaRC, a novel sparse fusion transformer for 3D perception that integrates multi- view image semantics with Radar and Camera point features. The fusion of radar and camera modalities has emerged as an efficient perception paradigm for autonomous driving systems. While conventional approaches utilize dense Bird’s Eye View (BEV)-based architectures for depth estimation, contempo- rary query-based transformers excel in camera-only detection through object-centric methodology. However, these query- based approaches exhibit limitations in false positive detections and localization precision due to implicit depth modeling. We address these challenges through three key contributions: (1) sparse frustum fusion (SFF) for cross-modal feature alignment, (2) range-adaptive radar aggregation (RAR) for precise object localization, and (3) local self-attention (LSA) for focused query aggregation. In contrast to existing methods requiring computationally intensive BEV-grid rendering, SpaRC operates directly on encoded point features, yielding substantial im- provements in efficiency and accuracy. Empirical evaluations on the nuScenes and TruckScenes benchmarks demonstrate that SpaRC significantly outperforms existing dense BEV-based and sparse query-based detectors. Our method achieves state-of-the- art performance of 67.1 NDS and 63.1 AMOTA. The code is available at https://phi-wol.github.io/sparc/.

Index terms

Sensor Fusion Deep Learning for Visual Perception