Class-Aware Queries for Robust Multi-View 3D Object Detection
Chaeyeon Sung, Sungmin Woo, Sangyoun Lee
AI summary
Problem
Query-based 3D detectors typically use shared learnable queries to jointly predict object categories and locations, creating representational conflicts that limit optimization. Furthermore, classification accuracy acts as a critical performance bottleneck that prior methods overlook by treating queries as class-agnostic.
Approach
A multi-view classifier first predicts which object classes are present in a scene, and these predictions are converted into embeddings that guide query initialization. This class-aware guidance is injected into the transformer decoder before localization begins, allowing semantic and geometric objectives to remain complementary without entanglement.
Key results
- Improves mAP by 2.7 points and NDS by 1.5 points over a DETR baseline on nuScenes
- Identifies classification accuracy as a primary bottleneck in DETR-style detectors via oracle analysis
- Introduces a two-stage training schedule transitioning from ground-truth to predicted class labels
- Proposes noised box-level guidance to enhance robustness to classification noise and stabilize geometric learning
Why it matters
Provides a scalable, vision-only solution for more accurate and robust camera-based 3D perception in autonomous driving by explicitly decoupling semantic and geometric guidance.
Abstract
Query-based multi-view 3D object detectors typi- cally rely on a fixed set of learnable queries that jointly predict object categories and locations. However, encoding both seman- tic and geometric information within a shared query embedding leads to representational conflicts, limiting optimization. While prior works decouple prediction heads to partially address this issue, such decoupling often treats classification and localization as independent tasks, leaving the queries themselves class- agnostic and unaware of the scene’s semantic context. In this paper, we present the first 3D object detection framework that constructs class-aware queries using scene-level object class predictions. Specifically, a multi-view image classifier first estimates which object classes are present in the scene, and these predictions are used to generate semantically guided queries for 3D localization within the transformer decoder. This allows our model to initialize each query with class-specific priors, in contrast to conventional uniform query initialization. As a result, queries attend more effectively to relevant regions and objects throughout decoding. Experiments on the nuScenes benchmark show that our method improves mAP by 2.7 points and NDS by 1.5 points over a strong DETR-based baseline. An oracle study further reveals that classification accuracy is a key bottleneck in existing DETR-style detectors, highlighting the benefit of early semantic guidance. The code is publicly available at https://github.com/ssungchae/CaQ3D.