← Back ICRA 2026

Class-Aware Queries for Robust Multi-View 3D Object Detection

Chaeyeon Sung, Sungmin Woo, Sangyoun Lee

PDF

AI summary

Key figure (auto-extracted from paper)

Injecting scene-level class predictions into detection queries before decoding resolves semantic-geometric conflicts and significantly boosts 3D object detection accuracy.

Multi-view 3D detection Class-aware queries Query-based detection Semantic-geometric decoupling Autonomous driving Transformer detectors

Problem

Query-based 3D detectors typically use shared learnable queries to jointly predict object categories and locations, creating representational conflicts that limit optimization. Furthermore, classification accuracy acts as a critical performance bottleneck that prior methods overlook by treating queries as class-agnostic.

Approach

A multi-view classifier first predicts which object classes are present in a scene, and these predictions are converted into embeddings that guide query initialization. This class-aware guidance is injected into the transformer decoder before localization begins, allowing semantic and geometric objectives to remain complementary without entanglement.

Key results

Improves mAP by 2.7 points and NDS by 1.5 points over a DETR baseline on nuScenes
Identifies classification accuracy as a primary bottleneck in DETR-style detectors via oracle analysis
Introduces a two-stage training schedule transitioning from ground-truth to predicted class labels
Proposes noised box-level guidance to enhance robustness to classification noise and stabilize geometric learning

Why it matters

Provides a scalable, vision-only solution for more accurate and robust camera-based 3D perception in autonomous driving by explicitly decoupling semantic and geometric guidance.

Abstract

Query-based multi-view 3D object detectors typi- cally rely on a fixed set of learnable queries that jointly predict object categories and locations. However, encoding both seman- tic and geometric information within a shared query embedding leads to representational conflicts, limiting optimization. While prior works decouple prediction heads to partially address this issue, such decoupling often treats classification and localization as independent tasks, leaving the queries themselves class- agnostic and unaware of the scene’s semantic context. In this paper, we present the first 3D object detection framework that constructs class-aware queries using scene-level object class predictions. Specifically, a multi-view image classifier first estimates which object classes are present in the scene, and these predictions are used to generate semantically guided queries for 3D localization within the transformer decoder. This allows our model to initialize each query with class-specific priors, in contrast to conventional uniform query initialization. As a result, queries attend more effectively to relevant regions and objects throughout decoding. Experiments on the nuScenes benchmark show that our method improves mAP by 2.7 points and NDS by 1.5 points over a strong DETR-based baseline. An oracle study further reveals that classification accuracy is a key bottleneck in existing DETR-style detectors, highlighting the benefit of early semantic guidance. The code is publicly available at https://github.com/ssungchae/CaQ3D.

Index terms

Object Detection Segmentation and Categorization Deep Learning for Visual Perception Autonomous Agents