← Back ICRA 2026

You Only Pose Once: A Minimalist's Detection Transformer for Monocular RGB Category-Level 9D Multi-Object Pose Estimation

Hakjin Lee, Junghoon Seo, Jaehoon Sim

PDF

AI summary

Key figure (auto-extracted from paper)

YOPO achieves state-of-the-art monocular category-level 9D pose estimation using only raw RGB images and a single-stage transformer detector, eliminating the need for CAD models, segmentation masks, or pseudo-depth.

Monocular pose estimation category-level 9D pose transformer detector RGB-only end-to-end learning robotic perception

Problem

Monocular RGB category-level 9D pose estimation typically relies on complex, multi-stage pipelines and auxiliary data like 3D CAD models, instance segmentation masks, or pseudo-depth maps, which hinders end-to-end training and generalization.

Approach

YOPO extends a query-based detection transformer with a lightweight pose head and bounding-box-conditioned regression modules to jointly predict 2D detection and 9D pose in a single forward pass, trained end-to-end solely on RGB images and 9D pose labels.

Key results

Sets new state-of-the-art on REAL275, CAMERA25, and HouseCat6D benchmarks.
Achieves 79.6% IoU50 and 54.1% under the 10°/10cm metric on REAL275.
Surpasses all prior RGB-only methods while closing the gap to RGB-D systems.
Eliminates dependency on CAD models, segmentation masks, and pseudo-depth maps.

Why it matters

Provides a simpler, more scalable, and cost-effective solution for robotic manipulation and automation by enabling high-accuracy 3D pose estimation from standard RGB cameras without auxiliary geometric priors.

Abstract

Accurately recovering the full 9-DoF pose of unseen instances within specific categories from a single RGB image remains a core challenge for robotics and automation. Most existing solutions still rely on pseudo-depth, CAD mod- els, or multi-stage cascades that separate 2D detection from pose estimation. Motivated by the need for a simpler, RGB- only alternative that learns directly at the category level, we revisit a longstanding question: Can object detection and 9-DoF pose estimation be unified with high performance, without any additional data? We show that they can be achieved with our method, YOPO, a single-stage, query-based framework that treats category-level 9-DoF estimation as a natural extension of 2D detection. YOPO augments a transformer detector with a lightweight pose head, a bounding-box–conditioned translation module, and a 6D-aware Hungarian matching cost. The model is trained end-to-end only with RGB images and category-level pose labels. Despite its minimalist design, YOPO sets a new state of the art on three benchmarks. On the REAL275 dataset, it achieves 79.6% IoU50 and 54.1% under the 10◦10cm metric, surpassing all prior RGB-only methods and closing much of the gap to RGB-D systems. The code, models, and additional qualitative results can be found on our project page1.

Index terms

Object Detection Segmentation and Categorization Deep Learning for Visual Perception Visual Learning