← Back ICRA 2026

InstantPose: Zero-Shot Instance-Level 6D Pose Estimation from a Single View

Francesco Di Felice, Alberto Remus, Stefano Gasperini, Benjamin Busam, Lionel Ott, Stefan Thalhammer, Federico Tombari, Carlo Alberto Avizzano

PDF

AI summary

Key figure (auto-extracted from paper)

InstantPose accurately estimates the 6D pose of unseen objects using only a single unposed RGB reference image, bypassing the need for 3D CAD models or multiple views.

Zero-shot pose estimation 6D pose estimation Large Reconstruction Models RGB-D perception Robotic grasping Single-view reconstruction

Problem

Current instance-level pose estimation methods depend on costly 3D CAD models or multiple posed reference images, which restricts their use in real-world robotic applications involving novel objects.

Approach

The method feeds a single RGB reference into a Large Reconstruction Model to generate a coarse 3D mesh, then aligns it to a query RGB-D view using semantic feature matching and refines the pose through an online optimization process that corrects for geometric inaccuracies.

Key results

Surpasses accuracy of methods requiring perfect 3D models on the YCB-V dataset
Enables successful robotic grasping from single-view pose estimates
Eliminates dependency on posed reference images or manual 3D scans
Provides a training-free pipeline for zero-shot instance-level pose estimation

Why it matters

It offers a practical, real-time solution for robotic manipulation in unstructured environments where obtaining 3D object models is infeasible.

Abstract

Object pose estimation using visual data is crucial for robotic interaction with the environment. Many existing instance- level methods are restricted by their requirements for 3D CAD models or multiple object views, which limits their flexibility and generalizability. Overcoming this limitation is critical to enhance the adaptability of pose estimation systems. In this work, a novel pipeline that leverages recent advances in reconstruction tech- niques is presented to address these challenges. To this end, Large Reconstruction Models (LRM) represent an advanced neural ar- chitecture capable of generating 3D object models from a limited set of views. Nevertheless, the resulting 3D models often lack relevant geometric and texture details due to insufficient input information. This research presents InstantPose, an innovative zero-shot instance-level pose estimation method that, building upon LRM, can determine the pose of unseen objects using as little as a single unposed RGB reference and RGB-D query images. Extensive experiments demonstrate that InstantPose achieves remarkable performance in object pose estimation on the YCB-V dataset, compared to methods conceived to rely on a geometrically perfect object’s model. Furthermore, the 6D pose provided through the presented approach facilitates successful object grasping, high- lighting its practical utility in robotic manipulation tasks.

Index terms

RGB-D Perception Deep Learning for Visual Perception Deep Learning Methods