← Back ICRA 2026

TIGeR: Text-Instructed Generation and Refinement for Template-Free Hand-Object Interaction

Yiyao Huang, Zhedong Zheng, Ziwei Yu, Yaxiong Wang, Tze Ho Elden Tse, Angela Yao

PDF

AI summary

Key figure (auto-extracted from paper)

Text-driven 3D priors combined with vision-guided refinement enable accurate, template-free reconstruction of hand-object interactions, even under heavy occlusion.

Template-free reconstruction Text-driven 3D priors Hand-object interaction 2D-3D attention Occlusion robustness Monocular 3D reconstruction

Problem

Pre-defined 3D object templates are hard to obtain and restrict adaptability in unconstrained scenarios, while existing template-free methods struggle with self-occlusion and fail to complete object geometry.

Approach

The framework generates coarse 3D shape priors from text descriptions of the held object, then refines them using a 2D-3D collaborative attention module to align with visual cues and jointly optimize hand and object poses.

Key results

Template-free framework replacing hand-crafted templates with text-driven 3D priors
2D-3D collaborative attention module for precise shape refinement and registration
State-of-the-art Chamfer distances on Dex-YCB (1.979) and Obman (5.468) datasets
Robust reconstruction under heavy hand occlusion and compatibility with diverse prior sources

Why it matters

Provides a scalable, template-free pipeline for real-world robotics and AR/VR applications where 3D object models are unavailable or highly variable.

Abstract

Pre-defined 3D object templates are widely used in 3D reconstruction of hand-object interactions. However, they often require substantial manual efforts to capture or source, and inherently restrict the adaptability of models to unconstrained interaction scenarios, e.g., heavily-occluded objects. To overcome this bottleneck, we propose a new Text-Instructed Generation and Refinement (TIGeR) framework, harnessing the power of intuitive text-driven priors to steer the object shape refinement and pose estimation. We use a two-stage framework: a text- instructed prior generation and vision-guided refinement. As the name implies, we first leverage off-the-shelf models to generate shape priors according to the text description without tedious 3D crafting. Considering the geometric gap between the synthesized prototype and the real object interacted with the hand, we further calibrate the synthesized prototype via 2D-3D collaborative attention. TIGeR achieves competitive performance, i.e., 1.979 and 5.468 object Chamfer distance on the widely- used Dex-YCB and Obman datasets, respectively, surpassing existing template-free methods. Notably, the proposed framework shows robustness to occlusion, while maintaining compatibility with heterogeneous prior sources, e.g., retrieved hand-crafted prototypes, in practical deployment scenarios. Our code will be available at https://github.com/huangyiyNUS/TIGeR.

Index terms

Contact Modeling Deep Learning in Grasping and Manipulation Grasping