← Back ICRA 2026

Imagine2Act: Leveraging Object-Action Motion Consistency from Imagined Goals for Robotic Manipulation

Liang Heng, Jiadong XU, Yiwen Wang, Xiaoqi Li, Muhe Cai, Yan Shen, Juan Zhu, Guanghui Ren, Hao Dong

PDF

AI summary

Key figure (auto-extracted from paper)

Imagine2Act boosts high-precision robotic manipulation by generating imagined 3D goal states and explicitly aligning predicted actions with object transformations.

Robotic manipulation Relational object rearrangement Imitation learning Imagined goals Object-action consistency 3D diffusion policy

Problem

Relational object rearrangement tasks demand precise semantic and geometric reasoning, but existing methods either rely on limited demonstrations, suffer from generative noise, or fail to explicitly couple object transformations with action prediction.

Approach

The framework generates imagined goal point clouds from language instructions and initial observations, then uses an object-action consistency learning strategy with soft pose supervision to align predicted end-effector motions with the generated object transformations.

Key results

0.79 mean success rate across 7 RLBench relational rearrangement tasks
25% average success rate increase in real-world high-precision manipulation
Zero-shot imagined goal point cloud generation for robust semantic-geometric conditioning
Object-action consistency learning strategy preventing error accumulation via soft pose supervision

Why it matters

Enables reliable execution of complex relational manipulation tasks, advancing the practical deployment of domestic and service robots.

Abstract

Relational object rearrangement (ROR) tasks (e.g. insert flower to vase ) require a robot to manipulate ob- jects with precise semantic and geometric reasoning. Existing approaches either rely on pre-collected demonstrations that struggle to capture complex geometric constraints or generate goal-state observations to capture semantic and geometric knowledge, but fail to explicitly couple object transformation with action prediction, resulting in errors due to generative noise. To address these limitations, we propose Imagine2Act, a 3D imitation-learning framework that incorporates semantic and geometric constraints of objects into policy learning to tackle high-precision manipulation tasks. We first generate imagined goal images conditioned on language instructions and reconstruct corresponding 3D point clouds to provide robust semantic and geometric priors. This imagined goal point clouds serve as additional inputs to the policy model, while an object–action consistency strategy with soft pose supervision explicitly aligns predicted end-effector motion with generated object transformation. This design enables Imagine2Act to reason about semantic and geometric relationships between objects and predict accurate actions across diverse tasks. Experiments in both simulation and real world demonstrate that Imagine2Act outperforms previous state-of-the-art poli- cies. Code is fully open-sourced at https://github.com/ LiangHeng121/Imagine2Act.

Index terms

Deep Learning in Grasping and Manipulation Perception for Grasping and Manipulation