← Back ICRA 2026

AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis

Junjie Ye, Rong Xue, Basile Van Hoorick, Pavel Tokmakov, Muhammad Zubair Irshad, YUE WANG, Vitor Guizilini

PDF

AI summary

Key figure (auto-extracted from paper)

Conditioning video diffusion models on robot motion traces enables scalable, photorealistic demonstration synthesis that significantly boosts downstream imitation learning performance.

Robot data synthesis Video diffusion Embodiment-aware generation Imitation learning Generative world models Sim-to-real gap

Problem

Collecting large-scale, diverse robot demonstrations is costly, while existing generative methods either fail to create new behaviors or produce kinematically inconsistent, hallucinated robot motions. Simulators offer an alternative but suffer from sim-to-real gaps and require labor-intensive environment modeling.

Approach

AnchorDream decouples trajectory expansion from scene generation by rendering only robot arm motions and using them to condition a pretrained video diffusion model. This anchors the synthesis process on the robot's kinematics, allowing the model to generate consistent environments and objects without explicit scene reconstruction.

Key results

Expands small demonstration sets by over an order of magnitude
Achieves 36.4% relative improvement in simulator policy benchmarks
Nearly doubles policy performance in real-world robot studies
Eliminates the need for explicit environment modeling or simulators

Why it matters

It offers a practical, scalable pathway for imitation learning by grounding generative priors in robot motion, reducing reliance on costly data collection and simulators.

Abstract

The collection of large-scale and diverse robot demonstrations remains a major bottleneck for imitation learn- ing, as real-world data acquisition is costly and simulators offer limited diversity and fidelity with pronounced sim-to-real gaps. While generative models present an attractive solution, existing methods often alter only visual appearances without creating new behaviors, or suffer from embodiment inconsistencies that yield implausible motions. To address these limitations, we introduce AnchorDream, an embodiment-aware world model that repurposes pretrained video diffusion models for robot data synthesis. AnchorDream conditions the diffusion process on robot motion renderings, anchoring the embodiment to pre- vent hallucination while synthesizing objects and environments consistent with the robot’s kinematics. Starting from only a handful of human teleoperation demonstrations, our method scales them into large, diverse, high-quality datasets without requiring explicit environment modeling. Experiments show that the generated data leads to consistent improvements in downstream policy learning, with relative gains of 36.4% in simulator benchmarks and nearly double performance in real- world studies. These results suggest that grounding generative world models1 in robot motion provides a practical path toward scaling imitation learning.

Index terms

Deep Learning in Grasping and Manipulation