DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos
Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Ziyuan Liu, Gitta Kutyniok
AI summary
Problem
Current trajectory-conditioned video generation methods for robotics rely on 2D guidance or single-modality inputs, which restricts fine-grained controllability, fails to capture 3D spatial constraints, and produces inconsistent robot-object interactions.
Approach
The framework fuses depth-aware 3D trajectories, DINOv2 object features, and coordinate-augmented text prompts into a video diffusion model to co-generate spatially aligned RGB and depth videos for downstream policy learning.
Key results
- Fuses depth-aware 3D trajectories, DINOv2 features, and coordinate-augmented text into a diffusion transformer
- Co-generates spatially aligned RGB and depth videos via cross-modality attention
- Achieves superior visual fidelity, lower trajectory deviation, and higher task success rates across benchmarks
- Enables consistent joint angle regression via a multimodal RGB-depth policy model
Why it matters
It provides a controllable, high-fidelity synthetic data pipeline that accelerates the training of robust, real-world robotic manipulation policies.
Abstract
Video diffusion models provide powerful real- world simulators for embodied AI but remain limited in control- lability for robotic manipulation. Recent works on trajectory- conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth- aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot’s joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.