← Back ICRA 2026

DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos

Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Ziyuan Liu, Gitta Kutyniok

PDF

AI summary

Key figure (auto-extracted from paper)

DRAW2ACT leverages depth-encoded 3D trajectories and object semantics to generate highly consistent RGB and depth videos, significantly boosting robotic manipulation success rates over prior trajectory-conditioned models.

Trajectory-conditioned video generation Depth-aware diffusion models Robotic manipulation Multimodal policy learning Synthetic data generation Embodied AI

Problem

Current trajectory-conditioned video generation methods for robotics rely on 2D guidance or single-modality inputs, which restricts fine-grained controllability, fails to capture 3D spatial constraints, and produces inconsistent robot-object interactions.

Approach

The framework fuses depth-aware 3D trajectories, DINOv2 object features, and coordinate-augmented text prompts into a video diffusion model to co-generate spatially aligned RGB and depth videos for downstream policy learning.

Key results

Fuses depth-aware 3D trajectories, DINOv2 features, and coordinate-augmented text into a diffusion transformer
Co-generates spatially aligned RGB and depth videos via cross-modality attention
Achieves superior visual fidelity, lower trajectory deviation, and higher task success rates across benchmarks
Enables consistent joint angle regression via a multimodal RGB-depth policy model

Why it matters

It provides a controllable, high-fidelity synthetic data pipeline that accelerates the training of robust, real-world robotic manipulation policies.

Abstract

Video diffusion models provide powerful real- world simulators for embodied AI but remain limited in control- lability for robotic manipulation. Recent works on trajectory- conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth- aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot’s joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.

Index terms

Deep Learning in Grasping and Manipulation Imitation Learning Deep Learning Methods