← Back ICRA 2026

Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow

Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, Ruohan Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

Extracting 3D object trajectories from off-the-shelf video generation models enables zero-shot robotic manipulation across diverse objects and tasks without task-specific data.

3D object flow video generation zero-shot manipulation robotic control trajectory tracking open-world robotics

Problem

Generative video models predict plausible physical interactions but cannot directly control robots due to an embodiment gap and differing action spaces. Translating high-level video reasoning into low-level robot commands remains a significant challenge.

Approach

Dream2Flow extracts 3D object flows from text-conditioned videos and formulates manipulation as a trajectory tracking problem, using trajectory optimization or reinforcement learning to generate executable robot actions.

Key results

Enables zero-shot manipulation across rigid, articulated, deformable, and granular objects
Outperforms alternative intermediate representations like AVDC and RIGVID in real-world tasks
Demonstrates robust generalization across varying object instances, backgrounds, and viewing angles
Bridges video generation and robotic control without task-specific demonstrations or training

Why it matters

Provides a scalable, embodiment-agnostic interface that allows robots to leverage the rich physical priors of foundation video models for diverse manipulation tasks.

Abstract

Generative video modeling has emerged as a compelling tool to zero-shot reason about plausible physical interactions for open-world manipulation. Yet, it remains a challenge to translate such human-led motions into the low-level actions demanded by robotic systems. We observe that given an initial image and task instruction, these models excel at synthe- sizing sensible object motions. Thus, we introduce Dream2Flow, a framework that bridges video generation and robotic control through 3D object flow as an intermediate representation. Our method reconstructs 3D object motions from generated videos and formulates manipulation as object trajectory tracking. By separating the state changes from the actuators that realize those changes, Dream2Flow overcomes the embodiment gap and enables zero-shot guidance from pre-trained video models to manipulate objects of diverse categories—including rigid, articulated, deformable, and granular. Through trajectory op- timization or reinforcement learning, Dream2Flow converts re- constructed 3D object flow into executable low-level commands without task-specific demonstrations. Simulation and real-world experiments highlight 3D object flow as a general and scalable interface for adapting video generation models to open-world robotic manipulation. Videos, visualizations, and appendix are available at https://dream2flow.github.io/.

Index terms

Machine Learning for Robot Control Big Data in Robotics and Automation Perception for Grasping and Manipulation