CoVAR: Co-Generation of Video and Action for Robotic Manipulation Via Multi-Modal Diffusion
Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Ziyuan Liu, Abhinav Valada
AI summary
Problem
Existing robotic policy learning methods either rely on two-stage pipelines that limit cross-modal information sharing or require training joint diffusion models from scratch, which struggles with limited data and lacks direct action annotations for video diffusion models.
Approach
CoVAR extends a pretrained video diffusion model with a parallel action diffusion transformer and introduces a Bridge Attention mechanism to enable effective cross-modal interaction, supplemented by an action refinement module for low-resolution data.
Key results
- Higher-quality video generation preserving pretrained knowledge
- More accurate action predictions aligned with generated videos
- Successful fine-grained real-world manipulation execution
- Outperforms baselines across simulated and real-world benchmarks
Why it matters
Provides a scalable, data-efficient framework for leveraging large-scale pretrained video models to learn accurate robotic manipulation policies, benefiting researchers and practitioners in embodied AI and robotics.
Abstract
We present a method to generate video–action pairs that follow text instructions, starting from an initial image observation and the robot’s joint states. Our approach automatically provides action labels for video diffusion mod- els, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that pre- serves pretrained knowledge, (2) introduce a Bridge Attention mechanism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse actions into precise controls for low-resolution datasets. Extensive evaluations on multiple public benchmarks and real-world datasets demonstrate that our method generates higher-quality videos, more accurate actions, and significantly outperforms existing baselines, offering a scalable framework for leveraging large-scale video data for robotic learning.