← Back ICRA 2026

CoVAR: Co-Generation of Video and Action for Robotic Manipulation Via Multi-Modal Diffusion

Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Ziyuan Liu, Abhinav Valada

PDF

AI summary

Key figure (auto-extracted from paper)

CoVAR enables high-quality, aligned video-action generation for robotics by extending a pretrained video diffusion model with a parallel action branch and a novel Bridge Attention mechanism, outperforming existing baselines on both simulated and real-world manipulation tasks.

Multi-modal diffusion Robotic manipulation Video-action generation Bridge Attention Policy learning

Problem

Existing robotic policy learning methods either rely on two-stage pipelines that limit cross-modal information sharing or require training joint diffusion models from scratch, which struggles with limited data and lacks direct action annotations for video diffusion models.

Approach

CoVAR extends a pretrained video diffusion model with a parallel action diffusion transformer and introduces a Bridge Attention mechanism to enable effective cross-modal interaction, supplemented by an action refinement module for low-resolution data.

Key results

Higher-quality video generation preserving pretrained knowledge
More accurate action predictions aligned with generated videos
Successful fine-grained real-world manipulation execution
Outperforms baselines across simulated and real-world benchmarks

Why it matters

Provides a scalable, data-efficient framework for leveraging large-scale pretrained video models to learn accurate robotic manipulation policies, benefiting researchers and practitioners in embodied AI and robotics.

Abstract

We present a method to generate video–action pairs that follow text instructions, starting from an initial image observation and the robot’s joint states. Our approach automatically provides action labels for video diffusion mod- els, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that pre- serves pretrained knowledge, (2) introduce a Bridge Attention mechanism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse actions into precise controls for low-resolution datasets. Extensive evaluations on multiple public benchmarks and real-world datasets demonstrate that our method generates higher-quality videos, more accurate actions, and significantly outperforms existing baselines, offering a scalable framework for leveraging large-scale video data for robotic learning.

Index terms

Perception for Grasping and Manipulation Deep Learning in Grasping and Manipulation Imitation Learning