Research Analyzer
← Back ICRA 2026

Physically-Grounded Data Generation Via Video Diffusion Models

Sriram Yenamandra, Dorsa Sadigh

PDF

AI summary

Key figure (auto-extracted from paper)
Fine-tuning generalist robot policies on an autonomously generated, physically diverse simulation dataset significantly improves their ability to manipulate unseen objects with varying shapes, sizes, and textures.
Video diffusion models Robot manipulation Data generation Simulation Generalist policies PHYSVIVID

Problem

Existing robot manipulation datasets lack diversity in object properties and physical interactions, limiting policy generalization, while manually collecting diverse real-world data is costly and simulators struggle to generate varied behaviors without human demonstrations.

Approach

The pipeline uses a fine-tuned video diffusion model to generate diverse robot manipulation videos, which a goal-conditioned planner translates into executable actions in simulation, followed by automated success filtering to create a training dataset.

Key results

  • Autonomous pipeline combining video diffusion models and goal-conditioned planning for trajectory generation
  • PHYSVIVID dataset containing 5,000+ trajectories across 400+ diverse objects
  • 15% average performance improvement in generalizing to unseen objects with varying physical properties
  • Simulation-based success detection and filtering ensuring high-quality, physically grounded training data

Why it matters

Enables scalable, automated creation of diverse robot training data, reducing reliance on manual collection and improving real-world generalization for generalist manipulation policies.

Abstract

Existing datasets for training generalist manipulation policies often lack diversity in object variety and initial states, limiting the range of physically grounded interactions present in them. Consequently, these policies struggle with unseen object shapes, sizes, or unfamiliar object poses. Manually collecting real- world trajectories with diverse physical interactions is tedious, time- consuming, and expensive, underscoring the need to generate these autonomously. Simulators offer a scalable pathway to autonomously generate trajectories by enabling extensive variation not only in tasks (e.g., objects, object properties, and initial conditions), but also in the robot behaviors required to solve these tasks. We develop a data generation pipeline that autonomously produces physically grounded trajectories in simulation using video diffusion models. Our approach first simulates random initial conditions across various tasks using a diverse asset library. A video diffusion model generates videos of a robot performing these tasks in physically diverse scenarios, which are then fed to a learned goal-conditioned planner to extract actions that closely follow the generated videos. Unlike prior trajectory generation methods, our pipeline generalizes to new objects across multiple tasks without relying on human demonstrations. Using our approach, we generate a simulation dataset PHYSVIVID, containing 5k+ demonstrations involving 400+ objects. We demonstrate the effectiveness of PHYSVIVID by fine-tuning robot policies on it, and demonstrating generalization of policies to unseen objects with varying shapes, textures, and sizes, as well as to unseen object categories. See videos on our website: https://sites.google.com/view/physvivid/.

Index terms

Data Sets for Robot Learning Imitation Learning Simulation and Animation

Related papers