← Back ICRA 2026

X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

Maximus Pace, Prithwish Dan, Chuanruo Ning, Atiksh Bhardwaj, Moyun Du, Edward Duan, Wei-Chiu Ma, Kushal Kedia

PDF

AI summary

Key figure (auto-extracted from paper)

X-DIFFUSION enables robots to learn effectively from human videos by selectively training on noised human actions, boosting task success by 16% over naive co-training.

Cross-embodiment learning Diffusion policies Human demonstrations Ambient diffusion Imitation learning Robot manipulation

Problem

Humans and robots differ significantly in embodiment, making many human actions physically infeasible for direct robot execution. Naively co-training diffusion policies on mixed human and robot data degrades performance by teaching dynamically infeasible motions.

Approach

The method treats human actions as noisy counterparts to robot actions and uses a classifier to identify the minimum diffusion timestep where noised human actions become indistinguishable from robot actions, integrating them into training only beyond that point.

Key results

16% average success rate improvement over naive co-training and manual filtering
Successful learning across five real-world manipulation tasks with varying execution mismatches
Prevention of kinematically and dynamically infeasible robot actions
Outperforms Point-Policy, Motion Tracks, and DemoDiffusion baselines on all tested tasks

Why it matters

Enables scalable robot learning from abundant human video data without sacrificing physical feasibility, benefiting researchers and practitioners in robot imitation learning.

Abstract

Human videos are a scalable source of training data for robot learning. However, humans and robots signif- icantly differ in embodiment, making many human actions infeasible for direct execution on a robot. Still, these demonstra- tions convey rich object-interaction cues and task intent. Our goal is to learn from this coarse guidance without transferring embodiment-specific, infeasible execution strategies. Recent ad- vances in generative modeling tackle a related problem of learn- ing from low-quality data. In particular, Ambient Diffusion is a recent method for diffusion modeling that incorporates low-quality data only at high-noise timesteps of the forward diffusion process. Our key insight is to view human actions as noisy counterparts of robot actions. As noise increases along the ∗Equal contribution. † Equal advising. forward diffusion process, embodiment-specific differences fade away while task-relevant guidance is preserved. Based on these observations, we present X-DIFFUSION, a cross-embodiment learning framework based on Ambient Diffusion that selectively trains diffusion policies on noised human actions. This enables effective use of easy-to-collect human videos without sacrificing robot feasibility. Across five real-world manipulation tasks, we show that X-DIFFUSION improves average success rates by 16% over naive co-training and manual data filtering.

Index terms

Imitation Learning Learning from Demonstration Transfer Learning