X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations
Maximus Pace, Prithwish Dan, Chuanruo Ning, Atiksh Bhardwaj, Moyun Du, Edward Duan, Wei-Chiu Ma, Kushal Kedia
AI summary
Problem
Humans and robots differ significantly in embodiment, making many human actions physically infeasible for direct robot execution. Naively co-training diffusion policies on mixed human and robot data degrades performance by teaching dynamically infeasible motions.
Approach
The method treats human actions as noisy counterparts to robot actions and uses a classifier to identify the minimum diffusion timestep where noised human actions become indistinguishable from robot actions, integrating them into training only beyond that point.
Key results
- 16% average success rate improvement over naive co-training and manual filtering
- Successful learning across five real-world manipulation tasks with varying execution mismatches
- Prevention of kinematically and dynamically infeasible robot actions
- Outperforms Point-Policy, Motion Tracks, and DemoDiffusion baselines on all tested tasks
Why it matters
Enables scalable robot learning from abundant human video data without sacrificing physical feasibility, benefiting researchers and practitioners in robot imitation learning.
Abstract
Human videos are a scalable source of training data for robot learning. However, humans and robots signif- icantly differ in embodiment, making many human actions infeasible for direct execution on a robot. Still, these demonstra- tions convey rich object-interaction cues and task intent. Our goal is to learn from this coarse guidance without transferring embodiment-specific, infeasible execution strategies. Recent ad- vances in generative modeling tackle a related problem of learn- ing from low-quality data. In particular, Ambient Diffusion is a recent method for diffusion modeling that incorporates low-quality data only at high-noise timesteps of the forward diffusion process. Our key insight is to view human actions as noisy counterparts of robot actions. As noise increases along the ∗Equal contribution. † Equal advising. forward diffusion process, embodiment-specific differences fade away while task-relevant guidance is preserved. Based on these observations, we present X-DIFFUSION, a cross-embodiment learning framework based on Ambient Diffusion that selectively trains diffusion policies on noised human actions. This enables effective use of easy-to-collect human videos without sacrificing robot feasibility. Across five real-world manipulation tasks, we show that X-DIFFUSION improves average success rates by 16% over naive co-training and manual data filtering.