← Back ICRA 2026

Trajectory Conditioned Cross-Embodiment Skill Transfer

Yuhang Tang, Yixuan Lou, Pengfei Han, Haoming Song, Xinyi Ye, Dong Wang, Bin Zhao

PDF

AI summary

Key figure (auto-extracted from paper)

TrajSkill enables zero-shot human-to-robot skill transfer by using sparse optical flow trajectories to condition video generation, eliminating the need for paired datasets or reinforcement learning.

cross-embodiment transfer sparse optical flow video diffusion zero-shot imitation robot policy learning human demonstration

Problem

Directly transferring manipulation skills from human demonstration videos to robots is hindered by the significant morphological and kinematic embodiment gap, forcing existing methods to rely on costly paired datasets or reinforcement learning.

Approach

The framework extracts sparse optical flow trajectories from human videos as embodiment-agnostic motion cues, conditions a diffusion transformer to generate robot-consistent manipulation videos, and translates these videos into executable robot actions.

Key results

Reduces FVD by 39.6% and KVD by 36.6% on MetaWorld compared to state-of-the-art video generators
Achieves up to 44.7% overall success rate on MetaWorld 50 tasks, outperforming prior video-to-action baselines
Successfully transfers human hand demonstrations to real-robot kitchen manipulation tasks without paired data or RL
Introduces sparse optical flow trajectories as a morphology-invariant representation bridging human and robot embodiments

Why it matters

Enables scalable, zero-shot robot skill acquisition from everyday human videos, advancing practical deployment of embodied AI in unstructured environments.

Abstract

Learning manipulation skills from human demon- stration videos presents a promising yet challenging problem, primarily due to the significant embodiment gap between human body and robot manipulators. Existing methods rely on paired datasets or hand-crafted rewards, which limit scal- ability and generalization. We propose TrajSkill, a framework for Trajectory Conditioned Cross-embodiment Skill Transfer, enabling robots to acquire manipulation skills directly from human demonstration videos. Our key insight is to represent human motions as sparse optical flow trajectories, which serve as embodiment-agnostic motion cues by removing morphological variations while preserving essential dynamics. Conditioned on these trajectories together with visual and textual inputs, TrajSkill jointly synthesizes temporally consistent robot ma- nipulation videos and translates them into executable actions, thereby achieving cross-embodiment skill transfer. Extensive experiments are conducted, and the results on simulation data (MetaWorld) show that TrajSkill reduces FVD by 39.6% and KVD by 36.6% compared with the state-of-the-art, and improves cross-embodiment success rate by up to 16.7%. Real- robot experiments in kitchen manipulation tasks further vali- date the effectiveness of our approach, demonstrating practical human-to-robot skill transfer across embodiments.

Index terms

Deep Learning Methods Task Planning Computer Vision for Automation