← Back ICRA 2026

Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt

Xiang Zhu, Yichen Liu, Hezhong Li, Jianyu Chen

PDF

AI summary

Key figure (auto-extracted from paper)

Robots can execute novel tasks and generalize to new skills directly from human demonstration videos as prompts, without requiring new teleoperation data or model fine-tuning.

Robot learning Human demonstration Video prompting Dexterous manipulation Diffusion policy Cross-embodiment transfer

Problem

Adapting robots to new tasks typically requires costly, time-consuming teleoperation data collection and model fine-tuning, while existing video-prompting methods lack robust cross-embodiment generalization.

Approach

A two-stage framework that first fine-tunes a video diffusion model using cross-prediction to learn shared human-robot representations, then trains a diffusion policy with a prototypical contrastive loss to fuse human video prompts with robot observations for direct skill execution.

Key results

Cross-embodiment representation learning via cross-prediction video generation
Joint policy training using abundant human videos and sparse dexterous hand data
ProtoDiffusion Contrastive Policy objective for sharper task discrimination
Zero-shot generalization to novel objects, scenes, and skills on real-world dexterous manipulation tasks

Why it matters

Provides a scalable, low-cost pathway for robot skill acquisition by leveraging readily available human videos, significantly reducing dependence on expensive teleoperation systems.

Abstract

Recent robot learning methods commonly rely on imitation learning from massive robotic dataset collected with teleoperation. When facing a new task, such methods generally require collecting a set of new teleoperation data and finetuning the policy. Furthermore, the teleoperation data collection pipeline is also tedious and expensive. Instead, human is able to efficiently learn new tasks by just watching others do. In this paper, we introduce a novel two-stage framework that utilizes human demonstrations to learn a generalizable robot policy. Such policy can directly take human demonstration video as a prompt and perform new tasks without any new teleoperation data and model finetuning at all. In the first stage, we train video generation model that captures a joint representation for both the human and robot demonstration video data using cross-prediction. In the second stage, we fuse the learned representation with a shared action space between human and robot using a novel prototypical contrastive loss. Empirical evaluations on real-world dexterous manipulation tasks show the effectiveness and generalization capabilities of our proposed method.

Index terms

Learning from Demonstration Dexterous Manipulation