← Back ICRA 2026

MimicDroid: In-Context Learning for Humanoid Robot Manipulation from Human Play Videos

Rutav Shah, Shuijing Liu, Qi Wang, Zhenyu Jiang, Sateesh Kumar, Mingyo Seo, Roberto MartÃn-MartÃn, Yuke Zhu

PDF

AI summary

Key figure (auto-extracted from paper)

MIMICDROID enables humanoid robots to master novel manipulation tasks from just a few human play videos via in-context learning, doubling real-world success rates over prior methods.

In-context learning humanoid robotics few-shot manipulation human play videos embodiment gap robot learning

Problem

Effective in-context learning for robot manipulation currently depends on expensive, labor-intensive teleoperated demonstrations, which severely limits scalability and adaptability to diverse household environments.

Approach

The method trains a policy exclusively on continuous, unlabeled human play videos by pairing similar manipulation behaviors as context-target examples, while bridging the human-robot gap through wrist pose retargeting and random visual masking.

Key results

Outperforms state-of-the-art baselines in simulation and real-world tests
Achieves nearly twofold higher real-world success rate
Scales effectively with training data, gaining 20% performance from 64k to 320k frames
Introduces an open-source simulation benchmark with three generalization levels

Why it matters

It offers a scalable, cost-effective framework for deploying adaptable humanoid robots in unstructured domestic settings without relying on costly teleoperation data.

Abstract

We aim to enable humanoid robots to efficiently solve new manipulation tasks from a few video examples. In- context learning (ICL) is a promising framework for achieving this goal due to its test-time data efficiency and rapid adapt- ability. However, current ICL methods rely on labor-intensive teleoperated data for training, which restricts scalability. We propose using human play videos—continuous, unlabeled videos of people interacting freely with their environment—as a scalable and diverse training data source. We introduce MIMIC- DROID, which enables humanoids to perform ICL using human play videos as the only training data. MIMICDROID extracts trajectory pairs with similar manipulation behaviors and trains the policy to predict the actions of one trajectory conditioned on the other. Through this process, the model acquired ICL capabilities for adapting to novel objects and environments at test time. To bridge the embodiment gap, MIMICDROID first retargets human wrist poses estimated from RGB videos to the humanoid, leveraging kinematic similarity. It also applies random patch masking during training to reduce overfitting to human-specific cues and improve robustness to visual dif- ferences. To evaluate few-shot learning for humanoids, we in- troduce an open-source simulation benchmark with increasing levels of generalization difficulty. MIMICDROID outperformed state-of-the-art methods and achieved a nearly twofold higher success rate in the real world. Additional materials can be found on: ut-austin-rpl.github.io/MimicDroid

Index terms

Learning from Demonstration Imitation Learning Deep Learning Methods