← Back ICRA 2026

Masquerade: Learning from In-The-Wild Human Videos Using Data-Editing

Marion Lepert, Jiaying Fang, Jeannette Bohg

PDF

AI summary

Key figure (auto-extracted from paper)

Explicitly bridging the visual embodiment gap via data editing unlocks massive in-the-wild human videos to dramatically improve zero-shot robot manipulation policies.

Robot learning Embodiment gap Data editing In-the-wild videos Diffusion policy Zero-shot generalization

Problem

Robot manipulation research is bottlenecked by severe data scarcity and struggles to transfer visual representations from human videos due to a large embodiment gap between human hands and robot grippers.

Approach

The method edits in-the-wild egocentric human videos by inpainting human arms and overlaying a rendered robot that tracks recovered hand poses, then co-trains a vision encoder and diffusion policy head on these edited clips alongside a small set of real robot demonstrations.

Key results

Outperforms baselines by 5-6× on long-horizon bimanual tasks in unseen scenes
Robot overlays and co-training are both indispensable for robust OOD generalization
Policy performance scales logarithmically with the volume of edited human video data
Achieves robust zero-shot deployment in novel environments using only 50 real robot demos per task

Why it matters

It proves that simple visual alignment can unlock vast, uncurated human video datasets to dramatically improve robot policy generalization, offering a scalable solution to robotics data scarcity.

Abstract

Robot manipulation research still suffers from sig- nificant data scarcity: even the largest robot datasets are orders of magnitude smaller and less diverse than those that fueled recent breakthroughs in language and vision. We introduce Masquerade, a method that edits in-the-wild egocentric human videos to bridge the visual embodiment gap between humans and robots and then learns a robot policy with these edited videos. Our pipeline turns each human video into “robotized” demonstrations by (i) estimating 3-D hand poses, (ii) inpainting the human arms, and (iii) overlaying a rendered bimanual robot that tracks the recovered end-effector trajectories. We pre-train a visual encoder to predict future 2-D robot keypoints on 675K frames of these edited clips. We continue that auxiliary loss while fine-tuning a diffusion-policy head on only 50 robot demonstrations per task. This yields policies that generalize significantly better than prior work. On three long-horizon, bimanual kitchen tasks evaluated in three unseen scenes each, Masquerade outperforms baselines by 5-6×. Ablations show that both the robot overlay and co-training are indispensable, and performance scales logarithmically with the amount of edited human video. These results demonstrate that explicitly closing the visual embodiment gap unlocks a vast, readily available source of data from human videos that can be used to improve robot policies.

Index terms

Imitation Learning Big Data in Robotics and Automation Representation Learning