← Back ICRA 2026

EasyMimic: A Low-Cost Framework for Robot Imitation Learning from Human Videos

Tao Zhang, Song Xia, Ye Wang, Qin Jin

PDF

AI summary

Key figure (auto-extracted from paper)

EasyMimic enables low-cost robots to master manipulation tasks from standard human videos by bridging the human-robot embodiment gap through physical alignment and co-training, achieving high success rates with minimal real robot data.

Robot imitation learning human-to-robot transfer low-cost robotics vision-language-action models physical alignment co-training

Problem

Robot imitation learning is bottlenecked by the high cost and complexity of collecting large-scale real-world robot demonstration data. Directly using human videos fails due to significant visual appearance and action space gaps between human hands and robot grippers.

Approach

The framework extracts 3D hand poses from consumer RGB videos, maps them to robot actions via kinematic retargeting, and standardizes visual appearance through lightweight color randomization. It then co-trains a Vision-Language-Action model on this processed human data alongside a small set of real robot teleoperation trajectories.

Key results

Achieves 0.88 average success rate across four manipulation tasks using only 20 robot trajectories per task
Reduces human-to-robot data collection time by over 6x compared to traditional teleoperation
Outperforms robot-only and pretrain-finetune baselines by up to 0.62 average score points
Enables robust language-conditioned task execution with minimal hardware and computational overhead

Why it matters

It offers a scalable, affordable pathway for non-experts to train household robots, accelerating the real-world deployment of intelligent manipulation systems.

Abstract

Robot imitation learning is often hindered by the high cost of collecting large-scale, real-world data. This challenge is especially significant for low-cost robots designed for home use, as they must be both user-friendly and affordable. To address this, we propose the EasyMimic framework, a low- cost and replicable solution that enables robots to quickly learn manipulation policies from human video demonstrations cap- tured with standard RGB cameras. Our method first extracts 3D hand trajectories from the videos. An action alignment module then maps these trajectories to the gripper control space of a low-cost robot. To bridge the human-to-robot domain gap, we introduce a simple and user-friendly hand visual augmentation strategy. We then use a co-training method, fine-tuning a model on both the processed human data and a small amount of robot data, enabling rapid adaptation to new tasks. Experiments on the low-cost LeRobot platform demonstrate that EasyMimic achieves high performance across various manipulation tasks. It significantly reduces the reliance on expensive robot data collection, offering a practical path for bringing intelligent robots into homes. Project website: https: //zt375356.github.io/EasyMimic-Project/.

Index terms

Learning from Demonstration Imitation Learning Deep Learning in Grasping and Manipulation