← Back ICRA 2026

ActiveUMI: Robotic Manipulation with Active Perception from Robot‑Free Human Demonstrations

Qiyuan Zeng, Chengmeng Li, Jude St. John, Zhongyi Zhou, Junjie Wen, Yi Xu, Guorui Feng, Yichen Zhu

PDF

AI summary

Key figure (auto-extracted from paper)

Integrating active, egocentric head tracking into a portable VR teleoperation system significantly boosts robot policy success rates and generalization on complex bimanual tasks.

Active perception VR teleoperation robot data collection bimanual manipulation visuomotor policies embodiment alignment

Problem

Scaling robot data collection while preserving embodiment fidelity is hindered by costly teleoperation, cross-embodiment gaps in human videos, and sim-to-real gaps in simulation, particularly because existing portable interfaces lack active perception to handle occlusions and long-horizon tasks.

Approach

ActiveUMI uses a low-cost, portable VR teleoperation kit with sensorized controllers that mirror robot grippers, explicitly recording the operator's head movements to teach the robot active viewpoint control alongside manipulation.

Key results

70% average success rate on in-distribution tasks
44% success rate improvement over wrist-camera baselines
56% success rate retention on novel objects and environments
Effective handling of long-horizon, occluded, and deformable tasks

Why it matters

Provides a scalable, low-cost pathway for training generalizable real-world robot policies by bridging the gap between in-the-wild human demonstrations and robot embodiment.

Abstract

We present ActiveUMI, a framework for a data collection system that transfers in-the-wild human demon- strations to robots capable of complex bimanual manipula- tion. ActiveUMI couples a portable VR teleoperation kit with sensorized controllers that mirror the robot’s end-effectors, bridging human-robot kinematics via precise pose alignment. To ensure mobility and data quality, we introduce several key techniques, including immersive 3D model rendering, a self-contained wearable computer, and efficient calibration methods. ActiveUMI’s defining feature is its capture of active, egocentric perception. By recording an operator’s deliberate head movements via a head-mounted display, our system learns the crucial link between visual attention and manipulation. We evaluate ActiveUMI on six challenging bimanual tasks. Policies trained exclusively on ActiveUMI data achieve an average success rate of 70% on in-distribution tasks and demonstrate strong generalization, retaining a 56% success rate when tested on novel objects and in new environments. Our results demonstrate that portable data collection systems, when coupled with learned active perception, provide an effective and scalable pathway toward creating generalizable and highly capable real-world robot policies. 1 Shanghai University, 2 Stanford University, 3 University of Toronto, 4 East China Normal University, 5 Midea Group. * Equal contribution. † Corresponding authors. This work was done while Qiyuan Zeng Chengmeng Li, Junjie Wen, Zhongyi Zhou and Yichen Zhu were at Midea Group.

Index terms

Imitation Learning

ActiveUMI: Robotic Manipulation with Active Perception from Robot&#8209;Free Human Demonstrations