← Back ICRA 2026

EMMA: Scaling Mobile Manipulation via Egocentric Human Data

Lawrence Y. Zhu, Pranav Kuppili, Ryan Punamiya, Patcharapong Aphiwetsa, Dhruv Patel, Simar Kareer, Sehoon Ha, Danfei Xu

PDF

AI summary

Key figure (auto-extracted from paper)

EMMA enables scalable mobile manipulation policy learning by co-training on cheap, egocentric human video data and static robot data, bypassing costly mobile teleoperation.

Mobile Manipulation Imitation Learning Egocentric Data Cross-Embodiment Learning Policy Co-training Robot Learning

Problem

Scaling mobile manipulation imitation learning is bottlenecked by the high cost and scarcity of teleoperated mobile robot data, limiting diversity and deployment in unpredictable real-world settings.

Approach

EMMA bridges the human-robot embodiment gap through optimization-based navigation retargeting and coordinate-space alignment, then co-trains a unified Transformer policy on heterogeneous egocentric human mobile data and static robot manipulation data.

Key results

Matches or exceeds task success of teleoperated baselines
Shows positive performance scaling with increased human data
Generalizes to novel spatial configurations and unseen scenes
Introduces unsupervised phase identification for navigation-manipulation switching

Why it matters

Provides a scalable, low-cost paradigm for training real-world mobile manipulation robots by leveraging abundant egocentric human video data instead of expensive teleoperation.

Abstract

Scaling mobile manipulation imitation learning is bottlenecked by expensive mobile robot teleoperation. We present Egocentric Mobile MAnipulation (EMMA), an end-to-end frame- work training mobile manipulation policies from human mobile manipulation data with static robot data, sidestepping mobile teleoperation. To accomplish this, we co-train human full-body motion data with static robot data. In our experiments across four real-world tasks, EMMA demonstrates comparable performance to baselines trained on teleoperated mobile robot data (Mobile ALOHA), achieving higher or equivalent task performance in full task success. We find that EMMA is able to generalize to new spatial configurations and scenes, and we observe positive performance scaling as we increase the hours of human data, opening new avenues for scalable robotic learning in real- world environments. Details of this project can be found at: https://ego-moma.github.io

Index terms

Mobile Manipulation Imitation Learning Learning from Demonstration