HoMeR: Learning In-The-Wild Mobile Manipulation Via Hybrid Imitation and Whole-Body Control
Priya Sundaresan, Rhea Malhotra, Phillip Miao, Jingyun Yang, Jimmy Wu, Hengyuan Hu, Rika Antonova, Francis Engelmann, Dorsa Sadigh, Jeannette Bohg
AI summary
Problem
Mobile manipulation in unstructured environments requires coordinating the base and arm while switching between long-range reaching and precise manipulation, but existing methods struggle with control complexity and require extensive data.
Approach
HOMER uses a kinematics-based whole-body controller to coordinate base-arm motion, paired with a hybrid policy that automatically switches between absolute keyposes for reaching and relative deltas for fine manipulation, learned from minimal human teleoperation data.
Key results
- Achieves 79.17% success rate across real and simulated household tasks with only 20 demonstrations per task
- Outperforms strong baselines by an average of 29.17% in success rate
- Enables seamless, learned switching between long-range reaching and fine-grained manipulation
- Integrates vision-language model keypoints to generalize to novel objects and cluttered scenes
Why it matters
Offers a scalable, sample-efficient framework for deploying capable mobile manipulation robots in diverse, everyday indoor environments.
Abstract
We introduce HOMER, an imitation learning framework for mobile manipulation that combines whole-body control with hybrid action modes that handle both long-range and fine-grained motion, enabling effective performance on realistic in-the-wild tasks. At its core is a fast, kinematics-based whole-body controller that maps desired end-effector poses to coordinated motion across the mobile base and arm. Within this reduced end-effector action space, HOMER learns to switch between absolute pose predictions for long-range movement and relative pose predictions for fine-grained manipulation, offloading low-level coordination to the controller and focusing learning on task-level decisions. We deploy HOMER on a holonomic mobile manipulator with a 7-DoF arm in a real home. We compare HOMER to baselines without hybrid actions or whole-body control across 3 simulated and 3 real household tasks such as opening cabinets, sweeping trash, and rearranging pillows. Across tasks, HOMER achieves an overall success rate of 79.17% using just 20 demonstrations per task, outperforming the next best baseline by 29.17% on average. HOMER is also compatible with vision-language models and can leverage their internet-scale priors to better generalize to novel object appearances, layouts, and cluttered scenes. In summary, HOMER moves beyond tabletop settings and demonstrates a scalable path toward sample-efficient, generalizable mobile manipulation in everyday indoor spaces. Code, videos, and supplementary material are available at: https://homer-manip.github.io/.