← Back ICRA 2026

HoMeR: Learning In-The-Wild Mobile Manipulation Via Hybrid Imitation and Whole-Body Control

Priya Sundaresan, Rhea Malhotra, Phillip Miao, Jingyun Yang, Jimmy Wu, Hengyuan Hu, Rika Antonova, Francis Engelmann, Dorsa Sadigh, Jeannette Bohg

PDF

AI summary

Key figure (auto-extracted from paper)

Combining whole-body control with hybrid imitation learning enables a mobile robot to master complex household tasks with just 20 demonstrations per task.

Mobile manipulation Imitation learning Whole-body control Hybrid action spaces Sample efficiency Vision-language models

Problem

Mobile manipulation in unstructured environments requires coordinating the base and arm while switching between long-range reaching and precise manipulation, but existing methods struggle with control complexity and require extensive data.

Approach

HOMER uses a kinematics-based whole-body controller to coordinate base-arm motion, paired with a hybrid policy that automatically switches between absolute keyposes for reaching and relative deltas for fine manipulation, learned from minimal human teleoperation data.

Key results

Achieves 79.17% success rate across real and simulated household tasks with only 20 demonstrations per task
Outperforms strong baselines by an average of 29.17% in success rate
Enables seamless, learned switching between long-range reaching and fine-grained manipulation
Integrates vision-language model keypoints to generalize to novel objects and cluttered scenes

Why it matters

Offers a scalable, sample-efficient framework for deploying capable mobile manipulation robots in diverse, everyday indoor environments.

Abstract

We introduce HOMER, an imitation learning framework for mobile manipulation that combines whole-body control with hybrid action modes that handle both long-range and fine-grained motion, enabling effective performance on realistic in-the-wild tasks. At its core is a fast, kinematics-based whole-body controller that maps desired end-effector poses to coordinated motion across the mobile base and arm. Within this reduced end-effector action space, HOMER learns to switch between absolute pose predictions for long-range movement and relative pose predictions for fine-grained manipulation, offloading low-level coordination to the controller and focusing learning on task-level decisions. We deploy HOMER on a holonomic mobile manipulator with a 7-DoF arm in a real home. We compare HOMER to baselines without hybrid actions or whole-body control across 3 simulated and 3 real household tasks such as opening cabinets, sweeping trash, and rearranging pillows. Across tasks, HOMER achieves an overall success rate of 79.17% using just 20 demonstrations per task, outperforming the next best baseline by 29.17% on average. HOMER is also compatible with vision-language models and can leverage their internet-scale priors to better generalize to novel object appearances, layouts, and cluttered scenes. In summary, HOMER moves beyond tabletop settings and demonstrates a scalable path toward sample-efficient, generalizable mobile manipulation in everyday indoor spaces. Code, videos, and supplementary material are available at: https://homer-manip.github.io/.

Index terms

Imitation Learning Mobile Manipulation Learning from Demonstration