Research Analyzer
← Back ICRA 2026

The One RING: A Robotic Indoor Navigation Generalist

Ainaz Eftekhar, Rose Hendrix, Luca Weihs, Jiafei Duan, Ege Caglar, Jordi Salvador, Alvaro Herrasti, Winson Han, Eli VanderBilt, Aniruddha Kembhavi, Ali Farhadi, Ranjay Krishna, kiana ehsani, Kuo-Hao Zeng

PDF

AI summary

Key figure (auto-extracted from paper)
A single policy trained entirely in simulation generalizes zero-shot across vastly different robot bodies and cameras to navigate indoor environments effectively in the real world.
indoor navigation cross-embodiment generalization simulation-to-real reinforcement learning embodied AI robotic policy

Problem

Most indoor navigation policies are tied to specific robot designs and fail to generalize to new or slightly modified embodiments, requiring costly retraining for each new platform.

Approach

The authors train a single transformer-based policy entirely in simulation by randomizing over one million diverse robot embodiments and fine-tuning it with reinforcement learning.

Key results

  • 72.1% average success rate across 5 unseen simulated embodiments
  • 78.9% zero-shot success rate on 4 real-world robot platforms
  • Embodiment-adaptive navigation strategies that adjust to physical constraints
  • Up to 10% performance boost with minimal embodiment-specific fine-tuning

Why it matters

It eliminates the need to train separate navigation policies for every new robot, accelerating the deployment of general-purpose indoor robots across diverse hardware platforms.

Abstract

Modern robots vary significantly in shape, size, and sensor configurations used to perceive and interact with their environments. However, most navigation policies are embodiment-specific—a policy trained on one robot typically fails to generalize to another, even with minor changes in body size or camera viewpoint. As custom hardware becomes increasingly common, there is a growing need for a single policy that generalizes across embodiments, eliminating the need to (re-)train for each specific robot. In this paper, we introduce RING (Robotic Indoor Navigation Generalist), an embodiment- agnostic policy that turns any mobile robot into an effective indoor semantic navigator. Trained entirely in simulation, RING leverages large-scale randomization over robot embodiments to enable robust generalization to many real-world platforms. To support this, we augment the AI2-THOR simulator to instantiate robots with controllable configurations, varying in body size, rotation pivot point, and camera parameters. On the visual object-goal navigation task, RING achieves strong cross-embodiment (XE) generalization—72.1% average success rate across 5 simulated embodiments (a 16.7% absolute im- provement on the CHORES-S benchmark) and 78.9% across 4 real-world platforms, including Stretch RE-1, LoCoBot, and Unitree Go1—matching or even surpassing embodiment- specific policies. We further deploy RING on the RB-Y1 wheeled humanoid in a real-world kitchen environment, showcasing its out-of-the-box potential for mobile manipulation platforms. INTRODUCTION Robot embodiments are diverse and are constantly evolv- ing to better suit new environments and tasks. This range in body configurations—differences in size, shape, wheeled or legged locomotion, and sensor configurations—not only shapes how robots perceive the world but also how they act in it. A robot with a wide field of view (FoV) or multiple cameras can scan its surroundings quickly, while one with a narrower view might need to more actively explore a room. A small robot can squeeze through tight spaces, a low- profile one can duck under furniture, while a larger robot needs to follow more conservative routes. The influence of embodiment on behavior means a policy trained on one design, or even a few, often fails to generalize out of domain. There has been progress towards scalable cross- embodiment training [1], [2], [3], [4], [5]. While these methods demonstrate some transfer to unseen embodiments, they still suffer from performance degradation with relatively small changes in embodiment (e.g., camera pose modifica- tion on the same robot) [6], [7]. Potentially, this is due to these methods relying on the small amount of real-world data available in public datasets-around 20 embodiments in total [1]. Similarly, general-purpose navigation policies [8], [9], [10] are trained on datasets with relatively few embod- iments (e.g., 8 robots in [9]), limiting their generalization. A more comprehensive solution is needed—one that can robustly handle the full spectrum of possible embodiments without retraining or additional adaptation. We introduce RING, a Robotic Indoor Navigation Generalist. RING is trained exclusively in simulation, without any use of real-world robot embodiments. In other words, all robot platforms we evaluate on (i.e., Stretch RE-1, LoCoBot, Unitree’s Go1, RB-Y1) are unseen by RING during training. We leverage simulation to randomly sample 1 Million agent body configurations, varying the robot’s camera parameters, collider sizes, and center of rotation. Concretely, each em- bodiment consists of a collider box of varying dimensions and cameras with randomized parameters, placed randomly within the collider box. Fig.1-A presents a t-SNE[11] visual- ization of body parameters for 30k such agents. Our approach builds on the success of prior works that achieve strong real- world performance through large-scale simulation-only train- ing [12], [13], [14]. Simulation enables training across a vast distribution of environments (150k ProcTHOR houses [15]) and objects (40k+ 3D objects in Objaverse [16]) in the AI2-THOR simulator. Extensive domain randomization on visual observations and the use of pre-trained visual encoders then allows simulation-trained policies to bridge the sim-to- real gap. We follow the training procedure in FLaRe [14], first training our policy on expert trajectories collected from 1M randomized embodiments and subsequently fine-tuning it with on-policy reinforcement learning (RL) in the simulator. Our results demonstrate generalization to truly unseen embodiments. RING transfers to diverse real-world embodi- ments without any adaptation, despite being trained entirely in simulation without access to the real robot configurations. We evaluate in a zero-shot setting across Stretch RE-1, LoCoBot, Unitree’s Go1, and RB-Y1 wheeled humanoid. RING achieves 72.1% average success rate in simulation (16.7% absolute improvement on CHORES-S benchmark) and 78.9% on real robot platforms—matching or even sur- 2026 IEEE International Conference on Robotics and Automation (ICRA 2026) June 1-5, 2026. Vienna, Austria 979-8-3315-8160-2/26/$31.00 ©2026 IEEE 19793

Index terms

Vision-Based Navigation Imitation Learning Reinforcement Learning

Related papers