← Back ICRA 2026

OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation

Noriaki Hirose, Catherine Glossop, Dhruv Shah, Sergey Levine

PDF

AI summary

Key figure (auto-extracted from paper)

Training a single Vision-Language-Action model on diverse, multi-modal goal specifications yields a more generalizable and adaptable robot navigation policy than single-modality specialists.

Vision-Language-Action Omni-modal navigation Robot foundation models Multi-modal conditioning Generalization Fine-tuning

Problem

Most robotic navigation policies are trained on a single goal modality, limiting their adaptability to real-world scenarios where users naturally combine language, poses, and visual references.

Approach

The authors train a Vision-Language-Action model on nearly 9,500 hours of data from 10 robot platforms using a randomized modality dropout strategy, enabling flexible conditioning on 2D poses, egocentric images, natural language, or their combinations.

Key results

Surpasses single-modality specialist baselines across all goal types
Achieves strong generalization to unseen environments and novel language instructions
Maintains robust performance under modality dropout and scarcity
Efficiently fine-tunes to new modalities and embodiments with limited data

Why it matters

It establishes a scalable foundation for building flexible, broadly generalizable navigation policies that adapt to diverse real-world user inputs and hardware constraints.

Abstract

Humans can flexibly interpret and compose dif- ferent goal specifications, such as language instructions, spatial coordinates, or visual references, when navigating to a desti- nation. In contrast, most existing robotic navigation policies are trained on a single modality, limiting their adaptability to real-world scenarios where different forms of goal specification are natural and complementary. In this work, we present a training framework for robotic foundation models that en- ables omni-modal goal conditioning for vision-based navigation. Our approach leverages a high-capacity vision-language-action (VLA) backbone and trains with three primary goal modalities: 2D poses, egocentric images, and natural language, as well as their combinations, through a randomized modality fusion strategy. This design not only expands the pool of usable datasets but also encourages the policy to develop richer geometric, semantic, and visual representations. The resulting model, OmniVLA, achieves strong generalization to unseen environments, robustness to scarce modalities, and the ability to follow novel natural language instructions. We demonstrate that OmniVLA outperforms specialist baselines across modalities and offers a flexible foundation for fine-tuning to new modalities and tasks. We believe OmniVLA provides a step toward broadly generalizable and flexible navigation policies, and a scalable path for building omni-modal robotic foundation models.

Index terms

Deep Learning Methods Vision-Based Navigation