← Back ICRA 2026

OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation

Heyu Guo, Shanmu Wang, Ruichun Ma, Shiqi Jiang, Yasaman Ghasempour, Omid Abari, Baining Guo, Lili Qiu

PDF

AI summary

Key figure (auto-extracted from paper)

OmniVLA enables robots to perform complex manipulation tasks beyond RGB vision by unifying infrared, mmWave, and acoustic sensor data into a single image-native format, achieving an 84% success rate.

Vision-Language-Action Multi-Sensor Fusion Robotic Manipulation Sensor-Masked Images Beyond-RGB Perception Foundation Models

Problem

Most Vision-Language-Action (VLA) models rely solely on RGB cameras, limiting their perception and preventing them from handling tasks that require non-visible physical cues like temperature, occlusion penetration, or sound.

Approach

The authors introduce sensor-masked images, a unified representation that overlays spatially grounded, heatmap-like representations of infrared, mmWave, and acoustic data onto RGB images, allowing existing VLA backbones to process multi-sensor inputs efficiently with lightweight per-sensor projectors.

Key results

84% average task success rate on real-world manipulation tasks
59% and 28% performance gains over RGB-only and raw-sensor baselines
High data efficiency, matching baseline performance with only 50% of training data
Strong generalization across three unseen tasks

Why it matters

It provides a scalable, data-efficient framework for equipping robots with beyond-RGB perception, enabling them to execute complex real-world tasks that require understanding temperature, occlusions, and ambient sound.

Abstract

Vision-language-action (VLA) models have shown strong generalization in robotic manipulation through large- scale vision-language pretraining. However, most existing mod- els rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present Om- niVLA, an omni-modality VLA model that integrates novel sensing modalities to enable beyond-RGB robotic perception and manipulation. The core of our approach is the sensor- masked image, a unified representation that overlays physically meaningful, spatially grounded masks onto the RGB images. These masks are derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Building on this, we design a multimodal vision-language-action model architecture and train OmniVLA by extending an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks that require sensor-modality perception to guide the manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw- sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.

Index terms

AI-Enabled Robotics AI-Based Methods Deep Learning in Grasping and Manipulation