← Back ICRA 2026

MLA: A Multisensory Language�Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation

Zhuoyang Liu, Jiaming Liu, Jiadong XU, Nuowei Han, Chenyang Gu, Hao Chen, kaichen zhou, Renrui Zhang, Kai Chin Hsieh, Kun Wu, Zhengping Che, Jian Tang, Shanghang Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

MLA outperforms existing vision-language-action models by 12–24% in complex robotic tasks by unifying multisensory inputs and predicting future physical states without extra encoders.

vision-language-action multisensory alignment robotic manipulation future state prediction encoder-free learning embodied AI

Problem

Existing vision-language-action models struggle with complex, contact-rich robotic tasks because they rely primarily on 2D visual inputs and inefficient modality-specific encoders, limiting their ability to model physical dynamics and spatial dependencies.

Approach

MLA repurposes the LLM itself to directly align 2D images, 3D point clouds, and tactile signals through positional mapping, then predicts their future states via a lightweight decoder to improve physical reasoning and control.

Key results

Encoder-free multimodal alignment mechanism integrating RGB, point cloud, and tactile data
Future multisensory generation post-training strategy for dynamic reasoning
12% and 24% SOTA improvements over 2D and 3D VLA baselines in real-world contact-rich tasks
Strong generalization to unseen objects, backgrounds, and dual-arm configurations

Why it matters

It enables more robust and efficient robotic manipulation in complex physical environments, offering a scalable path forward for embodied AI and autonomous robotics research.

Abstract

Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and language to generate actions, whereas robots must perceive and interact within the spatial-physical world. This gap highlights the need for a comprehensive understanding of robotic-specific multisensory information, which is crucial for achieving complex and contact-rich control. To this end, we introduce a multisensory language–action (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling. Specifically, to enhance perceptual representa- tions, we propose an encoder-free multimodal alignment scheme that innovatively repurposes the large language model itself as a perception module, directly interpreting multimodal cues by aligning 2D images, 3D point clouds, and tactile tokens through positional correspondence. To further enhance MLA’s understanding of physical dynamics, we design a future multi- sensory generation post-training strategy that enables MLA to reason about semantic, geometric, and interaction information, providing more robust conditions for action generation. For evaluation, the MLA model outperforms the previous state-of- the-art 2D and 3D VLA methods by 12% and 24% in complex, contact-rich real-world tasks, respectively, while also demon- strating improved generalization to unseen configurations. Project website: https://robotic-mla.github.io/

Index terms

Deep Learning in Grasping and Manipulation Imitation Learning Representation Learning