MLA: A Multisensory Language�Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation
Zhuoyang Liu, Jiaming Liu, Jiadong XU, Nuowei Han, Chenyang Gu, Hao Chen, kaichen zhou, Renrui Zhang, Kai Chin Hsieh, Kun Wu, Zhengping Che, Jian Tang, Shanghang Zhang
AI summary
Problem
Existing vision-language-action models struggle with complex, contact-rich robotic tasks because they rely primarily on 2D visual inputs and inefficient modality-specific encoders, limiting their ability to model physical dynamics and spatial dependencies.
Approach
MLA repurposes the LLM itself to directly align 2D images, 3D point clouds, and tactile signals through positional mapping, then predicts their future states via a lightweight decoder to improve physical reasoning and control.
Key results
- Encoder-free multimodal alignment mechanism integrating RGB, point cloud, and tactile data
- Future multisensory generation post-training strategy for dynamic reasoning
- 12% and 24% SOTA improvements over 2D and 3D VLA baselines in real-world contact-rich tasks
- Strong generalization to unseen objects, backgrounds, and dual-arm configurations
Why it matters
It enables more robust and efficient robotic manipulation in complex physical environments, offering a scalable path forward for embodied AI and autonomous robotics research.
Abstract
Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and language to generate actions, whereas robots must perceive and interact within the spatial-physical world. This gap highlights the need for a comprehensive understanding of robotic-specific multisensory information, which is crucial for achieving complex and contact-rich control. To this end, we introduce a multisensory language–action (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling. Specifically, to enhance perceptual representa- tions, we propose an encoder-free multimodal alignment scheme that innovatively repurposes the large language model itself as a perception module, directly interpreting multimodal cues by aligning 2D images, 3D point clouds, and tactile tokens through positional correspondence. To further enhance MLA’s understanding of physical dynamics, we design a future multi- sensory generation post-training strategy that enables MLA to reason about semantic, geometric, and interaction information, providing more robust conditions for action generation. For evaluation, the MLA model outperforms the previous state-of- the-art 2D and 3D VLA methods by 12% and 24% in complex, contact-rich real-world tasks, respectively, while also demon- strating improved generalization to unseen configurations. Project website: https://robotic-mla.github.io/