← Back ICRA 2026

Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation

Xiangyi Wei, Haotian Zhang, Xinyi Cao, Siyu Xie, Weifeng Ge, Yang Li, Changbo Wang

PDF

AI summary

Key figure (auto-extracted from paper)

Integrating contact audio into VLA models allows robots to perceive dynamic interaction processes and contact events that are invisible to vision-only systems.

VLA models contact audio robotic manipulation multimodal perception TCR metric

Problem

Vision-only VLA models cannot adequately capture rich dynamic information like contact events, while tactile sensors are often expensive and have low sampling frequencies.

Approach

A multimodal policy combining DINOv2/SigLIP for vision, AudioCLIP for audio, and Llama2 as the backbone, utilizing LoRA fine-tuning and a new Task Completion Rate (TCR) metric.

Key results

Superior performance over vision-only methods in LIBERO and RLBench benchmarks
At least three-fold improvement in real-world task success rates under seen and unseen conditions
Enhanced capability to handle contact-intensive manipulation tasks
Development of the TCR metric to systematically evaluate dynamic operational processes

Why it matters

Offers a low-cost, high-frequency alternative to tactile sensing that improves precision and robustness in contact-rich robotic interactions.

Abstract

The Vision-Language-Action models (VLA) have achieved significant advances in robotic manipulation recently. However, vision-only VLA models create fundamental limi- tations, particularly in perceiving interactive and manipula- tion dynamic processes. This paper proposes Audio-VLA, a multimodal manipulation policy that leverages contact audio to perceive contact events and dynamic process feedback. Audio-VLA demonstrates that acoustic feedback can com- plement visual perception in contact-rich manipulation tasks. Additionally, this paper introduces the Task Completion Rate (TCR) metric to systematically evaluate dynamic operational processes. Audio-VLA employs pre-trained DINOv2 and SigLIP as visual encoders, AudioCLIP as the audio encoder, and Llama2 as the large language model backbone. We apply LoRA fine-tuning to these pre-trained modules to achieve robust cross-modal understanding of both visual and acoustic inputs. A multimodal projection layer aligns features from different modalities into the same feature space. Moreover RLBench and LIBERO simulation environments are enhanced by adding collision-based audio generation to provide realis- tic sound feedback during object interactions. Since current robotic manipulation evaluations focus on final outcomes rather than providing systematic assessment of dynamic operational processes, the proposed TCR metric measures how well robots perceive dynamic processes during manipulation, creating a more comprehensive evaluation metric. Extensive experiments on LIBERO, RLBench, and two real-world tasks demonstrate Audio-VLA’s superior performance over vision-only compar- ative methods, while the TCR metric effectively quantifies dynamic process perception capabilities. The source code and pre-trained models are publicly available at https://wxone. github.io/AudioVLA.

Index terms

Deep Learning in Grasping and Manipulation Contact Modeling Imitation Learning