Enhancing VLA Precision in Robotic Manipulation Via FiLM-Based Force/Torque-Vision Integration
Gunhee Nam, Ayoung Hong
AI summary
Problem
Vision-Language-Action models struggle with high-precision, contact-rich manipulation because visual perception alone cannot detect physical resistance or self-occlusion, leading to insertion failures and hardware strain.
Approach
The authors introduce a ForceEncoder that converts 6-axis force/torque signals into modulation parameters, which dynamically adjust visual feature representations within a pre-trained VLA model using Feature-wise Linear Modulation.
Key results
- Successful integration of 6-axis F/T data into the π0.5 VLA architecture via FiLM
- Improved contact stability and insertion precision over vision-only baselines
- Real-time adaptive retraction and pause behaviors upon detecting physical resistance
- Experimental validation on a UR5e manipulator using 270 expert demonstration episodes
Why it matters
This approach provides a computationally efficient pathway to safer, more precise robotic assembly and manipulation without requiring heavy model retraining or architectural overhauls.
Abstract
We propose a multimodal integration framework to enhance the precision of Vision-Language-Action (VLA) models in contact-rich robotic tasks. Although visual perception is essential for task grounding, it often lacks the force awareness required for high-precision alignment and insertion. To address this limitation, we leverage Feature-wise Linear Modulation (FiLM) to condition intermediate visual representations on 6- axis Force/Torque (F/T) data. This lightweight fusion strategy allows the model to modulate its action predictions based on real-time physical resistance without incurring significant computational overhead. Experimental results on a UR5e ma- nipulator demonstrate that the proposed F/T-Vision integration enhances contact stability and precision in demanding manip- ulation tasks compared with vision-only baselines.