IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance
Jongwoo Park, Kanchana Ranasinghe, Jinhyeok Jang, Cristina Mata, Yoo Sung Jang, Michael S. Ryoo
AI summary
Problem
Flattening 2D image patches into 1D token sequences in Vision-Language-Action models erodes critical spatial cues and object boundaries, hindering precise robotic manipulation. Existing solutions typically require extensive retraining or specialized external modules to recover this lost structure.
Approach
IVRA extracts patch-wise affinity maps from a frozen vision encoder at inference time and injects them into selected language model layers to re-weight visual tokens. This lightweight intervention restores instance-level spatial coherence without modifying any model parameters or requiring additional training data.
Key results
- +4.2% average success gain on 2D VIMA benchmarks over LLaRA in a low-data regime
- Consistent performance lifts across 3D LIBERO suites for OpenVLA and FLOWER, even near accuracy saturation
- Up to +30% zero-shot success improvement on challenging real-world manipulation tasks
- Broad generalization across 2D/3D environments and multiple VLA architectures without retraining
Why it matters
Provides a plug-and-play, parameter-free upgrade for existing VLA robot policies, enabling more precise spatial reasoning and reliable real-world deployment with minimal computational overhead.
Abstract
Many Vision-Language-Action (VLA) models flat- ten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spa- tial understanding by exploiting affinity hints already avail- able in the model’s built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geomet- ric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3% →97.1%). Code and visualizations are available at: jongwoopark7978.github.io/IVRA