← Back ICRA 2026

IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance

Jongwoo Park, Kanchana Ranasinghe, Jinhyeok Jang, Cristina Mata, Yoo Sung Jang, Michael S. Ryoo

PDF

AI summary

Key figure (auto-extracted from paper)

Injecting built-in vision encoder affinity hints at inference time restores lost 2D spatial structure in VLA models, consistently boosting robot manipulation success without retraining.

Vision-Language-Action spatial understanding training-free affinity hints robot manipulation inference-time intervention

Problem

Flattening 2D image patches into 1D token sequences in Vision-Language-Action models erodes critical spatial cues and object boundaries, hindering precise robotic manipulation. Existing solutions typically require extensive retraining or specialized external modules to recover this lost structure.

Approach

IVRA extracts patch-wise affinity maps from a frozen vision encoder at inference time and injects them into selected language model layers to re-weight visual tokens. This lightweight intervention restores instance-level spatial coherence without modifying any model parameters or requiring additional training data.

Key results

+4.2% average success gain on 2D VIMA benchmarks over LLaRA in a low-data regime
Consistent performance lifts across 3D LIBERO suites for OpenVLA and FLOWER, even near accuracy saturation
Up to +30% zero-shot success improvement on challenging real-world manipulation tasks
Broad generalization across 2D/3D environments and multiple VLA architectures without retraining

Why it matters

Provides a plug-and-play, parameter-free upgrade for existing VLA robot policies, enabling more precise spatial reasoning and reliable real-world deployment with minimal computational overhead.

Abstract

Many Vision-Language-Action (VLA) models flat- ten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spa- tial understanding by exploiting affinity hints already avail- able in the model’s built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geomet- ric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3% →97.1%). Code and visualizations are available at: jongwoopark7978.github.io/IVRA

Index terms

Deep Learning for Visual Perception Visual Learning Recognition