Research Analyzer
← Back ICRA 2026

RetoVLA: Reusing Register Tokens for Spatial Reasoning in Vision-Language-Action Models

Jiyeon Koo, Taewan Cho, Hyunjoon Kang, Eunseom Pyo, Taegyun Oh, Taeryang Kim, Andrew Jaeyong Choi

PDF

AI summary

Key figure (auto-extracted from paper)
Repurposing discarded Vision Transformer Register Tokens as spatial context injectors significantly boosts the performance of lightweight Vision-Language-Action models without adding parameters.
Vision-Language-Action Spatial Reasoning Register Tokens Model Efficiency Robotics Lightweight Models

Problem

Lightweight Vision-Language-Action models sacrifice 3D spatial reasoning and global scene understanding to achieve computational efficiency, limiting their real-world robotic deployment.

Approach

RetoVLA recycles learnable Register Tokens from Vision Transformers, which normally absorb global context and are discarded, and routes them directly into the action-planning module via a dedicated spatial injection path with a learnable gating mechanism.

Key results

  • +17.1%p real-world success rate improvement over baseline
  • Recovers 3D spatial reasoning without adding parameters
  • Causal validation of token utility via attention and ablation analysis
  • Consistent performance gains across simulation, benchmark, and hardware

Why it matters

Enables efficient, spatially-aware robotic agents for real-time deployment by extracting hidden spatial value from standard vision backbones.

Abstract

Vision-Language-Action (VLA) models have demonstrated robust performance across diverse robotic tasks. However, their high memory and computational demands often limit real-time deployment. While existing model compression techniques reduce the parameter footprint, they often drop in 3D spatial reasoning and scene layout understanding. This work introduces RetoVLA, an architecture designed to maintain spatial awareness in lightweight models by repurposing Register Tokens—learnable parameters originally introduced to mitigate attention artifacts in Vision Transformers. While these tokens are generally discarded once used, we repurpose them for their dense representation of global spatial context. RetoVLA integrates these recycled tokens directly into the action-planning module through a dedicated spatial context injection path. Our proposed design enables the recovery of global context without increasing the total parameter count. Real-world experiments using a 7-DOF manipulator show a 17.1%p improvement in average success rates over the baseline. Our results demonstrate that leveraging internal register tokens provides a highly effective mechanism for developing efficient, spatially-aware robotic agents. A video demonstration is available at: https://youtu.be/2CseBR-snZg

Index terms

Deep Learning in Grasping and Manipulation Semantic Scene Understanding Deep Learning for Visual Perception

Related papers