← Back ICRA 2026

Adaptive Capacity Allocation for Vision Language Action Fine-Tuning

Donghoon Kim, Minji Bae, Unghui Nam, Gyeonghun Kim, Suyun Lee, Kyuhong Shim, Byonghyo Shim

PDF

AI summary

Key figure (auto-extracted from paper)

LoRA-SP dynamically allocates fine-tuning capacity per input and layer, matching full fine-tuning performance with far fewer parameters and boosting multi-task robot success by up to 31.6%.

Vision-language-action models Parameter-efficient fine-tuning LoRA Rank adaptation Multi-task robotics Adaptive capacity

Problem

Fixed-rank LoRA fails to generalize across diverse robotics tasks and unseen embodiments because the required adaptation rank is higher and highly variable compared to language models. This mismatch causes cross-task interference and forces costly manual rank tuning.

Approach

LoRA-SP replaces static LoRA ranks with a router that scores a shared vector bank, dynamically selecting only the most relevant basis vectors per input and layer based on a cumulative energy threshold and a spectral concentration loss.

Key results

Matches or exceeds full fine-tuning accuracy with significantly fewer trainable parameters
Improves multi-task success rates by up to 31.6% over standard LoRA
Demonstrates robustness to rank choice across unseen robotic embodiments
Reduces cross-task interference by dynamically pruning low-energy adapter vectors

Why it matters

Provides a practical, rank-agnostic fine-tuning framework that enables efficient and robust deployment of vision-language-action models on diverse, unseen robotic hardware.

Abstract

Vision language action models (VLAs) are increas- ingly used for Physical AI, but deploying a pre-trained VLA model to unseen environments, embodiments, or tasks still requires adaptation. Parameter-efficient fine-tuning (PEFT), especially LoRA, is common for VLA policies, yet the ex- posed capacity knob, the rank, does not transfer uniformly: robotics transfer exhibits a higher and task-varying intrinsic rank than language fine-tuning. Small ranks suffice for LLMs (e.g., r ∈{4,8}), while spectral analyses indicate VLAs may require much larger ranks (e.g., r ≈128) or near–full rank, a mismatch that worsens in multi-task settings. We present LoRA-SP (Select–Prune), a rank-adaptive fine-tuning method that replaces fixed-rank updates with input- and layer-wise capacity. LoRA-SP uses an SVD-style parameterization with a small router whose nonnegative scores act as singular values over a shared vector bank. The active set is chosen by an energy target on the cumulative squared scores E(k) ≥η, providing a direct link to approximation error via our spectral analysis. During training, η concentrates energy on a few directions and teaches the router to rely on fewer vectors while preserving accuracy. This yields compact adapters that reduce cross- task interference and improve generalization. On four real- robot manipulation tasks collected on an unseen AgileX PiPER arm, across two VLA backbones (π0 and SmolVLA), LoRA- SP matches or exceeds full fine-tuning with far fewer trainable parameters, and improves multi-task success by up to 31.6% over standard LoRA while remaining robust to rank choice.

Index terms

Deep Learning Methods Transfer Learning Machine Learning for Robot Control