Adaptive Capacity Allocation for Vision Language Action Fine-Tuning
Donghoon Kim, Minji Bae, Unghui Nam, Gyeonghun Kim, Suyun Lee, Kyuhong Shim, Byonghyo Shim
AI summary
Problem
Fixed-rank LoRA fails to generalize across diverse robotics tasks and unseen embodiments because the required adaptation rank is higher and highly variable compared to language models. This mismatch causes cross-task interference and forces costly manual rank tuning.
Approach
LoRA-SP replaces static LoRA ranks with a router that scores a shared vector bank, dynamically selecting only the most relevant basis vectors per input and layer based on a cumulative energy threshold and a spectral concentration loss.
Key results
- Matches or exceeds full fine-tuning accuracy with significantly fewer trainable parameters
- Improves multi-task success rates by up to 31.6% over standard LoRA
- Demonstrates robustness to rank choice across unseen robotic embodiments
- Reduces cross-task interference by dynamically pruning low-energy adapter vectors
Why it matters
Provides a practical, rank-agnostic fine-tuning framework that enables efficient and robust deployment of vision-language-action models on diverse, unseen robotic hardware.
Abstract
Vision language action models (VLAs) are increas- ingly used for Physical AI, but deploying a pre-trained VLA model to unseen environments, embodiments, or tasks still requires adaptation. Parameter-efficient fine-tuning (PEFT), especially LoRA, is common for VLA policies, yet the ex- posed capacity knob, the rank, does not transfer uniformly: robotics transfer exhibits a higher and task-varying intrinsic rank than language fine-tuning. Small ranks suffice for LLMs (e.g., r ∈{4,8}), while spectral analyses indicate VLAs may require much larger ranks (e.g., r ≈128) or near–full rank, a mismatch that worsens in multi-task settings. We present LoRA-SP (Select–Prune), a rank-adaptive fine-tuning method that replaces fixed-rank updates with input- and layer-wise capacity. LoRA-SP uses an SVD-style parameterization with a small router whose nonnegative scores act as singular values over a shared vector bank. The active set is chosen by an energy target on the cumulative squared scores E(k) ≥η, providing a direct link to approximation error via our spectral analysis. During training, η concentrates energy on a few directions and teaches the router to rely on fewer vectors while preserving accuracy. This yields compact adapters that reduce cross- task interference and improve generalization. On four real- robot manipulation tasks collected on an unseen AgileX PiPER arm, across two VLA backbones (π0 and SmolVLA), LoRA- SP matches or exceeds full fine-tuning with far fewer trainable parameters, and improves multi-task success by up to 31.6% over standard LoRA while remaining robust to rank choice.