← Back ICRA 2026

The Better You Learn, the Smarter You Prune: Towards Efficient Vision-Language-Action Models Via Differentiable Token Pruning

Titong Jiang, Xuefeng Jiang, Yuan Ma, Xin Wen, Bailin Li, Kun Zhan, Peng Jia, Yahui Liu, Sheng Sun, Xianpeng Lang

PDF

AI summary

Key figure (auto-extracted from paper)

LightVLA simultaneously boosts task success rates and cuts computational costs by learning to dynamically prune redundant visual tokens in vision-language-action models.

Vision-language-action models token pruning differentiable pruning robotic efficiency VLA acceleration edge robotics

Problem

Vision-language-action (VLA) models suffer from high computational overhead due to processing hundreds of visual tokens, which hinders real-time deployment on resource-constrained robots, while existing pruning methods often degrade performance or fail to transfer effectively from vision-language models.

Approach

LightVLA introduces a parameter-free, performance-driven framework that generates dynamic queries via cross-attention with task instructions and uses Gumbel-softmax to enable differentiable, adaptive selection of the most informative visual tokens during fine-tuning.

Key results

Achieves state-of-the-art success rates across all LIBERO benchmark task suites
Reduces FLOPs by 59.1% and inference latency by 38.2% compared to OpenVLA-OFT
Improves task success rate by 2.6% while retaining only ~78 visual tokens on average
Introduces LightVLA*, a learnable query variant that further enhances efficiency and performance

Why it matters

Enables efficient, real-time deployment of powerful VLA models on edge robotic hardware without sacrificing task performance, bridging the gap between AI research and practical robotics.

Abstract

We present LightVLA, a simple yet effective dif- ferentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive ca- pability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic “magic numbers” and introduces no additional train- able parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced com- putational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.6% im- provement in task success rate. Meanwhile, we also investigate the learnable query-based token pruning method LightVLA∗ with additional trainable parameters, which also achieves sat- isfactory performance. Our work reveals that as VLA pursues optimal performance, LightVLA spontaneously learns to prune tokens from a performance-driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real-time robotic systems. Project site: https://liauto-research.github.io/LightVLA.

Index terms

Deep Learning Methods Machine Learning for Robot Control Imitation Learning