SToRM: Supervised Token Reduction for Multi-Modal LLMs Toward Efficient End-To-End Autonomous Driving
Seo Hyun Kim, Jin Bok Park, Do Yeon Koo, Hogun Park, Il Yong Chun
AI summary
Problem
Multi-modal LLMs improve autonomous driving with language instructions but suffer from quadratic computational complexity due to processing excessive visual tokens, hindering real-time inference. Existing token reduction methods often degrade driving performance to achieve efficiency.
Approach
The framework trains a lightweight importance predictor using pseudo-supervision from an all-token LLM pass to score visual tokens, then merges less critical context tokens into high-importance anchor tokens to drastically cut redundancy while preserving essential driving information.
Key results
- Outperforms state-of-the-art E2E driving MLLMs under equal token budgets
- Maintains all-token driving performance while cutting computational cost by up to 30x
- Enables real-time end-to-end driving on standard GPUs
- Introduces an anchor-context merging module that minimizes information loss
Why it matters
It bridges the gap between high-performance multi-modal reasoning and real-time deployment constraints, making safe, instruction-aware autonomous driving computationally feasible.
Abstract
In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data achieved significant advancements. For safe autonomous driving in unexpected scenarios, one may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) in au- tonomous driving facilitates human–vehicle interactions, and may improve driving performances in unexpected scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and many visual tokens from sensor inputs, that are inherently limited in autonomous vehicles. Many MLLM studies have explored reducing the number of visual tokens, and many approaches tend to exhibit some end-task performance degradation compared to using all tokens. For efficient E2E driving while maintaining driv- ing performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for Multi-modal LLMs (SToRM). The proposed SToRM framework consists of three key elements. First, we propose a lightweight importance predictor with short-term sliding windows that pre- dicts the importance scores of visual tokens. Second, we propose a supervised learning approach for the importance predictor, that uses an auxiliary path to obtain pseudo-supervision signals from an all-token pass through the LLM. Third, guided by predicted importance scores, we propose an anchor–context merging module that partitions tokens into “anchors” and “context” tokens, then merges the latter into their most relevant anchors to reduce redundancy while minimizing information loss. Experiments with the LangAuto benchmark dataset show that the proposed SToRM outperforms state-of-the-art E2E driving MLLM under an equal reduced-token budget and maintains all-token performance while substantially reducing computational cost, by up to 30×, and enabling real-time E2E driving on a standard GPU.