Research Analyzer
← Back ICRA 2026

MoE-Powered Fast VLMs Via Curriculum Learning-Based Knowledge Distillation: Taming Regular and Corner Cases in Autonomous Driving

Xue Zhao, Zhou Fang

PDF

AI summary

Key figure (auto-extracted from paper)
CLKD accelerates Vision-Language Models by twofold while preserving performance on both common and rare autonomous driving scenarios.
Vision-Language Models Autonomous Driving Knowledge Distillation Mixture-of-Experts Curriculum Learning Real-time Inference

Problem

Large Vision-Language Models for autonomous driving suffer from high latency, and simply shrinking them degrades their ability to handle both regular and rare corner cases effectively.

Approach

The authors propose a Curriculum Learning-based Knowledge Distillation framework that combines a Mixture-of-Experts architecture with a two-granularity hardness mining strategy and a progressive release distillation schedule to balance efficiency and accuracy.

Key results

  • Twofold increase in inference speed over existing approaches
  • Maintains comparable performance on regular and corner cases
  • MoE architecture preserves small model expressiveness
  • H2G strategy adaptively mines hard tokens and samples

Why it matters

Enables real-time, resource-efficient deployment of autonomous driving systems without compromising safety or decision-making accuracy.

Abstract

Autonomous driving has advanced significantly with the integration of large Vision-Language Models (VLMs), which excel in understanding and analyzing driving data. However, existing VLMs face challenges, particularly in terms of latency, which is crucial for real-time driving tasks. While shrinking the model size can reduce latency, it also limits the model’s ability to handle both regular and corner cases effectively. To address this challenge, we propose the Curriculum Learning-based Knowledge Distillation (CLKD) framework. CLKD enhances student model performance through three key innovations: (1) integration of a Mixture-of-Experts (MoE) architecture to preserve model expressiveness; (2) Hardness- explored at Two Granularities (H2G), which dynamically identi- fies easy and difficult samples at both instance and feature levels; and (3) Progressive Release Distillation strategy that gradually reduces reliance on the teacher model, thereby fostering the student’s autonomy and improving its generalization capability in complex driving scenarios. In real-world data experiments, CLKD has achieved a twofold increase in speed compared to existing approaches while maintaining comparable performance.

Index terms

Intelligent Transportation Systems AI-Based Methods Deep Learning for Visual Perception

Related papers