VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving
Zhefan Xu, Ghassen Jerfel, Marina Haliem, Qi Zhao, Jeonhyung Kang, Khaled Refaat
AI summary
Problem
Standard imitation learning for motion forecasting prioritizes local geometric accuracy, failing to capture holistic human driving preferences. Existing vision-language model approaches often require expensive data curation, risk catastrophic forgetting, or suffer from high inference latency.
Approach
VL-DPO leverages a frozen, zero-shot vision-language model as a reasoning critic to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the forecasting model via Direct Preference Optimization.
Key results
- VLM trajectory selection validates as a high-quality proxy for human preference
- 11.94% increase in Rater Feedback Score over pretrained baseline
- 10.01% reduction in Average Displacement Error
- Consistently outperforms High-Level Action supervision strategies
Why it matters
Provides an efficient, interpretable pathway to align autonomous driving systems with nuanced human preferences without retraining large foundational models.
Abstract
The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demon- strated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle mo- tion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model’s rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM’s trajec- tory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.