← Back ICRA 2026

VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

Zhefan Xu, Ghassen Jerfel, Marina Haliem, Qi Zhao, Jeonhyung Kang, Khaled Refaat

PDF

AI summary

Key figure (auto-extracted from paper)

Using a frozen vision-language model to generate preference pairs enables Direct Preference Optimization, significantly improving motion forecasting alignment with human driving preferences without catastrophic forgetting.

Motion Forecasting Preference Alignment Vision-Language Models Direct Preference Optimization Autonomous Driving Human-Centered AI

Problem

Standard imitation learning for motion forecasting prioritizes local geometric accuracy, failing to capture holistic human driving preferences. Existing vision-language model approaches often require expensive data curation, risk catastrophic forgetting, or suffer from high inference latency.

Approach

VL-DPO leverages a frozen, zero-shot vision-language model as a reasoning critic to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the forecasting model via Direct Preference Optimization.

Key results

VLM trajectory selection validates as a high-quality proxy for human preference
11.94% increase in Rater Feedback Score over pretrained baseline
10.01% reduction in Average Displacement Error
Consistently outperforms High-Level Action supervision strategies

Why it matters

Provides an efficient, interpretable pathway to align autonomous driving systems with nuanced human preferences without retraining large foundational models.

Abstract

The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demon- strated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle mo- tion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model’s rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM’s trajec- tory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.

Index terms

Intelligent Transportation Systems Autonomous Agents Deep Learning Methods