← Back ICRA 2026

DriveCritic: Towards Context-Aware, Human-Aligned Evaluation for Autonomous Driving with Vision-Language Models

Jingyu Song, Zhenxin Li, Shiyi Lan, Xinglong Sun, Nadine Chang, Maying Shen, Jingde Chen, Katherine Skinner, Jose Alvarez

PDF

AI summary

Key figure (auto-extracted from paper)

DriveCritic, a vision-language model fine-tuned with verifiable rewards, significantly outperforms rule-based metrics in aligning with expert human preferences on nuanced driving scenarios.

Autonomous driving evaluation Vision-language models Human-aligned metrics Context-aware reasoning Reinforcement learning Trajectory assessment

Problem

Current open-loop evaluation metrics like EPDMS rely on rigid rules that lack context awareness, causing them to misjudge nuanced driving behaviors and diverge from expert human judgment.

Approach

The authors curate a dataset of challenging trajectory pairs annotated with human preferences and fine-tune a vision-language model using a two-stage supervised and reinforcement learning pipeline to adjudicate trajectories based on rich visual and symbolic context.

Key results

Exposed EPDMS context-blindness in nuanced scenarios
Curated DriveCritic dataset of 5,730 human-annotated trajectory pairs
Fine-tuned VLM evaluator via RLVR achieving 76% human alignment accuracy
Outperformed rule-based baselines in context-aware trajectory adjudication

Why it matters

Enables scalable, human-aligned benchmarking for autonomous driving planners, guiding safer and more socially aware policy development.

Abstract

Benchmarking autonomous driving planners to align with human judgment remains a critical challenge, as state-of-the-art metrics like the Extended Predictive Driver Model Score (EPDMS) lack context awareness in nuanced scenarios. To address this, we introduce DriveCritic, a novel framework featuring two key contributions: the DriveCritic dataset, a curated collection of challenging scenarios where context is critical for correct judgment and annotated with pairwise human preferences, and the DriveCritic model, a Vision-Language Model (VLM) based evaluator. Fine-tuned us- ing a two-stage supervised and reinforcement learning pipeline, the DriveCritic model learns to adjudicate between trajectory pairs by integrating visual and symbolic context. Experiments show DriveCritic significantly outperforms existing metrics and baselines in matching human preferences and demonstrates strong context awareness. Overall, our work provides a more reliable, human-aligned foundation to evaluating autonomous driving systems. The project page for DriveCritic is https: //song-jingyu.github.io/DriveCritic.

Index terms

Autonomous Vehicle Navigation AI-Based Methods Intelligent Transportation Systems