← Back ICRA 2026

PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction

Naman Mishra, Shankar Gangisetty, C.V. Jawahar

PDF

AI summary

Key figure (auto-extracted from paper)

Finetuning state-of-the-art vision-language models on a structured question-answering dataset significantly improves both pedestrian intention/trajectory prediction accuracy and the quality of explanatory rationales.

Pedestrian Intention Prediction Trajectory Forecasting Vision-Language Models Explainable AI Autonomous Driving Multimodal Benchmark

Problem

Current pedestrian prediction methods treat the task as a black-box classification or regression problem, underutilizing multimodal reasoning and lacking explainability, which hinders trust and safe deployment in autonomous driving.

Approach

The authors introduce PedestrianQA, a video-based dataset that reformulates pedestrian prediction as question-answering with five structured rationale categories, and demonstrate that finetuning a standard vision-language model on this data yields strong predictive and reasoning performance without architectural changes.

Key results

Introduction of PedestrianQA, a multimodal dataset with 10,251 training and 4,059 test samples featuring structured QA and rationale annotations
Establishment of a strong baseline by finetuning Qwen2.5-VL-3B using parameter-efficient LoRA adapters
Significant improvements in intention classification and trajectory forecasting accuracy across PIE, JAAD, TITAN, and IDD-PeD benchmarks
Generation of high-quality, multi-dimensional explanatory rationales that enhance model transparency and decision-making

Why it matters

It provides autonomous driving researchers and safety engineers with a unified, explainable framework for critical pedestrian behavior modeling, directly supporting safer AV deployment.

Abstract

Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving systems, directly influencing navigation decisions in complex traffic environments. Recent advances in large vision–language models offer a powerful new paradigm for these tasks by combining high-capacity visual understanding with flexible natural lan- guage reasoning. In this work, we introduce PedestrianQA, a large-scale video-based dataset that formulates pedestrian intention and trajectory prediction as question–answering tasks augmented with structured rationales. PedestrianQA expresses richly annotated pedestrian sequences, in natural language, enabling VLMs to learn from visual dynamics, contextual cues, and interactions among traffic agents while generating concise explanations of their predictions without needing specialized ar- chitectures tailored for each task. Empirical evaluations across PIE, JAAD, TITAN, and IDD-PeD show that finetuning state-of- the-art VLMs on PedestrianQA significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales, demonstrating the strong potential of VLMs as a unified and explainable framework for safety-critical pedestrian behavior modeling. Dataset and models are available at https://github.com/botmahn/PedestrianQA

Index terms

Data Sets for Robotic Vision Multi-Modal Perception for HRI Intelligent Transportation Systems