PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction
Naman Mishra, Shankar Gangisetty, C.V. Jawahar
AI summary
Problem
Current pedestrian prediction methods treat the task as a black-box classification or regression problem, underutilizing multimodal reasoning and lacking explainability, which hinders trust and safe deployment in autonomous driving.
Approach
The authors introduce PedestrianQA, a video-based dataset that reformulates pedestrian prediction as question-answering with five structured rationale categories, and demonstrate that finetuning a standard vision-language model on this data yields strong predictive and reasoning performance without architectural changes.
Key results
- Introduction of PedestrianQA, a multimodal dataset with 10,251 training and 4,059 test samples featuring structured QA and rationale annotations
- Establishment of a strong baseline by finetuning Qwen2.5-VL-3B using parameter-efficient LoRA adapters
- Significant improvements in intention classification and trajectory forecasting accuracy across PIE, JAAD, TITAN, and IDD-PeD benchmarks
- Generation of high-quality, multi-dimensional explanatory rationales that enhance model transparency and decision-making
Why it matters
It provides autonomous driving researchers and safety engineers with a unified, explainable framework for critical pedestrian behavior modeling, directly supporting safer AV deployment.
Abstract
Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving systems, directly influencing navigation decisions in complex traffic environments. Recent advances in large vision–language models offer a powerful new paradigm for these tasks by combining high-capacity visual understanding with flexible natural lan- guage reasoning. In this work, we introduce PedestrianQA, a large-scale video-based dataset that formulates pedestrian intention and trajectory prediction as question–answering tasks augmented with structured rationales. PedestrianQA expresses richly annotated pedestrian sequences, in natural language, enabling VLMs to learn from visual dynamics, contextual cues, and interactions among traffic agents while generating concise explanations of their predictions without needing specialized ar- chitectures tailored for each task. Empirical evaluations across PIE, JAAD, TITAN, and IDD-PeD show that finetuning state-of- the-art VLMs on PedestrianQA significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales, demonstrating the strong potential of VLMs as a unified and explainable framework for safety-critical pedestrian behavior modeling. Dataset and models are available at https://github.com/botmahn/PedestrianQA