DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios
Sainithin Artham, Shankar Gangisetty, Avijit Dasgupta, C.V. Jawahar
AI summary
Problem
General-purpose multimodal large language models lack the fine-grained, spatially grounded understanding required for safety-critical driving scenarios and fail to provide actionable safety guidance. Existing risk prediction methods also overlook the crucial step of generating explicit safety suggestions.
Approach
The framework generates spatially grounded scene captions enriched with optical flow, depth, and lane segmentation cues, then processes them through an LLM for risk assessment and safety suggestions. Performance is further boosted by fine-tuning a lightweight adapter module on caption-risk pairs to inject domain-specific knowledge.
Key results
- Outperforms zero-shot MLLMs and domain-specific baselines in risk assessment and safety suggestion prediction
- Achieves state-of-the-art performance on the DRAMA benchmark with adapter fine-tuning
- Extends the DRAMA dataset with explicit safety-suggestion annotations linked to risk keywords
- Demonstrates that lightweight adapter tuning effectively bridges the gap between general VLMs and safety-critical driving tasks
Why it matters
Provides a reliable, actionable framework for autonomous vehicles to anticipate hazards and issue precise safety guidance in complex real-world driving environments.
Abstract
Comprehensive situational awareness is essential for autonomous vehicles operating in safety-critical environ- ments, as it enables the identification and mitigation of potential risks. Although recent Multimodal Large Language Models (MLLMs) have shown promise on general vision–language tasks, our findings indicate that zero-shot MLLMs still under- perform compared to domain-specific methods in fine-grained, spatially grounded risk assessment. To address this gap, we propose DriveSafe, a framework for risk-aware scene under- standing that leverages structured natural language descrip- tions. Specifically, our method first generates spatially grounded captions enriched with multimodal context—including motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ caption–risk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines. Exhaustive experiments on the DRAMA benchmark demonstrate state-of-the-art performance, while ablation studies validate the effectiveness of our key design choices. Project page: https://cvit.iiit.ac.in/ research/projects/cvit-projects/drivesafe.