← Back ICRA 2026

DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios

Sainithin Artham, Shankar Gangisetty, Avijit Dasgupta, C.V. Jawahar

PDF

AI summary

Key figure (auto-extracted from paper)

DriveSafe significantly outperforms zero-shot MLLMs and prior baselines in driving risk assessment and safety suggestion generation by leveraging multimodal contextual cues and lightweight adapter fine-tuning.

Driving risk assessment Multimodal LLMs Safety suggestions Autonomous vehicles Adapter fine-tuning DRAMA benchmark

Problem

General-purpose multimodal large language models lack the fine-grained, spatially grounded understanding required for safety-critical driving scenarios and fail to provide actionable safety guidance. Existing risk prediction methods also overlook the crucial step of generating explicit safety suggestions.

Approach

The framework generates spatially grounded scene captions enriched with optical flow, depth, and lane segmentation cues, then processes them through an LLM for risk assessment and safety suggestions. Performance is further boosted by fine-tuning a lightweight adapter module on caption-risk pairs to inject domain-specific knowledge.

Key results

Outperforms zero-shot MLLMs and domain-specific baselines in risk assessment and safety suggestion prediction
Achieves state-of-the-art performance on the DRAMA benchmark with adapter fine-tuning
Extends the DRAMA dataset with explicit safety-suggestion annotations linked to risk keywords
Demonstrates that lightweight adapter tuning effectively bridges the gap between general VLMs and safety-critical driving tasks

Why it matters

Provides a reliable, actionable framework for autonomous vehicles to anticipate hazards and issue precise safety guidance in complex real-world driving environments.

Abstract

Comprehensive situational awareness is essential for autonomous vehicles operating in safety-critical environ- ments, as it enables the identification and mitigation of potential risks. Although recent Multimodal Large Language Models (MLLMs) have shown promise on general vision–language tasks, our findings indicate that zero-shot MLLMs still under- perform compared to domain-specific methods in fine-grained, spatially grounded risk assessment. To address this gap, we propose DriveSafe, a framework for risk-aware scene under- standing that leverages structured natural language descrip- tions. Specifically, our method first generates spatially grounded captions enriched with multimodal context—including motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ caption–risk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines. Exhaustive experiments on the DRAMA benchmark demonstrate state-of-the-art performance, while ablation studies validate the effectiveness of our key design choices. Project page: https://cvit.iiit.ac.in/ research/projects/cvit-projects/drivesafe.

Index terms

Deep Learning for Visual Perception Computer Vision for Transportation Intelligent Transportation Systems