Preventing Robotic Jailbreaking Via Multimodal Domain Adaptation
Francesco Marchiori, Rohan Sinha, Christopher George Agia, Alexander Robey, George J. Pappas, Mauro Conti, Marco Pavone
AI summary
Problem
Data-driven jailbreak detectors fail in robotics due to scarce domain-specific adversarial data and distribution shifts between general-purpose text benchmarks and embodied environments.
Approach
J-DAPT fuses text and visual embeddings via cross-attention, then aligns general-purpose jailbreak datasets to target robotic domains using importance weighting and CORAL correlation alignment.
Key results
- Mitigates 98.85% of jailbreak attacks across autonomous driving, maritime, and quadruped benchmarks
- Achieves up to 100% detection accuracy in specific scenarios without domain-specific jailbreak training data
- Runs 9.9× faster than the fastest comparable LLM-based detector
- Outperforms existing classifier baselines, which perform near random guessing
Why it matters
Provides a practical, low-latency defense for securing VLM-enabled robots in safety-critical real-world deployments where adversarial data is scarce.
Abstract
Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly deployed in robotic environments but remain vulnerable to jailbreaking attacks that bypass safety mechanisms and drive unsafe or physically harmful behaviors in the real world. Data-driven defenses such as jailbreak classifiers show promise, yet they struggle to generalize in domains where specialized datasets are scarce, limiting their effectiveness in robotics and other safety-critical contexts. To address this gap, we introduce J-DAPT, a lightweight framework for multimodal jailbreak detection through attention-based fusion and domain adaptation. J-DAPT integrates textual and visual embeddings to capture both semantic intent and environmental grounding, while aligning general-purpose jailbreak datasets with domain-specific reference data. Evaluations across autonomous driving, maritime robotics, and quadruped navigation show that J-DAPT boosts detection accuracy to very high levels (up to 100% in certain scenarios) under our evaluation protocol. These results demonstrate that J-DAPT provides a practical defense for securing VLMs in robotic applications. Additional materials are made available at: https://j-dapt.github.io.