Research Analyzer
← Back ICRA 2026

Preventing Robotic Jailbreaking Via Multimodal Domain Adaptation

Francesco Marchiori, Rohan Sinha, Christopher George Agia, Alexander Robey, George J. Pappas, Mauro Conti, Marco Pavone

PDF

AI summary

Key figure (auto-extracted from paper)
J-DAPT detects robotic jailbreaks with near-perfect accuracy by adapting general-purpose safety data to robotics domains without requiring domain-specific attack examples.
robotic jailbreak detection vision-language models domain adaptation multimodal fusion AI safety autonomous systems

Problem

Data-driven jailbreak detectors fail in robotics due to scarce domain-specific adversarial data and distribution shifts between general-purpose text benchmarks and embodied environments.

Approach

J-DAPT fuses text and visual embeddings via cross-attention, then aligns general-purpose jailbreak datasets to target robotic domains using importance weighting and CORAL correlation alignment.

Key results

  • Mitigates 98.85% of jailbreak attacks across autonomous driving, maritime, and quadruped benchmarks
  • Achieves up to 100% detection accuracy in specific scenarios without domain-specific jailbreak training data
  • Runs 9.9× faster than the fastest comparable LLM-based detector
  • Outperforms existing classifier baselines, which perform near random guessing

Why it matters

Provides a practical, low-latency defense for securing VLM-enabled robots in safety-critical real-world deployments where adversarial data is scarce.

Abstract

Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly deployed in robotic environments but remain vulnerable to jailbreaking attacks that bypass safety mechanisms and drive unsafe or physically harmful behaviors in the real world. Data-driven defenses such as jailbreak classifiers show promise, yet they struggle to generalize in domains where specialized datasets are scarce, limiting their effectiveness in robotics and other safety-critical contexts. To address this gap, we introduce J-DAPT, a lightweight framework for multimodal jailbreak detection through attention-based fusion and domain adaptation. J-DAPT integrates textual and visual embeddings to capture both semantic intent and environmental grounding, while aligning general-purpose jailbreak datasets with domain-specific reference data. Evaluations across autonomous driving, maritime robotics, and quadruped navigation show that J-DAPT boosts detection accuracy to very high levels (up to 100% in certain scenarios) under our evaluation protocol. These results demonstrate that J-DAPT provides a practical defense for securing VLMs in robotic applications. Additional materials are made available at: https://j-dapt.github.io.

Index terms

AI-Enabled Robotics Transfer Learning Robot Safety

Related papers