← Back ICRA 2026

I-FailSense: Towards General Robotic Failure Detection with Vision-Language Models

Clémence Grislain, Hamed Rahimi, Olivier Sigaud, Mohamed Chetouani

PDF

AI summary

Key figure (auto-extracted from paper)

I-FailSense enables vision-language models to accurately detect semantic misalignment failures in robotic tasks and generalizes to control errors and real-world settings with minimal training.

Robotic failure detection Vision-language models Semantic misalignment Foundation models Robot manipulation Zero-shot generalization

Problem

Vision-language models struggle to detect semantic misalignment errors in robotic manipulation, where the robot performs a meaningful action that mismatches the given instruction, limiting robust deployment in open-world settings.

Approach

I-FailSense post-trains a base VLM using parameter-efficient fine-tuning and attaches lightweight classification heads to multiple internal layers, aggregating their predictions via a weighted voting mechanism to detect success or failure.

Key results

90% accuracy on semantic misalignment detection in simulation
Zero-shot generalization to control errors and unseen environments, surpassing specialized baselines by +19 points
Real-world transfer with minimal fine-tuning achieving 74% accuracy
Open-source framework and dedicated semantic misalignment datasets released on HuggingFace

Why it matters

Provides roboticists and AI researchers with a reliable, lightweight tool for autonomous failure detection, enabling safer and more robust deployment of foundation models in real-world environments.

Abstract

Language-conditioned robotic manipulation in open-world settings requires not only accurate task execution but also the ability to detect failures for robust deployment in real-world environments. Although recent advances in vision- language models (VLMs) have significantly improved the spatial reasoning and task-planning capabilities of robots, they remain limited in their ability to recognize their own failures. In par- ticular, a critical yet underexplored challenge lies in detecting semantic misalignment errors, where the robot executes a task that is semantically meaningful but inconsistent with the given instruction. To address this, we propose a method for building datasets targeting Semantic Misalignment Failures detection, from existing language-conditioned manipulation datasets. We also present I-FailSense, an open-source VLM framework with grounded arbitration designed specifically for failure detection. Our approach relies on post-training a base VLM, followed by training lightweight classification heads, called FS blocks, attached to different internal layers of the VLM and whose predictions are aggregated using an ensembling mechanism. Experiments show that I-FailSense outperforms state-of-the-art VLMs, both comparable in size and larger, in detecting semantic misalignment errors. Notably, despite being trained only on semantic misalignment detection, I-FailSense generalizes to broader robotic failure categories and effectively transfers to other simulation environments and real-world with zero-shot or minimal post-training. The datasets and models are publicly released on HuggingFace (Webpage).

Index terms

Failure Detection and Recovery Deep Learning in Grasping and Manipulation