Failure Identification in Imitation Learning Via Statistical and Semantic Filtering
Quentin Rolland,, Fabrice Mayran de Chamisso, Jean-Baptiste Mouret
AI summary
Problem
Imitation learning policies struggle with rare, out-of-distribution events during real-world deployment. Existing vision-based anomaly detection methods flag deviations but cannot distinguish harmless scene changes from task-compromising failures.
Approach
FIDeL monitors policies by encoding expert demonstrations into a statistical memory, computing anomaly scores via optimal transport alignment, applying conformal prediction for dynamic thresholds, and using a vision-language model to semantically filter benign deviations from true failures.
Key results
- Novel representation-based anomaly detection using optimal transport alignment
- Extended conformal prediction framework for dynamic spatio-temporal thresholding
- Vision-language model semantic filtering to distinguish benign anomalies from failures
- +17.38% accuracy gain in failure detection over state-of-the-art baselines
Why it matters
Enables safer and more reliable deployment of imitation learning policies in real-world robotics by catching true failures while avoiding costly false alarms.
Abstract
Imitation learning (IL) policies in robotics deliver strong performance in controlled settings but remain brittle in real-world deployments: rare events such as hardware faults, defective parts, unexpected human actions, or any state that lies outside the training distribution can lead to failed executions. Vision-based Anomaly Detection (AD) methods emerged as an appropriate solution to detect these anomalous failure states but do not distinguish failures from benign deviations. We introduce FIDeL (Failure Identification in Demonstration Learning), a policy-independent failure detection module. Leveraging recent AD methods, FIDeL builds a compact representation of nominal demonstrations and aligns incoming observations via optimal transport matching to produce anomaly scores and heatmaps. Spatio-temporal thresholds are derived with an extension of conformal prediction, and a Vision–Language Model (VLM) performs semantic filtering to discriminate benign anomalies from genuine failures. We also introduce BotFails, a multimodal dataset of real-world tasks for failure detection in robotics. FI- DeL consistently outperforms state-of-the-art baselines, yielding +5.30% AUROC in anomaly detection and +17.38% failure- detection accuracy on BotFails compared to existing methods. Videos of FIDeL can be found on our website : https://cea-list.github.io/FIDeL/