← Back ICRA 2026

Foundational World Models Accurately Detect Bimanual Manipulator Failures

Isaac Ronald Ward, Michelle Ho, Houjun Liu, Aaron Feldman, Joseph Vincent, Liam Kruse, Sean Cheong, Duncan Eddy, Mykel Kochenderfer, Mac Schwager

PDF

AI summary

Key figure (auto-extracted from paper)

A lightweight probabilistic world model trained in a foundation model's latent space reliably detects bimanual robot failures at runtime with high accuracy and minimal parameters.

World models Anomaly detection Bimanual manipulation Conformal prediction Foundation models Robot safety

Problem

Deploying bimanual manipulators safely is hindered by the difficulty of defining and detecting anomalous failures in high-dimensional, multimodal sensory data in real-time.

Approach

The authors train a probabilistic variational autoencoder world model within the compressed latent space of NVIDIA's pretrained Cosmos Tokenizer, using its prediction uncertainty as a non-conformity score calibrated via conformal prediction to flag failures at runtime.

Key results

Outperformed five statistical and learning-based baselines in failure detection rate
Required approximately one-twentieth the trainable parameters of the next-best learning-based approach
Successfully detected both visual and dynamic anomalies in simulated Push-T environments
Introduced the Bimanual Cable Manipulation dataset with synchronized multi-view video and annotated real-world failures

Why it matters

Enables safer, scalable deployment of complex robotic systems in high-stakes environments where reliability is critical.

Abstract

Deploying visuomotor robots at scale is challeng- ing due to the potential for anomalous failures to degrade performance, cause damage, or endanger human life. Bimanual manipulators are no exception; these robots have vast state spaces comprised of high-dimensional images and proprio- ceptive signals. Explicitly defining failure modes within such state spaces is infeasible. In this work, we overcome these challenges by training a probabilistic, history informed, world model within the compressed latent space of a pretrained vision foundation model (NVIDIA’s Cosmos Tokenizer). The model outputs uncertainty estimates alongside its predictions that serve as non-conformity scores within a conformal prediction framework. We use these scores to develop a runtime monitor, correlating periods of high uncertainty with anomalous failures. To test these methods, we use the simulated Push-T environment and the Bimanual Cable Manipulation dataset, the latter of which we introduce in this work. This new dataset features trajectories with multiple synchronized camera views, proprio- ceptive signals, and annotated failures from a challenging data center maintenance task. We benchmark our methods against baselines from the anomaly detection and out-of-distribution detection literature, and show that our approach considerably outperforms statistical techniques. Furthermore, we show that our approach requires approximately one twentieth of the trainable parameters as the next-best learning-based approach, yet outperforms it by 3.8% in terms of failure detection rate, paving the way toward safely deploying manipulator robots in real-world environments where reliability is non-negotiable.

Index terms

Big Data in Robotics and Automation Failure Detection and Recovery AI-Based Methods