← Back ICRA 2026

INSIGHT: INference-Time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models

Ulas Berk Karli, Ziyao Shangguan, Tesca Fitzgerald

PDF

AI summary

Key figure (auto-extracted from paper)

Modeling the temporal evolution of token-level uncertainty with a compact transformer reliably predicts when vision-language-action models should request human help, outperforming static uncertainty thresholds.

Vision-Language-Action models introspection uncertainty quantification help triggers weak supervision transformer classifiers

Problem

Autonomous vision-language-action models lack mechanisms to recognize their own uncertainty or request human assistance during inference, risking unsafe failures in unstructured environments.

Approach

The method extracts per-token uncertainty metrics during inference and trains a lightweight transformer to predict step-by-step help triggers, evaluated under both strong (step-level) and weak (episode-level) supervision.

Key results

Sequential transformer modeling of token-level uncertainty significantly outperforms static scores for help detection
Strong supervision yields higher fidelity but requires costly annotation, while weak supervision provides a scalable alternative
Framework generalizes effectively across in-distribution, distribution-shift, and simulated out-of-distribution tasks
Establishes the first systematic evaluation of uncertainty-based introspection for vision-language-action models

Why it matters

Provides a practical pathway for real-time error mitigation and active learning, enabling safer and more reliable human-in-the-loop robot deployment.

Abstract

Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspec- tive mechanisms for anticipating failures and requesting help from a human supervisor. We present INSIGHT, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help. Using π0-FAST as the underlying model, we extract per-token entropy, log-probability, and Dirichlet-based estimates of aleatoric and epistemic uncer- tainty, and train compact transformer classifiers to map these sequences to help triggers. We explore supervision regimes for strong or weak supervision, and extensively compare them across in-distribution and out-of-distribution tasks. Our results show a trade-off: strong labels enable models to capture fine-grained uncertainty dynamics for reliable help detection, while weak labels, though noisier, still support competitive introspection when training and evaluation are aligned, of- fering a scalable path when dense annotation is impractical. Crucially, we find that modeling the temporal evolution of token-level uncertainty signals with transformers provides far greater predictive power than static sequence-level scores. This study provides the first systematic evaluation of uncertainty- based introspection in VLAs, opening future avenues for active learning and for real-time error mitigation through selective human intervention.

Index terms

Learning from Demonstration Continual Learning Sensorimotor Learning