INSIGHT: INference-Time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models
Ulas Berk Karli, Ziyao Shangguan, Tesca Fitzgerald
AI summary
Problem
Autonomous vision-language-action models lack mechanisms to recognize their own uncertainty or request human assistance during inference, risking unsafe failures in unstructured environments.
Approach
The method extracts per-token uncertainty metrics during inference and trains a lightweight transformer to predict step-by-step help triggers, evaluated under both strong (step-level) and weak (episode-level) supervision.
Key results
- Sequential transformer modeling of token-level uncertainty significantly outperforms static scores for help detection
- Strong supervision yields higher fidelity but requires costly annotation, while weak supervision provides a scalable alternative
- Framework generalizes effectively across in-distribution, distribution-shift, and simulated out-of-distribution tasks
- Establishes the first systematic evaluation of uncertainty-based introspection for vision-language-action models
Why it matters
Provides a practical pathway for real-time error mitigation and active learning, enabling safer and more reliable human-in-the-loop robot deployment.
Abstract
Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspec- tive mechanisms for anticipating failures and requesting help from a human supervisor. We present INSIGHT, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help. Using π0-FAST as the underlying model, we extract per-token entropy, log-probability, and Dirichlet-based estimates of aleatoric and epistemic uncer- tainty, and train compact transformer classifiers to map these sequences to help triggers. We explore supervision regimes for strong or weak supervision, and extensively compare them across in-distribution and out-of-distribution tasks. Our results show a trade-off: strong labels enable models to capture fine-grained uncertainty dynamics for reliable help detection, while weak labels, though noisier, still support competitive introspection when training and evaluation are aligned, of- fering a scalable path when dense annotation is impractical. Crucially, we find that modeling the temporal evolution of token-level uncertainty signals with transformers provides far greater predictive power than static sequence-level scores. This study provides the first systematic evaluation of uncertainty- based introspection in VLAs, opening future avenues for active learning and for real-time error mitigation through selective human intervention.