← Back ICRA 2026

KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

in its failure explanation, the VLM references the BEV diagram to infer that the cup’s position remains unchanged.

PDF

AI summary

Key figure (auto-extracted from paper)

KITE transforms raw robot videos into compact, keyframe-anchored schematics that dramatically improve off-the-shelf VLMs' failure detection and explanation without any fine-tuning.

Robot failure analysis Vision-language models Keyframe indexing Pseudo-BEV schematics Training-free robotics RoboFAC benchmark

Problem

Off-the-shelf vision-language models struggle to reason over long robot execution videos due to dense visuals and limited temporal memory, while existing failure analysis methods typically require costly task-specific fine-tuning.

Approach

KITE distills long execution videos into motion-salient keyframes paired with pseudo bird’s-eye-view schematics and open-vocabulary detections, serializing this layout-grounded context into a unified prompt for an unmodified VLM.

Key results

Training-free KITE + Qwen2.5-VL-7B outperforms vanilla Qwen2.5-VL-7B on RoboFAC with +36% gain in failure detection and +33% in localization
Achieves performance competitive with a RoboFAC-tuned baseline without any model training
Pseudo-BEV schematics and motion-based keyframe selection are proven critical through ablation studies
Demonstrates qualitative generalization on real-world dual-arm robot rollouts (DART and ALOHA-2)

Why it matters

Enables researchers and roboticists to perform accurate, interpretable failure diagnosis on long-horizon tasks using generalist VLMs without costly fine-tuning or custom architectures.

Abstract

We present KITE, a training-free, keyframe- anchored, layout-grounded front-end that converts long robot- execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird’s-eye- view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot-profile and scene-context tokens into a unified prompt, allowing the same front-end to support failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM. On the RoboFAC benchmark, KITE with Qwen2.5-VL substantially improves over vanilla Qwen2.5- VL in the training-free setting, with especially large gains on simulation failure detection, identification, and localization, while remaining competitive with a RoboFAC-tuned baseline. A small QLoRA fine-tune further improves explanation and correction quality. We also report qualitative results on real dual-arm robots, demonstrating the practical applicability of KITE as a structured and interpretable front-end for robot failure analysis. Project page: https://m80hz.github.io/kite/

Index terms

Semantic Scene Understanding Failure Detection and Recovery Representation Learning