← Back ICRA 2026

Robust Task Planning Via Failure Detection Using Scene Graph from Multi-View Images

Haechan Chong, Jongwon Lee, Hyemin Ahn

PDF

AI summary

Key figure (auto-extracted from paper)

A multi-view scene graph framework enables accurate failure detection and LLM-driven replanning, significantly improving robot task success in complex environments.

Task Planning Failure Detection Scene Graphs Multi-view Perception Robot Replanning Graph Neural Networks

Problem

Existing LLM-based robot planners often assume full environmental understanding or rely on single-view images, causing unreliable failure detection and planning in cluttered or occluded scenes.

Approach

The method generates local 2D scene graphs from multi-view images, fuses them into a unified representation using a graph neural network, and detects failures by comparing actual object relations against LLM-predicted expectations.

Key results

Unified scene graph construction via E-RGCN fusion of multi-view 2D graphs
Matrix-based failure detection comparing actual and LLM-predicted object relations
Closed-loop replanning that feeds failure causes back to an LLM for corrected task sequences
Validated on five real-world benchmarks with superior failure detection and replanning accuracy over baselines

Why it matters

Provides a computationally efficient and robust pathway for robots to autonomously recover from manipulation errors in unstructured real-world settings.

Abstract

Recent robot task planners utilize large language models (LLMs) or vision-language models (VLMs) as a failure detector. These methods perform well by leveraging their se- mantic reasoning capabilities but often assume full environment understanding, which can lead to unreliable planning in complex scenes lacking explicit structural modeling. To address these limitations, we propose a novel multi-view scene understand- ing framework that explicitly models object-level relationships, enabling failure detection and effective task replanning. Our approach first captures multi-view images for comprehensive coverage, and generates local 2D scene graphs encoding object identities and relational information. Building on this, we intro- duce a model based on a graph neural network that merges the local 2D scene graphs into a unified representation. This process results in the unified scene graph, used to detect task success and identify failure causes. For each sub-task, our framework compares the unified scene graph with the expected scene graph predicted by the LLM during the task planning stage, identifying potential failure causes based on their deviations. These causes are then fed back into the LLM to facilitate effective replanning, thereby reducing repetitive failures and enhancing adaptability. We evaluate our framework on five real-world benchmark tasks to demonstrate its applicability. Separately, we compare failure detection and reasoning performance with other methods, showing the benefits of combining multi-view perception with explicit graph-based reasoning. More information can be found in https://sites.google.com/view/scrutinize-robot-manipulation

Index terms

Task and Motion Planning Manipulation Planning