← Back ICRA 2026

SAGE: Scene Graph-Aware Guidance and Execution for Long-Horizon Manipulation Tasks

Jialiang Li, Wenzheng Wu, Gaojing Zhang, Yifan Han, Wenzhao Lian

PDF

AI summary

Key figure (auto-extracted from paper)

SAGE bridges symbolic planning and visual control for long-horizon manipulation by using scene graphs to generate precise sub-goal images, achieving state-of-the-art success rates on complex tasks.

Long-horizon manipulation Scene graphs Goal-conditioned control Image editing Robotic planning Visuo-motor control

Problem

Long-horizon manipulation requires bridging the semantic gap between high-level symbolic planning and low-level continuous control, but existing methods struggle with generalization, hallucination, or generating reliable goal images for unseen tasks.

Approach

SAGE uses a scene graph task planner to generate physically-grounded step-by-step transition chains, coupled with a decoupled image editing pipeline that converts each graph state into a precise sub-goal image for a goal-conditioned policy.

Key results

Robust scene graph task planner decomposes long-horizon tasks into physically-grounded transition chains
Decoupled structural image editing pipeline controllably synthesizes accurate sub-goal images
State-of-the-art success rates across sequential, flexible, and hybrid tasks (up to 100% seen, 91.3% unseen)
Iterative goal-switching aligns symbolic planning with pixel-level visuo-motor control

Why it matters

Enables robots to reliably execute complex, multi-step manipulation tasks in dynamic real-world environments by seamlessly integrating symbolic reasoning with visual goal conditioning.

Abstract

Successfully solving long-horizon manipulation tasks remains a fundamental challenge. These tasks involve extended action sequences and complex object interactions, pre- senting a critical gap between high-level symbolic planning and low-level continuous control. To bridge this gap, two essential capabilities are required: robust long-horizon task planning and effective goal-conditioned manipulation. Existing task planning methods, including traditional and LLM-based approaches, often exhibit limited generalization or sparse semantic reason- ing. Meanwhile, image-conditioned control methods struggle to adapt to unseen tasks. To tackle these problems, we propose SAGE, a novel framework for Scene Graph-Aware Guidance and Execution in Long-Horizon Manipulation Tasks. SAGE utilizes semantic scene graphs as a structural representation for scene states. A structural scene graph enables bridging task- level semantic reasoning and pixel-level visuo-motor control. This also facilitates the controllable synthesis of accurate, novel sub-goal images. SAGE consists of two key components: (1) a scene graph-based task planner that uses VLMs and LLMs to parse the environment and reason about physically- grounded scene state transition sequences, and (2) a decoupled structural image editing pipeline that controllably converts each target sub-goal graph into a corresponding image through image inpainting and composition. Extensive experiments have demonstrated that SAGE achieves state-of-the-art performance on distinct long-horizon tasks.

Index terms

Integrated Planning and Learning Semantic Scene Understanding Visual Learning