SAGE: Scene Graph-Aware Guidance and Execution for Long-Horizon Manipulation Tasks
Jialiang Li, Wenzheng Wu, Gaojing Zhang, Yifan Han, Wenzhao Lian
AI summary
Problem
Long-horizon manipulation requires bridging the semantic gap between high-level symbolic planning and low-level continuous control, but existing methods struggle with generalization, hallucination, or generating reliable goal images for unseen tasks.
Approach
SAGE uses a scene graph task planner to generate physically-grounded step-by-step transition chains, coupled with a decoupled image editing pipeline that converts each graph state into a precise sub-goal image for a goal-conditioned policy.
Key results
- Robust scene graph task planner decomposes long-horizon tasks into physically-grounded transition chains
- Decoupled structural image editing pipeline controllably synthesizes accurate sub-goal images
- State-of-the-art success rates across sequential, flexible, and hybrid tasks (up to 100% seen, 91.3% unseen)
- Iterative goal-switching aligns symbolic planning with pixel-level visuo-motor control
Why it matters
Enables robots to reliably execute complex, multi-step manipulation tasks in dynamic real-world environments by seamlessly integrating symbolic reasoning with visual goal conditioning.
Abstract
Successfully solving long-horizon manipulation tasks remains a fundamental challenge. These tasks involve extended action sequences and complex object interactions, pre- senting a critical gap between high-level symbolic planning and low-level continuous control. To bridge this gap, two essential capabilities are required: robust long-horizon task planning and effective goal-conditioned manipulation. Existing task planning methods, including traditional and LLM-based approaches, often exhibit limited generalization or sparse semantic reason- ing. Meanwhile, image-conditioned control methods struggle to adapt to unseen tasks. To tackle these problems, we propose SAGE, a novel framework for Scene Graph-Aware Guidance and Execution in Long-Horizon Manipulation Tasks. SAGE utilizes semantic scene graphs as a structural representation for scene states. A structural scene graph enables bridging task- level semantic reasoning and pixel-level visuo-motor control. This also facilitates the controllable synthesis of accurate, novel sub-goal images. SAGE consists of two key components: (1) a scene graph-based task planner that uses VLMs and LLMs to parse the environment and reason about physically- grounded scene state transition sequences, and (2) a decoupled structural image editing pipeline that controllably converts each target sub-goal graph into a corresponding image through image inpainting and composition. Extensive experiments have demonstrated that SAGE achieves state-of-the-art performance on distinct long-horizon tasks.