← Back ICRA 2026

Compose by Focus: Scene Graph-Based Atomic Skills

Han Qi, Changhe Chen, HENG YANG

PDF

AI summary

Key figure (auto-extracted from paper)

Focusing visuomotor policies on task-relevant scene graphs instead of raw visual data drastically improves robustness and success rates for composing atomic skills in cluttered, long-horizon tasks.

scene graphs skill composition diffusion policy visuomotor learning robotic manipulation compositional generalization

Problem

Visuomotor policies trained on raw images or point clouds fail to generalize when composing atomic skills in cluttered scenes due to distribution shifts and sensitivity to irrelevant visual noise.

Approach

The method transforms visual observations into dynamic, task-focused 3D scene graphs of relevant objects and relations, processes them with a graph neural network, and conditions a diffusion policy to execute robust atomic skills.

Key results

Near-perfect atomic skill success with stable high performance on compositional tasks
Baseline raw-input policies drop 50–70% success rate on composed tasks
Strong robustness to visual clutter and perturbations in simulation and real-world tests
Ablations confirm 3D geometry, explicit graph structure, and GNN processing are critical

Why it matters

Provides a scalable, interpretable framework for training robust low-level manipulation policies that generalize to complex tasks without requiring exhaustive multi-skill demonstrations.

Abstract

A key requirement for generalist robots is com- positional generalization—the ability to combine atomic skills to solve complex, long-horizon tasks. While prior work has primarily focused on synthesizing a planner that sequences pre-learned skills, robust execution of the individual skills themselves remains challenging, as visuomotor policies often fail under distribution shifts induced by scene composition. To address this, we introduce a scene graph-based representation that focuses on task-relevant objects and relations, thereby mitigating sensitivity to irrelevant variation. Building on this idea, we develop a scene-graph skill learning framework that integrates graph neural networks with diffusion-based imitation learning, and further combine “focused” scene-graph skills with a vision-language model (VLM) based task planner. Experiments in both simulation and real-world manipulation tasks demonstrate substantially higher success rates than state- of-the-art baselines, highlighting improved robustness and com- positional generalization in long-horizon tasks.

Index terms

Deep Learning in Grasping and Manipulation Imitation Learning Learning from Demonstration