AI summary
Problem
Visuomotor policies trained on raw images or point clouds fail to generalize when composing atomic skills in cluttered scenes due to distribution shifts and sensitivity to irrelevant visual noise.
Approach
The method transforms visual observations into dynamic, task-focused 3D scene graphs of relevant objects and relations, processes them with a graph neural network, and conditions a diffusion policy to execute robust atomic skills.
Key results
- Near-perfect atomic skill success with stable high performance on compositional tasks
- Baseline raw-input policies drop 50–70% success rate on composed tasks
- Strong robustness to visual clutter and perturbations in simulation and real-world tests
- Ablations confirm 3D geometry, explicit graph structure, and GNN processing are critical
Why it matters
Provides a scalable, interpretable framework for training robust low-level manipulation policies that generalize to complex tasks without requiring exhaustive multi-skill demonstrations.
Abstract
A key requirement for generalist robots is com- positional generalization—the ability to combine atomic skills to solve complex, long-horizon tasks. While prior work has primarily focused on synthesizing a planner that sequences pre-learned skills, robust execution of the individual skills themselves remains challenging, as visuomotor policies often fail under distribution shifts induced by scene composition. To address this, we introduce a scene graph-based representation that focuses on task-relevant objects and relations, thereby mitigating sensitivity to irrelevant variation. Building on this idea, we develop a scene-graph skill learning framework that integrates graph neural networks with diffusion-based imitation learning, and further combine “focused” scene-graph skills with a vision-language model (VLM) based task planner. Experiments in both simulation and real-world manipulation tasks demonstrate substantially higher success rates than state- of-the-art baselines, highlighting improved robustness and com- positional generalization in long-horizon tasks.