Learning Composable Skills by Discovering Spatial and Temporal Structure with Foundation Models
Neil Nie, Wenlong Huang, Jiayuan Mao, Li Fei-Fei, Weiyu Liu, Jiajun Wu
AI summary
Problem
Robots struggle with long-horizon manipulation due to poor generalization beyond training data, and existing methods rely on hand-designed structures or lack geometric reasoning for skill composition.
Approach
STACK uses video-language models to segment demonstrations into skills and identify relevant 3D scene elements, then trains diffusion-based trajectory samplers and geometric effect models in each skill's reference frame to enable test-time composition.
Key results
- Zero missing or extra segments in automatic temporal decomposition
- Spatially invariant trajectory samplers and rigid-body effect models per skill
- Strong generalization to novel scenes, constraints, and longer horizons across real-world domains
- Outperforms baselines in segmentation accuracy and partial success rates on bimanual tasks
Why it matters
Enables robots to autonomously learn and compose manipulation skills from limited demonstrations, advancing real-world long-horizon task execution without manual annotations.
Abstract
We present STACK, a framework for discovering and learning composable manipulation skills from unsegmented demonstrations by leveraging spatial and temporal structure extracted from foundation models. STACK automatically extracts temporal structure by segmenting raw demonstrations into short-horizon skills using a video-language model, and spatial structure by identifying skill-relevant elements in 3D point cloud observations. For each discovered skill, we learn a diffusion- based trajectory sampler and a skill effect model, both of which operate in the reference frame of the relevant scene element. At test time, given a language goal, STACK segments the 3D scene, samples skill trajectories, and composes them by simulating geometric effects. This enables generalization to new scene configurations, geometric constraints, and longer task horizons beyond training across diverse real-world manipulation tasks. Project page: https://icra-stack.github.io