← Back ICRA 2026

Learning Composable Skills by Discovering Spatial and Temporal Structure with Foundation Models

Neil Nie, Wenlong Huang, Jiayuan Mao, Li Fei-Fei, Weiyu Liu, Jiajun Wu

PDF

AI summary

Key figure (auto-extracted from paper)

Foundation models automatically extract spatial and temporal structure from raw demonstrations to learn composable skills that generalize to novel scenes, constraints, and longer horizons.

Composable skills Foundation models Spatial structure Temporal segmentation Diffusion policies Robot manipulation

Problem

Robots struggle with long-horizon manipulation due to poor generalization beyond training data, and existing methods rely on hand-designed structures or lack geometric reasoning for skill composition.

Approach

STACK uses video-language models to segment demonstrations into skills and identify relevant 3D scene elements, then trains diffusion-based trajectory samplers and geometric effect models in each skill's reference frame to enable test-time composition.

Key results

Zero missing or extra segments in automatic temporal decomposition
Spatially invariant trajectory samplers and rigid-body effect models per skill
Strong generalization to novel scenes, constraints, and longer horizons across real-world domains
Outperforms baselines in segmentation accuracy and partial success rates on bimanual tasks

Why it matters

Enables robots to autonomously learn and compose manipulation skills from limited demonstrations, advancing real-world long-horizon task execution without manual annotations.

Abstract

We present STACK, a framework for discovering and learning composable manipulation skills from unsegmented demonstrations by leveraging spatial and temporal structure extracted from foundation models. STACK automatically extracts temporal structure by segmenting raw demonstrations into short-horizon skills using a video-language model, and spatial structure by identifying skill-relevant elements in 3D point cloud observations. For each discovered skill, we learn a diffusion- based trajectory sampler and a skill effect model, both of which operate in the reference frame of the relevant scene element. At test time, given a language goal, STACK segments the 3D scene, samples skill trajectories, and composes them by simulating geometric effects. This enables generalization to new scene configurations, geometric constraints, and longer task horizons beyond training across diverse real-world manipulation tasks. Project page: https://icra-stack.github.io

Index terms

Manipulation Planning Integrated Planning and Learning Learning from Demonstration