← Back ICRA 2026

S$^2$-Diffusion: Generalizing from Instance-Level to Category-Level Skills in Robot Manipulation

Quantao Yang, Michael C. Welle, Danica Kragic, Olov Andersson

PDF

AI summary

Key figure (auto-extracted from paper)

S2-Diffusion enables robot manipulation policies to generalize from a single training instance to an entire category of objects by fusing open-vocabulary semantic masks with depth estimation.

Imitation Learning Diffusion Policy Spatial-Semantic Representation Robot Manipulation Generalization Single RGB Camera

Problem

Current imitation learning methods overfit to specific training instances, such as particular object colors or backgrounds, and fail to transfer to new instances within the same category without costly retraining.

Approach

The method replaces raw RGB inputs with a spatial-semantic representation by combining open-vocabulary semantic segmentation and normalized depth maps from foundation models, which then condition a visuomotor diffusion policy.

Key results

Achieves highest success rates across six diverse simulation tasks compared to baselines
Demonstrates robust zero-shot generalization to unseen object instances and background variations
Matches multi-view RGB-D baseline performance using only a single RGB camera and proprioception
Validates real-world transferability of category-level manipulation skills across varied instances

Why it matters

Enables robots to learn complex manipulation skills efficiently from limited demonstrations and deploy them across diverse real-world scenarios without retraining.

Abstract

Recent advances in skill learning has propelled robot manipulation to new heights by enabling it to learn complex manipulation tasks from a practical number of demonstrations. However, these skills are often limited to the particular action, object, and environment instances that are shown in the training data, and have trouble transferring to other instances of the same category. In this work we present an open-vocabulary Spatial-Semantic Diffusion policy (S2-Diffusion) which enables generalization from instance-level training data to category- level, enabling skills to be transferable between instances of the same category. We show that functional aspects of skills can be captured via a promptable semantic module combined with a spatial representation. We further propose leveraging depth estimation networks to allow the use of only a single RGB camera. Our approach is evaluated and compared on a diverse number of robot manipulation tasks, both in simulation and in the real world. Our results show that S2-Diffusion is invariant to changes in category-irrelevant factors as well as enables satisfying performance on other instances within the same category, even if it was not trained on that specific instance. Project website: https://s2-diffusion.github.io.

Index terms

Imitation Learning Learning from Demonstration Deep Learning in Grasping and Manipulation