S$^2$-Diffusion: Generalizing from Instance-Level to Category-Level Skills in Robot Manipulation
Quantao Yang, Michael C. Welle, Danica Kragic, Olov Andersson
AI summary
Problem
Current imitation learning methods overfit to specific training instances, such as particular object colors or backgrounds, and fail to transfer to new instances within the same category without costly retraining.
Approach
The method replaces raw RGB inputs with a spatial-semantic representation by combining open-vocabulary semantic segmentation and normalized depth maps from foundation models, which then condition a visuomotor diffusion policy.
Key results
- Achieves highest success rates across six diverse simulation tasks compared to baselines
- Demonstrates robust zero-shot generalization to unseen object instances and background variations
- Matches multi-view RGB-D baseline performance using only a single RGB camera and proprioception
- Validates real-world transferability of category-level manipulation skills across varied instances
Why it matters
Enables robots to learn complex manipulation skills efficiently from limited demonstrations and deploy them across diverse real-world scenarios without retraining.
Abstract
Recent advances in skill learning has propelled robot manipulation to new heights by enabling it to learn complex manipulation tasks from a practical number of demonstrations. However, these skills are often limited to the particular action, object, and environment instances that are shown in the training data, and have trouble transferring to other instances of the same category. In this work we present an open-vocabulary Spatial-Semantic Diffusion policy (S2-Diffusion) which enables generalization from instance-level training data to category- level, enabling skills to be transferable between instances of the same category. We show that functional aspects of skills can be captured via a promptable semantic module combined with a spatial representation. We further propose leveraging depth estimation networks to allow the use of only a single RGB camera. Our approach is evaluated and compared on a diverse number of robot manipulation tasks, both in simulation and in the real world. Our results show that S2-Diffusion is invariant to changes in category-irrelevant factors as well as enables satisfying performance on other instances within the same category, even if it was not trained on that specific instance. Project website: https://s2-diffusion.github.io.