SEM: Enhancing Spatial Understanding for Robust Robot Manipulation
Xuewu Lin, Tianwei Lin, Yun Du, Jitao Li, HONGYU XIE, Yiwei Jin, Lichao Huang, Zhizhong Su
AI summary
Problem
Current robot manipulation policies rely on 2D visual inputs and process sensor data independently, lacking explicit 3D spatial reasoning and struggling to generalize across different camera setups and robot embodiments.
Approach
SEM unifies multi-view image features and joint-centric robot states into a shared 3D base-frame using a spatial enhancer and joint graph attention, providing a consistent spatial representation for a diffusion-based action policy.
Key results
- Achieves up to 29.5% higher mean success rate than baselines on the RoboTwin 2.0 benchmark
- Maintains robust performance under varying camera heights, demonstrating strong 3D spatial generalization
- Improves cross-embodiment transfer by 5.5% when trained on mixed robot data, unlike baselines that degrade
- Delivers state-of-the-art results with only ~40M trainable parameters
Why it matters
Provides a practical pathway for deploying generalizable, spatially-aware manipulation policies across heterogeneous real-world robot platforms and sensor configurations.
Abstract
A key challenge in robot manipulation lies in developing policy models with consistent spatial understand- ing—the ability to reason about 3D geometry, object relations, and robot state. Existing mainstream models take 2D images as input, without performing explicit 3D modeling, and thus lack spatial understanding capabilities as well as 3D and embodi- ment generalization. To address this, we propose SEM (Spatial Enhanced Manipulation), a diffusion-based policy framework that constructs a unified spatial representation by projecting multi-view image features and joint-centric robot states into a unified 3D space. This spatially aligned representation provides a consistent feature space for the diffusion policy to condition on during action generation. Extensive experiments demonstrate that SEM significantly improves spatial understanding, leading to robust and generalizable manipulation across diverse tasks that outperform existing baselines.