← Back ICRA 2026

SEM: Enhancing Spatial Understanding for Robust Robot Manipulation

Xuewu Lin, Tianwei Lin, Yun Du, Jitao Li, HONGYU XIE, Yiwei Jin, Lichao Huang, Zhizhong Su

PDF

AI summary

Key figure (auto-extracted from paper)

Projecting multi-view visual and robot state features into a unified 3D space significantly boosts spatial understanding, enabling robust and generalizable robot manipulation across diverse cameras and embodiments.

Robot manipulation Spatial understanding Diffusion policy 3D representation Embodiment generalization Vision-language-action

Problem

Current robot manipulation policies rely on 2D visual inputs and process sensor data independently, lacking explicit 3D spatial reasoning and struggling to generalize across different camera setups and robot embodiments.

Approach

SEM unifies multi-view image features and joint-centric robot states into a shared 3D base-frame using a spatial enhancer and joint graph attention, providing a consistent spatial representation for a diffusion-based action policy.

Key results

Achieves up to 29.5% higher mean success rate than baselines on the RoboTwin 2.0 benchmark
Maintains robust performance under varying camera heights, demonstrating strong 3D spatial generalization
Improves cross-embodiment transfer by 5.5% when trained on mixed robot data, unlike baselines that degrade
Delivers state-of-the-art results with only ~40M trainable parameters

Why it matters

Provides a practical pathway for deploying generalizable, spatially-aware manipulation policies across heterogeneous real-world robot platforms and sensor configurations.

Abstract

A key challenge in robot manipulation lies in developing policy models with consistent spatial understand- ing—the ability to reason about 3D geometry, object relations, and robot state. Existing mainstream models take 2D images as input, without performing explicit 3D modeling, and thus lack spatial understanding capabilities as well as 3D and embodi- ment generalization. To address this, we propose SEM (Spatial Enhanced Manipulation), a diffusion-based policy framework that constructs a unified spatial representation by projecting multi-view image features and joint-centric robot states into a unified 3D space. This spatially aligned representation provides a consistent feature space for the diffusion policy to condition on during action generation. Extensive experiments demonstrate that SEM significantly improves spatial understanding, leading to robust and generalizable manipulation across diverse tasks that outperform existing baselines.

Index terms

Deep Learning in Grasping and Manipulation Dual Arm Manipulation Dexterous Manipulation