CLAR: Learning 3D Representations for Robotic Manipulation by Fusing Masked Reconstruction with Multi-Level Contrastive Alignment
Wenbo Cui, Chengyang Zhao, Yuhui Chen, Haoran Li, Zhizheng Zhang, Dongbin Zhao, He Wang
AI summary
Problem
Existing 3D pre-training methods struggle to balance spatial geometry with semantic understanding, while global alignment fails to capture the fine-grained details required for precise manipulation.
Approach
CLAR unifies point cloud masked autoencoding for spatial reconstruction with global contrastive learning for semantics, augmented by a deformable attention mechanism that adaptively aligns local 3D patches with 2D visual features.
Key results
- 82.6% success rate on MetaWorld and 82.0% on RLBench
- 83.0% real-world task success, outperforming baselines by 22%
- Novel adaptive local alignment resolves contextual mismatch from point cloud cropping
- Seamlessly transfers semantic knowledge from 2D foundation models to 3D representations
Why it matters
Enables more robust and adaptable visuomotor policies for real-world robotic manipulation by overcoming the spatial and semantic limitations of prior pre-training methods.
Abstract
The spatial information inherent in 3D point clouds is crucial for robotic manipulation. However, existing 3D pre-training methods face a fundamental trade-off: Masked Autoencoding (MAE) excels at capturing spatial-geometric features but lacks semantics, whereas contrastive learning, while able to distill semantics from 2D foundation models, is ill-suited for the fine-grained details required for manipulation tasks. To address these challenges, we propose CLAR, a novel 3D pre- training framework that synergizes global understanding with fine-grained local alignment. Our framework unifies MAE with global cross-modal contrastive learning to integrate robust spa- tial awareness with rich semantic understanding. To enhance its focus on fine-grained details, at the local level, we introduce an adaptive alignment mechanism that leverages deformable atten- tion to force precise correspondences between local 3D geometry and 2D visual features, thereby overcoming the limitations of conventional global alignment in manipulation tasks. Extensive experiments in simulation and the real world demonstrate that CLAR achieves state-of-the-art performance, significantly outperforming existing methods in visuomotor policy learning. Our project page is https://cwb0106.github.io/CLAR/.