Research Analyzer
← Back ICRA 2026

CLAR: Learning 3D Representations for Robotic Manipulation by Fusing Masked Reconstruction with Multi-Level Contrastive Alignment

Wenbo Cui, Chengyang Zhao, Yuhui Chen, Haoran Li, Zhizheng Zhang, Dongbin Zhao, He Wang

PDF

AI summary

Key figure (auto-extracted from paper)
CLAR achieves state-of-the-art robotic manipulation performance by fusing 3D masked reconstruction with adaptive local cross-modal alignment.
3D representation robotic manipulation masked autoencoding contrastive learning cross-modal alignment deformable attention

Problem

Existing 3D pre-training methods struggle to balance spatial geometry with semantic understanding, while global alignment fails to capture the fine-grained details required for precise manipulation.

Approach

CLAR unifies point cloud masked autoencoding for spatial reconstruction with global contrastive learning for semantics, augmented by a deformable attention mechanism that adaptively aligns local 3D patches with 2D visual features.

Key results

  • 82.6% success rate on MetaWorld and 82.0% on RLBench
  • 83.0% real-world task success, outperforming baselines by 22%
  • Novel adaptive local alignment resolves contextual mismatch from point cloud cropping
  • Seamlessly transfers semantic knowledge from 2D foundation models to 3D representations

Why it matters

Enables more robust and adaptable visuomotor policies for real-world robotic manipulation by overcoming the spatial and semantic limitations of prior pre-training methods.

Abstract

The spatial information inherent in 3D point clouds is crucial for robotic manipulation. However, existing 3D pre-training methods face a fundamental trade-off: Masked Autoencoding (MAE) excels at capturing spatial-geometric features but lacks semantics, whereas contrastive learning, while able to distill semantics from 2D foundation models, is ill-suited for the fine-grained details required for manipulation tasks. To address these challenges, we propose CLAR, a novel 3D pre- training framework that synergizes global understanding with fine-grained local alignment. Our framework unifies MAE with global cross-modal contrastive learning to integrate robust spa- tial awareness with rich semantic understanding. To enhance its focus on fine-grained details, at the local level, we introduce an adaptive alignment mechanism that leverages deformable atten- tion to force precise correspondences between local 3D geometry and 2D visual features, thereby overcoming the limitations of conventional global alignment in manipulation tasks. Extensive experiments in simulation and the real world demonstrate that CLAR achieves state-of-the-art performance, significantly outperforming existing methods in visuomotor policy learning. Our project page is https://cwb0106.github.io/CLAR/.

Index terms

Imitation Learning Representation Learning

Related papers