← Back IROS 2024

RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective

Chenxi Wang, Hongjie Fang, Hao-Shu Fang, Cewu Lu

PDF

Abstract

Precise robot manipulations require rich spatial information in imitation learning. Image-based policies model object positions from fixed cameras, which are sensitive to camera view changes. Policies utilizing 3D point clouds usu- ally predict keyframes rather than continuous actions, posing difficulty in dynamic and contact-rich scenarios. To utilize 3D perception efficiently, we present RISE, an end-to-end baseline for real-world imitation learning, which predicts continuous actions directly from single-view point clouds. It compresses the point cloud to tokens with a sparse 3D encoder. After adding sparse positional encoding, the tokens are featurized using a transformer. Finally, the features are decoded into robot actions by a diffusion head. Trained with 50 demonstrations for each real-world task, RISE surpasses currently representative 2D and 3D policies by a large margin, showcasing significant advantages in both accuracy and efficiency. Experiments also demonstrate that RISE is more general and robust to envi- ronmental change compared with previous baselines. Project website: rise-policy.github.io.

Index terms

Imitation Learning RGB-D Perception