← Back ICRA 2026

Re-MAE: Rethinking Masked Autoencoders towards Geometry-Aware Self-Supervised LiDAR-Based 3D Object Detection

Youngho Cheon, Jae-Keun Lee, Soon Kwon, Jin-Hee Lee, Yongseob Lim

PDF

AI summary

Key figure (auto-extracted from paper)

Explicitly modeling LiDAR geometry during self-supervised pre-training dramatically improves downstream 3D object detection.

LiDAR Masked Autoencoder Self-Supervised Learning 3D Object Detection Geometry-Aware Point Cloud

Problem

Existing LiDAR masked autoencoders ignore critical geometric properties like distance-dependent sparsity, realistic occlusions, and voxel imbalance, limiting their ability to learn robust representations for occluded or distant objects.

Approach

Re-MAE replaces standard masking with geometry-aware occlusion simulation, uses a multi-scale occupancy reconstruction guided by a context-aware loss, and applies label-free object augmentation to focus learning on foreground structures.

Key results

+2.83 mAP gain on ONCE dataset
+1.53 L2 mAP gain on Waymo Open Dataset
State-of-the-art data-efficient fine-tuning performance
Enhanced detection of heavily occluded and distant objects

Why it matters

Provides a practical, annotation-free pre-training strategy that significantly advances autonomous driving perception with limited labeled data.

Abstract

Self-supervised pre-training with masked autoen- coders has shown promise for 3D perception, yet most ap- proaches treat LiDAR point clouds in a geometry-agnostic manner. In this paper, we introduce Re-MAE, a geometry- aware self-supervised learning framework for LiDAR-based 3D object detection that explicitly encodes core properties of LiDAR point clouds: occlusion, distance-driven sparsity, and occupied-empty voxel structure. Re-MAE rethinks the geomet- ric characteristics of LiDAR point clouds from the perspectives of “what to learn” and “how to learn”, and introduces three components: (i) Geometry-Aware Masking, which realistically simulates occlusions in LiDAR scans and enables learning complete object representations from partial observations; (ii) Reconstruction-Contextual BCE loss, which effectively guides a multi-scale occupancy prediction task to mitigate distance- dependent point sparsity and the strong occupied-empty voxel imbalance, improving detection of both large vehicles and small, distant pedestrians; and (iii) Realistic Object Augmentation, a label-free foreground augmentation strategy that promotes object-centric representation learning and yields consistent gains across categories. Experiments on ONCE and Waymo Open Dataset validate the effectiveness of Re-MAE, delivering 2.83 mAP and 1.53 L2 mAP respectively over baselines. These results demonstrate that explicitly incorporating the geometric characteristics of LiDAR point clouds enhances the effectiveness of self-supervised learning. The code1 will be released.

Index terms

Intelligent Transportation Systems Deep Learning for Visual Perception Object Detection Segmentation and Categorization