CrossVideo: Self-Supervised Cross-Modal Contrastive Learning for Point Cloud Video Understanding
Yunze Liu, Changxi Chen, Zifan Wang, Li Yi
Abstract
This paper introduces a novel approach named CrossVideo, which aims to enhance self-supervised cross-modal contrastive learning in the field of point cloud video understand- ing. Traditional supervised learning methods encounter limita- tions due to data scarcity and challenges in label acquisition. To address these issues, we propose a self-supervised learning method that leverages the cross-modal relationship between point cloud videos and image videos to acquire meaningful fea- ture representations. Intra-modal and cross-modal contrastive learning techniques are employed to facilitate effective com- prehension of point cloud video. We also propose a multi-level contrastive approach for both modalities. Through extensive experiments, we demonstrate that our method significantly surpasses previous state-of-the-art approaches, and we conduct comprehensive ablation studies to validate the effectiveness of our proposed designs.