MOSFormer: A Transformer-Based Multi-Modal Fusion Network for Moving Object Segmentation
Zike Cheng, Hengwang Zhao, Qiyuan Shen, Weihao Yan, Chunxiang Wang, Ming Yang
Abstract
3D moving object segmentation (MOS) is vital for autonomous systems, providing essential information for downstream tasks like mapping and localization. However, current MOS methods face challenges due to the limitation of existing datasets, which are sparse in moving objects and limited in scene diversity. Meanwhile, the prevalent meth- ods are projection-based, struggling with the challenge of blurred boundaries. To tackle the dataset issue, we introduce a nuScenes-based MOS dataset, which provides richer scenes and more dynamic instances. To alleviate the boundary blur- ring issue and further improve accuracy and generalizability, we propose a dual-branch multimodal fusion MOS network, MOSFormer. The Transformer structure is incorporated to extract spatio-temporal information better, while image se- mantic information is utilized to refine the boundaries of moving objects. Finally, experiments on two datasets show that our method achieves state-of-the-art performance, and a mapping experiment with our method confirms its effectiveness in downstream tasks such as mapping and localization.