← Back IROS 2024

FM-Fusion: Instance-Aware Semantic Mapping Boosted by Vision-Language Foundation Models

Chuhao Liu, Ke WANG, Jieqi Shi, Zhijian Qiao, Shaojie Shen

PDF

Abstract

Semantic mapping based on the supervised object detectors is sensitive to image distribution. In real- world environments, the object detection and segmentation performance can lead to a major drop, preventing the use of semantic mapping in a wider domain. On the other hand, the development of vision-language foundation models demonstrates a strong zero-shot transferability across data distribution. It provides an opportunity to construct gen- eralizable instance-aware semantic maps. Hence, this work explores how to boost instance-aware semantic mapping from object detection generated from foundation models. We propose a probabilistic label fusion method to predict close-set semantic classes from open-set label measurements. An instance refinement module merges the over-segmented instances caused by inconsistent segmentation. We integrate all the modules into a unified semantic mapping system. Reading a sequence of RGB-D input, our work incrementally reconstructs an instance-aware semantic map. We evaluate the zero-shot performance of our method in ScanNet and SceneNN datasets. Our method achieves 40.3 mean average precision (mAP) on the ScanNet semantic instance segmen- tation task. It outperforms the traditional semantic mapping method significantly.

Index terms

Semantic Scene Understanding Mapping RGB-D Perception