← Back ICRA 2024

Open-Fusion: Real-Time Open-Vocabulary 3D Mapping and Queryable Scene Representation

Kashu Yamazaki, Taisei Hanyu, Khoa Vo, Trong Thang Pham, Tran Minh, Gianfranco Doretto, Anh Nguyen, Ngan Le

PDF

Abstract

Precise 3D environmental mapping with semantics is essential in robotics. Existing methods often rely on pre- defined concepts during training or are time-intensive when generating semantic maps. This paper presents Open-Fusion, an approach for real-time open-vocabulary 3D mapping and queryable scene representation using RGB-D data. Open- Fusion harnesses the power of a pretrained vision-language foundation model (VLFM) for open-set semantic comprehen- sion and employs the Truncated Signed Distance Function (TSDF) for swift 3D scene reconstruction. By leveraging the VLFM, we extract region-based embeddings and their asso- ciated confidence maps. These are then integrated with the 3D knowledge from TSDF using an enhanced Hungarian- based feature-matching mechanism. In particular, Open-Fusion delivers outstanding annotation-free 3D segmentation for open vocabulary query without the need for additional 3D training. Benchmark tests on the ScanNet dataset against leading zero- shot methods highlight Open-Fusion’s superiority. Further- more, it seamlessly combines the strengths of region-based VLFM and TSDF, facilitating real-time 3D scene comprehen- sion that includes object concepts and open-world semantics. We encourage the readers to view the demos on our project page: https://uark-aicv.github.io/OpenFusion

Index terms

Semantic Scene Understanding Mapping Localization