← Back ICRA 2026

Open-Vocabulary Online Semantic Mapping for SLAM

Tomas Berriel Martins, Martin R. Oswald, Javier Civera

PDF

AI summary

Key figure (auto-extracted from paper)

OVO enables real-time, open-vocabulary 3D semantic mapping integrated with SLAM, achieving superior segmentation accuracy and lower computational cost than offline baselines.

Open-vocabulary mapping Online SLAM 3D semantic reconstruction CLIP descriptors Neural feature fusion Real-time robotics

Problem

Existing semantic mapping methods rely on closed-set categories or offline processing, limiting real-world applicability, while current online open-vocabulary approaches lack full SLAM integration with loop closure.

Approach

The pipeline detects and tracks 3D segments from RGB-D frames, fusing multi-view CLIP descriptors using a novel neural network to generate open-vocabulary labels, and integrates directly into SLAM systems with loop-closure support.

Key results

First end-to-end open-vocabulary online 3D mapping pipeline with full SLAM and loop closure integration
Novel neural network learns per-dimension weights to fuse multi-view CLIP descriptors
Outperforms offline and online baselines in segmentation accuracy with lower computational cost
Seamlessly integrates with Gaussian-SLAM and ORB-SLAM2 for robust real-world mapping

Why it matters

Enables flexible, real-time semantic understanding for robots and AR/VR systems operating in dynamic, open-ended environments without predefined category constraints.

Abstract

This paper presents an Open-Vocabulary Online 3D semantic mapping pipeline, that we denote by its acronym OVO. Given a sequence of posed RGB-D frames, we detect and track 3D segments, which we describe using CLIP vectors. These are computed from the viewpoints where they are observed by a novel CLIP merging method. Notably, our OVO has a significantly lower computational and memory footprint than offline baselines, while also showing better segmentation metrics than offline and online ones. Along with superior segmentation performance, we also show experimental results of our mapping contributions integrated with two different full SLAM backbones (Gaussian-SLAM and ORB-SLAM2), being the first ones using a neural network to merge CLIP descrip- tors and demonstrating end-to-end open-vocabulary online 3D mapping with loop closure.

Index terms

Semantic Scene Understanding Mapping SLAM