← Back ICRA 2026

OMCL: Open-Vocabulary Monte Carlo Localization

Evgenii Kruzhkov, Raphael Memmesheimer, Sven Behnke

PDF

AI summary

Key figure (auto-extracted from paper)

OMCL enables robust, cross-modal robot localization and natural language-guided global initialization by storing open-vocabulary vision-language features in a 3D map and evaluating them via Monte Carlo Localization.

Open-vocabulary localization Monte Carlo Localization Vision-language models Cross-modal mapping Semantic scene understanding Natural language initialization

Problem

Traditional robot localization struggles with cross-modal sensor mismatches and relies on closed-set semantics or fine-tuned models, limiting flexibility and generalization across diverse environments.

Approach

The authors extend Monte Carlo Localization with an Octree Language Map that stores CLIP-based vision-language features, using ray tracing to compute observation likelihoods and natural language prompts to initialize global localization.

Key results

State-of-the-art localization accuracy on Matterport3D and Replica datasets
Cross-modal mapping and localization with RGB-only cameras
Natural language prompt-based global initialization
Generalization to outdoor scenes and pre-existing map compatibility

Why it matters

It enables heterogeneous robotic systems and non-expert users to achieve robust, language-guided localization across diverse environments without retraining or fine-tuning.

Abstract

Robust robot localization is an important prereq- uisite for navigation, but it becomes challenging when the map and robot measurements are obtained from different sensors. Prior methods are often tailored to specific environments, relying on closed-set semantics or fine-tuned features. In this work, we extend Monte Carlo Localization with vision-language features, allowing OMCL to robustly compute the likelihood of visual observations given a camera pose and a 3D map created from posed RGB-D images or aligned point clouds These open- vocabulary features enable us to associate observations and map elements from different modalities, and to natively initialize global localization through natural language descriptions of nearby objects. We evaluate our approach using Matterport3D and Replica for indoor scenes and demonstrate generalization on SemanticKITTI for outdoor scenes. The code is accessible at https://github.com/AIS-Bonn/omcl.

Index terms

Localization Semantic Scene Understanding Mapping