OMCL: Open-Vocabulary Monte Carlo Localization
Evgenii Kruzhkov, Raphael Memmesheimer, Sven Behnke
AI summary
Problem
Traditional robot localization struggles with cross-modal sensor mismatches and relies on closed-set semantics or fine-tuned models, limiting flexibility and generalization across diverse environments.
Approach
The authors extend Monte Carlo Localization with an Octree Language Map that stores CLIP-based vision-language features, using ray tracing to compute observation likelihoods and natural language prompts to initialize global localization.
Key results
- State-of-the-art localization accuracy on Matterport3D and Replica datasets
- Cross-modal mapping and localization with RGB-only cameras
- Natural language prompt-based global initialization
- Generalization to outdoor scenes and pre-existing map compatibility
Why it matters
It enables heterogeneous robotic systems and non-expert users to achieve robust, language-guided localization across diverse environments without retraining or fine-tuning.
Abstract
Robust robot localization is an important prereq- uisite for navigation, but it becomes challenging when the map and robot measurements are obtained from different sensors. Prior methods are often tailored to specific environments, relying on closed-set semantics or fine-tuned features. In this work, we extend Monte Carlo Localization with vision-language features, allowing OMCL to robustly compute the likelihood of visual observations given a camera pose and a 3D map created from posed RGB-D images or aligned point clouds These open- vocabulary features enable us to associate observations and map elements from different modalities, and to natively initialize global localization through natural language descriptions of nearby objects. We evaluate our approach using Matterport3D and Replica for indoor scenes and demonstrate generalization on SemanticKITTI for outdoor scenes. The code is accessible at https://github.com/AIS-Bonn/omcl.