← Back ICRA 2026

FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment

Sebastián Barbas Laina, Simon Boche, Sotiris Papatheodorou, Simon Schaefer, Jaehyung Jung, Helen Oleynikova, Stefan Leutenegger

PDF

AI summary

Key figure (auto-extracted from paper)

FindAnything enables real-time, memory-efficient open-vocabulary semantic mapping on resource-constrained robots by aggregating vision-language features at the object level.

open-vocabulary mapping object-centric representation vision-language models MAV exploration real-time SLAM memory-efficient mapping

Problem

Real-time, open-vocabulary semantic understanding of large-scale unknown environments remains computationally and memory-prohibitive for resource-constrained robots like MAVs, as high-dimensional vision-language feature embeddings are too costly to store in dense 3D maps.

Approach

The framework integrates a lightweight segmentation model and a vision-language encoder into a submap-based SLAM system, tracking and aggregating high-dimensional features per object segment to drastically cut memory usage while preserving semantic accuracy.

Key results

Competitive state-of-the-art semantic accuracy on indoor and outdoor benchmarks
Up to 60% reduction in memory usage compared to voxel-level feature mapping
Substantially faster real-time processing enabling on-board MAV deployment
Successful natural language-guided autonomous exploration in simulated search-and-rescue scenarios

Why it matters

It empowers resource-constrained aerial robots to perform large-scale, real-time semantic mapping and natural language-guided exploration, critical for time-sensitive applications like search and rescue.

Abstract

Geometrically accurate and semantically expres- sive map representations have proven invaluable for robot deployment and task planning in unknown environments. Nev- ertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments still presents open chal- lenges, mainly due to computational requirements. In this paper we present FindAnything, an open-world mapping framework that incorporates vision-language information into dense volu- metric submaps. Thanks to the use of vision-language features, FindAnything combines pure geometric and open-vocabulary semantic information for a higher level of understanding. It proposes an efficient storage of open-vocabulary information through the aggregation of features at the object level. Pixel- wise vision-language features are aggregated based on eSAM segments, which are in turn integrated into object-centric vol- umetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. We demonstrate that FindAnything performs on par with the state-of-the-art in terms of semantic accuracy while being substantially faster and more memory-efficient, allowing its deployment in large-scale environments and on resource- constrained devices, such as MAVs. We show that the real-time capabilities of FindAnything make it useful for downstream tasks, such as autonomous MAV exploration in a simulated Search and Rescue scenario. Project Page: https://ethz-mrl.github.io/findanything/.

Index terms

Semantic Scene Understanding Mapping Aerial Systems: Perception and Autonomy