← Back ICRA 2026

QuASH: Using Natural-Language Heuristics to Query Visual-Language Robotic Maps

Matti Pekkanen, Francesco Verdoja, Ville Kyrki

PDF

AI summary

Key figure (auto-extracted from paper)

Augmenting text queries with semantic synonyms and antonyms to train a classifier significantly improves the accuracy of matching visual-language embeddings to environments.

Visual-Language Models Semantic Mapping Open-Vocabulary Querying Embedding Classification Robotic Perception Natural Language Heuristics

Problem

Querying visual-language model embeddings in robotic maps or images is difficult because existing methods rely on simple similarity thresholds or single negative queries, which fail to accurately capture the complex shape of the relevant embedding region.

Approach

QuASH generates semantic synonyms and antonyms for a text query, embeds them, and trains an off-the-shelf classifier to non-linearly partition the model's latent space into matching and non-matching regions.

Key results

Formalization of querying latent-semantic maps as a classification problem
QuASH method for encoder-agnostic query augmentation and classification
Up to 227% F1-score improvement on robotic maps and 15% on image benchmarks
Demonstrated robustness across different VLM encoders with minimal training

Why it matters

Provides a practical, low-cost solution for accurate open-vocabulary scene understanding, enabling robots to reliably execute complex text-based commands in real-world environments.

Abstract

Embeddings from Visual-Language Models are increasingly utilized to represent semantics in robotic maps, offering an open-vocabulary scene understanding that surpasses traditional, limited labels. Embeddings enable on-demand querying by comparing embedded user text prompts to map embeddings via a similarity metric. The key challenge in performing the task indicated in a query is that the robot must determine the parts of the environment relevant to the query. This paper proposes a solution to this challenge. We leverage natural-language synonyms and antonyms associated with the query within the embedding space, applying heuristics to estimate the language space relevant to the query, and use that to train a classifier to partition the environment into matches and non-matches. We evaluate our method through extensive experiments, querying both maps and standard image benchmarks. The results demonstrate increased queryability of maps and images. Our querying technique is agnostic to the representation and encoder used, and requires limited training.

Index terms

Object Detection Segmentation and Categorization Semantic Scene Understanding Deep Learning for Visual Perception