← Back ICRA 2026

Open-Vocabulary Object-Goal Navigation by Generalizing Semantic Mapping with Dense CLIP

MENG WEI, Chenyang Wan, Tai Wang, Yuqiang Yang, Wenzhe Cai, Yilun Chen, Hanqing Wang, Jiangmiao Pang, Xihui Liu

PDF

AI summary

Key figure (auto-extracted from paper)

OVExp enables efficient open-vocabulary object navigation by decoupling visual perception from policy training through cross-modal transfer on semantic maps.

Open-vocabulary navigation semantic mapping CLIP cross-modal transfer embodied AI goal prediction

Problem

Existing open-vocabulary navigation methods either rely on inefficient, high-cost LLM inference or suffer from poor generalization due to intensive end-to-end reinforcement learning training.

Approach

OVExp trains a goal prediction network using only text-based semantic maps and CLIP text embeddings, then transfers the learned policy to vision-based maps and goals at test time via a cross-modal transfer strategy.

Key results

Supervised text-only training outperforms RL-based models in efficiency and performance
OVExp surpasses training-free LLM-based methods with lower computational costs
Text-trained policies successfully adapt to vision-only inference at test time
Achieves competitive state-of-the-art results across HM3D-ObjectNav, HM3D-OVON, and InstanceImageNav benchmarks

Why it matters

It offers a scalable, training-efficient alternative to costly LLM or RL pipelines for embodied agents navigating unseen environments.

Abstract

Object-oriented embodied navigation tasks re- quire agents to locate specific objects, either defined by category or images, in unseen environments. While recent methods have made progress in extending closed-set models to open- vocabulary scenarios with foundation models, they typically rely on training-free large language models (LLMs) or finetuning with end-to-end reinforcement learning (RL). However, they face challenges in efficiency (e.g., the overhead and cost of LLM inference) and limited generalization from intensive RL training. In this paper, we propose OVExp, a training-efficient framework for open-vocabulary exploration. We make the first effort to demonstrate the generalization capabilities of semantic map-based goal prediction networks using Dense CLIP models. A major challenge is that preserving both precise point-wise object locations and generalizable visual representations in the semantic map leads to unaffordable training costs. To address this, we design a Cross-Modal Transfer on Semantic Mapping strategy which adapts an intriguing text-only training and transfer to multi-model semantic mapping and goals in test- time. Despite relying on text-based spatial layouts with limited objects, OVExp demonstrates robust generalization to unseen targets on established ObjectNav benchmarks.

Index terms

Vision-Based Navigation RGB-D Perception Deep Learning for Visual Perception