Open-Vocabulary Object-Goal Navigation by Generalizing Semantic Mapping with Dense CLIP
MENG WEI, Chenyang Wan, Tai Wang, Yuqiang Yang, Wenzhe Cai, Yilun Chen, Hanqing Wang, Jiangmiao Pang, Xihui Liu
AI summary
Problem
Existing open-vocabulary navigation methods either rely on inefficient, high-cost LLM inference or suffer from poor generalization due to intensive end-to-end reinforcement learning training.
Approach
OVExp trains a goal prediction network using only text-based semantic maps and CLIP text embeddings, then transfers the learned policy to vision-based maps and goals at test time via a cross-modal transfer strategy.
Key results
- Supervised text-only training outperforms RL-based models in efficiency and performance
- OVExp surpasses training-free LLM-based methods with lower computational costs
- Text-trained policies successfully adapt to vision-only inference at test time
- Achieves competitive state-of-the-art results across HM3D-ObjectNav, HM3D-OVON, and InstanceImageNav benchmarks
Why it matters
It offers a scalable, training-efficient alternative to costly LLM or RL pipelines for embodied agents navigating unseen environments.
Abstract
Object-oriented embodied navigation tasks re- quire agents to locate specific objects, either defined by category or images, in unseen environments. While recent methods have made progress in extending closed-set models to open- vocabulary scenarios with foundation models, they typically rely on training-free large language models (LLMs) or finetuning with end-to-end reinforcement learning (RL). However, they face challenges in efficiency (e.g., the overhead and cost of LLM inference) and limited generalization from intensive RL training. In this paper, we propose OVExp, a training-efficient framework for open-vocabulary exploration. We make the first effort to demonstrate the generalization capabilities of semantic map-based goal prediction networks using Dense CLIP models. A major challenge is that preserving both precise point-wise object locations and generalizable visual representations in the semantic map leads to unaffordable training costs. To address this, we design a Cross-Modal Transfer on Semantic Mapping strategy which adapts an intriguing text-only training and transfer to multi-model semantic mapping and goals in test- time. Despite relying on text-based spatial layouts with limited objects, OVExp demonstrates robust generalization to unseen targets on established ObjectNav benchmarks.