CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation
Taeyun Kim, Alvin Jinsung Choi, Dasol Hong, Hyun Myung
AI summary
Problem
Existing zero-shot navigation methods treat all contextual cues equally, ignoring that some targets are better predicted by room types while others are better predicted by co-located objects, leading to inefficient exploration.
Approach
CLUE uses offline LLM queries to extract commonsense knowledge and dynamically balances global room cues with local object cues in a unified semantic map, weighted by the target's contextual ambiguity to guide exploration.
Key results
- Achieves state-of-the-art success rate and SPL on the HM3D simulation benchmark
- Successfully deployed on a Clearpath Jackal robot in real-world environments
- Eliminates online LLM latency by precomputing commonsense knowledge offline
- Constructs a unified semantic map that adaptively prioritizes global or local cues based on target ambiguity
Why it matters
Enables efficient, real-time zero-shot navigation for robots without costly online reasoning or task-specific training.
Abstract
Zero-shot object-goal navigation (ZSON) is a challenging problem in robotics that requires a comprehen- sive understanding of both language and visual observations. Contextual cues from rooms and objects are critical, but their relative importance depends on the target: some objects are strongly tied to specific room types, while others are better predicted by nearby co-located objects. Existing methods overlook this distinction, leading to inefficient and inaccurate exploration. We present CLUE, a novel navigation framework that adaptively balances the use of contextual rooms and objects by leveraging commonsense knowledge extracted from an offline large language model (LLM). By estimating a target’s association with room types using LLM, the agent prioritizes room cues for predictable objects and object cues for those with weak room associations. Our framework constructs a unified semantic value map that integrates both types of contextual information, adaptively weighted by the target’s ambiguity to guide exploration. Combined with multi-viewpoint verification and an exploration strategy informed by contextual cues, CLUE achieves robust and efficient navigation. Extensive experiments in simulation and real-world deployments show that our method consistently outperforms state-of-the-art baselines in both suc- cess rate (SR) and success weighted by path length (SPL), demonstrating its effectiveness and practicality for real-world navigation tasks.