← Back ICRA 2026

CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation

Taeyun Kim, Alvin Jinsung Choi, Dasol Hong, Hyun Myung

PDF

AI summary

Key figure (auto-extracted from paper)

Adaptively weighting room-level and object-level contextual cues based on target ambiguity significantly improves zero-shot object-goal navigation efficiency and accuracy.

Zero-shot navigation contextual cues semantic value map LLM reasoning embodied AI object-goal navigation

Problem

Existing zero-shot navigation methods treat all contextual cues equally, ignoring that some targets are better predicted by room types while others are better predicted by co-located objects, leading to inefficient exploration.

Approach

CLUE uses offline LLM queries to extract commonsense knowledge and dynamically balances global room cues with local object cues in a unified semantic map, weighted by the target's contextual ambiguity to guide exploration.

Key results

Achieves state-of-the-art success rate and SPL on the HM3D simulation benchmark
Successfully deployed on a Clearpath Jackal robot in real-world environments
Eliminates online LLM latency by precomputing commonsense knowledge offline
Constructs a unified semantic map that adaptively prioritizes global or local cues based on target ambiguity

Why it matters

Enables efficient, real-time zero-shot navigation for robots without costly online reasoning or task-specific training.

Abstract

Zero-shot object-goal navigation (ZSON) is a challenging problem in robotics that requires a comprehen- sive understanding of both language and visual observations. Contextual cues from rooms and objects are critical, but their relative importance depends on the target: some objects are strongly tied to specific room types, while others are better predicted by nearby co-located objects. Existing methods overlook this distinction, leading to inefficient and inaccurate exploration. We present CLUE, a novel navigation framework that adaptively balances the use of contextual rooms and objects by leveraging commonsense knowledge extracted from an offline large language model (LLM). By estimating a target’s association with room types using LLM, the agent prioritizes room cues for predictable objects and object cues for those with weak room associations. Our framework constructs a unified semantic value map that integrates both types of contextual information, adaptively weighted by the target’s ambiguity to guide exploration. Combined with multi-viewpoint verification and an exploration strategy informed by contextual cues, CLUE achieves robust and efficient navigation. Extensive experiments in simulation and real-world deployments show that our method consistently outperforms state-of-the-art baselines in both suc- cess rate (SR) and success weighted by path length (SPL), demonstrating its effectiveness and practicality for real-world navigation tasks.

Index terms

Autonomous Agents Semantic Scene Understanding AI-Enabled Robotics