← Back ICRA 2026

STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation

Haokun Zhu, Zongtai Li, Zhixuan Liu, Wenshan Wang, Ji Zhang, Jonathan Francis, Jean Oh

PDF

AI summary

Key figure (auto-extracted from paper)

STRIVE achieves state-of-the-art zero-shot object navigation by combining a multi-layer environment representation with a two-stage VLM-guided policy that drastically reduces redundant exploration.

Object Navigation Vision-Language Models Multi-layer Representation Two-stage Policy Zero-shot Navigation Robotics

Problem

Existing VLM-guided navigation methods lack structured environmental understanding and over-rely on step-by-step VLM queries, causing inefficient exploration, frequent backtracking, and poor spatial reasoning in large environments.

Approach

The framework incrementally constructs a three-layer graph of objects, viewpoints, and rooms, then uses a two-stage policy that delegates high-level room planning to VLM reasoning while handling low-level exploration with efficient frontier-based navigation and targeted VLM verification.

Key results

State-of-the-art success rate and navigation efficiency on HM3D, RoboTHOR, and MP3D benchmarks
13.1% improvement in success rate and 6.2% in SPL over prior methods
Robust real-world deployment across 120 episodes in 10 diverse indoor environments
Effective zero-shot navigation through structured room-level VLM reasoning and early-stop exploration

Why it matters

Enables reliable, efficient zero-shot object navigation for robots in complex real-world settings without requiring task-specific fine-tuning.

Abstract

Vision-Language Models (VLMs) have been in- creasingly integrated into object navigation tasks for their rich prior knowledge and strong reasoning abilities. However, apply- ing VLMs to navigation presents two key challenges: effectively parsing and structuring complex environment information and determining when and how to query VLMs. Insufficient environ- ment understanding and over-reliance on VLMs (e.g. querying at every step) can easily lead to unnecessary backtracking and reduced navigation efficiency, especially in large contin- uous environments. To address these challenges, we propose a novel framework that incrementally constructs a multi-layer environment representation consisting of viewpoints, object nodes, and room nodes during navigation. Viewpoints and object nodes facilitate intra-room exploration and accurate target localization, while room nodes support efficient inter- room planning. Building on this structured representation, we propose a novel two-stage navigation policy, integrating high-level planning guided by VLM reasoning with low-level VLM-assisted exploration to efficiently and reliably locate a goal object. We evaluated our approach on four simulated benchmarks (HM3D v1&v2, RoboTHOR, and MP3D), and achieved state-of-the-art performance on both the success rate (SR% ↑13.1% ) and navigation efficiency (SPL% ↑6.2% ). We further validate our method on a real robot platform, demonstrating strong robustness across 120 episodes in 10 different indoor environments. Project page is available at: https://zwandering.github.io/STRIVE.github.io/.

Index terms

Semantic Scene Understanding AI-Enabled Robotics AI-Based Methods