← Back ICRA 2026

VLION: Vision-Language Guided Interactive Object Navigation with Mobile Manipulation

Renming Liu, Hao Ren, Lanxiang Zheng, Yiming Zeng, Ying Wu, Hui Cheng∗

PDF

AI summary

Key figure (auto-extracted from paper)

VLION enables robots to efficiently locate and access occluded objects in unknown environments by adaptively fusing vision-language scene and object cues for interactive navigation.

Interactive Object Navigation Vision-Language Models Mobile Manipulation Zero-Shot Navigation Semantic Value Mapping Embodied AI

Problem

Traditional object navigation assumes visible targets and unobstructed paths, failing when objects are hidden behind doors or inside containers. Existing methods lack long-horizon reasoning and active interaction capabilities needed to reveal and access these occluded targets.

Approach

VLION leverages a vision-language model to generate scene-level and object-level semantic value maps from egocentric RGB-D data, which are adaptively fused based on spatial entropy to guide target selection. A hybrid A* planner and star-convex manipulation regions then ensure safe navigation and interaction with occluded objects.

Key results

Unified framework for semantic reasoning and geometric planning in ION
Adaptive value fusion strategy balancing scene and object cues via spatial entropy
State-of-the-art performance in iGibson simulations for zero-shot interactive navigation
Successful real-world deployment demonstrating effective zero-shot transfer and onboard decision-making

Why it matters

Enables mobile manipulators to actively explore and interact with complex, occluded real-world environments, advancing the practical deployment of embodied AI.

Abstract

Object navigation for mobile robots typically assumes that targets are visible and paths are unobstructed. However, real-world scenarios often involve occluded targets like objects hidden behind doors or inside containers. Such scenarios require interactive navigation and manipulation by mobile manipulators. To address this challenge, we propose VLION, a vision-language model-guided framework for interactive object navigation (ION) that enables robots to locate and access such targets efficiently. VLION constructs a probabilistic occupancy map and dynamically identifies frontiers for efficient exploration. It leverages vision-language models (VLMs) to perform joint semantic reasoning at both the scene and object levels, generating Scene-Target and Object-Target Value Maps from egocentric observations. These maps are adaptively fused based on spatial entropy to guide target selection and dynamically balance navigation and manipulation priorities for multi-step decision-making. A hybrid A* planner ensures safe and feasible navigation, while star-convex manipulation regions enable interaction with objects. Extensive experiments in iGibson simulations and real-world environments demonstrate the effectiveness of VLION in zero-shot transfer and on-board deployment, advancing the state of the art in ION.

Index terms

Vision-Based Navigation Mobile Manipulation Semantic Scene Understanding