VLION: Vision-Language Guided Interactive Object Navigation with Mobile Manipulation
Renming Liu, Hao Ren, Lanxiang Zheng, Yiming Zeng, Ying Wu, Hui Cheng∗
AI summary
Problem
Traditional object navigation assumes visible targets and unobstructed paths, failing when objects are hidden behind doors or inside containers. Existing methods lack long-horizon reasoning and active interaction capabilities needed to reveal and access these occluded targets.
Approach
VLION leverages a vision-language model to generate scene-level and object-level semantic value maps from egocentric RGB-D data, which are adaptively fused based on spatial entropy to guide target selection. A hybrid A* planner and star-convex manipulation regions then ensure safe navigation and interaction with occluded objects.
Key results
- Unified framework for semantic reasoning and geometric planning in ION
- Adaptive value fusion strategy balancing scene and object cues via spatial entropy
- State-of-the-art performance in iGibson simulations for zero-shot interactive navigation
- Successful real-world deployment demonstrating effective zero-shot transfer and onboard decision-making
Why it matters
Enables mobile manipulators to actively explore and interact with complex, occluded real-world environments, advancing the practical deployment of embodied AI.
Abstract
Object navigation for mobile robots typically assumes that targets are visible and paths are unobstructed. However, real-world scenarios often involve occluded targets like objects hidden behind doors or inside containers. Such scenarios require interactive navigation and manipulation by mobile manipulators. To address this challenge, we propose VLION, a vision-language model-guided framework for interactive object navigation (ION) that enables robots to locate and access such targets efficiently. VLION constructs a probabilistic occupancy map and dynamically identifies frontiers for efficient exploration. It leverages vision-language models (VLMs) to perform joint semantic reasoning at both the scene and object levels, generating Scene-Target and Object-Target Value Maps from egocentric observations. These maps are adaptively fused based on spatial entropy to guide target selection and dynamically balance navigation and manipulation priorities for multi-step decision-making. A hybrid A* planner ensures safe and feasible navigation, while star-convex manipulation regions enable interaction with objects. Extensive experiments in iGibson simulations and real-world environments demonstrate the effectiveness of VLION in zero-shot transfer and on-board deployment, advancing the state of the art in ION.