Fast-SmartWay: Panoramic-Free End-To-End Zero-Shot Vision-And-Language Navigation
Xiangyu Shi, Zerui Li, Yanyuan Qiao, Qi Wu
AI summary
Problem
Current zero-shot VLN-CE methods depend on slow 360-degree panoramic scans and two-stage waypoint predictors, which introduce high latency, limit compatibility with compact robots, and often misalign visual cues with language instructions.
Approach
The framework feeds three frontal RGB-D images and language instructions directly into a multimodal large language model to predict actions in a single step, augmented by an uncertainty-aware reasoning module that handles disambiguation and bidirectional planning.
Key results
- Eliminates panoramic scans and waypoint predictors for direct action prediction
- Integrates uncertainty-aware reasoning to avoid dead-ends and improve planning consistency
- Achieves competitive or superior performance against panoramic-view baselines
- Demonstrates significantly reduced per-step latency in simulated and real-robot tests
Why it matters
It provides a practical, low-latency navigation solution for real-world robots constrained to front-facing cameras, accelerating the deployment of zero-shot embodied AI.
Abstract
Recent advances in Vision-and-Language Navi- gation in Continuous Environments (VLN-CE) have lever- aged multimodal large language models (MLLMs) to achieve zero-shot navigation. However, existing methods often rely on panoramic observations and two-stage pipelines involving waypoint predictors, which introduce significant latency and limit real-world applicability. In this work, we propose Fast- SmartWay, an end-to-end zero-shot VLN-CE framework that eliminates the need for panoramic views and waypoint pre- dictors. Our approach uses only three frontal RGB-D images combined with natural language instructions, enabling MLLMs to directly predict actions. To enhance decision robustness, we introduce an Uncertainty-Aware Reasoning module that inte- grates (i) a Disambiguation Module for avoiding local optima, and (ii) a Future-Past Bidirectional Reasoning mechanism for globally coherent planning. Experiments on both simulated and real-robot environments demonstrate that our method signifi- cantly reduces per-step latency while achieving competitive or superior performance compared to panoramic-view baselines. These results demonstrate the practicality and effectiveness of Fast-SmartWay for real-world zero-shot embodied navigation.