← Back ICRA 2026

Fast-SmartWay: Panoramic-Free End-To-End Zero-Shot Vision-And-Language Navigation

Xiangyu Shi, Zerui Li, Yanyuan Qiao, Qi Wu

PDF

AI summary

Key figure (auto-extracted from paper)

Fast-SmartWay enables efficient, real-world zero-shot navigation by replacing slow panoramic scans and waypoint predictors with a direct frontal-view MLLM pipeline that cuts latency while maintaining competitive performance.

Vision-and-Language Navigation Zero-Shot Navigation Multimodal LLMs End-to-End Navigation Real-World Robotics Uncertainty-Aware Reasoning

Problem

Current zero-shot VLN-CE methods depend on slow 360-degree panoramic scans and two-stage waypoint predictors, which introduce high latency, limit compatibility with compact robots, and often misalign visual cues with language instructions.

Approach

The framework feeds three frontal RGB-D images and language instructions directly into a multimodal large language model to predict actions in a single step, augmented by an uncertainty-aware reasoning module that handles disambiguation and bidirectional planning.

Key results

Eliminates panoramic scans and waypoint predictors for direct action prediction
Integrates uncertainty-aware reasoning to avoid dead-ends and improve planning consistency
Achieves competitive or superior performance against panoramic-view baselines
Demonstrates significantly reduced per-step latency in simulated and real-robot tests

Why it matters

It provides a practical, low-latency navigation solution for real-world robots constrained to front-facing cameras, accelerating the deployment of zero-shot embodied AI.

Abstract

Recent advances in Vision-and-Language Navi- gation in Continuous Environments (VLN-CE) have lever- aged multimodal large language models (MLLMs) to achieve zero-shot navigation. However, existing methods often rely on panoramic observations and two-stage pipelines involving waypoint predictors, which introduce significant latency and limit real-world applicability. In this work, we propose Fast- SmartWay, an end-to-end zero-shot VLN-CE framework that eliminates the need for panoramic views and waypoint pre- dictors. Our approach uses only three frontal RGB-D images combined with natural language instructions, enabling MLLMs to directly predict actions. To enhance decision robustness, we introduce an Uncertainty-Aware Reasoning module that inte- grates (i) a Disambiguation Module for avoiding local optima, and (ii) a Future-Past Bidirectional Reasoning mechanism for globally coherent planning. Experiments on both simulated and real-robot environments demonstrate that our method signifi- cantly reduces per-step latency while achieving competitive or superior performance compared to panoramic-view baselines. These results demonstrate the practicality and effectiveness of Fast-SmartWay for real-world zero-shot embodied navigation.

Index terms

Human-Robot Collaboration Vision-Based Navigation