Research Analyzer
← Back ICRA 2026

ActiveVLN: Towards Active Exploration Via Multi-Turn RL in Vision-And-Language Navigation

Zekai Zhang, Weiye Zhu, Hewei Pan, Xiangchen Wang, Rongtao Xu, XING Sun, Feng Zheng

PDF

AI summary

Key figure (auto-extracted from paper)
ActiveVLN leverages multi-turn reinforcement learning and active exploration to significantly boost navigation success rates, outperforming larger models with less data and training time.
Vision-and-Language Navigation Reinforcement Learning Active Exploration Multi-Turn RL Multimodal LLMs Embodied AI

Problem

Existing MLLM-based navigation methods rely on imitation learning or DAgger, which demand extensive expert data, incur high costs, and struggle with covariate shift. Prior reinforcement learning approaches lack dynamic environmental interaction and open-ended exploration, limiting route discovery and generalization.

Approach

The framework bootstraps a policy with minimal expert demonstrations, then refines it through multi-turn reinforcement learning where the agent actively explores, collects self-generated trajectories, and optimizes them using GRPO with a dynamic early-stopping strategy.

Key results

  • +11.6 SR on R2R and +9.7 SR on RxR over imitation learning baselines
  • Surpasses larger DAgger-based and prior RL models despite using a smaller 3B parameter model
  • Reduces training time and data collection costs while maintaining competitive state-of-the-art performance
  • Successfully validated on a real-world wheeled humanoid robot

Why it matters

It proves that active exploration via multi-turn RL can efficiently refine navigation policies without heavy expert supervision, offering a scalable and cost-effective path for robust embodied AI.

Abstract

The Vision-and-Language Navigation (VLN) task requires an agent to follow natural language instructions and navigate through complex environments. Existing MLLM-based VLN methods primarily rely on imitation learning (IL) and often use DAgger for post-training to mitigate covariate shift. While effective, these approaches incur substantial data collec- tion and training costs. Reinforcement learning (RL) offers a promising alternative. However, prior VLN RL methods lack dynamic interaction with the environment and depend on expert trajectories for reward shaping, rather than engaging in open- ended active exploration. This restricts the agents ability to discover diverse and plausible navigation routes. To address these limitations, we propose ActiveVLN, a VLN framework that explicitly enables active exploration through multi-turn RL. In the first stage, a small fraction of expert trajectories is used for IL to bootstrap the agent. In the second stage, the agent iteratively predicts and executes actions, automatically collects diverse trajectories, and optimizes multiple rollouts via the GRPO objective. To further improve RL efficiency, we introduce a dynamic early-stopping strategy to prune long-tail or likely failed trajectories, along with additional engineering optimizations. Experiments show that ActiveVLN achieves the largest performance gains over IL baselines compared to both DAgger-based and prior RL-based post-training methods, while reaching competitive performance with state-of-the-art approaches despite using a smaller model.

Index terms

Vision-Based Navigation Reinforcement Learning Visual Learning

Related papers