← Back ICRA 2026

Select2Plan: Training-Free ICL-Based Planning through VQA and Memory Retrieval

Davide Buoso, Luke Robinson, Giuseppe Averta, Philip Torr, Tim Franzmeyer, Daniele De Martini

PDF

AI summary

Key figure (auto-extracted from paper)

A training-free framework that boosts off-the-shelf Vision-Language Models for robot navigation by combining structured visual questioning with in-context memory retrieval, eliminating the need for task-specific fine-tuning.

Training-free planning Vision-Language Models In-Context Learning Robot navigation Visual Question Answering Memory retrieval

Problem

Learning-based robot navigation methods typically require extensive task-specific training and large-scale data collection, often failing to generalize to unfamiliar or ambiguous deployment scenarios.

Approach

Select2Plan adapts pre-trained Vision-Language Models for navigation by using structured Visual Question Answering to ground action selection and In-Context Learning to retrieve relevant annotated examples from a memory bank for planning.

Key results

40% baseline VLM improvement in third-person view navigation
24% surpassing of end-to-end trained models in first-person view
Effective generalization to novel scenes with minimal demonstrations
Seamless cross-setup adaptability for both FPV and TPV navigation

Why it matters

Provides a flexible, data-efficient alternative to costly fine-tuning for deploying robust autonomous navigation systems in diverse real-world environments.

Abstract

We introduce Select2Plan (S2P), a novel training-free framework for high-level robot planning that leverages off-the- shelf Vision-Language Models (VLMs) for autonomous navigation. Unlike most learning-based approaches that require extensive task- specific training and large-scale data collection, S2P overcomes the need for fine-tuning by adapting inputs to align with the VLM’s pretraining data. Our method achieves this through a combination of structured Visual Question Answering (VQA) to ground action selection on the image, and In-Context Learning (ICL) to exploit knowledge drawn from relevant examples from a memory bank of (visually) annotated data, which can include diverse, in-the-wild sources. We demonstrate S2P flexibility by evaluating it in both First-Person View (FPV) and Third-Person View (TPV) navigation. S2P improves the performance of a baseline VLM by 40% in TPV and surpasses end-to-end trained models by approximately 24% in FPV when tasked with navigating towards unseen objects in novel scenes. These results highlight the adaptability, simplicity, and effectiveness of our training-free approach, demonstrating that the use of pre-trained VLMs with structured memory re- trieval enables robust high-level robot planning without costly task-specific training. Our experiments also show that retrieving samples from heterogeneous data sources, including online videos of different robots or humans walking, is highly beneficial for navigation. Notably, our method effectively generalizes to novel scenarios, requiring only a handful of demonstrations. Project Page: lambdavi.github.io/select2plan

Index terms

Motion and Path Planning Vision-Based Navigation Autonomous Agents