← Back ICRA 2026

MAP-VLA: Memory-Augmented Prompting for Vision-Language-Action Model in Robotic Manipulation

Runhao Li, Wenkai Guo, Zhenyu Wu, Changyuan Wang, Haoyuan Deng, Zhenyu Weng, Yap-Peng Tan, Ziwei Wang

PDF

AI summary

Key figure (auto-extracted from paper)

Dynamically retrieving and integrating demonstration-derived memory prompts into frozen VLA models significantly boosts performance on long-horizon robotic manipulation tasks.

Vision-Language-Action models robotic manipulation episodic memory prompt tuning long-horizon tasks retrieval-augmented generation

Problem

Current vision-language-action models lack episodic memory and rely solely on immediate sensory inputs, causing them to fail at complex, long-horizon tasks where recalling past expert demonstrations is crucial.

Approach

MAP-VLA constructs a library of stage-specific soft prompts from expert demonstrations and dynamically retrieves the most relevant memory during execution to augment action generation via prompt ensembling, all without updating the base model's weights.

Key results

Stage-specific memory prompts encoded via prompt tuning
Trajectory similarity-based memory retrieval mechanism
Dynamic prompt ensembling for robust action generation
Up to 7.0% simulation and 25.0% real-robot performance gains on long-horizon tasks

Why it matters

Provides a lightweight, plug-and-play solution to equip frozen foundation models with episodic memory, advancing general-purpose robotic manipulation for complex multi-step tasks.

Abstract

Pre-trained Vision-Language-Action (VLA) mod- els have achieved remarkable success in improving robustness and generalization for end-to-end robotic manipulation. How- ever, these models struggle with long-horizon tasks due to their lack of memory and reliance solely on immediate sensory inputs. To address this limitation, we propose Memory-Augmented Prompting for Vision-Language-Action model (MAP-VLA), a novel framework that empowers pre-trained VLA models with demonstration-derived memory prompts to augment action generation for long-horizon robotic manipulation tasks. To achieve this, MAP-VLA first constructs a memory library from historical demonstrations, where each memory unit captures information about a specific stage of a task. These memory units are implemented as learnable soft prompts optimized through prompt tuning. Then, during real-time task execution, MAP- VLA retrieves relevant memory through trajectory similarity matching and dynamically integrates it into the VLA model for augmented action generation. Importantly, this prompt tuning and retrieval augmentation approach operates as a plug-and- play module for a frozen VLA model, offering a lightweight and flexible solution to improve task performance. Experimental results show that MAP-VLA delivers up to 7.0% absolute performance gains in the simulation benchmark and 25.0% on real robot evaluations for long-horizon tasks, surpassing the current state-of-the-art methods.

Index terms

Imitation Learning Deep Learning in Grasping and Manipulation Deep Learning Methods