← Back ICRA 2026

ExpReS-VLA: Specializing Vision-Language-Action Models through Experience Replay and Retrieval

Shahram Najam Syed, Yatharth Ahuja, Arthur Jakobsson, Jeffrey Ichnowski

PDF

AI summary

Key figure (auto-extracted from paper)

ExpReS-VLA rapidly specializes pre-trained VLA models on-device in 31 seconds using only 12 demonstrations, boosting success rates by up to 13.5% in simulation and achieving 98% success on physical robots without catastrophic forgetting.

Vision-Language-Action Models Catastrophic Forgetting Retrieval-Augmented Generation On-Device Adaptation Experience Replay Contrastive Learning

Problem

Pre-trained Vision-Language-Action models struggle to adapt to specific deployment environments due to domain shift and catastrophic forgetting during fine-tuning, while existing adaptation methods require extensive compute or ignore failed attempts.

Approach

ExpReS-VLA enables rapid on-device adaptation by storing compressed vision embeddings in a dual-buffer memory, retrieving similar past experiences via retrieval-augmented generation, and using a novel contrastive loss to learn from both successful and failed trajectories.

Key results

97% storage reduction via compressed embedding replay
10-11% simulation success rate improvement over base OpenVLA
98% physical robot success on in-distribution and out-of-distribution tasks
31-second on-device adaptation using only 12 demonstrations

Why it matters

Enables practical, rapid specialization of generalist robot policies for real-world deployment on consumer hardware without requiring extensive compute or risking catastrophic forgetting.

Abstract

Vision-Language-Action (VLA) models like Open- VLA demonstrate impressive zero-shot generalization across robotic manipulation tasks but struggle to adapt to specific deployment environments where consistent high performance on a limited set of tasks is more valuable than broad general- ization. We present EXPierence replayed, REtrieval augmented, Specialized VLA (ExpReS-VLA), a method that enables rapid on-device adaptation of pre-trained VLAs to target domains while preventing catastrophic forgetting through compressed experience replay and retrieval-augmented generation. Our ap- proach maintains a memory-efficient buffer by storing extracted embeddings from OpenVLA’s frozen vision backbone, reducing storage requirements by 97% compared to raw image-action pairs. During deployment, ExpReS-VLA retrieves the k most similar past experiences using cosine similarity to augment training batches, while a prioritized experience replay buffer preserves recently successful trajectories. To leverage failed attempts, we introduce Thresholded Hybrid Contrastive Loss (THCL), enabling the model to learn from both successful and unsuccessful demonstrations collected during deployment. Experiments on the LIBERO simulation benchmark show that ExpReS-VLA improves success rates from 82.6% to 93.1% on spatial reasoning tasks and from 61% to 72.3% on long- horizon tasks compared to base OpenVLA, with consistent gains across VLA architectures including π0 (+3.2 points) and OpenVLA-OFT (+1.7 points). Physical robot experiments across five manipulation tasks demonstrate that our approach achieves 98% success on both in-distribution and out-of-distribution tasks (with unseen backgrounds and objects), improving from 84.7% and 32% respectively for naive fine-tuning. ExpReS- VLA accomplishes this adaptation in 31 seconds using only 12 demonstrations on a single RTX 5090, making it practical for real-world deployment where robots must quickly specialize to their specific operating environment.

Index terms

Continual Learning Machine Learning for Robot Control Incremental Learning