ExpReS-VLA: Specializing Vision-Language-Action Models through Experience Replay and Retrieval
Shahram Najam Syed, Yatharth Ahuja, Arthur Jakobsson, Jeffrey Ichnowski
AI summary
Problem
Pre-trained Vision-Language-Action models struggle to adapt to specific deployment environments due to domain shift and catastrophic forgetting during fine-tuning, while existing adaptation methods require extensive compute or ignore failed attempts.
Approach
ExpReS-VLA enables rapid on-device adaptation by storing compressed vision embeddings in a dual-buffer memory, retrieving similar past experiences via retrieval-augmented generation, and using a novel contrastive loss to learn from both successful and failed trajectories.
Key results
- 97% storage reduction via compressed embedding replay
- 10-11% simulation success rate improvement over base OpenVLA
- 98% physical robot success on in-distribution and out-of-distribution tasks
- 31-second on-device adaptation using only 12 demonstrations
Why it matters
Enables practical, rapid specialization of generalist robot policies for real-world deployment on consumer hardware without requiring extensive compute or risking catastrophic forgetting.
Abstract
Vision-Language-Action (VLA) models like Open- VLA demonstrate impressive zero-shot generalization across robotic manipulation tasks but struggle to adapt to specific deployment environments where consistent high performance on a limited set of tasks is more valuable than broad general- ization. We present EXPierence replayed, REtrieval augmented, Specialized VLA (ExpReS-VLA), a method that enables rapid on-device adaptation of pre-trained VLAs to target domains while preventing catastrophic forgetting through compressed experience replay and retrieval-augmented generation. Our ap- proach maintains a memory-efficient buffer by storing extracted embeddings from OpenVLA’s frozen vision backbone, reducing storage requirements by 97% compared to raw image-action pairs. During deployment, ExpReS-VLA retrieves the k most similar past experiences using cosine similarity to augment training batches, while a prioritized experience replay buffer preserves recently successful trajectories. To leverage failed attempts, we introduce Thresholded Hybrid Contrastive Loss (THCL), enabling the model to learn from both successful and unsuccessful demonstrations collected during deployment. Experiments on the LIBERO simulation benchmark show that ExpReS-VLA improves success rates from 82.6% to 93.1% on spatial reasoning tasks and from 61% to 72.3% on long- horizon tasks compared to base OpenVLA, with consistent gains across VLA architectures including π0 (+3.2 points) and OpenVLA-OFT (+1.7 points). Physical robot experiments across five manipulation tasks demonstrate that our approach achieves 98% success on both in-distribution and out-of-distribution tasks (with unseen backgrounds and objects), improving from 84.7% and 32% respectively for naive fine-tuning. ExpReS- VLA accomplishes this adaptation in 31 seconds using only 12 demonstrations on a single RTX 5090, making it practical for real-world deployment where robots must quickly specialize to their specific operating environment.