Research Analyzer
← Back ICRA 2026

Notes-To-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks

Sanjay Haresh, Daniel Dijkman, Apratim Bhattacharyya, Memisevic Roland

PDF

AI summary

Key figure (auto-extracted from paper)
Augmenting Vision-Language-Action models with a dynamic language scratchpad dramatically improves their ability to solve long-horizon, memory-dependent robotic tasks.
Vision-Language-Action robotic memory language scratchpad long-horizon manipulation ClevrSkills-Mem recurrent policies

Problem

Stateless Vision-Language-Action models fail at long-horizon manipulation tasks because they cannot retain spatial object locations or track temporal task progress across multiple steps.

Approach

The model generates and continuously updates a textual scratchpad containing object coordinates, a task plan, and completed subgoals, which is fed back as context to guide future actions.

Key results

  • Introduced ClevrSkills-Mem benchmark with five memory-dependent manipulation tasks
  • Achieved ~48% and ~11% performance gains for non-recurrent and recurrent VLAs respectively
  • Enabled generalist VLAs to match specialized methods on MemoryBench
  • Demonstrated successful real-world pick-and-place execution with scratchpad-augmented policies

Why it matters

Offers a lightweight, language-based memory mechanism that enables state-of-the-art robotic foundation models to handle complex, multi-stage real-world tasks without architectural overhauls.

Abstract

Many dexterous manipulation tasks are non- markovian in nature, yet little attention has been paid to this fact in the recent upsurge of the vision-language-action (VLA) paradigm. Although they are successful in bringing internet- scale semantic understanding to robotics, existing VLAs are primarily “stateless” and struggle with memory-dependent long horizon tasks. In this work, we explore a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad. The scratchpad makes it possible to memorize task-specific information, such as object positions, and it allows the model to keep track of a plan and progress towards subgoals within that plan. We evaluate this approach on a split of memory-dependent tasks from the ClevrSkills environment, on MemoryBench, as well as on a challenging real-world pick-and-place task. We show that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.

Index terms

Deep Learning Methods Machine Learning for Robot Control

Related papers