← Back ICRA 2026

PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies

Jesse Zhang, Marius Memmel, Kevin Kim, Dieter Fox, Jesse Thomason, Fabio Ramos, Erdem Bıyık, Abhishek Gupta, Anqi Li

PDF

AI summary

Key figure (auto-extracted from paper)

PEEK boosts zero-shot robot manipulation generalization by using fine-tuned vision-language models to generate minimal, policy-agnostic path and mask annotations that offload high-level reasoning from low-level policies.

Robot manipulation Zero-shot generalization Vision-language models Policy generalization Visual masking Imitation learning

Problem

Robot manipulation policies struggle to generalize to novel objects, clutter, or semantic variations because they must simultaneously learn where to attend, what actions to take, and how to execute them.

Approach

PEEK fine-tunes vision-language models to predict 2D gripper paths and task-relevant masking points, which are drawn directly onto robot observations to provide a simplified, policy-agnostic intermediate representation.

Key results

41.4× real-world success improvement for simulation-trained 3D policies
2–3.5× success rate gains across large VLAs and small transformer policies
Scalable automatic annotation pipeline generating over 2 million training samples
Consistent zero-shot generalization across 535 real-world evaluations

Why it matters

Enables robust, zero-shot robot manipulation in open-world settings by decoupling high-level semantic reasoning from low-level action execution.

Abstract

Robotic manipulation policies often fail to gener- alize because they must simultaneously learn where to attend, what actions to take, and how to execute them. We argue that high-level reasoning about where and what can be offloaded to vision-language models (VLMs), leaving policies to specialize in how to act. We present PEEK (Policy-agnostic Extraction of Essential Keypoints), which fine-tunes VLMs to predict a unified point-based intermediate representation: (1) end- effector paths specifying what actions to take, and (2) task- relevant masks indicating where to focus. These annotations are directly overlaid onto robot observations, making the representation policy-agnostic and transferable across architec- tures. To enable scalable training, we introduce an automatic annotation pipeline, generating labeled data across 20+ robot datasets spanning 9 embodiments. In real-world evaluations, PEEK consistently boosts zero-shot generalization, including a 41.4× real-world improvement for a 3D policy trained only in simulation, and 2–3.5× gains for both large VLAs and small manipulation policies. By letting VLMs absorb semantic and visual complexity, PEEK equips manipulation policies with the minimal cues they need—where, what, and how. Website at https://peek-robot.github.io.

Index terms

Imitation Learning Transfer Learning Learning from Demonstration