PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies
Jesse Zhang, Marius Memmel, Kevin Kim, Dieter Fox, Jesse Thomason, Fabio Ramos, Erdem Bıyık, Abhishek Gupta, Anqi Li
AI summary
Problem
Robot manipulation policies struggle to generalize to novel objects, clutter, or semantic variations because they must simultaneously learn where to attend, what actions to take, and how to execute them.
Approach
PEEK fine-tunes vision-language models to predict 2D gripper paths and task-relevant masking points, which are drawn directly onto robot observations to provide a simplified, policy-agnostic intermediate representation.
Key results
- 41.4× real-world success improvement for simulation-trained 3D policies
- 2–3.5× success rate gains across large VLAs and small transformer policies
- Scalable automatic annotation pipeline generating over 2 million training samples
- Consistent zero-shot generalization across 535 real-world evaluations
Why it matters
Enables robust, zero-shot robot manipulation in open-world settings by decoupling high-level semantic reasoning from low-level action execution.
Abstract
Robotic manipulation policies often fail to gener- alize because they must simultaneously learn where to attend, what actions to take, and how to execute them. We argue that high-level reasoning about where and what can be offloaded to vision-language models (VLMs), leaving policies to specialize in how to act. We present PEEK (Policy-agnostic Extraction of Essential Keypoints), which fine-tunes VLMs to predict a unified point-based intermediate representation: (1) end- effector paths specifying what actions to take, and (2) task- relevant masks indicating where to focus. These annotations are directly overlaid onto robot observations, making the representation policy-agnostic and transferable across architec- tures. To enable scalable training, we introduce an automatic annotation pipeline, generating labeled data across 20+ robot datasets spanning 9 embodiments. In real-world evaluations, PEEK consistently boosts zero-shot generalization, including a 41.4× real-world improvement for a 3D policy trained only in simulation, and 2–3.5× gains for both large VLAs and small manipulation policies. By letting VLMs absorb semantic and visual complexity, PEEK equips manipulation policies with the minimal cues they need—where, what, and how. Website at https://peek-robot.github.io.