← Back ICRA 2026

AMPLIFY: Actionless Motion Priors for Robot Learning from Videos

Jeremy Collins, Lorand Cheng, Kunal Aneja, Albert Wilcox, Benjamin Joffe, Animesh Garg

PDF

AI summary

Key figure (auto-extracted from paper)

AMPLIFY decouples visual motion prediction from action inference using latent keypoint tokens, enabling robots to learn effective policies from vast action-free videos with minimal action-labeled data.

Robot Learning Actionless Priors Latent Dynamics Keypoint Tracking Video Pre-training Policy Generalization

Problem

Training generalist robot policies typically requires prohibitively large amounts of expensive, action-labeled expert demonstrations, while abundant action-free video data remains underutilized due to the difficulty of translating visual observations into effective control policies.

Approach

AMPLIFY compresses keypoint trajectories from videos into discrete latent motion tokens, trains a forward dynamics model on action-free videos to predict these motions, and uses a separate inverse dynamics model to map predicted motions to robot actions, allowing independent scaling of video and action data.

Key results

Over 3× lower MSE and 2.5× better pixel accuracy in keypoint trajectory prediction
1.2–2.2× success rate improvement in low-data regimes and learning from human videos
First generalization to LIBERO tasks with zero in-distribution action data (60% success)
Enhanced conditional video prediction quality beyond robotic control

Why it matters

It provides a scalable, data-efficient paradigm for robot learning that bridges the gap between abundant internet-scale video data and scarce action-labeled demonstrations, benefiting researchers and practitioners in generalist robotics.

Abstract

Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it. We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples, allowing for independent scaling. Extensive evaluations demonstrate our dynamics model achieves over 2× better point track prediction accuracy compared to the prior state-of-the-art. In downstream policy learning, our dynamics predictions enable a 1.2-2.2× success rate improvement in low-data regimes, a 1.4× average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks with zero in-distribution action data. Beyond robotic control, we find the latent dynamics learned by AMPLIFY to enhance video prediction quality. Our results present a novel paradigm leveraging heterogeneous data sources to build efficient, generalizable world models. More information can be found at amplify-robotics.github.io.

Index terms

Representation Learning Imitation Learning Machine Learning for Robot Control