RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation
Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li
AI summary
Problem
Vision-language-action models are bottlenecked by the scarcity and high cost of large-scale robot manipulation data, creating a gap between visual understanding and low-level action execution.
Approach
The authors propose a two-stage pretraining pipeline that first learns future frame prediction from ego-centric human videos, then aligns visual dynamics with robot actions by predicting human keypoint trajectories. A dedicated ActionVAE further compresses action sequences into compact latent embeddings for efficient control.
Key results
- 90.6% average success rate on manipulation tasks, surpassing GR00T N1.5 and Pi0
- Novel two-stage ego-centric video pretraining bridging visual dynamics and robot actions
- ActionVAE framework for compressing action chunks into smooth, coherent latent embeddings
- Strong generalization across single-target, multi-target, and distractor-heavy scenarios
Why it matters
Offers a scalable, data-efficient pathway to train capable robotic policies without relying on expensive real-world teleoperation datasets.
Abstract
This paper presents RynnVLA-001, a vision- language-action (VLA) model built upon large-scale video gen- erative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image- to-Video model to predict future frames based on an image and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby bridging visual frame pre- diction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoen- coder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of- the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.