Research Analyzer
← Back ICRA 2026

RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation

Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li

PDF

AI summary

Key figure (auto-extracted from paper)
Pretraining a vision-language-action model on ego-centric human videos and keypoint trajectories significantly boosts robot manipulation success rates over state-of-the-art baselines.
vision-language-action models robot manipulation ego-centric video pretraining human demonstrations action representation embodied AI

Problem

Vision-language-action models are bottlenecked by the scarcity and high cost of large-scale robot manipulation data, creating a gap between visual understanding and low-level action execution.

Approach

The authors propose a two-stage pretraining pipeline that first learns future frame prediction from ego-centric human videos, then aligns visual dynamics with robot actions by predicting human keypoint trajectories. A dedicated ActionVAE further compresses action sequences into compact latent embeddings for efficient control.

Key results

  • 90.6% average success rate on manipulation tasks, surpassing GR00T N1.5 and Pi0
  • Novel two-stage ego-centric video pretraining bridging visual dynamics and robot actions
  • ActionVAE framework for compressing action chunks into smooth, coherent latent embeddings
  • Strong generalization across single-target, multi-target, and distractor-heavy scenarios

Why it matters

Offers a scalable, data-efficient pathway to train capable robotic policies without relying on expensive real-world teleoperation datasets.

Abstract

This paper presents RynnVLA-001, a vision- language-action (VLA) model built upon large-scale video gen- erative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image- to-Video model to predict future frames based on an image and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby bridging visual frame pre- diction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoen- coder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of- the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.

Index terms

Deep Learning Methods

Related papers