← Back ICRA 2026

Scalable Vision-Language-Action Model Pretraining for Robotic Dexterous Manipulation with Real-Life Human Activity Videos

Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, ZhiYuan Feng, Huizhi Liang, Sicheng Xu, Yizhong Zhang, Xi Chen, Hao Chen, Lily Sun, Dong Chen, Jiaolong Yang, Baining Guo

PDF

AI summary

Key figure (auto-extracted from paper)

Unstructured real-life human videos can be automatically converted into large-scale VLA training data, enabling strong zero-shot robotic manipulation and effective generalization after minimal fine-tuning.

Vision-Language-Action Robotic Manipulation Dexterous Hands Self-Supervised Learning Embodied AI Video-to-Robot Data

Problem

Robotic manipulation datasets are expensive to collect, limited in scale, and lack the diversity of real-world objects and environments needed for generalizable AI. The paper asks whether unstructured, unannotated human videos can be transformed into aligned VLA data for scalable robot pretraining.

Approach

The authors treat human hands as robot end-effectors and develop a fully automated pipeline that extracts 3D hand and camera poses, segments atomic actions using wrist speed minima, and generates language instructions via a VLM to create a massive aligned dataset for pretraining a dexterous hand VLA model.

Key results

Fully automated pipeline converting unstructured egocentric videos into 1M aligned VLA episodes
Novel dexterous hand VLA model architecture with diffusion-based action prediction
Strong zero-shot action prediction on completely unseen objects and environments
Significant real-world task success improvements after fine-tuning on minimal robot data

Why it matters

Provides a scalable, low-cost pathway to train general-purpose robotic manipulation models by leveraging abundant web videos, reducing reliance on expensive teleoperation data.

Abstract

This paper presents an approach for pretraining robotic manipulation Vision-Language-Action (VLA) models using a large corpus of unscripted real-life video recordings of human hand activities. Treating human hand as dexterous robot end-effector, we show that “in-the-wild” egocentric human videos without any annotations can be transformed into data formats fully aligned with existing robotic V-L-A training data in terms of task granularity and labels. This is achieved by the development of a fully-automated holistic human activity analysis approach for arbitrary human hand videos. This approach can generate atomic-level hand activity segments and their language descriptions, each accompanied with framewise 3D hand motion and camera motion. We process a large volume of egocentric videos and create a hand-VLA training dataset containing 1M episodes and 26M frames. This training data covers a wide range of objects and concepts, dexterous ma- nipulation tasks, and environment variations in real life, vastly exceeding the coverage of existing robot data. We also design a dexterous hand VLA model architecture and pretrain the model on this dataset. The model exhibits strong zero-shot capabilities on completely unseen real-world observations. Additionally, fine-tuning it on a small amount of real robot action data 1Tsinghua University. 2Microsoft Research Asia. ∗Equal contribution. †Intern work done at Microsoft Research Asia. ‡Corresponding author: jiaoyan@microsoft.com significantly improves task success rates and generalization to novel objects in real robotic experiments. We believe this work lays a solid foundation for scaling up VLA pretraining towards generalizable embodied intelligence. The project website, which includes additional visualizations, models, datasets, and code, is available at: https://microsoft.github.io/VITRA/.

Index terms

Imitation Learning Big Data in Robotics and Automation Deep Learning in Grasping and Manipulation