← Back ICRA 2026

Developing Vision-Language-Action Model from Egocentric Videos

Tomoya Yoshida, Shuhei Kurita, Taichi Nishimura, Shinsuke Mori

PDF

AI summary

Key figure (auto-extracted from paper)

Pre-training Vision-Language-Action models on automatically extracted trajectories from raw egocentric videos significantly boosts performance and rivals real-robot teleoperation data.

Vision-Language-Action models Egocentric videos Robot learning VLA pre-training Object manipulation Scalable datasets

Problem

Training Vision-Language-Action models typically relies on costly human teleoperation or egocentric videos that require expensive auxiliary annotations, limiting scalability. It remains unclear whether VLAs can be effectively trained directly from raw, unannotated egocentric videos.

Approach

The authors apply the EgoScaler framework to extract 6DoF object manipulation trajectories from four large-scale egocentric video datasets without auxiliary labels, curate the resulting data, and pre-train a state-of-the-art Vision-Language-Action model on this new dataset.

Key results

Successfully pre-training a VLA on raw egocentric videos without auxiliary labels
Achieving over 20% higher task success rates compared to training from scratch
Matching or slightly outperforming leading real-robot teleoperation datasets
Gaining further performance improvements by combining the new dataset with existing real-robot data

Why it matters

It provides a scalable, low-cost alternative to expensive teleoperation data collection, enabling broader advancement of general-purpose robotic foundation models.

Abstract

Egocentric videos capture how humans manipu- late objects and tools, providing diverse motion cues for learning object manipulation. Unlike the costly, expert-driven manual teleoperation commonly used in training Vision-Language- Action models (VLAs), egocentric videos offer a scalable al- ternative. However, prior studies that leverage such videos for training robot policies typically rely on auxiliary annotations, such as detailed hand-pose recordings. Consequently, it remains unclear whether VLAs can be trained directly from raw egocentric videos. In this work, we address this challenge by leveraging EgoScaler, a framework that extracts 6DoF object manipulation trajectories from egocentric videos without requiring auxiliary recordings. We apply EgoScaler to four large-scale egocentric video datasets and automatically refine noisy or incomplete trajectories, thereby constructing a new large-scale dataset for VLA pre-training. Our experiments with a state-of-the-art π0 architecture in both simulated and real- robot environments yield three key findings: (i) pre-training on our dataset improves task success rates by over 20% compared to training from scratch, (ii) the performance is competitive with that achieved using real-robot datasets, and (iii) combining our dataset with real-robot data yields further improvements. These results demonstrate that egocentric videos constitute a promising and scalable resource for advancing VLA research.

Index terms

Data Sets for Robot Learning Visual Learning Learning from Demonstration