Galaxy Open-World Dataset and G0 Dual-System VLA Model
TAO JIANG, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, Hang Zhao
AI summary
Problem
Existing robot datasets lack large-scale, high-quality data from authentic, unstructured environments, limiting the real-world generalization of Vision-Language-Action models.
Approach
The authors introduce the Galaxy Open-World Dataset of 500 hours of real-world mobile manipulation data and propose G0, a dual-system framework that pairs a slow-thinking VLM planner with a fast-executing VLA model trained via a three-stage curriculum.
Key results
- Galaxy Open-World Dataset provides 500 hours of uniformly captured, subtask-annotated real-world mobile manipulation data
- G0 dual-system architecture decouples high-level VLM planning from low-level VLA execution for efficient real-time control
- Single-embodiment pre-training significantly improves action stability and instruction following
- G0 achieves top progress scores on challenging real-world benchmarks like table bussing and bed making
Why it matters
Provides a critical benchmark and training paradigm for developing generalizable robot policies capable of handling complex, unstructured domestic and commercial tasks.
Abstract
We present Galaxy Open-World Dataset, a large- scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluation. Building on this dataset, we introduce G0, a dual-system framework that couples a Vision- Language Model (VLM) for multimodal planning with a Vision- Language-Action (VLA) model for fine-grained execution. G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training. A comprehensive benchmark—spanning tabletop manipulation, few-shot learning, and long-horizon mobile manipulation—demonstrates the effectiveness of our approach. In particular, we find that the single-embodiment pre-training stage, together with the Galaxy Open-World Dataset, plays a critical role in achieving strong performance. Dataset, code and pretrained weights will be made publicly available.