← Back ICRA 2026

Galaxy Open-World Dataset and G0 Dual-System VLA Model

TAO JIANG, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, Hang Zhao

PDF

AI summary

Key figure (auto-extracted from paper)

Single-embodiment pre-training on high-quality real-world data is essential for robust VLA performance, outperforming cross-embodiment approaches when embodiment gaps are large.

Robot datasets Vision-Language-Action models Dual-system robotics Real-world manipulation Pre-training strategies Mobile manipulation

Problem

Existing robot datasets lack large-scale, high-quality data from authentic, unstructured environments, limiting the real-world generalization of Vision-Language-Action models.

Approach

The authors introduce the Galaxy Open-World Dataset of 500 hours of real-world mobile manipulation data and propose G0, a dual-system framework that pairs a slow-thinking VLM planner with a fast-executing VLA model trained via a three-stage curriculum.

Key results

Galaxy Open-World Dataset provides 500 hours of uniformly captured, subtask-annotated real-world mobile manipulation data
G0 dual-system architecture decouples high-level VLM planning from low-level VLA execution for efficient real-time control
Single-embodiment pre-training significantly improves action stability and instruction following
G0 achieves top progress scores on challenging real-world benchmarks like table bussing and bed making

Why it matters

Provides a critical benchmark and training paradigm for developing generalizable robot policies capable of handling complex, unstructured domestic and commercial tasks.

Abstract

We present Galaxy Open-World Dataset, a large- scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluation. Building on this dataset, we introduce G0, a dual-system framework that couples a Vision- Language Model (VLM) for multimodal planning with a Vision- Language-Action (VLA) model for fine-grained execution. G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training. A comprehensive benchmark—spanning tabletop manipulation, few-shot learning, and long-horizon mobile manipulation—demonstrates the effectiveness of our approach. In particular, we find that the single-embodiment pre-training stage, together with the Galaxy Open-World Dataset, plays a critical role in achieving strong performance. Dataset, code and pretrained weights will be made publicly available.

Index terms

AI-Enabled Robotics Data Sets for Robot Learning Learning from Demonstration