← Back ICRA 2026

GAIA: Generating Task Instruction Aware Simulation Grounded in Real Contexts Using Vision-Language Models

Dogyu Ko, Chanyoung Yeo, Daeho Kim, Jaeho Kim, Hyoseok Hwang

PDF

AI summary

Key figure (auto-extracted from paper)

GAIA automatically generates realistic, task-ready simulation environments from a single RGB image and text instruction using a pre-trained VLM, enabling effective sim-to-real policy transfer.

Simulation Vision-Language Models Robot Learning Sim-to-Real Transfer Task-Aware Generation Embodied AI

Problem

Manual creation of diverse, task-specific virtual scenes for robot learning is labor-intensive, while existing automatic generation methods either lack real-world fidelity or fail to automatically configure task-relevant objects.

Approach

GAIA uses a pre-trained Vision-Language Model to jointly interpret a real-world RGB image and a natural language task instruction, then automatically retrieves, scales, orients, and places necessary 3D assets to build an interactive simulation without additional training.

Key results

Automated generation of task-ready scenes from RGB images and text
Zero-shot VLM reasoning for contextual object placement
Successful sim-to-real transfer of learned robot policies
Configurable placement augmentation and distractor generation

Why it matters

It eliminates manual scene setup for robot learning, accelerating the development of embodied AI agents that can seamlessly transfer policies from simulation to the real world.

Abstract

Enabling robots to interact effectively with the real world requires extensive learning from physical interaction data, making simulation crucial for generating such data safely and cost-effectively. Despite the advantages of simulation, manual environment creation remains a laborious process, motivating the development of automated generation approaches. However, the limitations of current automatic virtual scene generation approaches in bridging the sim-to-real gap and achieving task readiness necessitate the creation of automatically generated, realistic, and task-ready virtual scenes. In this paper, we propose GAIA, a novel methodology to automatically generate interactive, task-ready simulation environments grounded in real contexts from only a single RGB image and a task instruction. GAIA utilizes a pre-trained Vision-Language Model (VLM) without requiring explicit training, and jointly understands the visual context and the user’s instruction. Based on this understanding, it infers and places necessary task-aware objects, including unseen ones to construct an interactive virtual environment that main- tains real-scene fidelity while reflecting task requirements without additional manual setup. We show qualitative experiments that GAIA generates spaces consistent with user instructions, and quantitative results that policies learned within these GAIA- generated environments successfully transfer to target environ- ments. Source code and supplementary materials are available at our project page https://sites.google.com/view/gaia-project-page.

Index terms

Simulation and Animation Task and Motion Planning Deep Learning for Visual Perception