GAIA: Generating Task Instruction Aware Simulation Grounded in Real Contexts Using Vision-Language Models
Dogyu Ko, Chanyoung Yeo, Daeho Kim, Jaeho Kim, Hyoseok Hwang
AI summary
Problem
Manual creation of diverse, task-specific virtual scenes for robot learning is labor-intensive, while existing automatic generation methods either lack real-world fidelity or fail to automatically configure task-relevant objects.
Approach
GAIA uses a pre-trained Vision-Language Model to jointly interpret a real-world RGB image and a natural language task instruction, then automatically retrieves, scales, orients, and places necessary 3D assets to build an interactive simulation without additional training.
Key results
- Automated generation of task-ready scenes from RGB images and text
- Zero-shot VLM reasoning for contextual object placement
- Successful sim-to-real transfer of learned robot policies
- Configurable placement augmentation and distractor generation
Why it matters
It eliminates manual scene setup for robot learning, accelerating the development of embodied AI agents that can seamlessly transfer policies from simulation to the real world.
Abstract
Enabling robots to interact effectively with the real world requires extensive learning from physical interaction data, making simulation crucial for generating such data safely and cost-effectively. Despite the advantages of simulation, manual environment creation remains a laborious process, motivating the development of automated generation approaches. However, the limitations of current automatic virtual scene generation approaches in bridging the sim-to-real gap and achieving task readiness necessitate the creation of automatically generated, realistic, and task-ready virtual scenes. In this paper, we propose GAIA, a novel methodology to automatically generate interactive, task-ready simulation environments grounded in real contexts from only a single RGB image and a task instruction. GAIA utilizes a pre-trained Vision-Language Model (VLM) without requiring explicit training, and jointly understands the visual context and the user’s instruction. Based on this understanding, it infers and places necessary task-aware objects, including unseen ones to construct an interactive virtual environment that main- tains real-scene fidelity while reflecting task requirements without additional manual setup. We show qualitative experiments that GAIA generates spaces consistent with user instructions, and quantitative results that policies learned within these GAIA- generated environments successfully transfer to target environ- ments. Source code and supplementary materials are available at our project page https://sites.google.com/view/gaia-project-page.