One Prompt, Many Rooms: A Force-Directed Approach to 3D Scene Generation for Robotics Simulation
Christopher May, Peter Hetzner, Matthias Kalenberg, Ashwin Sajith Nambiar, Jörg Franke, Sebastian Reitelshöfer
AI summary
Problem
Current text-to-scene methods are inefficient for robotics because generating diverse layouts requires repeated, expensive sampling that breaks semantic consistency, hindering the creation of large-scale, physically consistent training datasets for generalizable robotic agents.
Approach
An LLM generates a single declarative scene plan, which a force-directed 2D physics simulation translates into multiple distinct, physically plausible layouts that share identical semantics but vary in geometry.
Key results
- A one-plan-to-many-layouts pipeline generating diverse 3D scenes from a single prompt
- A force-directed placement technique translating semantic rules into forces and torques via 2D physics simulation
- A scalable validation loop combining deterministic physical checks with VLM-based semantic assessment
- A navigation agent trained on generated scenes achieving a 0.84 success rate on Replica environments
Why it matters
It provides a scalable, efficient method for generating controlled, diverse simulation data, addressing a critical bottleneck for training generalizable embodied AI and robotic agents.
Abstract
Training generalizable robotic agents requires large datasets of diverse, physically consistent 3D scenes, yet their generation remains a critical bottleneck. Current text-to- scene methods are inefficient for this task; generating diverse layouts requires repeated, expensive sampling that fails to maintain the semantic consistency required for robust policy learning. In this paper, we address this with our “one-plan-to- many-layouts” method. A Large Language Model (LLM) generates a single declarative plan, which a force-directed physics simulation then realizes into multiple layouts that share semantics but differ in geometry. We validate our method by transfer to photorealistic 3D reconstructions of real environments (Replica) within simulation, where a navigation agent trained on our scenes attains a Success Rate of 0.84. These results establish our pipeline as a scalable method for producing the controlled, diverse data required for embodied AI training.