SAGrid: Scaling Robot Simulation through Automatic Affordance Annotation on In-The-Wild 3D Assets
Cem Gokmen, Yalcin Tur, Aditesh Kumar, Auddithio Nag, Li Fei-Fei
AI summary
Problem
Scaling robot simulation is bottlenecked by a lack of simulation-ready 3D assets, as in-the-wild models lack specialized annotations for complex phenomena like fluids and heat, making manual annotation costly and limiting asset diversity.
Approach
SAGrid uses pretrained 2D visual and 3D geometric features to predict a dense distance field to the nearest simulation affordance on a voxelized mesh, requiring as few as 10 labeled examples per feature type.
Key results
- Achieves 4.8 cm mean localization error, outperforming VLM and embedding-based baselines
- Successfully annotates and integrates in-the-wild Objaverse-XL assets into the BEHAVIOR-1K simulator
- Expanding training assets with automated annotations significantly improves robot policy generalization to unseen objects
- Operates effectively in a low-data regime with only 10 training objects per affordance type
Why it matters
It removes the manual annotation bottleneck, allowing researchers and developers to scale up diverse, simulation-ready environments for training robust, real-world robot policies.
Abstract
Robot simulation is a highly efficient approach for scaling data collection for robot learning, but scaling for most household tasks remains bottlenecked by a shortage of simulation-ready 3D assets. While modern robot simulators can model complex phenomena like temperature and fluids, most in- the-wild 3D models lack “simulation affordances” (specialized annotations such as fluid source and heat emitter positions) that are required for these features. As a result, costly manual annotation is required, severely limiting asset scale and variety. We introduce Simulation Affordance Grids (SAGrid), a method that automates the annotation of simulation affor- dances on in-the-wild 3D meshes. SAGrid leverages pretrained representations (DINOv2, TRELLIS) to predict a dense 3D distance field to the nearest affordance. Our approach oper- ates effectively in a low-data regime, requiring as few as 10 training objects per affordance type to accurately locate these features. We validate our method by processing Objaverse-XL models and integrating them into the BEHAVIOR-1K simulator. Training robot policies on this automatically expanded asset suite significantly improves generalization to unseen objects in complex tasks, demonstrating that automated affordance annotation is crucial for scaling robot learning.