Subsecond 3D Mesh Generation for Robot Manipulation
Qian Wang, Omar Abdellall, Tony Gao, Xiatao Sun, Daniel Rakita
AI summary
Problem
High-fidelity 3D mesh generation is currently too slow for real-time robotics, and existing methods lack efficient pipelines that simultaneously segment objects from the scene and register them with correct scale and pose.
Approach
The system processes a single RGB-D image through three optimized stages: open-vocabulary segmentation to isolate objects, a distilled diffusion model with hierarchical decoding to rapidly generate 3D geometry, and RANSAC/ICP registration to align the mesh with the sensor's point cloud.
Key results
- End-to-end pipeline runtime of ~0.82 seconds per object
- Geometric fidelity matching full diffusion models while using only three denoising steps
- Successful real-world robotic grasping and placement using generated meshes
- Quantitative ablation confirming optimal speed-accuracy trade-offs across pipeline stages
Why it matters
Provides robotics researchers and engineers with a practical, on-demand 3D representation tool that bridges the gap between high-quality generative modeling and real-time physical interaction.
Abstract
3D meshes are a fundamental representation widely used in computer science and engineering. In robotics, they are particularly valuable because they capture objects in a form that aligns directly with how robots interact with the physical world, enabling core capabilities such as predicting stable grasps, detecting collisions, and simulating dynamics. Although automatic 3D mesh generation methods have shown promising progress in recent years, potentially offering a path toward real-time robot perception, two critical challenges remain. First, generating high-fidelity meshes is prohibitively slow for real-time use, often requiring tens of seconds per object. Second, mesh generation by itself is insufficient. In robotics, a mesh must be contextually grounded, i.e., correctly segmented from the scene and registered with the proper scale and pose. Additionally, unless these contextual grounding steps remain ef- ficient, they simply introduce new bottlenecks. In this work, we introduce an end-to-end system that addresses these challenges, producing a high-quality, contextually grounded 3D mesh from a single RGB-D image in under one second. Our contribution is a system level design that integrates open-vocabulary object * Equal contribution This work was supported by Office of Naval Research award N00014- 24-1-2124 segmentation, accelerated diffusion-based mesh generation, and robust point cloud registration, each optimized for both speed and accuracy. We demonstrate its effectiveness in a real-world manipulation task, showing that it enables meshes to be used as a practical, on-demand representation for robotics perception and planning. Open-source code and videos are located at the paper website.1