← Back ICRA 2026

Subsecond 3D Mesh Generation for Robot Manipulation

Qian Wang, Omar Abdellall, Tony Gao, Xiatao Sun, Daniel Rakita

PDF

AI summary

Key figure (auto-extracted from paper)

A unified pipeline combining open-vocabulary segmentation, accelerated diffusion modeling, and point cloud registration generates high-fidelity, contextually grounded 3D meshes from a single RGB-D image in under one second, enabling real-time robotic manipulation.

3D mesh generation real-time robotics accelerated diffusion open-vocabulary segmentation point cloud registration robotic manipulation

Problem

High-fidelity 3D mesh generation is currently too slow for real-time robotics, and existing methods lack efficient pipelines that simultaneously segment objects from the scene and register them with correct scale and pose.

Approach

The system processes a single RGB-D image through three optimized stages: open-vocabulary segmentation to isolate objects, a distilled diffusion model with hierarchical decoding to rapidly generate 3D geometry, and RANSAC/ICP registration to align the mesh with the sensor's point cloud.

Key results

End-to-end pipeline runtime of ~0.82 seconds per object
Geometric fidelity matching full diffusion models while using only three denoising steps
Successful real-world robotic grasping and placement using generated meshes
Quantitative ablation confirming optimal speed-accuracy trade-offs across pipeline stages

Why it matters

Provides robotics researchers and engineers with a practical, on-demand 3D representation tool that bridges the gap between high-quality generative modeling and real-time physical interaction.

Abstract

3D meshes are a fundamental representation widely used in computer science and engineering. In robotics, they are particularly valuable because they capture objects in a form that aligns directly with how robots interact with the physical world, enabling core capabilities such as predicting stable grasps, detecting collisions, and simulating dynamics. Although automatic 3D mesh generation methods have shown promising progress in recent years, potentially offering a path toward real-time robot perception, two critical challenges remain. First, generating high-fidelity meshes is prohibitively slow for real-time use, often requiring tens of seconds per object. Second, mesh generation by itself is insufficient. In robotics, a mesh must be contextually grounded, i.e., correctly segmented from the scene and registered with the proper scale and pose. Additionally, unless these contextual grounding steps remain ef- ficient, they simply introduce new bottlenecks. In this work, we introduce an end-to-end system that addresses these challenges, producing a high-quality, contextually grounded 3D mesh from a single RGB-D image in under one second. Our contribution is a system level design that integrates open-vocabulary object * Equal contribution This work was supported by Office of Naval Research award N00014- 24-1-2124 segmentation, accelerated diffusion-based mesh generation, and robust point cloud registration, each optimized for both speed and accuracy. We demonstrate its effectiveness in a real-world manipulation task, showing that it enables meshes to be used as a practical, on-demand representation for robotics perception and planning. Open-source code and videos are located at the paper website.1

Index terms

Perception for Grasping and Manipulation