← Back ICRA 2026

UM3D: Towards a Unified Multimodal 3D Shape Generation Model

Xian-Feng Han, Zecheng Zhang, Xuran He, Ming-Jie Wang

PDF

AI summary

Key figure (auto-extracted from paper)

UM3D unifies text, image, and sketch inputs into a single generative model that produces high-fidelity, geometrically accurate 3D shapes, outperforming existing multimodal baselines.

3D shape generation multimodal conditioning sketch-enhanced synthesis finite scalar quantization CLIP-guided generation flow-based models

Problem

Extending 2D vision-language generation to 3D is hindered by scarce text-3D paired data and the inherent ambiguity of text prompts, which often fail to capture complete geometric details.

Approach

The method compresses 3D shapes into a compact discrete space using a Finite Scalar Quantization Autoencoder, then aligns sketch features with CLIP’s multimodal embedding space to condition a Glow-based flow model for unified input synthesis.

Key results

Superior FID and MMD scores on ShapeNet compared to CLIP-Forge
Robust zero-shot text-to-3D generation without text supervision
Sketch integration corrects geometric ambiguity and prevents model collapse
High structural fidelity maintained across single and multimodal conditioning

Why it matters

Enables scalable, unified 3D content creation for CAD, gaming, and robotics by effectively bridging 2D vision-language models with 3D geometry generation.

Abstract

Vision-Language Pre-training models (VLMs) have emerged as a highly promising solution to the generative problem, achieving remarkable success in the field of 2D image generation. However, extending these 2D paradigms to 3D domains is still unexplored due to the scarcity of text-3D pairs and shape ambiguity. To address this challenge, we introduce UM3D, a two-stage pre-training architecture towards unified multimodal 3D shape generation. Our approach first optimizes a Finite Scalar Quantization based Autoencoder (FSQ-AE) to learn a compact yet powerful implicit representation with improved codebook utilization. We then encode sketch features into CLIP’s multimodal embedding space to incorporate ad- ditional geometric information. This unified space conditions our well-designed Instance-Normalized Glow model (Glow-IN) to model the distribution of 3D shape representations while mitigating distribution shift issues. During inference, UM3D can accept individual text, image, sketch, or combined inputs to generate corresponding 3D shapes. Quantitative and qualitative evaluations confirm our method’s effectiveness in synthesizing high-fidelity, input-consistent 3D geometries.

Index terms

Deep Learning for Visual Perception Visual Learning