UM3D: Towards a Unified Multimodal 3D Shape Generation Model
Xian-Feng Han, Zecheng Zhang, Xuran He, Ming-Jie Wang
AI summary
Problem
Extending 2D vision-language generation to 3D is hindered by scarce text-3D paired data and the inherent ambiguity of text prompts, which often fail to capture complete geometric details.
Approach
The method compresses 3D shapes into a compact discrete space using a Finite Scalar Quantization Autoencoder, then aligns sketch features with CLIP’s multimodal embedding space to condition a Glow-based flow model for unified input synthesis.
Key results
- Superior FID and MMD scores on ShapeNet compared to CLIP-Forge
- Robust zero-shot text-to-3D generation without text supervision
- Sketch integration corrects geometric ambiguity and prevents model collapse
- High structural fidelity maintained across single and multimodal conditioning
Why it matters
Enables scalable, unified 3D content creation for CAD, gaming, and robotics by effectively bridging 2D vision-language models with 3D geometry generation.
Abstract
Vision-Language Pre-training models (VLMs) have emerged as a highly promising solution to the generative problem, achieving remarkable success in the field of 2D image generation. However, extending these 2D paradigms to 3D domains is still unexplored due to the scarcity of text-3D pairs and shape ambiguity. To address this challenge, we introduce UM3D, a two-stage pre-training architecture towards unified multimodal 3D shape generation. Our approach first optimizes a Finite Scalar Quantization based Autoencoder (FSQ-AE) to learn a compact yet powerful implicit representation with improved codebook utilization. We then encode sketch features into CLIP’s multimodal embedding space to incorporate ad- ditional geometric information. This unified space conditions our well-designed Instance-Normalized Glow model (Glow-IN) to model the distribution of 3D shape representations while mitigating distribution shift issues. During inference, UM3D can accept individual text, image, sketch, or combined inputs to generate corresponding 3D shapes. Quantitative and qualitative evaluations confirm our method’s effectiveness in synthesizing high-fidelity, input-consistent 3D geometries.