Audio-2-Shape: 3D Generation from What You Hear
Xuran He, Xian-Feng Han, Shi-Jie Sun
AI summary
Problem
Direct synthesis of 3D geometry from audio remains unexplored due to the semantic gap between acoustic signals and spatial data, compounded by a lack of high-quality audio-3D paired datasets.
Approach
The framework aligns audio, image, and 3D features in a unified space, uses a latent diffusion model to generate coarse shapes from audio cues, and refines them into high-fidelity 3D structures via 3D Gaussian Splatting optimization.
Key results
- First audio-to-3D shape generation framework
- Tri-modal contrastive alignment of audio, vision, and point clouds
- Coarse-to-fine 3D synthesis using latent diffusion and 3D Gaussian Splatting
- New multimodal dataset of aligned audio, image, and point cloud triplets
Why it matters
Advances embodied AI and multimodal interaction by enabling robots and autonomous systems to construct 3D environmental understanding from auditory cues alone.
Abstract
Audio serves as an important bridge connecting humans to their surroundings, providing a unique modality for perceiving the world. For embodied AI systems, such as robots and autonomous vehicles, enabling them to understand the world through sound is a promising and significant re- search direction. In this paper, we explore the underexplored domain of audio-driven 3D shape generation and propose a novel architecture for audio-conditioned 3D shape synthesis. Specifically, our framework comprises three key modules: cross- modal alignment, a latent diffusion model for generation, and a 3D Gaussian Splatting (3DGS) based optimization module. We first align audio and 3D shape representations within a unified embedding space using a contrastive learning strategy, which conditions a latent diffusion model to generate an initial coarse 3D structure. Subsequently, we introduce a refinement stage utilizing 3D Gaussian Splatting to produce high-fidelity 3D shapes. Extensive qualitative and quantitative experiments validate the effectiveness of our proposed method, demonstrat- ing its capability to generate semantically coherent 3D shapes from audio input.