← Back ICRA 2026

Audio-2-Shape: 3D Generation from What You Hear

Xuran He, Xian-Feng Han, Shi-Jie Sun

PDF

AI summary

Key figure (auto-extracted from paper)

AI can now generate semantically coherent 3D shapes directly from audio inputs by bridging the acoustic and geometric domains.

audio-driven 3D generation cross-modal alignment latent diffusion 3D Gaussian Splatting point cloud synthesis multimodal AI

Problem

Direct synthesis of 3D geometry from audio remains unexplored due to the semantic gap between acoustic signals and spatial data, compounded by a lack of high-quality audio-3D paired datasets.

Approach

The framework aligns audio, image, and 3D features in a unified space, uses a latent diffusion model to generate coarse shapes from audio cues, and refines them into high-fidelity 3D structures via 3D Gaussian Splatting optimization.

Key results

First audio-to-3D shape generation framework
Tri-modal contrastive alignment of audio, vision, and point clouds
Coarse-to-fine 3D synthesis using latent diffusion and 3D Gaussian Splatting
New multimodal dataset of aligned audio, image, and point cloud triplets

Why it matters

Advances embodied AI and multimodal interaction by enabling robots and autonomous systems to construct 3D environmental understanding from auditory cues alone.

Abstract

Audio serves as an important bridge connecting humans to their surroundings, providing a unique modality for perceiving the world. For embodied AI systems, such as robots and autonomous vehicles, enabling them to understand the world through sound is a promising and significant re- search direction. In this paper, we explore the underexplored domain of audio-driven 3D shape generation and propose a novel architecture for audio-conditioned 3D shape synthesis. Specifically, our framework comprises three key modules: cross- modal alignment, a latent diffusion model for generation, and a 3D Gaussian Splatting (3DGS) based optimization module. We first align audio and 3D shape representations within a unified embedding space using a contrastive learning strategy, which conditions a latent diffusion model to generate an initial coarse 3D structure. Subsequently, we introduce a refinement stage utilizing 3D Gaussian Splatting to produce high-fidelity 3D shapes. Extensive qualitative and quantitative experiments validate the effectiveness of our proposed method, demonstrat- ing its capability to generate semantically coherent 3D shapes from audio input.

Index terms

Deep Learning for Visual Perception Visual Learning