Zero-Shot Denoiser for Enhanced Acoustic Inspection: Blind Signal Separation and Text-Guided Audio Reconstruction
Koki Shoda, Jun Younes Louhi Kasahara, Qi An, Atsushi Yamashita
AI summary
Problem
Conventional acoustic denoising relies on pre-collected target sound samples or labeled training data, limiting its use in dynamic real-world inspections where anomaly sounds are unknown. Additionally, blind signal separation requires complex, manual hyperparameter tuning and struggles to identify target components without spatial priors.
Approach
The method decomposes mixed audio using blind signal separation, then uses a frozen audio-language model to semantically weight and reconstruct target sounds based on user-provided text prompts. It automatically optimizes separation quality by maximizing a model-derived pseudo-SNR metric.
Key results
- Artifact-Resilient Attention mechanism for text-guided audio reconstruction
- Automatic BSS hyperparameter optimization via pseudo-SNR maximization
- Denoising performance matching state-of-the-art supervised methods in a true zero-shot setting
- Validated effectiveness on real-world hammering test acoustic inspection data
Why it matters
Enables reliable, data-free acoustic noise reduction for robotic inspection and open-set audio applications where target sounds cannot be pre-recorded.
Abstract
Acoustic inspection is crucial for infrastructure maintenance, but its effectiveness is often hampered by envi- ronmental noise. Conventional denoising methods rely on prior knowledge or training data, limiting their practicability. This paper presents Zero-Shot Denoiser, a novel approach achieving noise reduction without pre-collected target sound samples or noise knowledge. Our method synergistically combines Blind Signal Separation (BSS) for unsupervised audio decomposition and Artifact-Resilient Attention (AR-Attention) for text-guided audio reconstruction. AR-Attention leverages pre-trained audio- language models and dual normalization to mitigate BSS artifacts and identify target sounds semantically. We introduce pseudo Signal-to-Noise Ratio, derived from the audio-language model, for automatic BSS hyperparameter optimization. In experiments using public datasets, our method, operating in a true zero-shot setting, achieved performance comparable to that of state-of- the-art supervised denoising methods, and experiments targeting hammering tests confirmed the effectiveness of our approach for real-world acoustic inspections. Our approach overcomes the limitations of data-dependent techniques and offers a versatile noise reduction solution for acoustic inspection and broader acoustic tasks.