← Back ICRA 2026

Zero-Shot Denoiser for Enhanced Acoustic Inspection: Blind Signal Separation and Text-Guided Audio Reconstruction

Koki Shoda, Jun Younes Louhi Kasahara, Qi An, Atsushi Yamashita

PDF

AI summary

Key figure (auto-extracted from paper)

A training-free denoising framework that combines blind signal separation with text-guided attention to achieve supervised-level noise reduction without prior sound samples.

Zero-Shot Denoising Blind Signal Separation Audio-Language Models Acoustic Inspection Artifact-Resilient Attention Hyperparameter Optimization

Problem

Conventional acoustic denoising relies on pre-collected target sound samples or labeled training data, limiting its use in dynamic real-world inspections where anomaly sounds are unknown. Additionally, blind signal separation requires complex, manual hyperparameter tuning and struggles to identify target components without spatial priors.

Approach

The method decomposes mixed audio using blind signal separation, then uses a frozen audio-language model to semantically weight and reconstruct target sounds based on user-provided text prompts. It automatically optimizes separation quality by maximizing a model-derived pseudo-SNR metric.

Key results

Artifact-Resilient Attention mechanism for text-guided audio reconstruction
Automatic BSS hyperparameter optimization via pseudo-SNR maximization
Denoising performance matching state-of-the-art supervised methods in a true zero-shot setting
Validated effectiveness on real-world hammering test acoustic inspection data

Why it matters

Enables reliable, data-free acoustic noise reduction for robotic inspection and open-set audio applications where target sounds cannot be pre-recorded.

Abstract

Acoustic inspection is crucial for infrastructure maintenance, but its effectiveness is often hampered by envi- ronmental noise. Conventional denoising methods rely on prior knowledge or training data, limiting their practicability. This paper presents Zero-Shot Denoiser, a novel approach achieving noise reduction without pre-collected target sound samples or noise knowledge. Our method synergistically combines Blind Signal Separation (BSS) for unsupervised audio decomposition and Artifact-Resilient Attention (AR-Attention) for text-guided audio reconstruction. AR-Attention leverages pre-trained audio- language models and dual normalization to mitigate BSS artifacts and identify target sounds semantically. We introduce pseudo Signal-to-Noise Ratio, derived from the audio-language model, for automatic BSS hyperparameter optimization. In experiments using public datasets, our method, operating in a true zero-shot setting, achieved performance comparable to that of state-of- the-art supervised denoising methods, and experiments targeting hammering tests confirmed the effectiveness of our approach for real-world acoustic inspections. Our approach overcomes the limitations of data-dependent techniques and offers a versatile noise reduction solution for acoustic inspection and broader acoustic tasks.

Index terms

Robotics and Automation in Construction Industrial Robots Surveillance Robotic Systems