Contrastive Auditory Knowledge Transfer for Tool-Mediated Robot Interaction with Granular Objects
Si Liu, Jindan Huang, Zhengyan Huan, Michael Hughes, Jivko Sinapov
AI summary
Problem
Transferring audio-based object recognition knowledge across different robotic tools and interaction behaviors traditionally requires costly, extensive data collection for each new context.
Approach
The authors project tool-mediated audio into a shared latent space using two contrastive strategies: a supervised method leveraging shared objects and a zero-shot method aligning audio with natural language context descriptions.
Key results
- Latent embeddings cluster by object identity independent of tool or behavior variations
- Transfer models match or exceed supervised baselines despite limited target-context data
- Zero-shot method successfully recognizes entirely novel objects via audio-text alignment
- Effective cross-tool and cross-behavior knowledge transfer demonstrated on real-world granular object data
Why it matters
Offers a scalable, data-efficient perception framework for robots operating in dynamic, real-world environments where collecting context-specific data is impractical.
Abstract
Tool-mediated interactions enable robotics to ma- nipulate and explore granular objects, producing informative auditory signals. A central challenge is transferring this per- ceptual knowledge across different tools and behaviors without costly data collection for each new context. We address this problem in the domain of audio-based recognition of granular and liquid-like objects. In this work, we leverage audio signals from tool-mediated interactions and learn context-agnostic rep- resentations for object recognition. We propose two contrastive learning approaches: a shared-object transfer method that per- forms supervised contrastive learning using audio data, and a zero-shot transfer method that integrates both audio and natural language descriptions of interaction contexts. Experiments on real-world data show that both methods achieve strong object recognition performance in unseen contexts, sometimes match- ing or exceeding a supervised baseline despite limited target- context data. Furthermore, the learned latent spaces exhibit clearly separable clusters by object identity, and the zero- shot method successfully recognizes novel objects, offering a practical solution for robot perception in data-scarce scenarios. The code for this paper is available at: https://github. com/siliu6487/AuditoryKnowledgeTransfer.