← Back ICRA 2026

CAVER: Curious AudioVisual Exploring Robot

Luca Macesanu, Boueny Folefack, Ruchira Ray, Ben Abbatematteo, Roberto MartÃn-MartÃn

PDF

AI summary

Key figure (auto-extracted from paper)

A robot can autonomously and efficiently learn audiovisual object representations through curiosity-driven exploration, significantly boosting material classification and audio-based imitation.

Audiovisual learning Curious exploration Robotic manipulation Interactive perception Self-supervised learning

Problem

Robots struggle to autonomously learn the correlations between an object’s visual appearance and its acoustic properties, typically relying on large, manually curated datasets rather than efficient interactive exploration.

Approach

CAVER taps objects with a custom 3D-printed impact tool and uses an uncertainty-guided curiosity strategy to select the most visually distinct interaction points, building a retrieval-based audiovisual representation via a KNN model.

Key results

Custom 3D-printed spring-loaded impact end-effector
Multi-scale audiovisual representation with aligned visual-audio KNN mapping
Uncertainty-guided exploration prioritizing visually distinct object parts
87% material classification accuracy, 66% audio imitation accuracy, and faster audio prediction convergence

Why it matters

Enables robots to autonomously build rich multimodal perceptions for robust manipulation and imitation without relying on large-scale pretraining or manual data collection.

Abstract

Multimodal audiovisual perception can enable new avenues for robotic manipulation, from better material classification to the imitation of demonstrations for which only audio signals are available (e.g., playing a tune by ear). However, to unlock such multimodal potential, robots need to learn the correlations between an object’s visual appearance and the sound it generates when they interact with it. Such an active sensorimotor experience requires new interaction capabilities, representations, and exploration methods to guide the robot in efficiently building increasingly rich audiovisual knowledge. In this work, we present CAVER, a novel robot that builds and utilizes rich audiovisual representations of objects. CAVER includes three novel contributions: 1) a novel 3D-printed end- effector, attachable to parallel grippers, that excites objects’ audio responses, 2) an audiovisual representation that combines local and global appearance information with sound features, and 3) an exploration algorithm that uses and builds the audiovisual representation in a curiosity-driven manner that prioritizes interacting with high uncertainty objects to obtain good coverage of surprising audio with fewer interactions. We demonstrate that CAVER builds rich representations in different scenarios more efficiently than several exploration baselines, and that the learned audiovisual representation leads to significant improvements in material classification and the imitation of audio-only human demonstrations. More informa- tion: https://robin-lab.cs.utexas.edu/CAVER

Index terms

Robot Audition Perception for Grasping and Manipulation Perception-Action Coupling