← Back ICRA 2026

Visual-Auditory Extrinsic Contact Estimation

Xili Yi, Jayjun Lee, Nima Fazeli

PDF

AI summary

Key figure (auto-extracted from paper)

Fusing active audio sensing with vision enables robots to accurately detect hidden extrinsic contacts, significantly boosting downstream manipulation performance.

active audio sensing extrinsic contact estimation multimodal perception sim-to-real transfer robotic manipulation contact-rich tasks

Problem

Vision-based methods struggle to perceive extrinsic contacts between a grasped tool and its environment due to occlusions, limited resolution, and ambiguous near-contact states, while tactile sensors cannot detect indirect interactions.

Approach

The system equips a robotic gripper with a conduction speaker and contact microphone to probe objects acoustically, fusing this active audio feedback with visual depth and optical flow in a multimodal UNet trained via a novel real-to-sim audio hallucination technique.

Key results

Zero-shot sim-to-real transfer of a multimodal contact perception model
Real-to-sim audio hallucination technique bridging the sim2real gap
Accurate estimation of extrinsic contact location and shape under heavy occlusion
Explicit contact prediction significantly improves downstream policy learning for contact-rich tasks

Why it matters

Provides a low-cost, hardware-efficient sensing solution for reliable tool-environment interaction perception, advancing robust manipulation in cluttered or occluded environments.

Abstract

Robust manipulation often hinges on a robot’s ability to perceive extrinsic contacts—contacts between a grasped object and its surrounding environment. However, these contacts are difficult to observe through vision alone due to oc- clusions, limited resolution, and ambiguous near-contact states. In this paper, we propose a visual-auditory method for extrinsic contact estimation that integrates global scene information from vision with local contact cues obtained through active audio sensing. Our approach equips a robotic gripper with contact microphones and conduction speakers, enabling the system to emit and receive acoustic signals through the grasped object to detect external contacts. We train our perception pipeline entirely in simulation and zero-shot transfer to the real-world. To bridge the sim-to-real gap, we introduce a real- to-sim audio hallucination technique, injecting real-world audio samples into simulated scenes with ground-truth contact labels. The resulting multimodal model accurately estimates both the location and size of extrinsic contacts across a range of cluttered and occluded scenarios. We further demonstrate that explicit contact prediction significantly improves policy learning for downstream contact-rich manipulation tasks. Project webpage: va2contact.github.io

Index terms

Deep Learning in Grasping and Manipulation Perception for Grasping and Manipulation Deep Learning for Visual Perception