AI summary
Problem
Vision-based methods struggle to perceive extrinsic contacts between a grasped tool and its environment due to occlusions, limited resolution, and ambiguous near-contact states, while tactile sensors cannot detect indirect interactions.
Approach
The system equips a robotic gripper with a conduction speaker and contact microphone to probe objects acoustically, fusing this active audio feedback with visual depth and optical flow in a multimodal UNet trained via a novel real-to-sim audio hallucination technique.
Key results
- Zero-shot sim-to-real transfer of a multimodal contact perception model
- Real-to-sim audio hallucination technique bridging the sim2real gap
- Accurate estimation of extrinsic contact location and shape under heavy occlusion
- Explicit contact prediction significantly improves downstream policy learning for contact-rich tasks
Why it matters
Provides a low-cost, hardware-efficient sensing solution for reliable tool-environment interaction perception, advancing robust manipulation in cluttered or occluded environments.
Abstract
Robust manipulation often hinges on a robot’s ability to perceive extrinsic contacts—contacts between a grasped object and its surrounding environment. However, these contacts are difficult to observe through vision alone due to oc- clusions, limited resolution, and ambiguous near-contact states. In this paper, we propose a visual-auditory method for extrinsic contact estimation that integrates global scene information from vision with local contact cues obtained through active audio sensing. Our approach equips a robotic gripper with contact microphones and conduction speakers, enabling the system to emit and receive acoustic signals through the grasped object to detect external contacts. We train our perception pipeline entirely in simulation and zero-shot transfer to the real-world. To bridge the sim-to-real gap, we introduce a real- to-sim audio hallucination technique, injecting real-world audio samples into simulated scenes with ground-truth contact labels. The resulting multimodal model accurately estimates both the location and size of extrinsic contacts across a range of cluttered and occluded scenarios. We further demonstrate that explicit contact prediction significantly improves policy learning for downstream contact-rich manipulation tasks. Project webpage: va2contact.github.io