Symmetry-Aware Fusion of Vision and Tactile Sensing Via Bilateral Force Priors for Robotic Manipulation
Wonju Lee, Matteo Grimaldi, Tao Yu
AI summary
Problem
Vision-only policies lack local contact feedback for precise alignment, while naive visuo-tactile fusion often dilutes modality-specific cues and fails to synchronize heterogeneous representations.
Approach
The method fuses wrist-camera vision and tactile signals using a Cross-Modal Transformer with hierarchical attention, guided by a physics-informed regularization that enforces bilateral force symmetry.
Key results
- 96.59% insertion success rate on the TacSL benchmark
- Surpasses naive and gated fusion baselines while matching privileged force feedback
- Resolves visuo-tactile synchronization via hierarchical self- and cross-attention
- Stabilizes grasps and reduces lateral misalignment through symmetry regularization
Why it matters
Offers a scalable, physics-informed fusion framework that bridges the gap between privileged sensing and real-world contact-rich robotic manipulation.
Abstract
Insertion tasks in robotic manipulation demand precise, contact-rich interactions that vision alone cannot re- solve. While tactile feedback is intuitively valuable, existing studies have shown that na ̈ıve visuo-tactile fusion often fails to deliver consistent improvements. In this work, we propose a Cross-Modal Transformer (CMT) for visuo-tactile fusion that integrates wrist-camera observations with tactile signals through structured self- and cross-attention. To stabilize tactile embeddings, we further introduce a physics-informed regu- larization that encourages bilateral force balance, reflecting principles of human motor control. Experiments on the TacSL benchmark show that CMT with symmetry regularization achieves a 96.59% insertion success rate, surpassing na ̈ıve and gated fusion baselines and closely matching the privileged “wrist + contact force” configuration (96.09%). These results highlight two central insights: (i) tactile sensing is indispensable for precise alignment, and (ii) principled multimodal fusion, fur- ther strengthened by physics-informed regularization, unlocks complementary strengths of vision and touch, approaching privileged performance under realistic sensing.