← Back ICRA 2026

Symmetry-Aware Fusion of Vision and Tactile Sensing Via Bilateral Force Priors for Robotic Manipulation

Wonju Lee, Matteo Grimaldi, Tao Yu

PDF

AI summary

Key figure (auto-extracted from paper)

Physics-informed bilateral force regularization within a Cross-Modal Transformer enables robust visuo-tactile fusion, achieving near-privileged insertion success rates.

visuo-tactile fusion robotic insertion cross-modal transformer bilateral force symmetry physics-informed learning tactile sensing

Problem

Vision-only policies lack local contact feedback for precise alignment, while naive visuo-tactile fusion often dilutes modality-specific cues and fails to synchronize heterogeneous representations.

Approach

The method fuses wrist-camera vision and tactile signals using a Cross-Modal Transformer with hierarchical attention, guided by a physics-informed regularization that enforces bilateral force symmetry.

Key results

96.59% insertion success rate on the TacSL benchmark
Surpasses naive and gated fusion baselines while matching privileged force feedback
Resolves visuo-tactile synchronization via hierarchical self- and cross-attention
Stabilizes grasps and reduces lateral misalignment through symmetry regularization

Why it matters

Offers a scalable, physics-informed fusion framework that bridges the gap between privileged sensing and real-world contact-rich robotic manipulation.

Abstract

Insertion tasks in robotic manipulation demand precise, contact-rich interactions that vision alone cannot re- solve. While tactile feedback is intuitively valuable, existing studies have shown that na ̈ıve visuo-tactile fusion often fails to deliver consistent improvements. In this work, we propose a Cross-Modal Transformer (CMT) for visuo-tactile fusion that integrates wrist-camera observations with tactile signals through structured self- and cross-attention. To stabilize tactile embeddings, we further introduce a physics-informed regu- larization that encourages bilateral force balance, reflecting principles of human motor control. Experiments on the TacSL benchmark show that CMT with symmetry regularization achieves a 96.59% insertion success rate, surpassing na ̈ıve and gated fusion baselines and closely matching the privileged “wrist + contact force” configuration (96.09%). These results highlight two central insights: (i) tactile sensing is indispensable for precise alignment, and (ii) principled multimodal fusion, fur- ther strengthened by physics-informed regularization, unlocks complementary strengths of vision and touch, approaching privileged performance under realistic sensing.

Index terms

Force and Tactile Sensing Sensor Fusion Deep Learning in Grasping and Manipulation