Research Analyzer
← Back ICRA 2026

Symmetry-Aware Fusion of Vision and Tactile Sensing Via Bilateral Force Priors for Robotic Manipulation

Wonju Lee, Matteo Grimaldi, Tao Yu

PDF

AI summary

Key figure (auto-extracted from paper)
Physics-informed bilateral force regularization within a Cross-Modal Transformer enables robust visuo-tactile fusion, achieving near-privileged insertion success rates.
visuo-tactile fusion robotic insertion cross-modal transformer bilateral force symmetry physics-informed learning tactile sensing

Problem

Vision-only policies lack local contact feedback for precise alignment, while naive visuo-tactile fusion often dilutes modality-specific cues and fails to synchronize heterogeneous representations.

Approach

The method fuses wrist-camera vision and tactile signals using a Cross-Modal Transformer with hierarchical attention, guided by a physics-informed regularization that enforces bilateral force symmetry.

Key results

  • 96.59% insertion success rate on the TacSL benchmark
  • Surpasses naive and gated fusion baselines while matching privileged force feedback
  • Resolves visuo-tactile synchronization via hierarchical self- and cross-attention
  • Stabilizes grasps and reduces lateral misalignment through symmetry regularization

Why it matters

Offers a scalable, physics-informed fusion framework that bridges the gap between privileged sensing and real-world contact-rich robotic manipulation.

Abstract

Insertion tasks in robotic manipulation demand precise, contact-rich interactions that vision alone cannot re- solve. While tactile feedback is intuitively valuable, existing studies have shown that na ̈ıve visuo-tactile fusion often fails to deliver consistent improvements. In this work, we propose a Cross-Modal Transformer (CMT) for visuo-tactile fusion that integrates wrist-camera observations with tactile signals through structured self- and cross-attention. To stabilize tactile embeddings, we further introduce a physics-informed regu- larization that encourages bilateral force balance, reflecting principles of human motor control. Experiments on the TacSL benchmark show that CMT with symmetry regularization achieves a 96.59% insertion success rate, surpassing na ̈ıve and gated fusion baselines and closely matching the privileged “wrist + contact force” configuration (96.09%). These results highlight two central insights: (i) tactile sensing is indispensable for precise alignment, and (ii) principled multimodal fusion, fur- ther strengthened by physics-informed regularization, unlocks complementary strengths of vision and touch, approaching privileged performance under realistic sensing.

Index terms

Force and Tactile Sensing Sensor Fusion Deep Learning in Grasping and Manipulation

Related papers