← Back ICRA 2026

MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose

Sirine Bhouri, Lan Wei, Jian-Qing Zheng, Dandan Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

MultiDiffSense unifies multi-modal visuo-tactile image generation in a single diffusion model, significantly outperforming baselines and cutting real data requirements for robotic pose estimation in half.

visuo-tactile synthesis diffusion models multi-modal tactile sensing synthetic data generation robotic perception CAD conditioning

Problem

Acquiring large-scale, spatially aligned visuo-tactile datasets across heterogeneous tactile sensors is costly and slow, while existing synthetic methods are limited to single modalities.

Approach

The model conditions on CAD-derived depth maps and structured text prompts encoding sensor type and 4-DoF contact pose to generate physically consistent, aligned tactile images across multiple sensors.

Key results

Outperforms Pix2Pix cGAN baseline in SSIM by +36.3% to +134.6% across three sensor types
Maintains robust generation quality on unseen objects and contact poses
Halves required real data volume for downstream 3-DoF pose estimation while maintaining competitive accuracy
Enables cross-modal synthesis within a single unified architecture

Why it matters

It removes a major bottleneck in tactile robotics by enabling scalable, controllable multi-modal dataset generation for cross-sensor learning and deployment.

Abstract

Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthe- sises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and structured prompts that encode sensor type and 4- DoF contact pose, enabling controllable, physically consistent multi-modal synthesis. Evaluating on 8 objects (5 seen, 3 novel) and unseen poses, MultiDiffSense outperforms a Pix2Pix cGAN baseline in SSIM by +36.3% (ViTac), +134.6% (ViTacTip), and +64.7% (TacTip). For downstream 3-DoF pose estimation, mixing 50% synthetic with 50% real halves the required real data while maintaining competitive performance (R2: ViTac 0.940 vs. 0.919 real-only; ViTacTip 0.937 vs. 0.982; TacTip 0.784 vs. 0.794). MultiDiffSense alleviates the data-collection bottleneck in tactile sensing and enables scalable, controllable multi-modal dataset generation for robotic applications.

Index terms

Force and Tactile Sensing Data Sets for Robot Learning Sensor Fusion