← Back ICRA 2026

How to Train Your Tactile Model: Tactile Perception with Multi-Fingered Robot Hands

Christopher Ford, Kaichen Shi, Laura Elizabeth Butcher, Nathan Lepora, Efi Psomopoulou

PDF

AI summary

Key figure (auto-extracted from paper)

TacViT, a Vision Transformer-based model, generalizes to unseen tactile sensors without retraining, significantly outperforming CNNs in accuracy and robustness.

Tactile sensing Vision Transformers robotic hands sensor generalization tactile perception deep learning

Problem

CNN-based tactile perception models require extensive sensor-specific data collection and retraining for each new sensor due to manufacturing variations and wear, limiting scalable deployment in multi-fingered robotic hands.

Approach

TacViT applies a pre-trained Vision Transformer fine-tuned with Low-Rank Adaptation to tactile images, using global self-attention to extract robust features that transfer across different sensor domains.

Key results

Maintains high pose and force prediction accuracy on unseen sensors
Reduces mean absolute error spread and prediction noise compared to CNNs
Eliminates the need for sensor-specific retraining and data collection
Demonstrates consistent generalization across five distinct TacTip sensors

Why it matters

Enables scalable, rapid deployment of vision-based tactile sensors in multi-fingered robotic hands without costly per-sensor retraining.

Abstract

Rapid deployment of new tactile sensors is es- sential for scalable robotic manipulation, especially in multi- fingered hands equipped with vision-based tactile sensors. However, current methods for inferring contact properties rely heavily on convolutional neural networks (CNNs), which, while effective on known sensors, require large, sensor-specific datasets. Furthermore, they require retraining for each new sensor due to differences in lens properties, illumination, and sensor wear. Here we introduce TacViT, a novel tactile percep- tion model based on Vision Transformers, designed to generalize on new sensor data. TacViT leverages global self-attention mechanisms to extract robust features from tactile images, enabling accurate contact property inference even on previously unseen sensors. This capability significantly reduces the need for data collection and retraining, accelerating the deployment of new sensors. We evaluate TacViT on sensors for a five- fingered robot hand and demonstrate its superior generalization performance compared to CNNs. Our results highlight TacViT’s potential to make tactile sensing more scalable and practical for real-world robotic applications.

Index terms

Force and Tactile Sensing