← Back ICRA 2026

TactEx: An Explainable Multimodal Robotic Interaction Framework for Human-Like Touch and Hardness Estimation

Felix Verstraete, Lan Wei, WEN FAN, Dandan Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

TactEx unifies vision, touch, and language to enable accurate, explainable, and human-like hardness estimation in robotics, achieving 90% task success on simple queries.

Tactile sensing Hardness estimation Multimodal robotics Explainable AI Grounded-SAM LLM grounding

Problem

Fine-grained hardness estimation in robotics requires controlled contact and statistical validation, yet existing methods often lack explainability, demand large datasets, and fail to provide transparent reasoning for human-facing applications.

Approach

The framework fuses GelSight tactile streams with RGB vision and language prompts, using a ResNet50-LSTM regressor for hardness prediction, Grounded-SAM for precise touch placement, and an LLM to generate sensor-grounded explanations.

Key results

Data-efficient visuo-tactile regression achieving RMSE 4.3 and ρ=0.88 with only 280 fine-tuning samples
Grounded-SAM outperforms YOLO in touch placement accuracy and fine-grained segmentation
90% end-to-end task success on simple queries with generalization to novel tasks
Statistically significant ripeness ranking across five fruit types with sensor-grounded LLM explanations

Why it matters

Provides a deployable, transparent interface for safe and dexterous robotic manipulation in human-facing settings without relying on extensive task-specific data collection.

Abstract

Accurate perception of object hardness is essential for safe and dexterous contact-rich robotic manipulation. Here, we present TactEx, an explainable multimodal robotic interac- tion framework that unifies vision, touch, and language for human-like hardness estimation and interactive guidance. We evaluate TactEx on fruit-ripeness assessment, a representative task that requires both tactile sensing and contextual under- standing. The system fuses GelSight-Mini tactile streams with RGB observations and language prompts. A ResNet50+LSTM model estimates hardness from sequential tactile data, while a cross-modal alignment module combines visual cues with guidance from a large language model (LLM). This explainable multimodal interface allows users to distinguish ripeness levels with statistically significant class separation (p < 0.01 for all fruit pairs). For touch placement, we compare YOLO with Grounded-SAM (GSAM) and find GSAM to be more robust for fine-grained segmentation and contact-site selection. A lightweight LLM parses user instructions and produces grounded natural-language explanations linked to the tactile outputs. In end-to-end evaluations, TactEx attains 90% task success on simple user queries and generalises to novel tasks without large-scale tuning. These results highlight the promise of combining pretrained visual and tactile models with language grounding to advance explainable, human-like touch perception and decision-making in robotics.

Index terms

Force and Tactile Sensing Sensor Fusion Visual Servoing