← Back ICRA 2026

TacUMI: A Multi-Modal Universal Manipulation Interface for Contact-Rich Tasks

Tailai Cheng, Kejia Chen, Lingyun Chen, Liding Zhang, Yue Zhang, Yao Ling, Mahdi Hamad, Zhenshan Bing, Fan Wu, Karan Sharma, Alois Knoll

PDF

AI summary

Key figure (auto-extracted from paper)

Integrating tactile, force-torque, and pose sensing into a handheld gripper enables highly accurate segmentation of long-horizon, contact-rich manipulation tasks.

Multi-modal sensing Task segmentation Contact-rich manipulation Handheld demonstration Imitation learning Tactile robotics

Problem

Learning long-horizon manipulation tasks is hindered by the lack of tactile feedback in existing handheld data collection devices and the inability of vision-only methods to capture critical contact dynamics during complex physical interactions.

Approach

The authors developed a compact handheld gripper that synchronously captures tactile, force-torque, and pose data during human demonstrations, paired with a multi-modal temporal model that automatically segments long tasks into meaningful skill phases.

Key results

A robot-compatible handheld gripper integrating ViTac tactile sensors, 6D F/T sensing, and drift-free pose tracking
A continuous self-locking mechanism that eliminates trigger-induced interference from force-torque measurements
A multi-modal segmentation framework achieving over 90% accuracy on a challenging cable mounting task
Demonstrated cross-dataset transferability, maintaining comparable segmentation accuracy when applied to teleoperation-collected data

Why it matters

It provides a practical, scalable foundation for collecting and decomposing high-quality multi-modal demonstrations, accelerating the development of robots that can master complex, contact-rich real-world tasks.

Abstract

Task decomposition is critical for understanding and learning complex long-horizon manipulation tasks. Espe- cially for tasks involving rich physical interactions, relying solely on visual observations and robot proprioceptive informa- tion often fails to reveal the underlying event transitions. This raises the requirement for efficient collection of high-quality multi-modal data as well as robust segmentation method to decompose demonstrations into meaningful modules. Building on the idea of the handheld demonstration device Universal Manipulation Interface (UMI), we introduce TacUMI, a multi-modal data collection system that integrates additionally ViTac sensors, force–torque sensor, and pose tracker into a compact, robot-compatible gripper design, which enables synchronized acquisition of all these modalities during human demonstrations. We then propose a multi-modal segmentation framework that leverages temporal models to detect semanti- cally meaningful event boundaries in sequential manipulations. Evaluation on a challenging cable mounting task shows more than 90% segmentation accuracy and highlights a remark- able improvement with more modalities, which validates that TacUMI establishes a practical foundation for both scalable collection and segmentation of multi-modal demonstrations in contact-rich tasks. The design and experiment results are avail- able at our project website: https://tac-umi.github.io/TacUMI/.

Index terms

Force and Tactile Sensing Bimanual Manipulation Methods and Tools for Robot System Design