← Back ICRA 2026

TransDexNet: An End-To-End Motion Retargeting Network with Transformer for Dexterous Hand Teleoperation from RGB Images

jiaying Tan, Qing Gao, yuanchuan lai

PDF

AI summary

Key figure (auto-extracted from paper)

TransDexNet directly maps a single human hand RGB image to dexterous hand joint angles with high accuracy and real-time speed, eliminating the need for intermediate pose estimation or specialized hardware.

Dexterous hand teleoperation Motion retargeting Vision Transformer End-to-end learning RGB-based control Robot manipulation

Problem

Existing vision-based dexterous hand teleoperation methods rely on costly specialized hardware, multi-stage pipelines that cause latency and error accumulation, or depth sensors that limit real-world deployment.

Approach

The authors introduce TransDexNet, a dual-branch Transformer network that aligns latent features from human and robotic hand images to directly regress the dexterous hand's joint angles from a single RGB input.

Key results

Achieves 0.076 rad average joint angle error
Enables real-time inference at 0.22 seconds per frame
Introduces TransDexData, a 91,000-sample paired RGB dataset
Demonstrates accurate retargeting in simulation and real-world experiments

Why it matters

Enables low-cost, hardware-free, real-time dexterous hand teleoperation for applications in rehabilitation, manufacturing, and hazardous environment rescue.

Abstract

Dexterous hand teleoperation is becoming in- creasingly common, yet existing methods rarely provide both efficiency and convenience. The core challenge is to achieve motion retargeting from the human hand to a dexterous hand. To address this, we introduce TransDexNet, an end-to- end vision-based motion retargeting architecture for dexterous hands. Equipped with a Vision Transformer backbone, it takes a single RGB image of a human hand and directly regresses the joint angles of a dexterous hand without any intermediate pose estimation. The architecture employs dual branches bridged by an alignment layer to close the gaps in degrees of freedom (DoFs), geometry, and kinematics between the human and dexterous hands, enabling domain-invariant latent features. To train TransDexNet, we built a dataset named TransDexData, consisting of 91,000 RGB images of human hands paired with the corresponding dexterous hand RGB images and joint angles. In evaluation, the proposed network achieves an average joint angle error of 0.076 rad. Both simulation and real-world experiments demonstrate accurate and efficient performance. The project page is available at: https://joyyyy-gaint.github.io/TransDexNet.

Index terms

Telerobotics and Teleoperation Dexterous Manipulation