← Back ICRA 2026

Robust Hand Tracking from Visual-Inertial Fusion

Hyelim Choi, Hyunreal Park, Harim Ji, Somang Lee, Youngseon Lee, Yongseok Lee, Dongjun Lee

PDF

AI summary

Key figure (auto-extracted from paper)

Fusing lightweight vision transformer keypoint detection with IMU propagation enables robust, marker-free, real-time hand tracking even under severe occlusion.

Visual-inertial fusion Hand tracking Vision transformer Factor graph optimization Marker-free sensing Dexterous teleoperation

Problem

Hand tracking struggles with IMU drift, vision occlusion during object manipulation, and the reliance on markers or depth sensors. Prior fusion methods also face domain gaps in RGB imagery and require costly, manual data annotation.

Approach

The framework combines glove-mounted IMU data with RGB camera feeds using a lightweight vision transformer to predict keypoint likelihoods and uncertainties. These visual estimates are fused with IMU-propagated poses via probabilistic inference and refined through factor graph optimization.

Key results

Efficient synthetic-to-real dataset generation without manual annotation
Lightweight ViT-based keypoint detection network for real-time processing
IMU-aided probabilistic inference for robust keypoint estimation
Factor graph optimization refining pose with anatomical constraints

Why it matters

Provides a reliable, marker-free tracking solution essential for high-quality demonstration data acquisition and teleoperation in dexterous robotic manipulation.

Abstract

Hand tracking plays a key role in capturing and transferring dexterous human manipulation skills to robots. However, achieving reliable tracking across diverse conditions and during complex interactions (e.g., object manipulation) re- mains challenging. A promising solution is to combine wearable sensors such as IMUs with vision, where previous studies have handled the vision input by attaching markers to wearables or by relying on depth data to avoid the domain gap in color images. In this work, we present a hand tracking framework that fuses inertial measurements with state-of-the-art vision methods, eliminating the need for markers while fully exploiting visual cues. For this, we introduce a dataset generation scheme that produces synthetic and real data for the target glove using a compact setup, without manual annotation. Using the dataset, we train the keypoint detection network that predicts the likelihood of an image for keypoints, designed based on a lightweight vision transformer (ViT) for real-time usage. Based on the network prediction, the IMU-propagated pose is used as a prior in probabilistic inference to estimate the keypoint positions and uncertainties. Tracking primarily relies on high- rate IMU updates for fast motion estimation, while the pose is corrected through factor graph optimization. The framework is validated in challenging scenarios, demonstrating its robustness and accuracy, and can be used for high-quality demonstration data acquisition and teleoperation for dexterous manipulation.

Index terms

Sensor Fusion Human Detection and Tracking Multifingered Hands