Robust Hand Tracking from Visual-Inertial Fusion
Hyelim Choi, Hyunreal Park, Harim Ji, Somang Lee, Youngseon Lee, Yongseok Lee, Dongjun Lee
AI summary
Problem
Hand tracking struggles with IMU drift, vision occlusion during object manipulation, and the reliance on markers or depth sensors. Prior fusion methods also face domain gaps in RGB imagery and require costly, manual data annotation.
Approach
The framework combines glove-mounted IMU data with RGB camera feeds using a lightweight vision transformer to predict keypoint likelihoods and uncertainties. These visual estimates are fused with IMU-propagated poses via probabilistic inference and refined through factor graph optimization.
Key results
- Efficient synthetic-to-real dataset generation without manual annotation
- Lightweight ViT-based keypoint detection network for real-time processing
- IMU-aided probabilistic inference for robust keypoint estimation
- Factor graph optimization refining pose with anatomical constraints
Why it matters
Provides a reliable, marker-free tracking solution essential for high-quality demonstration data acquisition and teleoperation in dexterous robotic manipulation.
Abstract
Hand tracking plays a key role in capturing and transferring dexterous human manipulation skills to robots. However, achieving reliable tracking across diverse conditions and during complex interactions (e.g., object manipulation) re- mains challenging. A promising solution is to combine wearable sensors such as IMUs with vision, where previous studies have handled the vision input by attaching markers to wearables or by relying on depth data to avoid the domain gap in color images. In this work, we present a hand tracking framework that fuses inertial measurements with state-of-the-art vision methods, eliminating the need for markers while fully exploiting visual cues. For this, we introduce a dataset generation scheme that produces synthetic and real data for the target glove using a compact setup, without manual annotation. Using the dataset, we train the keypoint detection network that predicts the likelihood of an image for keypoints, designed based on a lightweight vision transformer (ViT) for real-time usage. Based on the network prediction, the IMU-propagated pose is used as a prior in probabilistic inference to estimate the keypoint positions and uncertainties. Tracking primarily relies on high- rate IMU updates for fast motion estimation, while the pose is corrected through factor graph optimization. The framework is validated in challenging scenarios, demonstrating its robustness and accuracy, and can be used for high-quality demonstration data acquisition and teleoperation for dexterous manipulation.