Pose Retargeting from a Single RGB Camera: Optimization-Based Hand Pose Retargeting and Wrist Pose Estimation
Longrui Chen, Lipeng Chen, Kunpeng Yao, Mehmet R Dogar
AI summary
Problem
Accurate wrist pose estimation and general teleoperation typically require depth sensors, specialized hardware, or extensive neural network training, limiting accessibility and scalability.
Approach
The authors propose OAT, a lightweight pipeline that uses MediaPipe for hand joint detection and solves a 2D/3D projection optimization problem to infer wrist pose and retarget hand motions to any robot.
Key results
- Achieves 30 Hz real-time control on standard laptops without depth sensors or GPU dependency
- Introduces a grasp stabilization mechanism that eliminates wrist drift during pick-and-place tasks
- Enables camera intrinsic self-calibration by jointly optimizing wrist position and intrinsics from a single hand image
- Demonstrates superior accuracy and lower latency compared to state-of-the-art vision-based teleoperation methods
Why it matters
Provides a low-cost, hardware-agnostic, and easily deployable teleoperation solution for researchers and practitioners collecting imitation learning data or controlling robots remotely.
Abstract
Robot teleoperation plays a crucial role in collect- ing data for large-scale imitation learning. Inferring operator’s hand pose is crucial for vision-based teleoperation, and current solutions either rely on additional neural network training or hardware to infer the operator’s wrist pose. To our knowledge, there is no open-source, general teleoperation toolkit that can be easily deployed to retarget both hand and wrist poses from a single RGB camera. In this paper, we propose OAT (Optimization-based hAnd pose retargeting and wrisT pose estimation), a streamlined approach to retarget human hand and wrist pose to the robot. We leverage the off-the-shelf MediaPipe framework to estimate the operator’s hand pose and employ an optimization-based method to infer the operator’s wrist pose within the camera frame by 2D/3D hand joint matching. This integrated pipeline facilitates teleoperation from virtually any location using any device equipped with an RGB camera, offering a highly accessible and easily implementable solution. Furthermore, a hand-based camera calibration opti- mization is proposed to improve the accuracy of wrist pose estimation. In addition to minimal hardware requirements and deployment convenience, our system also demonstrates superior real-time performance compared to state-of-the-art vision-based teleoperation methods.