Research Analyzer
← Back ICRA 2026

Pose Retargeting from a Single RGB Camera: Optimization-Based Hand Pose Retargeting and Wrist Pose Estimation

Longrui Chen, Lipeng Chen, Kunpeng Yao, Mehmet R Dogar

PDF

AI summary

Key figure (auto-extracted from paper)
OAT enables accurate, real-time robot teleoperation using only a single RGB camera by combining off-the-shelf hand tracking with optimization-based wrist pose estimation and grasp stabilization.
Robot teleoperation Monocular RGB Wrist pose estimation Hand retargeting Optimization-based tracking Grasp stabilization

Problem

Accurate wrist pose estimation and general teleoperation typically require depth sensors, specialized hardware, or extensive neural network training, limiting accessibility and scalability.

Approach

The authors propose OAT, a lightweight pipeline that uses MediaPipe for hand joint detection and solves a 2D/3D projection optimization problem to infer wrist pose and retarget hand motions to any robot.

Key results

  • Achieves 30 Hz real-time control on standard laptops without depth sensors or GPU dependency
  • Introduces a grasp stabilization mechanism that eliminates wrist drift during pick-and-place tasks
  • Enables camera intrinsic self-calibration by jointly optimizing wrist position and intrinsics from a single hand image
  • Demonstrates superior accuracy and lower latency compared to state-of-the-art vision-based teleoperation methods

Why it matters

Provides a low-cost, hardware-agnostic, and easily deployable teleoperation solution for researchers and practitioners collecting imitation learning data or controlling robots remotely.

Abstract

Robot teleoperation plays a crucial role in collect- ing data for large-scale imitation learning. Inferring operator’s hand pose is crucial for vision-based teleoperation, and current solutions either rely on additional neural network training or hardware to infer the operator’s wrist pose. To our knowledge, there is no open-source, general teleoperation toolkit that can be easily deployed to retarget both hand and wrist poses from a single RGB camera. In this paper, we propose OAT (Optimization-based hAnd pose retargeting and wrisT pose estimation), a streamlined approach to retarget human hand and wrist pose to the robot. We leverage the off-the-shelf MediaPipe framework to estimate the operator’s hand pose and employ an optimization-based method to infer the operator’s wrist pose within the camera frame by 2D/3D hand joint matching. This integrated pipeline facilitates teleoperation from virtually any location using any device equipped with an RGB camera, offering a highly accessible and easily implementable solution. Furthermore, a hand-based camera calibration opti- mization is proposed to improve the accuracy of wrist pose estimation. In addition to minimal hardware requirements and deployment convenience, our system also demonstrates superior real-time performance compared to state-of-the-art vision-based teleoperation methods.

Index terms

Telerobotics and Teleoperation Grasping Visual Tracking

Related papers