← Back ICRA 2026

DemoBot: Efficient Learning of Bimanual Manipulation with Dexterous Hands from Third-Person Human Videos

Yucheng Xu, Xiaofeng Mao, Elle Miller, Yang Li, Xinyu Yi, Zhibin (Alex) Li, Robert Fisher

PDF

AI summary

Key figure (auto-extracted from paper)

A robot can master complex, long-horizon bimanual dexterous tasks from a single unannotated human video by refining extracted motion priors through a corrective residual reinforcement learning pipeline.

Bimanual manipulation Learning from demonstration Residual reinforcement learning Dexterous hands Video-to-robot transfer Long-horizon tasks

Problem

Learning scalable robot skills from internet-scale human videos is hindered by the lack of specialized hardware, embodiment mismatches, and the difficulty of extracting robust, physically accurate policies from noisy, unannotated RGB-D footage. Specifically, acquiring long-horizon bimanual dexterous manipulation from just one video remains an open challenge due to temporal misalignment and missing physical dynamics.

Approach

DemoBot extracts structured 3D hand-object motion priors from a single RGB-D video and feeds them into a corrective residual reinforcement learning framework. The agent learns local policy corrections to account for physical dynamics, guided by temporal-segment training, success-gated environment resets, and an event-driven reward curriculum.

Key results

Robust pipeline converting unannotated RGB-D videos into high-quality 3D motion priors in minutes
Novel residual RL framework with temporal segmentation, success-gated resets, and event-driven rewards
Successful learning of long-horizon synchronous and asynchronous bimanual assembly tasks
First demonstration of efficient bimanual dexterous skill acquisition from a single visual demonstration

Why it matters

Provides a scalable, hardware-light pathway for general-purpose robots to acquire complex manipulation skills directly from massive internet-scale video datasets.

Abstract

This work presents DemoBot, a learning frame- work that enables a dual-arm, multi-finger robotic system to acquire complex manipulation skills from a single unannotated RGB-D video demonstration. The method extracts structured motion trajectories of both hands and objects from raw video data. These trajectories serve as motion priors for a novel reinforcement learning (RL) pipeline that learns to refine them through contact-rich interactions, thereby eliminating the need to learn from scratch. To address the challenge of learning long-horizon manipulation skills, we introduce: (1) Temporal-segment based RL to enforce temporal alignment of the current state with demonstrations; (2) Success-Gated Reset strategy to balance the refinement of readily acquired skills and the exploration of subsequent task stages; and (3) Event-Driven Reward curriculum with adaptive thresholding to guide the RL learning of high-precision manipulation. The novel video processing and RL framework successfully achieved long- horizon synchronous and asynchronous bimanual assembly tasks, offering a scalable approach for direct skill acquisition from human videos. Visual materials are available in our project website: https://demobot-seed.github.io/

Index terms

Learning from Demonstration Reinforcement Learning Dexterous Manipulation