Deep Sensorimotor Control by Imitating Predictive Models of Human Motion
Himanshu Gaurav Singh, Pieter Abbeel, Jitendra Malik, Antonio Loquercio
AI summary
Problem
Leveraging large-scale human interaction datasets for robot learning is hindered by the need for per-sample kinematic retargeting, accurate environment replicas, or unstable adversarial losses, making it difficult to scale.
Approach
Train a causal transformer to predict future human hand keypoints from scene observations, then use reinforcement learning to train robot policies that track these zero-shot predictions while optimizing a sparse task reward.
Key results
- Eliminates gradient-based kinematic retargeting and adversarial losses
- Enables zero-shot transfer of a single motion predictor across diverse robots and tasks
- Substitutes dense reward engineering with a simple keypoint tracking reward
- Outperforms demonstration-guided RL baselines by a large margin
Why it matters
Provides a scalable pathway to leverage massive human interaction datasets for training dexterous robot manipulation policies without manual reward design.
Abstract
As the embodiment gap between a robot and a human narrows, new opportunities arise to leverage datasets of humans interacting with their surroundings for robot learning. We propose a novel technique for training sensorimotor policies with reinforcement learning by imitating predictive models of human motions. Our key insight is that the motion of keypoints on human-inspired robot end-effectors closely mirrors the motion of corresponding human body keypoints. This enables us to use a model trained to predict future motion on human data zero-shot on robot data. We train sensorimotor policies to track the predictions of such a model, conditioned on a history of past robot states, while optimizing a relatively sparse task reward. This approach entirely bypasses gradient-based kinematic retargeting and adversarial losses, which limit existing methods from fully leveraging the scale and diversity of modern human-scene interaction datasets. Empirically, we find that our approach can work across robots and tasks, outperforming existing baselines by a large margin. In addition, we find that tracking a human motion model can substitute for carefully designed dense rewards and curricula in manipulation tasks.