OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-To-Robot Action Transfer
Kuanning Wang, Ke Fan, Yuqian Fu, Siyu Lin, Hu Luo, Daniel Seita, Yanwei Fu, Yu-Gang Jiang, Xiangyang Xue
AI summary
Problem
Current human-to-robot imitation methods often ignore background distractions, lack rich 3D geometry for capturing object interactions, or require costly teleoperation, while vision alone fails to perceive critical tactile properties like texture and weight.
Approach
OCRA extracts object-centric 3D point clouds from multi-view human demonstration videos, fuses them with tactile priors via a ResFiLM module, and conditions a diffusion policy to generate precise manipulation actions.
Key results
- Extracts object-centric 3D representations directly from multi-view human videos without teleoperation
- Pretrains a tactile encoder on a novel dataset of over one million tactile images
- Fuses visual and tactile priors to accurately perceive object properties like texture and weight
- Outperforms baselines on 7 vision-only and visuo-tactile manipulation tasks
Why it matters
Provides a scalable, low-cost framework for teaching robots complex manipulation skills directly from human videos, advancing practical imitation learning.
Abstract
We present OCRA, an Object-Centric framework for video-based human-to-Robot Action transfer that learns directly from human demonstration videos to enable robust manipulation. Object-centric learning emphasizes task-relevant objects and their interactions while filtering out irrelevant background, providing a natural and scalable way to teach robots. OCRA leverages multi-view RGB videos, the state-of- the-art 3D foundation model VGGT, and advanced detection and segmentation models to reconstruct object-centric 3D point clouds, capturing rich interactions between objects. To handle properties not easily perceived by vision alone, we incorporate tactile priors via a large-scale dataset of over one million tactile images. These 3D and tactile priors are fused through a multimodal module (ResFiLM) and fed into a Diffusion Policy to generate robust manipulation actions. Extensive ex- periments on both vision-only and visuo-tactile tasks show that OCRA significantly outperforms existing baselines and ablations, demonstrating its effectiveness for learning from human demonstration videos.