← Back ICRA 2026

OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-To-Robot Action Transfer

Kuanning Wang, Ke Fan, Yuqian Fu, Siyu Lin, Hu Luo, Daniel Seita, Yanwei Fu, Yu-Gang Jiang, Xiangyang Xue

PDF

AI summary

Key figure (auto-extracted from paper)

Fusing object-centric 3D visual priors with tactile data from human videos enables robust human-to-robot action transfer that significantly outperforms vision-only baselines.

Object-centric learning human-to-robot transfer tactile sensing 3D reconstruction diffusion policy imitation learning

Problem

Current human-to-robot imitation methods often ignore background distractions, lack rich 3D geometry for capturing object interactions, or require costly teleoperation, while vision alone fails to perceive critical tactile properties like texture and weight.

Approach

OCRA extracts object-centric 3D point clouds from multi-view human demonstration videos, fuses them with tactile priors via a ResFiLM module, and conditions a diffusion policy to generate precise manipulation actions.

Key results

Extracts object-centric 3D representations directly from multi-view human videos without teleoperation
Pretrains a tactile encoder on a novel dataset of over one million tactile images
Fuses visual and tactile priors to accurately perceive object properties like texture and weight
Outperforms baselines on 7 vision-only and visuo-tactile manipulation tasks

Why it matters

Provides a scalable, low-cost framework for teaching robots complex manipulation skills directly from human videos, advancing practical imitation learning.

Abstract

We present OCRA, an Object-Centric framework for video-based human-to-Robot Action transfer that learns directly from human demonstration videos to enable robust manipulation. Object-centric learning emphasizes task-relevant objects and their interactions while filtering out irrelevant background, providing a natural and scalable way to teach robots. OCRA leverages multi-view RGB videos, the state-of- the-art 3D foundation model VGGT, and advanced detection and segmentation models to reconstruct object-centric 3D point clouds, capturing rich interactions between objects. To handle properties not easily perceived by vision alone, we incorporate tactile priors via a large-scale dataset of over one million tactile images. These 3D and tactile priors are fused through a multimodal module (ResFiLM) and fed into a Diffusion Policy to generate robust manipulation actions. Extensive ex- periments on both vision-only and visuo-tactile tasks show that OCRA significantly outperforms existing baselines and ablations, demonstrating its effectiveness for learning from human demonstration videos.

Index terms

Learning from Demonstration Perception for Grasping and Manipulation Sensor Fusion