← Back ICRA 2026

RoboPCA: Pose-Centered Affordance Learning from Human Demonstrations for Robot Manipulation

Zhanqi Xiao, Ruiping Wang, Xilin Chen

PDF

AI summary

Key figure (auto-extracted from paper)

RoboPCA jointly predicts contact points and poses from human videos using a diffusion model, significantly improving robotic manipulation success rates over existing methods.

Pose-centered affordance Robot manipulation Diffusion models Human demonstrations 3D pose estimation Robotic learning

Problem

Current affordance prediction methods separate contact region localization from pose estimation, causing inconsistencies that lead to robotic task failures, while learning from human demonstrations lacks reliable 3D and pose annotations.

Approach

The authors introduce Human2Afford to automatically extract contact points and poses from unlabeled human videos, and RoboPCA, a diffusion-based model that jointly predicts these pose-centered affordances conditioned on RGB-D scenes, object masks, and language instructions.

Key results

Automated Human2Afford pipeline for extracting pose-centered affordances from human videos
Diffusion-based RoboPCA framework for joint contact point and pose prediction
18.6% improvement in contact point prediction on AGD20K dataset
38.5% and 24.9% manipulation success rate gains in simulation and real-world tests

Why it matters

Enables robots to learn reliable, pose-accurate manipulation skills from abundant human videos without costly 3D annotations, advancing general-purpose robotic manipulation.

Abstract

Understanding spatial affordances—comprising the contact regions of object interaction and the corresponding contact poses—is essential for robots to effectively manipulate objects and accomplish diverse tasks. However, existing spa- tial affordance prediction methods mainly focus on locating the contact regions while delegating the pose to independent pose estimation approaches, which can lead to task failures due to inconsistencies between predicted contact regions and candidate poses. In this work, we propose RoboPCA, a pose- centered affordance prediction framework that jointly predicts task-appropriate contact regions and poses conditioned on instructions. To enable scalable data collection for pose-centered affordance learning, we devise Human2Afford, a data curation pipeline that automatically recovers scene-level 3D information and infers pose-centered affordance annotations from human demonstrations. With Human2Afford, scene depth and the in- teraction object’s mask are extracted to provide 3D context and object localization, while pose-centered affordance annotations are obtained by tracking object points within the contact region and analyzing hand–object interaction patterns to establish a mapping from the 3D hand mesh to the robot end-effector orientation. By integrating geometry–appearance cues through an RGB-D encoder and incorporating mask-enhanced features to emphasize task-relevant object regions into the diffusion- based framework, RoboPCA outperforms baseline methods on image datasets, simulation, and real robots, and exhibits strong generalization across tasks and categories.

Index terms

Learning from Demonstration Perception for Grasping and Manipulation