Multi-Keypoint Affordance Representation for Functional Dexterous Grasping
Kailun Yang, Zhiyong Li, and Yaonan Wang
AI summary
Problem
Existing affordance methods only predict coarse interaction regions, failing to constrain precise dexterous grasping postures and creating a disconnect between visual perception and manipulation.
Approach
The method uses weakly-supervised keypoint localization guided by human interaction images and Large Vision Models, then computes hand-object relative poses via a geometric keypoint transformation to directly drive dexterous grasping.
Key results
- 45.35% improvement over state-of-the-art in KLD metric on the FAH dataset
- Direct geometric mapping of visual keypoints to executable dexterous hand poses
- Successful generalization to unseen tools and complex functional tasks in simulation and real robots
- Elimination of manual keypoint annotation costs through weak supervision and LVMs
Why it matters
Enables robots to perform complex, task-specific dexterous manipulations reliably without costly manual annotations, bridging the vision-to-action gap.
Abstract
Functional dexterous grasping requires precise hand-object interaction, going beyond simple gripping. Existing affordance-based methods primarily predict coarse interaction regions and cannot directly constrain the grasping posture, leading to a disconnection between visual perception and ma- nipulation. To address this issue, we propose a multi-keypoint affordance representation for functional dexterous grasping, which directly encodes task-driven grasp configurations by localizing functional contact points. Our method introduces Contact-guided Multi-Keypoint Affordance (CMKA), leverag- ing human grasping experience images for weak supervision combined with Large Vision Models for fine affordance feature extraction, achieving generalization while avoiding manual key- point annotations. Additionally, we present a Keypoint-based Grasp matrix Transformation (KGT) method, ensuring spatial consistency between hand keypoints and object contact points, thus providing a direct link between visual perception and dex- terous grasping actions. Experiments on public real-world FAH datasets, IsaacGym simulation, and challenging robotic tasks demonstrate that our method significantly improves affordance localization accuracy, grasp consistency, and generalization to unseen tools and tasks, bridging the gap between visual affordance learning and dexterous robotic manipulation. The source code and demo videos are publicly available at https: //github.com/PopeyePxx/MKA.