← Back ICRA 2026

Multi-Keypoint Affordance Representation for Functional Dexterous Grasping

Kailun Yang, Zhiyong Li, and Yaonan Wang

PDF

AI summary

Key figure (auto-extracted from paper)

A multi-keypoint affordance framework directly links visual perception to precise dexterous grasping postures, significantly improving grasp consistency and generalization.

dexterous grasping affordance representation multi-keypoint learning vision-to-action robotic manipulation weakly supervised learning

Problem

Existing affordance methods only predict coarse interaction regions, failing to constrain precise dexterous grasping postures and creating a disconnect between visual perception and manipulation.

Approach

The method uses weakly-supervised keypoint localization guided by human interaction images and Large Vision Models, then computes hand-object relative poses via a geometric keypoint transformation to directly drive dexterous grasping.

Key results

45.35% improvement over state-of-the-art in KLD metric on the FAH dataset
Direct geometric mapping of visual keypoints to executable dexterous hand poses
Successful generalization to unseen tools and complex functional tasks in simulation and real robots
Elimination of manual keypoint annotation costs through weak supervision and LVMs

Why it matters

Enables robots to perform complex, task-specific dexterous manipulations reliably without costly manual annotations, bridging the vision-to-action gap.

Abstract

Functional dexterous grasping requires precise hand-object interaction, going beyond simple gripping. Existing affordance-based methods primarily predict coarse interaction regions and cannot directly constrain the grasping posture, leading to a disconnection between visual perception and ma- nipulation. To address this issue, we propose a multi-keypoint affordance representation for functional dexterous grasping, which directly encodes task-driven grasp configurations by localizing functional contact points. Our method introduces Contact-guided Multi-Keypoint Affordance (CMKA), leverag- ing human grasping experience images for weak supervision combined with Large Vision Models for fine affordance feature extraction, achieving generalization while avoiding manual key- point annotations. Additionally, we present a Keypoint-based Grasp matrix Transformation (KGT) method, ensuring spatial consistency between hand keypoints and object contact points, thus providing a direct link between visual perception and dex- terous grasping actions. Experiments on public real-world FAH datasets, IsaacGym simulation, and challenging robotic tasks demonstrate that our method significantly improves affordance localization accuracy, grasp consistency, and generalization to unseen tools and tasks, bridging the gap between visual affordance learning and dexterous robotic manipulation. The source code and demo videos are publicly available at https: //github.com/PopeyePxx/MKA.

Index terms

Computer Vision for Automation Dexterous Manipulation Deep Learning in Grasping and Manipulation