← Back ICRA 2026

Visual Category-Guided One-Shot Open Affordance Grounding

Yangfan Wang, Hongyang Yu, Xiying Li

PDF

AI summary

Key figure (auto-extracted from paper)

A dynamic category-conditioned prompt and coarse-to-fine Transformer decoder enable robust affordance localization with less than 1% of the training data and superior generalization to unseen objects.

affordance grounding one-shot learning vision-language models category-conditioned prompts Transformer decoder robotic perception

Problem

Existing one-shot open affordance grounding methods struggle with fine-grained perception in complex scenarios, function-appearance heterogeneity, and poor generalization to unseen categories due to static prompt designs.

Approach

The method dynamically encodes instance-level visual features into category tokens to build semantic prompts, paired with a coarse-to-fine Transformer decoder that progressively aligns textual affordance cues with multi-scale visual features for precise part localization.

Key results

Category-conditioned affordance prompt learning for dynamic semantic alignment
Coarse-to-fine semantic-guided Transformer decoder for complex structure localization
Competitive performance on AGD20K and UMD benchmarks using less than 1% of training data
Superior generalization to unseen object categories and novel affordances compared to SOTA baselines

Why it matters

This approach enables robots and vision systems to accurately identify functional object parts with minimal training data, advancing efficient human-object interaction and robotic manipulation.

Abstract

Affordance grounding is a challenging task that aims to locate functional regions in object images enabling potential human-object interactions. One-shot open affordance grounding leverages the generalization capability of visual foun- dation models to overcome limitations of training data scale. However, existing methods often fail to locate functional regions in complex scenarios due to the lack of fine-grained perception, function-appearance heterogeneity, and the overfitting of affor- dance prompts to known categories. To improve generalization to unseen categories, we introduce a category-conditioned affor- dance prompt learning, which constructs a complete semantic category-affordance prompt from instance-level visual features. To further improve the accuracy of affordance localization for objects with complex structures, we propose a coarse-to-fine semantic-guided Transformer decoder. This design enhances the decoder’s ability to understand the semantic mapping between the affordance words and corresponding object part- level regions. On multiple standard benchmarks, our method achieves competitive performance compared to related methods with less than 1% of the training cost. Notably, our approach shows more robust generalization to unseen objects and novel affordances than the recent SOTA baseline methods.

Index terms

Deep Learning for Visual Perception Perception for Grasping and Manipulation Recognition