Visual Category-Guided One-Shot Open Affordance Grounding
Yangfan Wang, Hongyang Yu, Xiying Li
AI summary
Problem
Existing one-shot open affordance grounding methods struggle with fine-grained perception in complex scenarios, function-appearance heterogeneity, and poor generalization to unseen categories due to static prompt designs.
Approach
The method dynamically encodes instance-level visual features into category tokens to build semantic prompts, paired with a coarse-to-fine Transformer decoder that progressively aligns textual affordance cues with multi-scale visual features for precise part localization.
Key results
- Category-conditioned affordance prompt learning for dynamic semantic alignment
- Coarse-to-fine semantic-guided Transformer decoder for complex structure localization
- Competitive performance on AGD20K and UMD benchmarks using less than 1% of training data
- Superior generalization to unseen object categories and novel affordances compared to SOTA baselines
Why it matters
This approach enables robots and vision systems to accurately identify functional object parts with minimal training data, advancing efficient human-object interaction and robotic manipulation.
Abstract
Affordance grounding is a challenging task that aims to locate functional regions in object images enabling potential human-object interactions. One-shot open affordance grounding leverages the generalization capability of visual foun- dation models to overcome limitations of training data scale. However, existing methods often fail to locate functional regions in complex scenarios due to the lack of fine-grained perception, function-appearance heterogeneity, and the overfitting of affor- dance prompts to known categories. To improve generalization to unseen categories, we introduce a category-conditioned affor- dance prompt learning, which constructs a complete semantic category-affordance prompt from instance-level visual features. To further improve the accuracy of affordance localization for objects with complex structures, we propose a coarse-to-fine semantic-guided Transformer decoder. This design enhances the decoder’s ability to understand the semantic mapping between the affordance words and corresponding object part- level regions. On multiple standard benchmarks, our method achieves competitive performance compared to related methods with less than 1% of the training cost. Notably, our approach shows more robust generalization to unseen objects and novel affordances than the recent SOTA baseline methods.