RAAP: Retrieval-Augmented Affordance Prediction with Cross-Image Action Alignment
Qiyuan Zhuang, He-Yang Xu, Yijun Wang, Xin-Yang Zhao, Yang-Yang Li, Xiu-Shen Wei
AI summary
Problem
Existing affordance prediction methods struggle with robust generalization to unseen objects due to fragile single-example retrieval or the high data demands and mislocalization errors of large-scale models.
Approach
RAAP decouples static contact localization and dynamic action direction, transferring contact points via dense feature correspondence and predicting action directions by aggregating multiple retrieved references through a dual-weighted cross-image alignment model.
Key results
- Achieves strong zero-shot generalization on unseen objects and cross-category tasks with as few as tens of samples per task.
- Outperforms retrieval and diffusion baselines in dynamic affordance prediction accuracy (lower Mean Angular Error).
- Enables robust zero-shot robotic manipulation in both simulation and real-world environments.
- Introduces a dual-weighted attention mechanism that consolidates multiple references to reduce directional prediction ambiguity.
Why it matters
It enables data-efficient, robust fine-grained robotic manipulation for novel tasks and objects, reducing reliance on massive training datasets.
Abstract
Understanding object affordances is essential for en- abling robots to perform purposeful and fine-grained interactions in diverse and unstructured environments. However, existing ap- proaches either rely on retrieval, which is fragile due to sparsity and coverage gaps, or on large-scale models, which frequently mislocalize contact points and mispredict post-contact actions when applied to unseen categories, thereby hindering robust generalization. We introduce Retrieval-Augmented Affordance Prediction (RAAP), a framework that unifies affordance retrieval with alignment-based learning. By decoupling static contact localization and dynamic action direction, RAAP transfers contact points via dense correspondence and predicts action directions through a retrieval-augmented alignment model that consolidates multiple references with dual-weighted attention. Trained on compact subsets of DROID and HOI4D with as few as tens of samples per task, RAAP achieves consistent performance across unseen objects and categories, and enables zero-shot robotic manipulation in both simulation and the real world. Project website: github.com/SEU-VIPGroup/RAAP. This work was supported by National Natural Science Foundation of China under Grant (62522602), Basic Research Program of Jiangsu under Grant (BK20250073), CIE-Tencent Robotics X Rhino-Bird Focused Research Program, and the Fundamental Research Funds for the Central Universities (4009002401, 2242025K30024). This work was also supported by the Big Data Computing Center of Southeast University. † For Correspondence: weixs@seu.edu.cn 1School of Computer Science and Engineering, and Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Nanjing 211189, China 2Southeast University-Monash University Joint Graduate School, South- east University, Suzhou 215123, China 3School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China