← Back ICRA 2026

RAAP: Retrieval-Augmented Affordance Prediction with Cross-Image Action Alignment

Qiyuan Zhuang, He-Yang Xu, Yijun Wang, Xin-Yang Zhao, Yang-Yang Li, Xiu-Shen Wei

PDF

AI summary

Key figure (auto-extracted from paper)

RAAP accurately predicts where and how to interact with unseen objects by combining dense feature matching with multi-reference alignment, enabling robust zero-shot manipulation from just tens of examples.

affordance prediction retrieval-augmented learning zero-shot manipulation cross-image alignment robotic manipulation few-shot learning

Problem

Existing affordance prediction methods struggle with robust generalization to unseen objects due to fragile single-example retrieval or the high data demands and mislocalization errors of large-scale models.

Approach

RAAP decouples static contact localization and dynamic action direction, transferring contact points via dense feature correspondence and predicting action directions by aggregating multiple retrieved references through a dual-weighted cross-image alignment model.

Key results

Achieves strong zero-shot generalization on unseen objects and cross-category tasks with as few as tens of samples per task.
Outperforms retrieval and diffusion baselines in dynamic affordance prediction accuracy (lower Mean Angular Error).
Enables robust zero-shot robotic manipulation in both simulation and real-world environments.
Introduces a dual-weighted attention mechanism that consolidates multiple references to reduce directional prediction ambiguity.

Why it matters

It enables data-efficient, robust fine-grained robotic manipulation for novel tasks and objects, reducing reliance on massive training datasets.

Abstract

Understanding object affordances is essential for en- abling robots to perform purposeful and fine-grained interactions in diverse and unstructured environments. However, existing ap- proaches either rely on retrieval, which is fragile due to sparsity and coverage gaps, or on large-scale models, which frequently mislocalize contact points and mispredict post-contact actions when applied to unseen categories, thereby hindering robust generalization. We introduce Retrieval-Augmented Affordance Prediction (RAAP), a framework that unifies affordance retrieval with alignment-based learning. By decoupling static contact localization and dynamic action direction, RAAP transfers contact points via dense correspondence and predicts action directions through a retrieval-augmented alignment model that consolidates multiple references with dual-weighted attention. Trained on compact subsets of DROID and HOI4D with as few as tens of samples per task, RAAP achieves consistent performance across unseen objects and categories, and enables zero-shot robotic manipulation in both simulation and the real world. Project website: github.com/SEU-VIPGroup/RAAP. This work was supported by National Natural Science Foundation of China under Grant (62522602), Basic Research Program of Jiangsu under Grant (BK20250073), CIE-Tencent Robotics X Rhino-Bird Focused Research Program, and the Fundamental Research Funds for the Central Universities (4009002401, 2242025K30024). This work was also supported by the Big Data Computing Center of Southeast University. † For Correspondence: weixs@seu.edu.cn 1School of Computer Science and Engineering, and Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Nanjing 211189, China 2Southeast University-Monash University Joint Graduate School, South- east University, Suzhou 215123, China 3School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

Index terms

Deep Learning for Visual Perception Visual Learning RGB-D Perception