← Back ICRA 2026

FUNCanon: Learning Pose-Aware Action Primitives Via Functional Object Canonicalization for Generalizable Robotic Manipulation

Hongli Xu, Lei Zhang, Xiaoyue Hu, Boyang Zhong, Kaixin Bai, Zoltan-Csaba Marton, Zhenshan Bing, Zhaopeng Chen, Alois Knoll, Jianwei Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

Decomposing manipulation into functionally aligned action primitives enables pose-aware, category-generalizable policies with robust sim-to-real transfer.

Functional alignment Action primitives Diffusion policy Sim-to-real transfer Object-centric learning Vision-language models

Problem

End-to-end robotic policies struggle to generalize across unseen objects, poses, and tasks because they treat manipulation as monolithic, task-specific trajectories rather than modular, reusable behaviors.

Approach

FunCanon breaks long-horizon tasks into reusable actor-verb-object action primitives, uses vision-language models to extract affordance cues for functional object canonicalization, and trains a diffusion policy on functionally aligned data to automatically transfer trajectories across objects.

Key results

Decomposes manipulation into reusable actor-verb-object action primitives
Enables automatic trajectory transfer across object instances and categories
Achieves instance-level and category-level generalization in simulation
Demonstrates robust sim-to-real transfer on a physical Franka robot

Why it matters

Offers a scalable inductive bias for imitation learning, allowing robots to generalize manipulation skills across diverse objects and tasks without extensive retraining or manual labeling.

Abstract

General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision–language models. An object- centric and action-centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim-to-real deployment, showing that functional * The first three authors contribute equally to this paper. †Corresponding author. zhanglei.cn.de@gmail.com, lei.zhang- 1@studium.uni-hamburg.de 1TAMS (Technical Aspects of Multimodal Systems), Department of Informatics, University of Hamburg, Hamburg, Germany. 2Technical University of Munich, Germany. 3Agile Robots SE, Munich, Germany. This work is supported by New Generation Artificial Intelligence-National Science and Technology Major Project (2025ZD0122903). canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon.

Index terms

Imitation Learning Visual Learning Transfer Learning