FUNCanon: Learning Pose-Aware Action Primitives Via Functional Object Canonicalization for Generalizable Robotic Manipulation
Hongli Xu, Lei Zhang, Xiaoyue Hu, Boyang Zhong, Kaixin Bai, Zoltan-Csaba Marton, Zhenshan Bing, Zhaopeng Chen, Alois Knoll, Jianwei Zhang
AI summary
Problem
End-to-end robotic policies struggle to generalize across unseen objects, poses, and tasks because they treat manipulation as monolithic, task-specific trajectories rather than modular, reusable behaviors.
Approach
FunCanon breaks long-horizon tasks into reusable actor-verb-object action primitives, uses vision-language models to extract affordance cues for functional object canonicalization, and trains a diffusion policy on functionally aligned data to automatically transfer trajectories across objects.
Key results
- Decomposes manipulation into reusable actor-verb-object action primitives
- Enables automatic trajectory transfer across object instances and categories
- Achieves instance-level and category-level generalization in simulation
- Demonstrates robust sim-to-real transfer on a physical Franka robot
Why it matters
Offers a scalable inductive bias for imitation learning, allowing robots to generalize manipulation skills across diverse objects and tasks without extensive retraining or manual labeling.
Abstract
General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision–language models. An object- centric and action-centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim-to-real deployment, showing that functional * The first three authors contribute equally to this paper. †Corresponding author. zhanglei.cn.de@gmail.com, lei.zhang- 1@studium.uni-hamburg.de 1TAMS (Technical Aspects of Multimodal Systems), Department of Informatics, University of Hamburg, Hamburg, Germany. 2Technical University of Munich, Germany. 3Agile Robots SE, Munich, Germany. This work is supported by New Generation Artificial Intelligence-National Science and Technology Major Project (2025ZD0122903). canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon.