Efficient Alignment of Unconditioned Action Prior for Language-Conditioned Pick and Place in Clutter
Kechun Xu, Xunlong Xia, Kaixuan Wang, Yifei Yang, Yunxuan Mao, Bing Deng, Jieping Ye, Rong Xiong, Yue Wang
AI summary
Problem
Existing language-conditioned pick-and-place methods either require massive demonstration data for end-to-end learning or suffer from cascading errors in modular zero-shot systems, particularly when handling cluttered scenes.
Approach
The method generates unconditioned action candidates and 3D vision-language scene representations from foundation models, then aligns them using a single cross-attention layer to efficiently predict task-specific action probabilities.
Key results
- Proposes A2, an action prior alignment method requiring only one attention layer for policy learning
- Leverages MaskCLIP to construct zero-shot generalizable 3D vision-language priors
- Achieves higher task success rates with fewer planning steps in both simulation and real-world cluttered environments
- Demonstrates effective zero-shot generalization to unseen objects and language instructions
Why it matters
It advances generalizable robotic manipulation by reducing data requirements and improving real-world performance for language-driven pick-and-place tasks in cluttered environments.
Abstract
We study the task of language-conditioned pick and place in clutter, where a robot should grasp a target object in open clutter and move it to a specified place. Some approaches learn end-to-end policies with features from vision foundation models, requiring large datasets. Others combine foundation models in a zero-shot setting, suffering from cascading errors. In addition, they primarily leverage vision and language foundation models, focusing less on action priors. In this paper, we aim to develop an effective policy by integrating foundation priors from vision, language, and action. We propose A2, an action prior alignment method that aligns unconditioned action priors with 3D vision-language priors by learning one attention layer. The alignment formulation enables our policy to train with less data and preserve zero-shot generalization capabilities. We show that a shared policy for both pick and place actions enhances the performance for each task, and introduce a policy adaptation scheme to accommodate the multi-modal nature of actions. Extensive experiments in simulation and the real-world show that our policy achieves higher task success rates with fewer steps for both pick and place tasks in clutter, effectively generalizing to unseen objects and language instructions. Videos and codes are available at https://xukechun.github.io/papers/A2 Note to Practitioners—This research is motivated by the chal- lenge of generalizable policy learning of language-conditioned pick and place in clutter. Solving such a challenge could signifi- cantly improve the robot’s level of automation and intelligence in household and industrial pick and place tasks. Existing methods struggle with large data requirements, poor generalization to unseen scenarios, and cascading errors across individual com- ponents. To overcome these limitations, we propose to integrate priors from vision, language, and action foundation models by learning-based alignment. Our policy aligns action priors with 3D vision-language priors by learning one attention layer, requiring less data and preserving zero-shot generalization capabilities from foundation models. Experiments show that our method can improve both task success rate and generalization for pick and place tasks in simulation and the real world. In future work, we will incorporate more action foundation models to extend our Received 12 March 2025; revised 24 July 2025; accepted 26 August 2025. Date of publication 5 September 2025; date of current version 26 September 2025. This article was recommended for publication by Associate Editor Z. Liu and Editor J. Yi upon evaluation of the reviewers’ comments. This work was supported in part by the Joint Funds of the National Natural Science Foundation of China under Grant U24A20128 and in part by Zhejiang Provincial Natural Science Foundation of China under Grant LD25F030001. (Corresponding author: Yue Wang.) Kechun Xu is with Zhejiang University, Hangzhou 310027, China, and also with Alibaba Cloud, Hangzhou 311121, China. Xunlong Xia, Bing Deng, and Jieping Ye are with Alibaba Cloud, Hangzhou 311121, China. Kaixuan Wang, Yifei Yang, Yunxuan Mao, Rong Xiong, and Yue Wang are with Zhejiang University, Hangzhou 310027, China (e-mail: wangyue@ iipc.zju.edu.cn). This article has supplementary downloadable material available at https://doi.org/10.1109/TASE.2025.3606549, provided by the authors. Digital Object Identifier 10.1109/TASE.2025.3606549 approach of action prior alignment to a wider range of tasks, offering a promising direction for general manipulation.