← Back ICRA 2024

Learning Distributional Demonstration Spaces for Task-Specific Cross-Pose Estimation

Jenny Wang, Octavian Donca, David Held

PDF

Abstract

Relative placement tasks are an important cate- gory of tasks in which one object needs to be placed in a desired pose relative to another object. Previous work has shown success in learning relative placement tasks from just a small number of demonstrations when using relational reasoning networks with geometric inductive biases. However, such methods cannot flexibly represent multimodal tasks, like a mug hanging on any of = racks. We propose a method that incorporates additional properties that enable learning multimodal relative placement solutions, while retaining the provably translation- invariant and relational properties of prior work. We show that our method is able to learn precise relative placement tasks with only 10-20 multimodal demonstrations with no human annotations across a diverse set of objects within a category. Supplementary information can be found on the website: https://sites.google.com/view/tax-posed/home. I. I Many robotic manipulation tasks can be framed as relative placement tasks. For example, hanging a mug on a mug rack requires placing the mug in a position relative to one of the pegs of the rack. Even complex, long-horizon tasks such as organizing a cluttered table can be framed as a series of relative placements: first, predict an SE(3) transformation that stacks one book on top of another, then predict a transformation that puts the pencil in the pencil box, and then predict a transformation that centers the keyboard in front of the monitor. Previous work such as TAX-Pose [1] has shown that, for relative placement tasks, using network architectures that explicitly reason about object relationships helps the network to generalize significantly better across object poses and instances. However, this previous work outputs only a single relative placement prediction for each observation. In multi-modal settings, this leads to predictions which are the mean of valid placement modes, which may be incorrect. For example, suppose that a set of demonstrations place a mug on any of = racks in the scene. The average of these demonstrations will be a point in the middle of the racks, which is not a valid placement. In contrast, many tasks are defined by a distribution of relationships: a robot may be tasked to grasp anywhere along the rim of a bowl, place a fork on the left of any of the plates (e.g. when setting the table), or grasp any one of a cabinet’s drawers. This material is based upon work supported by the United States United States Air Force and DARPA under Contract No. FA8750-18-C-0092, NIST under Grant No. 70NANB23H178, and the Uber Presidential Fellowship. 1All authors are with the Robotics Institute, Carnegie Mellon Uni- versity. * represents equal contribution. (jennyw2@andrew.cmu.edu , odonca@andrew.cmu.edu , dheld@andrew.cmu.edu ) Fig. 1: Our method’s learned prior ?q(I | -) learns a distribution over modalities for the task, for example placing a mug on the left rack or the right rack. During inference time, this allows the model to predict a diverse set of ways to perform the task. To address these challenges, we present TAX-PoseD, a Distributional variant of TAX-Pose [1]. Our method predicts task-specific object relationships from just a few demonstra- tions and no human annotations, while robustly accounting for multimodal demonstration distributions. Our core technical contributions include: • A method for efficiently learning distributional relative placement tasks; our approach extends TAX-Pose [1] to handle multimodal, distributional demonstrations. • A novel spatially-grounded architecture for a cVAE [2] that represents the latent variable distribution as a categor- ical distribution over 3D points; this results in a grounded and interpretable latent space that avoids the smoothing effect commonly found in cVAEs, leading to significantly improved performance for multi-modal placement tasks. We evaluate our method on challenging multimodal tasks and evaluate its generalization across diverse objects within a category. We demonstrate that our method is both interpretable and achieves strong performance on distributional relative placement tasks. II. RW Action Representations. Many action representations have been explored that enable robots to increase their learning efficiency when learning to solve manipulation tasks. For example, representations can be per-point [3], [4], [5] or consist of keypoints [6]. Additionally, architectures that leverage local geometry have been shown to provide useful priors for learning 2024 IEEE International Conference on Robotics and Automation (ICRA 2024) May 13-17, 2024. Yokohama, Japan 979-8-3503-8457-4/24/$31.00 ©2024 IEEE 15054

Index terms

Learning from Demonstration Deep Learning Methods Representation Learning