Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks
Gabriela Sejnova, Michal Vavrecka, Karla Stepanova
Abstract
In this work, we focus on unsupervised vision- language-action mapping in the area of robotic manipulation. Recently, multiple approaches employing pre-trained large lan- guage and vision models have been proposed for this task. How- ever, they are computationally demanding and require careful fine-tuning of the produced output. A more lightweight alter- native would be the implementation of multimodal Variational Autoencoders (VAEs) which can extract the latent features of the data and integrate them into a joint representation, as has been demonstrated mostly on image-image or image-text data for the state-of-the-art models. Here, we explore whether and how multimodal VAEs can be employed in unsupervised robotic manipulation tasks in a simulated environment. Based on the results obtained, we propose a model-invariant training alterna- tive that improves the models’ performance in a simulator by up to 55 %. Moreover, we systematically evaluate the challenges raised by individual tasks, such as object or robot position variability, number of distractors, or task length. Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories based on vision and language.