← Back SII 2025

Conditional NewtonianVAE to Generate Pre-Grasping Actions in Physical Latent Spaces

Masaki Ito, Gustavo Alfonso Garcia Ricardez, Ryo Okumura, Tadahiro Taniguchi

PDF

Abstract

To make robotic grasping scalable, vision-based control with high data efficiency and accuracy is needed. World models are capable of creating representations of phys- ical environments from sensory information. In particular, NewtonianVAE is a world model that can control targets in physical environments by using proportional control in its latent space from input images. However, NewtonianVAE entangles information of each object in separate state subspaces making control unfeasible when trained with multiple objects. In this paper, we introduce Conditional NewtonianVAE, a novel framework designed to generate pre-grasping actions by disentangling object-type information from the state space in physical latent spaces. Our method incorporates a conditioning variable to achieve disentanglement, facilitating the use of the learned state space for control tasks. Through simulation and real-robot experiments, we demonstrate the effectiveness of Conditional NewtonianVAE in accurately positioning the end- effector into a pre-grasping pose, thereby enhancing the success rate of robotic grasping. Conditional NewtonianVAE achieves a grasping success rate of 83% for known objects and 78% for unseen objects in the real-robot experiments.

Index terms

Robotic hands and grasping Vision Systems