← Back ICRA 2024

Commonsense Spatial Knowledge-Aware 3-D Human Motion and Object Interaction Prediction

Sang Uk Lee

PDF

Abstract

We propose a novel 3-D human motion and object interaction prediction model that is aware of commonsense knowledge about human–object interaction. We jointly predict human joint motion and human–object interactions. The two prediction results are combined to enforce commonsense knowl- edge, such as “if the human right hand is predicted to be in contact with an object after 1 second, the distance between the right hand and an object should also be predicted to be small,” explicit to the model. Our model uses the raw point cloud representation of the surrounding objects in the environment as input. Using raw point cloud representation allows us to model commonsense knowledge easily and improve accuracy. In particular, it does not require a separate perception system (e.g., object classification, object pose estimation, and so on), as in previous studies, and thus is robust to perception errors. Our model applies a cross-attention mechanism to fuse the environmental point cloud and past human joint poses. The surrounding environment context and past human joint poses are two heterogeneous inputs and cross-attention can be a powerful approach to fuse them. Our model is validated on the KIT Whole-Body Human Motion (WBHM) dataset.

Index terms

Deep Learning for Visual Perception Human-Robot Collaboration Deep Learning Methods