Preliminary Experiments of Inferring Human Intention by Analyzing Time-Series Images from Multiple Views
Masae Yokota, Sarthak Pathak, Mihoko Niitsuma, Kazunori Umeda
Abstract
The objective of this research is to construct an intelligent human-robot environment that can infer human behavorial intentions and adjust the space accordingly. In this research, we perform preliminary studies and verify whether inferring of human behavorial intention can be done from image information alone. First, the vision and Language Model (VLM) and object detection methods are used to infer possible human actions for each object detected in images. Differences between inference results and actual behavior are identified and methods needed for more accurate inference are discussed. The spatial relationship between the skeletal points and the object by observation reveals which skeletal points to focus on in order to predict the behavior. We confirmed that it is possible to predict behaviors by focusing on the neck point for actions performed with the clear intention of sitting on or passing by a chair. Parameters for the neck skeletal points are selected and each behavior is predicted by a Temporal Convolutional Network (TCN) with 91% performance. Through preliminary experiments, we discuss the methods necessary for inferring human behavioral intentions from images.