← Back SII 2025

Object Positions Interpretation System for Service Robots through Targeted Object Marking

Kosei Yamao, Daiju Kanaoka, Kosei Isomoto, Hakaru Tamukoh

PDF

Abstract

Service robots are typically required to interpret and execute various complex tasks in home environments. Recognizing the environment, such as furniture, and under- standing the relationships between object positions is critical for executing various tasks. Set of mark (SoM) is a visual prompting method that focuses on interpreting the relationship between semantic regions by overlaying marks in each region. However, SoM marks segmented regions that are not objects such as walls and floors. This marking creates noise when interpreting object positions. To address this problem, we propose a novel object-position interpretation system that combines an object detection model and a vision-language model (VLM). The proposed system incorporates an object detection model to mark only objects, allowing the VLM to efficiently interpret object positions. Furthermore, the proposed system improves the accuracy of the system by including the original image and label output by the object detection model in the input to the VLM. The experimental results show that the proposed system outperforms SoM in terms of interpreting object positions.

Index terms

Vision Systems Machine Learning Human-Robot/System Interaction