Grid-Based Marking Prompt Framework for Spatial Understanding in Vision�Language Models
Ryo Terashima, Yuga Yano, Koshun Arimura, Hakaru Tamukoh
Abstract
In this study, we propose a grid-based marking prompt framework to enhance spatial understanding in vision- language models (VLMs). The framework integrates object detection, background masking, and number overlaying to enable VLMs to interpret spatial and contextual instructions more effectively. By inputting numbered images along with natural language instructions, a VLM selects the number corresponding to the most semantically appropriate location. The framework operates without requiring prior information such as 3D models or physical markers. Moreover, the proposed framework allows flexible rule adaptation through prompt engineering alone, providing general applicability across vari- ous objects and environments. We conducted two experiments for the object placement task. In experiment 1, shelf images captured by a service robot were used to evaluate the placement selection accuracy of a VLM. In experiment 2, the framework was implemented on a service robot and conducted the object placement task at positions selected by a VLM in a real-world environment. The framework achieved a high success rate in both experiments, demonstrating the effectiveness and practical utility of the framework in real-world environments.