Texting-While-Walking Detection in Real-World Environments Using Vision-Language Models with Prompt Engineering
Seung pyo Choi, Jiaxu Wu, Qi An, Atsushi Yamashita
Abstract
Smartphone-induced “texting while walking” poses growing safety risks not only in public shared spaces but also in robot navigation scenarios where humans and robots coexist. To mitigate these risks, recent studies have developed pedestrian behavior detection models that aim to recognize when people are distracted by their smartphones. However, these models still suffer from high false-positive rates and re- duced detection accuracy when visually similar poses or occlu- sions occur. To address this issue, we propose a Vision-Language Model (VLM)-based behavior detector that exploits VLMs pretrained on large image-text datasets and capable of global- context inference. Specifically, we leverage LLaVA-7B and sys- tematically evaluate three prompt-engineering schemes—chain- of-thought and self-consistency under zero-shot settings, and few-shot prompting under few-shot settings. We conducted the dataset generation experiment in a typical indoor hall with a centrally placed table that intermittently occluded the robot’s view. During each session, four to six participants walked freely while performing nine everyday actions, resulting in 11,815 annotated pedestrian images captured from the robot’s perspec- tive. Experimental results show that our VLM-based pipeline significantly reduces false-positive detections and improves both precision and overall F1-score compared to a conventional pose- based LSTM baseline. These gains demonstrate that combining large-scale VLM reasoning with specially designed prompts can overcome long-standing misclassification issues in existing approaches. Our curated dataset and prompt-analysis results provide a foundation for extending VLM-based perception to a wide range of camera-based monitoring and navigation systems.