← Back SII 2026

EVLOD: Ensemble Vision-Language Open-Vocabulary Detection for Construction Site Object Recognition

Yongdong Wang, Runze Xiao, Jun Younes Louhi Kasahara, Shota Chikushi, Keiji Nagatani, Atsushi Yamashita, Hajime Asama

PDF

Abstract

The construction industry faces severe labor short- ages, driving the need for robotic automation solutions. How- ever, effective deployment of construction robots requires robust environmental perception capabilities, particularly accurate identification of diverse objects in complex, dynamic construc- tion environments. Closed-set object detection methods are limited to predefined categories, proving inadequate for the highly varied object types encountered on construction sites. This paper introduces EVLOD (Ensemble Vision-Language Open-vocabulary Detection), an ensemble framework that inte- grates multiple state-of-the-art vision-language models to enable open-vocabulary object detection in construction scenarios. EVLOD employs a voting-based fusion strategy that combines predictions from GroundingDINO and GroundingDINO-CLIP detectors, utilizing their complementary strengths while mit- igating individual model weaknesses. The ensemble approach incorporates confidence voting, object name voting, and bound- ing box voting to produce reliable detections with reduced false positives. Evaluated on a comprehensive dataset of 825 Unmanned Aerial Vehicle (UAV)-captured construction images with 5,020 annotated objects, EVLOD achieves an Average Precision (AP) of 0.49 when Intersection over Union (IoU) equals 0.5, representing a 36.1% improvement over the best- performing baseline. The method effectively reduces detection noise from 5,495 to 3,232 detections. Qualitative analysis reveals primary limitations in detecting small-scale objects and low- contrast elements.

Index terms

Robotics Machine Learning Automation