Vision-Language Feature Alignment for Road Anomaly Segmentation
Zhuolin He, Jiacheng Tang, Jian Pu, Xiangyang Xue
AI summary
Problem
Existing road anomaly segmentation methods rely on pixel-level statistics or visual features, causing high false-positive rates on normal backgrounds like sky and vegetation while missing true out-of-distribution obstacles, which threatens autonomous driving safety.
Approach
VL-Anomaly aligns segmentation features with CLIP text embeddings using a learnable prompt-driven aligner at both pixel and mask levels, then fuses detector confidence, text-guided similarity, and CLIP image-text similarity during inference for robust anomaly scoring.
Key results
- State-of-the-art performance on RoadAnomaly, SMIYC, and Fishyscapes benchmarks
- Novel PL-Aligner module for joint pixel- and mask-level vision-language alignment
- Multi-source inference strategy fusing detector confidence, text similarity, and CLIP scores
- Significant reduction in false positives on semantically normal background regions
Why it matters
Provides autonomous driving and robotic perception systems with a more reliable, semantically grounded method for detecting unknown road obstacles, directly addressing critical safety gaps in open-world navigation.
Abstract
Safe autonomous systems in complex environ- ments require robust road anomaly segmentation to iden- tify unknown obstacles. However, existing approaches of- ten rely on pixel-level statistics to determine whether a re- gion appears anomalous. This reliance leads to high false- positive rates on semantically normal background regions such as sky or vegetation, and poor recall of true Out- of-distribution (OOD) instances, thereby posing safety risks for robotic perception and decision-making. To address these challenges, we propose VL-Anomaly, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs). Specifi- cally, we design a prompt learning-driven alignment module that adapts Mask2Former’s visual features to CLIP text em- beddings of known categories, effectively suppressing spuri- ous anomaly responses in background regions. At inference time, we further introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence, enabling more reliable anomaly prediction by leveraging complementary information sources. Extensive experiments demonstrate that VL-Anomaly achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC and Fishyscapes. Code is released on https://github.com/NickHezhuolin/ VL-aligner-Road-anomaly-segment.