Research Analyzer
← Back ICRA 2026

Vision-Language Feature Alignment for Road Anomaly Segmentation

Zhuolin He, Jiacheng Tang, Jian Pu, Xiangyang Xue

PDF

AI summary

Key figure (auto-extracted from paper)
Integrating vision-language priors via prompt learning and multi-source inference drastically cuts false positives and boosts out-of-distribution anomaly detection in road scenes.
Road anomaly segmentation Vision-language alignment Out-of-distribution detection Prompt learning Autonomous driving Multi-source inference

Problem

Existing road anomaly segmentation methods rely on pixel-level statistics or visual features, causing high false-positive rates on normal backgrounds like sky and vegetation while missing true out-of-distribution obstacles, which threatens autonomous driving safety.

Approach

VL-Anomaly aligns segmentation features with CLIP text embeddings using a learnable prompt-driven aligner at both pixel and mask levels, then fuses detector confidence, text-guided similarity, and CLIP image-text similarity during inference for robust anomaly scoring.

Key results

  • State-of-the-art performance on RoadAnomaly, SMIYC, and Fishyscapes benchmarks
  • Novel PL-Aligner module for joint pixel- and mask-level vision-language alignment
  • Multi-source inference strategy fusing detector confidence, text similarity, and CLIP scores
  • Significant reduction in false positives on semantically normal background regions

Why it matters

Provides autonomous driving and robotic perception systems with a more reliable, semantically grounded method for detecting unknown road obstacles, directly addressing critical safety gaps in open-world navigation.

Abstract

Safe autonomous systems in complex environ- ments require robust road anomaly segmentation to iden- tify unknown obstacles. However, existing approaches of- ten rely on pixel-level statistics to determine whether a re- gion appears anomalous. This reliance leads to high false- positive rates on semantically normal background regions such as sky or vegetation, and poor recall of true Out- of-distribution (OOD) instances, thereby posing safety risks for robotic perception and decision-making. To address these challenges, we propose VL-Anomaly, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs). Specifi- cally, we design a prompt learning-driven alignment module that adapts Mask2Former’s visual features to CLIP text em- beddings of known categories, effectively suppressing spuri- ous anomaly responses in background regions. At inference time, we further introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence, enabling more reliable anomaly prediction by leveraging complementary information sources. Extensive experiments demonstrate that VL-Anomaly achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC and Fishyscapes. Code is released on https://github.com/NickHezhuolin/ VL-aligner-Road-anomaly-segment.

Index terms

Computer Vision for Automation Object Detection Segmentation and Categorization Deep Learning for Visual Perception

Related papers