← Back ICRA 2026

Language-Guided Attribute Alignment and Semantic Consistency for Zero-Shot Domain Adaptation

Junhong Pan, Chenyi Jiang, Minxian Li, Haofeng Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

LASC dynamically generates attribute-aware text prompts and enforces semantic consistency via a memory bank, significantly boosting zero-shot cross-domain segmentation without target data.

zero-shot domain adaptation cross-modal alignment attribute-driven prompts semantic consistency vision-language models semantic segmentation

Problem

Existing zero-shot domain adaptation methods rely on rigid, fixed prompts that miss fine-grained domain-specific attributes and suffer from unstable visual-linguistic alignment, limiting cross-domain transfer.

Approach

The authors propose LASC, which dynamically combines category labels with domain-relevant attributes to create adaptive text prompts, aligns them with visual features through contrastive learning, and stabilizes representations using a memory bank that enforces intra-class compactness and inter-class separation.

Key results

Dynamic attribute-driven prompt generation captures fine-grained domain variations
Memory-based consistency constraint enforces intra-class compactness and inter-class separation
Significant performance gains over state-of-the-art baselines on multiple cross-domain benchmarks
Robust zero-shot adaptation achieved without requiring any target-domain supervision

Why it matters

Enables reliable cross-domain visual understanding for safety-critical applications like autonomous driving and medical imaging where labeled target data is impractical to collect.

Abstract

In cross-domain visual understanding tasks, mod- els often achieve strong performance on the source domain but suffer severe degradation when applied to target domains with substantial distribution shifts. This challenge is particularly prominent under the zero-shot domain adaptation setting, where adaptation must be achieved without access to target- domain samples and instead relies on language guidance to bridge the gap. However, existing approaches typically de- pend on fixed class names or handcrafted prompt templates, which fail to capture fine-grained semantic attributes present in the target domain. Moreover, the insufficient alignment between visual and linguistic modalities further constrains the transferability of semantic knowledge. To address these issues, we propose an attribute-driven cross-modal feature modula- tion framework, termed Language-guided Attribute alignment and Semantic Consistency (LASC). On the semantic side, we introduce an attribute-driven prompt generation module that dynamically combines category information with domain- relevant attributes to construct adaptive text prompts, which are aligned with visual features through cross-modal attention for enhanced semantic stability. Furthermore, we incorporate a semantic consistency constraint, where a memory bank enforces intra-class compactness and inter-class separation, ensuring robust discriminability across domains. Extensive experiments demonstrate that our approach achieves significant improve- ments over state-of-the-art baselines on multiple cross-domain benchmarks, and maintains strong adaptation ability without requiring any target-domain data. The code is available at https://github.com/JHP-3/LASC.

Index terms

Semantic Scene Understanding Autonomous Vehicle Navigation Vision-Based Navigation