Multi-State Consistency Visual Language Model Combine Wavelet Transform for Weakly Supervised Robot Visual Segmentation
Feng Xiao, Peihua Han, Guoyuan Li, Houxiang Zhang
AI summary
Problem
Weakly supervised semantic segmentation struggles with noisy pseudo-labels, boundary blurring, and domain shift when adapting large visual models to robotics, while dense pixel annotations remain prohibitively costly.
Approach
The method uses a dual-branch encoder (CLIP and DINOv3) aligned through consistency learning to reduce representation gaps, combined with a wavelet transform decoder that simultaneously captures global context and high-frequency spatial details for sharper boundaries.
Key results
- 82.6% mIoU on PASCAL VOC2012 test set, surpassing prior single-stage methods by ~5%
- Consistency learning suppresses cross-domain noise and aligns dual-branch feature spaces
- Wavelet-based decoder recovers fine-grained details and sharpens boundaries without multi-stage post-processing
- Maintains computational efficiency suitable for real-time robotic deployment
Why it matters
Provides a scalable, annotation-efficient solution for high-precision visual perception in dynamic robotic environments.
Abstract
Robotic visual segmentation is essential for en- abling robots to operate in complex environments. Although supervised methods have achieved remarkable progress, their dependence on dense annotations hinders scalability. Weakly supervised semantic segmentation (WSSS) alleviates this issue but suffers from sparse supervision, leading to noisy pseudo- labels and boundary errors. Large visual models (LVMs), pre- trained on diverse data, provide rich semantic priors that can strengthen weak supervision and address these limitations. To this end, we designed a dual-branch architecture, introducing two large pre-trained models with complementary characteris- tics. We align the feature spaces of the two branches through consistency learning to alleviate the representation differences and weakly supervised noise problems caused by cross-domain migration, thereby obtaining more robust and fine-grained semantic features. Furthermore, to effectively restore spatial details and improve the quality of segmentation boundaries, we introduce a wavelet transform in the decoder. Wavelet decomposition can simultaneously capture low-frequency global information and high-frequency local details at multiple scales, allowing the model to enhance spatial restoration capabilities while maintaining semantic consistency. Experimental results show that our method improves the performance by 7.7% compared with the state-of-the-art methods in WSSS.