Research Analyzer
← Back ICRA 2026

Multi-State Consistency Visual Language Model Combine Wavelet Transform for Weakly Supervised Robot Visual Segmentation

Feng Xiao, Peihua Han, Guoyuan Li, Houxiang Zhang

PDF

AI summary

Key figure (auto-extracted from paper)
A dual-branch large vision model aligned via consistency learning and decoded with wavelet transforms achieves state-of-the-art weakly supervised segmentation performance while preserving fine-grained boundaries for robotics.
Weakly supervised segmentation Large vision models Wavelet transform Consistency learning Robotic vision Semantic segmentation

Problem

Weakly supervised semantic segmentation struggles with noisy pseudo-labels, boundary blurring, and domain shift when adapting large visual models to robotics, while dense pixel annotations remain prohibitively costly.

Approach

The method uses a dual-branch encoder (CLIP and DINOv3) aligned through consistency learning to reduce representation gaps, combined with a wavelet transform decoder that simultaneously captures global context and high-frequency spatial details for sharper boundaries.

Key results

  • 82.6% mIoU on PASCAL VOC2012 test set, surpassing prior single-stage methods by ~5%
  • Consistency learning suppresses cross-domain noise and aligns dual-branch feature spaces
  • Wavelet-based decoder recovers fine-grained details and sharpens boundaries without multi-stage post-processing
  • Maintains computational efficiency suitable for real-time robotic deployment

Why it matters

Provides a scalable, annotation-efficient solution for high-precision visual perception in dynamic robotic environments.

Abstract

Robotic visual segmentation is essential for en- abling robots to operate in complex environments. Although supervised methods have achieved remarkable progress, their dependence on dense annotations hinders scalability. Weakly supervised semantic segmentation (WSSS) alleviates this issue but suffers from sparse supervision, leading to noisy pseudo- labels and boundary errors. Large visual models (LVMs), pre- trained on diverse data, provide rich semantic priors that can strengthen weak supervision and address these limitations. To this end, we designed a dual-branch architecture, introducing two large pre-trained models with complementary characteris- tics. We align the feature spaces of the two branches through consistency learning to alleviate the representation differences and weakly supervised noise problems caused by cross-domain migration, thereby obtaining more robust and fine-grained semantic features. Furthermore, to effectively restore spatial details and improve the quality of segmentation boundaries, we introduce a wavelet transform in the decoder. Wavelet decomposition can simultaneously capture low-frequency global information and high-frequency local details at multiple scales, allowing the model to enhance spatial restoration capabilities while maintaining semantic consistency. Experimental results show that our method improves the performance by 7.7% compared with the state-of-the-art methods in WSSS.

Index terms

Deep Learning for Visual Perception Computer Vision for Automation Deep Learning Methods

Related papers