← Back ICRA 2026

Instance-Guided Unsupervised Domain Adaptation for Robotic Semantic Segmentation

Michele Antonazzi, Lorenzo Signorelli, Matteo Luperto, Nicola Basilico

PDF

AI summary

Key figure (auto-extracted from paper)

Refining multi-view pseudo-labels with a foundation model's instance segmentation capabilities significantly boosts unsupervised domain adaptation for robotic semantic segmentation without target labels.

Unsupervised Domain Adaptation Robotic Semantic Segmentation Foundation Models Multi-View Consistency Self-Supervised Learning Instance Segmentation

Problem

Semantic segmentation models degrade due to domain shift when deployed in new environments. Existing unsupervised adaptation methods leveraging multi-view consistency suffer from instance-level incoherence and rendering artifacts in their pseudo-labels.

Approach

The method aggregates per-frame predictions into a 3D map to generate multi-view consistent pseudo-labels, then refines them using the zero-shot instance segmentation of a foundation model (SAM) via automated prompting to enforce instance-level coherence before self-supervised fine-tuning.

Key results

Proposes a novel UDA framework combining multi-view consistency with instance-aware pseudo-label refinement
Integrates the SAM foundation model with two complementary automated prompting strategies
Demonstrates consistent performance improvements over state-of-the-art UDA baselines on real-world data
Eliminates instance-level incoherence and rendering artifacts in pseudo-labels without requiring target domain ground-truth

Why it matters

Enables mobile robots to autonomously adapt their perception systems to novel environments in real-time, reducing reliance on costly manual annotation and improving long-term operational robustness.

Abstract

Semantic segmentation networks, which are es- sential for robotic perception, often suffer from performance degradation when the visual distribution of the deployment environment differs from that of the source dataset on which they were trained. Unsupervised Domain Adaptation (UDA) addresses this challenge by adapting the network to the robot’s target environment without external supervision, leveraging the large amounts of data a robot might naturally collect during long–term operation. In such settings, UDA methods can exploit multi–view consistency across the environment’s map to fine– tune the model in an unsupervised fashion and mitigate domain shift. However, these approaches remain sensitive to cross– view instance–level inconsistencies. In this work, we propose a method1 that starts from a volumetric 3D map to generate multi–view consistent pseudo–labels. We then refine these labels using the zero–shot instance segmentation capabilities of a foun- dation model, enforcing instance–level coherence. The refined annotations serve as supervision for self–supervised fine–tuning, enabling the robot to adapt its perception system at deployment time. Experiments on real–world data demonstrate that our approach consistently improves performance over state–of–the– art UDA baselines based on multi–view consistency, without requiring any ground–truth labels in the target domain.

Index terms

Deep Learning for Visual Perception Object Detection Segmentation and Categorization