Instance-Guided Unsupervised Domain Adaptation for Robotic Semantic Segmentation
Michele Antonazzi, Lorenzo Signorelli, Matteo Luperto, Nicola Basilico
AI summary
Problem
Semantic segmentation models degrade due to domain shift when deployed in new environments. Existing unsupervised adaptation methods leveraging multi-view consistency suffer from instance-level incoherence and rendering artifacts in their pseudo-labels.
Approach
The method aggregates per-frame predictions into a 3D map to generate multi-view consistent pseudo-labels, then refines them using the zero-shot instance segmentation of a foundation model (SAM) via automated prompting to enforce instance-level coherence before self-supervised fine-tuning.
Key results
- Proposes a novel UDA framework combining multi-view consistency with instance-aware pseudo-label refinement
- Integrates the SAM foundation model with two complementary automated prompting strategies
- Demonstrates consistent performance improvements over state-of-the-art UDA baselines on real-world data
- Eliminates instance-level incoherence and rendering artifacts in pseudo-labels without requiring target domain ground-truth
Why it matters
Enables mobile robots to autonomously adapt their perception systems to novel environments in real-time, reducing reliance on costly manual annotation and improving long-term operational robustness.
Abstract
Semantic segmentation networks, which are es- sential for robotic perception, often suffer from performance degradation when the visual distribution of the deployment environment differs from that of the source dataset on which they were trained. Unsupervised Domain Adaptation (UDA) addresses this challenge by adapting the network to the robot’s target environment without external supervision, leveraging the large amounts of data a robot might naturally collect during long–term operation. In such settings, UDA methods can exploit multi–view consistency across the environment’s map to fine– tune the model in an unsupervised fashion and mitigate domain shift. However, these approaches remain sensitive to cross– view instance–level inconsistencies. In this work, we propose a method1 that starts from a volumetric 3D map to generate multi–view consistent pseudo–labels. We then refine these labels using the zero–shot instance segmentation capabilities of a foun- dation model, enforcing instance–level coherence. The refined annotations serve as supervision for self–supervised fine–tuning, enabling the robot to adapt its perception system at deployment time. Experiments on real–world data demonstrate that our approach consistently improves performance over state–of–the– art UDA baselines based on multi–view consistency, without requiring any ground–truth labels in the target domain.