MSG-Loc: Multi-Label Likelihood-Based Semantic Graph Matching for Object-Level Global Localization
Gihyeon Lee, Jungwoo Lee, Juwon Kim, Young-Sik Shin, Younggun Cho
AI summary
Problem
Existing object-based localization methods rely on single-label predictions, which ignore semantic uncertainty and fail in open-set or viewpoint-varying scenarios, causing frequent data association errors and inaccurate pose estimation.
Approach
The framework builds semantic graphs using multi-label detection frequencies and confidence scores, then propagates context-aware maximum likelihoods from neighboring nodes to compute robust similarity scores for reliable object matching and pose estimation.
Key results
- First to exploit multi-label likelihoods for capturing semantic uncertainty in object correspondence
- Stabilizes data association and pose estimation under sparse and ambiguous observations via 1-hop neighbor propagation
- Maintains reliable performance across large-vocabulary and open-set detection configurations
- Demonstrates cross-paradigm compatibility with both zero-shot and supervised classification tasks
Why it matters
Enables robust robot relocalization and loop closure in real-world settings where object detection is inherently uncertain or open-vocabulary.
Abstract
Robots are often required to localize in environments with unknown object classes and semantic ambiguity. However, when performing global localization using semantic objects, high semantic ambiguity intensifies object misclassification and increases the likelihood of incorrect associations, which in turn can cause significant errors in the estimated pose. Thus, in this letter, we propose a multi-label likelihood-based semantic graph matching framework for object-level global localization. The key idea is to exploit multi-label graph representations, rather than single-label alternatives, to capture and leverage the inherent semantic context of object observations. Based on these representations, our approach enhances semantic correspondence across graphs by combining the likelihood of each node with the maximum likelihood of its neighbors via context-aware likelihood propagation. For rigorous validation, data association and pose estimation performance are evaluated under both closed-set and open-set detection configurations. In addition, we demonstrate the scalability of our approach to large-vocabulary object categories in both real-world indoor scenes and synthetic environments.