← Back ICRA 2026

SG-Reg: Generalizable and Efficient Scene Graph Registration

Chuhao Liu, Zhijian Qiao, Jieqi Shi, Ke WANG, Peize LIU, Shaojie Shen

PDF

AI summary

Key figure (auto-extracted from paper)

SG-Reg enables robust, bandwidth-efficient semantic scene graph registration for multi-agent SLAM by learning multi-modal node features and eliminating reliance on ground-truth annotations.

Semantic scene graph registration Multi-agent SLAM Vision foundation models Graph neural networks Robust pose estimation Bandwidth-efficient mapping

Problem

Classical semantic registration relies on fragile handcrafted descriptors, while learning-based methods depend on ground-truth annotations, creating a domain gap and poor generalization in noisy real-world environments. Additionally, visual-based multi-agent registration demands excessive communication bandwidth.

Approach

The method encodes open-set semantic labels, local topology via a triplet-boosted GNN, and geometric shape features into compact node representations. These are matched coarsely-to-finely using optimal transport and a robust pose estimator, trained on automatically generated graphs from vision foundation models.

Key results

Achieves significantly higher registration recall than SG-PGM on real-world graphs
Requires only 52 kB of communication bandwidth per query frame
Outperforms handcrafted descriptors and visual loop closure networks in two-agent SLAM
Eliminates ground-truth annotation dependency via a novel vision foundation model data pipeline

Why it matters

Enables scalable, robust multi-agent mapping and localization in real-world environments where communication bandwidth is limited and semantic data is inherently noisy.

Abstract

This article addresses the challenges of registering two rigid semantic scene graphs, an essential capability when an autonomous agent needs to register its map against a remote agent, or against a prior map. The handcrafted descriptors in classical semantic-aided registration, or the ground-truth anno- tation reliance in learning-based scene graph registration, im- pede their application in practical real-world environments. To address the challenges, we design a scene graph network to encode multiple modalities of semantic nodes: open-set semantic feature, local topology with spatial awareness, and shape feature. These modalities are fused to create compact semantic node features. The matching layers then search for correspondences in a coarse-to-fine manner. In the back end, we employ a robust pose estimator to decide transformation according to the correspondences. We manage to maintain a sparse and hierarchical scene representation. Our approach demands fewer GPU resources and fewer com- munication bandwidth in multiagent tasks. Moreover, we design a new data generation approach using vision foundation models and a semantic mapping module to reconstruct semantic scene graphs. It differs significantly from previous works, which rely on ground-truth semantic annotations to generate data. We validate our method in a two-agent simultaneous localization and mapping benchmark. It significantly outperforms the handcrafted baseline in terms of registration success rate. Compared to visual loop closurenetworks,ourmethodachievesaslightlyhigherregistration recall while requiring only 52 kB of communication bandwidth for each query frame.

Index terms

SLAM Deep Learning in Robotics and Automation Multi-Robot Systems Semantic Scene Understanding