SSMG-Nav: Enhancing Lifelong Object Navigation with Semantic Skeleton Memory Graph
Haochen Niu, Lantao Zhang, Xingwu Ji, RENDONG YING, Peilin Liu, Fei Wen
AI summary
Problem
Existing object navigation methods lack persistent, reusable memory, rely on single-modality inputs, and employ myopic greedy policies that cause inefficient back-and-forth maneuvers, limiting their effectiveness in lifelong settings.
Approach
The framework constructs a Semantic Skeleton Memory Graph anchored by topological keypoints to consolidate past observations, then uses a vision-language model to estimate target beliefs and a long-horizon planner to optimize visitation sequences and minimize backtracking.
Key results
- Novel Semantic Skeleton Memory Graph unifying entity and spatial semantics
- Long-horizon planner balancing VLM belief and traversal costs
- State-of-the-art performance on GOAT-Bench lifelong navigation benchmarks
- Substantial gains in success rates and path efficiency over zero-shot baselines
Why it matters
Enables service robots to navigate more reliably and efficiently across diverse, unseen environments by leveraging reusable spatial memory and multimodal reasoning.
Abstract
Navigating to out-of-sight targets from human instructions in unfamiliar environments is a core capability for service robots. Despite substantial progress, most approaches underutilize reusable, persistent memory, constraining per- formance in lifelong settings. Many are additionally limited to single-modality inputs and employ myopic greedy poli- cies, which often induce inefficient back-and-forth maneuvers (BFMs). To address such limitations, we introduce SSMG-Nav, a framework for object navigation built on a Semantic Skeleton Memory Graph (SSMG) that consolidates past observations into a spatially aligned, persistent memory anchored by topolog- ical keypoints (e.g., junctions, room centers). SSMG clusters nearby entities into subgraphs, unifying entity- and space-level semantics to yield a compact set of candidate destinations. To support multimodal targets (images, objects, and text), we integrate a vision-language model (VLM). For each subgraph, a multimodal prompt synthesized from memory guides the VLM to infer a target belief over destinations. A long-horizon planner then trades off this belief against traversability costs to produce a visit sequence that minimizes expected path length, thereby reducing backtracking. Extensive experiments on challenging lifelong benchmarks and standard ObjectNav benchmarks demonstrate that, compared to strong baselines, our method achieves higher success rates and greater path efficiency, validating the effectiveness of SSMG-Nav.