Multi-Modal Loop Closure Detection with Foundation Models in Severely Unstructured Environments
Laura Alejandra Encinar Gonzalez, John Folkesson, Rudolph Triebel, Riccardo Giubilato
AI summary
Problem
Visual place recognition fails in unstructured, feature-sparse terrains due to weak textures and aliasing, while existing multi-modal pipelines typically output only similarity scores without the explicit 6-DoF pose constraints required for direct SLAM integration.
Approach
The pipeline uses a two-stage DINOv2-based visual retrieval strategy for efficient candidate screening, followed by SONATA-based LiDAR descriptors to compute explicit 6-DoF relative poses through RANSAC geometric verification.
Key results
- Achieves 75.7% Precision@1 on S3LI Etna and 78.3% on Vulcano sequences
- Maintains end-to-end retrieval runtime under 500 ms per query
- Delivers reliable 6-DoF pose estimates with over 69% of yaw predictions within 10° of ground truth
- Outperforms uni-modal and retrieval-only baselines in accuracy and efficiency trade-offs
Why it matters
Provides a reliable, interpretable loop closure solution for autonomous planetary rovers and GNSS-denied SLAM systems navigating severely unstructured terrains.
Abstract
Robust loop closure detection is a critical com- ponent of Simultaneous Localization and Mapping (SLAM) algorithms in GNSS-denied environments, such as in the con- text of planetary exploration. In these settings, visual place recognition often fails due to aliasing and weak textures, while LiDAR-based methods suffer from sparsity and ambiguity. This paper presents MPRF, a multimodal pipeline that leverages transformer-based foundation models for both vision and Li- DAR modalities to achieve robust loop closure in severely un- structured environments. Unlike prior work limited to retrieval, MPRF integrates a two-stage visual retrieval strategy with explicit 6-DoF pose estimation, combining DINOv2 features with SALAD aggregation for efficient candidate screening and SONATA-based LiDAR descriptors for geometric verification. Experiments on the S3LI dataset and S3LI Vulcano dataset show that MPRF outperforms state-of-the-art retrieval methods in precision while enhancing pose estimation robustness in low- texture regions. By providing interpretable correspondences compatible with SLAM back-ends, MPRF achieves a favorable trade-off between accuracy, efficiency, and reliability, demon- strating the potential of foundation models to unify place recognition and pose estimation. Code and models will be released at github.com/DLR-RM/MPRF.