← Back ICRA 2026

WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments

Joshua Knights, Joseph Reid, Kaushik Roy, David Hall, Mark Cox, Peyman Moghadam

PDF

AI summary

Key figure (auto-extracted from paper)

State-of-the-art place recognition and depth estimation models struggle in unstructured natural environments, highlighting the need for robust cross-modal benchmarks and visibility-aware annotation pipelines.

place recognition metric depth estimation cross-modal benchmark natural environments autonomous robotics semi-dense depth

Problem

Existing robotics datasets are predominantly captured in structured urban or indoor settings, leaving a critical gap for evaluating autonomous perception and navigation in complex, unstructured natural environments.

Approach

The authors extend the Wild-Places dataset by regenerating traversals with accurate 6DoF camera poses, synchronizing them with dense lidar submaps, and introducing a visibility-aware pipeline to generate semi-dense metric depth and surface normal annotations for over 476K RGB frames.

Key results

Release of WildCross dataset with 476K+ RGB frames, semi-dense depth, and accurate 6DoF poses across eight natural traversals.
Novel visibility-aware annotation pipeline using accumulated point clouds and hidden point removal to eliminate depth occlusion artifacts.
Benchmarking shows leading place recognition and depth models degrade sharply in natural environments, even after fine-tuning.
Introduction of a four-fold cross-fold evaluation protocol to robustly test VPR and CMPR generalization without data leakage.

Why it matters

It equips roboticists and computer vision researchers with a challenging, real-world benchmark to develop and validate perception systems for autonomous robots operating in forests, agriculture, and other unstructured natural settings.

Abstract

Recent years have seen a significant increase in demand for robotic solutions in unstructured natural environments, alongside growing interest in bridging 2D and 3D scene understanding. However, existing robotics datasets are predominantly captured in structured urban environments, making them inadequate for addressing the challenges posed by complex, unstructured natural settings. To address this gap, we propose WildCross, a cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF poses and synchronized dense lidar submaps. We conduct comprehensive experiments on visual, lidar, and cross-modal place recognition, as well as metric depth estimation, demonstrating the value of WildCross as a challenging benchmark for multi-modal robotic perception tasks. We provide access to the code repository and dataset at https://csiro-robotics.github.io/WildCross.

Index terms

Data Sets for Robotic Vision Deep Learning for Visual Perception RGB-D Perception