← Back ICRA 2026

Zero-Shot Metric Depth Estimation Via Monocular Visual-Inertial Rescaling for Autonomous Aerial Navigation

Steven Yang, Xiaoyu Tian, Kshitij Goel, Wennie Tabib

PDF

AI summary

Key figure (auto-extracted from paper)

Monotonic spline rescaling of monocular depth using visual-inertial features enables accurate, real-time metric depth estimation for collision avoidance on constrained quadrotors without fine-tuning.

Monocular depth estimation Visual-inertial navigation Zero-shot learning Autonomous aerial navigation Metric depth rescaling SWaP-constrained robotics

Problem

Autonomous aerial systems typically rely on heavy sensors or require domain-specific fine-tuning to achieve metrically accurate depth, which is impractical for SWaP-constrained robots operating in unknown environments.

Approach

The method rescales relative depth predictions from a monocular neural network using a sparse 3D feature map generated by a visual-inertial navigation system, evaluating multiple mathematical fitting strategies for accuracy and efficiency.

Key results

Monotonic spline rescaling achieves highest accuracy across confined and open simulated environments
On-board deployment yields metric depth at 15 Hz with successful real-world collision avoidance
TensorRT optimization increases inference speed from 5 to 20 FPS with negligible accuracy loss
Open-source release of the rescaling pipeline

Why it matters

Enables reliable, low-cost autonomous navigation in cluttered environments for resource-constrained aerial robots without requiring heavy sensors or environment-specific training.

Abstract

This paper presents a methodology to predict metric depth from monocular RGB images and an inertial measurement unit (IMU). To enable collision avoidance during autonomous flight, prior works either leverage heavy sensors (e.g., LiDARs or stereo cameras) or data-intensive and domain- specific fine-tuning of monocular metric depth estimation methods. In contrast, we propose several lightweight zero- shot rescaling strategies to obtain metric depth from relative depth estimates via the sparse 3D feature map created using a visual-inertial navigation system. These strategies are compared for their accuracy in diverse simulation environments. The best performing approach, which leverages monotonic spline fitting, is deployed in the real-world on a compute-constrained quadrotor. We obtain on-board metric depth estimates at 15 Hz and demonstrate successful collision avoidance after integrating the proposed method with a motion primitives-based planner.

Index terms

Field Robots Aerial Systems: Applications