Zero-Shot Metric Depth Estimation Via Monocular Visual-Inertial Rescaling for Autonomous Aerial Navigation
Steven Yang, Xiaoyu Tian, Kshitij Goel, Wennie Tabib
AI summary
Problem
Autonomous aerial systems typically rely on heavy sensors or require domain-specific fine-tuning to achieve metrically accurate depth, which is impractical for SWaP-constrained robots operating in unknown environments.
Approach
The method rescales relative depth predictions from a monocular neural network using a sparse 3D feature map generated by a visual-inertial navigation system, evaluating multiple mathematical fitting strategies for accuracy and efficiency.
Key results
- Monotonic spline rescaling achieves highest accuracy across confined and open simulated environments
- On-board deployment yields metric depth at 15 Hz with successful real-world collision avoidance
- TensorRT optimization increases inference speed from 5 to 20 FPS with negligible accuracy loss
- Open-source release of the rescaling pipeline
Why it matters
Enables reliable, low-cost autonomous navigation in cluttered environments for resource-constrained aerial robots without requiring heavy sensors or environment-specific training.
Abstract
This paper presents a methodology to predict metric depth from monocular RGB images and an inertial measurement unit (IMU). To enable collision avoidance during autonomous flight, prior works either leverage heavy sensors (e.g., LiDARs or stereo cameras) or data-intensive and domain- specific fine-tuning of monocular metric depth estimation methods. In contrast, we propose several lightweight zero- shot rescaling strategies to obtain metric depth from relative depth estimates via the sparse 3D feature map created using a visual-inertial navigation system. These strategies are compared for their accuracy in diverse simulation environments. The best performing approach, which leverages monotonic spline fitting, is deployed in the real-world on a compute-constrained quadrotor. We obtain on-board metric depth estimates at 15 Hz and demonstrate successful collision avoidance after integrating the proposed method with a motion primitives-based planner.