← Back ICRA 2026

Scale-Invariant and View-Relational Representation Learning for Full Surround Monocular Depth

Kyumin Hwang, Wonhyeok Choi, Kiljoon Han, Wonjoon Choi, Minwoo Choi, YONGCHEON NA, Minwoo Park, Sunghoon Im

PDF

AI summary

Key figure (auto-extracted from paper)

A novel knowledge distillation framework transfers scale-invariant and multi-view relational depth knowledge from foundation models to lightweight networks, enabling real-time, metric-scale full surround depth estimation.

monocular depth estimation knowledge distillation full surround cameras autonomous driving scale-invariant learning multi-view consistency

Problem

Directly applying foundation models to full surround monocular depth estimation is hindered by prohibitive computational costs and an inability to consistently estimate metric-scale depth across multiple camera views.

Approach

The method uses cross-interaction and view-relational knowledge distillation to transfer scale-invariant depth bin probabilities and inter-camera structural relationships from a teacher foundation model to a lightweight student network.

Key results

Average accuracy improvements of 5.88% on DDAD and 11.87% on nuScenes over supervised baselines
5.13–11.14× faster inference speed compared to the teacher foundation model
Superior performance over existing knowledge distillation methods in full surround settings
Successful real-time metric-scale depth estimation for autonomous driving

Why it matters

Provides a practical, efficient pathway for deploying robust depth perception in real-time autonomous driving systems without relying on expensive LiDAR hardware.

Abstract

Recent foundation models demonstrate strong generalization capabilities in monocular depth estimation. However, directly applying these models to Full Surround Monocular Depth Estimation (FSMDE) presents two major challenges: (1) high computational cost, which limits real- time performance, and (2) difficulty in estimating metric- scale depth, as these models are typically trained to predict only relative depth. To address these limitations, we propose a novel knowledge distillation strategy that transfers robust depth knowledge from a foundation model to a lightweight FSMDE network. Our approach leverages a hybrid regres- sion framework combining the knowledge distillation scheme– traditionally used in classification–with a depth binning module to enhance scale consistency. Specifically, we introduce a cross- interaction knowledge distillation scheme that distills the scale- invariant depth bin probabilities of a foundation model into the student network while guiding it to infer metric-scale depth bin centers from ground-truth depth. Furthermore, we propose view-relational knowledge distillation, which encodes structural relationships among adjacent camera views and transfers them to enhance cross-view depth consistency. Exper- iments on DDAD and nuScenes demonstrate the effectiveness of our method compared to conventional supervised methods and existing knowledge distillation approaches. Moreover, our method achieves a favorable trade-off between performance and efficiency, meeting real-time requirements.

Index terms

Computer Vision for Automation Deep Learning for Visual Perception Recognition