← Back ICRA 2026

Semi-SMD: Semi-Supervised Metric Depth Estimation Via Surrounding Cameras for Autonomous Driving

Yusen Xie, Zhenmin Huang, Shaojie Shen, Jun Ma

PDF

AI summary

Key figure (auto-extracted from paper)

Semi-SMD achieves state-of-the-art metric depth estimation for autonomous driving by fusing spatial-temporal-semantic features and leveraging extrinsic camera parameters to resolve scale ambiguity.

metric depth estimation autonomous driving surrounding cameras semi-supervised learning spatial-temporal fusion curvature loss

Problem

Existing visual-only depth estimation methods suffer from scale ambiguity, high computational costs, and poor semantic boundary clarity, limiting their reliability for precise autonomous driving perception.

Approach

The framework unifies visual, temporal, and semantic features through a lightweight transformer, jointly estimates camera pose using extrinsic parameters, and applies a curvature loss guided by a pre-trained depth world model to refine metric depth.

Key results

Unified spatial-temporal-semantic transformer reduces computation while boosting accuracy
Joint pose estimation network integrates depth and extrinsic parameters for improved interpretability and precision
Curvature loss from a depth world model accelerates convergence and sharpens depth boundaries
State-of-the-art performance on DDAD and nuScenes datasets for surrounding-camera depth estimation

Why it matters

Provides a scalable, vision-only solution for precise 3D environmental perception, critical for safe autonomous navigation and motion planning.

Abstract

In this paper, we introduce Semi-SMD, a novel metric depth estimation framework tailored for surrounding cameras equipment in autonomous driving. In this work, the input data consists of adjacent surrounding frames and camera parameters. We propose a unified spatial-temporal-semantic fusion module to construct the visual fused features. Cross- attention components for surrounding cameras and adjacent frames are utilized to focus on metric scale information re- finement and temporal feature matching. Building on this, we propose a pose estimation framework using surrounding cameras, their corresponding estimated depths, and extrinsic parameters, which effectively address the scale ambiguity in multi-camera setups. Moreover, semantic world model and monocular depth estimation world model are integrated to supervise the depth estimation, which improve the quality of depth estimation. We evaluate our algorithm on DDAD and nuScenes datasets, and the results demonstrate that our method achieves state-of-the-art performance in terms of surrounding camera based depth estimation quality. The source code is available on GitHub1.

Index terms

Deep Learning for Visual Perception Visual Learning Visual Tracking