Semi-SMD: Semi-Supervised Metric Depth Estimation Via Surrounding Cameras for Autonomous Driving
Yusen Xie, Zhenmin Huang, Shaojie Shen, Jun Ma
AI summary
Problem
Existing visual-only depth estimation methods suffer from scale ambiguity, high computational costs, and poor semantic boundary clarity, limiting their reliability for precise autonomous driving perception.
Approach
The framework unifies visual, temporal, and semantic features through a lightweight transformer, jointly estimates camera pose using extrinsic parameters, and applies a curvature loss guided by a pre-trained depth world model to refine metric depth.
Key results
- Unified spatial-temporal-semantic transformer reduces computation while boosting accuracy
- Joint pose estimation network integrates depth and extrinsic parameters for improved interpretability and precision
- Curvature loss from a depth world model accelerates convergence and sharpens depth boundaries
- State-of-the-art performance on DDAD and nuScenes datasets for surrounding-camera depth estimation
Why it matters
Provides a scalable, vision-only solution for precise 3D environmental perception, critical for safe autonomous navigation and motion planning.
Abstract
In this paper, we introduce Semi-SMD, a novel metric depth estimation framework tailored for surrounding cameras equipment in autonomous driving. In this work, the input data consists of adjacent surrounding frames and camera parameters. We propose a unified spatial-temporal-semantic fusion module to construct the visual fused features. Cross- attention components for surrounding cameras and adjacent frames are utilized to focus on metric scale information re- finement and temporal feature matching. Building on this, we propose a pose estimation framework using surrounding cameras, their corresponding estimated depths, and extrinsic parameters, which effectively address the scale ambiguity in multi-camera setups. Moreover, semantic world model and monocular depth estimation world model are integrated to supervise the depth estimation, which improve the quality of depth estimation. We evaluate our algorithm on DDAD and nuScenes datasets, and the results demonstrate that our method achieves state-of-the-art performance in terms of surrounding camera based depth estimation quality. The source code is available on GitHub1.