DASP: Self-Supervised Nighttime Monocular Depth Estimation with Domain Adaptation of Spatiotemporal Priors
Yiheng Huang, Junhong Chen, Anqi Ning, Zhanhong Liang, Nick Michiels, Luc Claesen, Wenyin Liu
AI summary
Problem
Self-supervised monocular depth estimation degrades sharply at night due to low visibility, textureless regions, and motion blur from uneven illumination. Existing methods lack robust mechanisms to transfer daytime structural cues and maintain temporal consistency in dark, dynamic scenes.
Approach
The authors introduce DASP, a self-supervised framework that uses a GAN-based discriminator with specialized spatial and temporal learning blocks to extract and adapt daytime priors to nighttime depth maps. A novel 3D consistency projection loss is also applied to enforce geometric stability across frame sequences.
Key results
- Novel spatiotemporal priors learning block (SPLB) capturing motion and multiscale structural features
- 3D consistency projection loss enforcing bilateral geometric stability
- State-of-the-art depth estimation accuracy on Oxford RobotCar and nuScenes nighttime benchmarks
- Effective recovery of textureless regions and robust handling of motion blur in dynamic scenes
Why it matters
Provides a label-free, robust depth perception solution critical for safe autonomous navigation and robotics in challenging nighttime conditions.
Abstract
Self-supervised monocular depth estimation has achieved notable success under daytime conditions. However, its performance deteriorates markedly at night due to low visibility and varying illumination, e.g., insufficient light causes textureless areas, and moving objects bring blurry regions. To this end, we propose a self-supervised framework named DASP that leverages spatiotemporal priors for nighttime depth estimation. Specifically, DASP consists of an adversarial branch for extracting spatiotem- poral priors and a self-supervised branch for learning. In the adversarial branch, we first design an adversarial network where the discriminator is composed of four devised spatiotemporal priors learning blocks (SPLB) to exploit the daytime priors. In particular, the SPLB contains a spatial-based temporal learning module (STLM) that uses orthogonal differencing to extract motion-related variations along the time axis and an axial spatial learning module (ASLM) that adopts local asymmetric convolutions with global axial attention to capture the multiscale structural information. By combining STLM and ASLM, our model can acquire sufficient spatiotemporal features to restore textureless areas and estimate the blurry regions caused by dynamic objects. In the self-supervised branch, we propose a 3D consistency projection loss to bilaterally project the target frame and source frame into a shared 3D space, and calculate the 3D discrepancy between the two projected frames as a loss to optimize the 3D structural consistency and daytime priors. Extensive experiments on the Oxford RobotCar and nuScenes datasets demonstrate that our approach achieves state-of-the- art performance for nighttime depth estimation. Ablation studies further validate the effectiveness of each component. Manuscript received: June, 17, 2025; Revised September, 29, 2025; Ac- cepted December, 4, 2025. This paper was recommended for publication by Editor Editor M. Vincze upon evaluation of the Associate Editor and Reviewers’ comments. The work of Chen Junhong was supported by China Scholarship Council under Grant 202208440309. This work was supported in part by the National Natural Science Foundation of China under Grant 91748107, in part by the Special Research Fund (BOF) of Hasselt University under Grant BOF23DOCBL11, and in part by the Guangdong Innovative Re- search Team Program under Grant 2014ZT05G157. (Corresponding author: Junhong Chen and Wenyin Liu.) Yiheng Huang, Zhanhong Liang, and Wenyin Liu are with the College of Computer Science and Technology, Guangdong University of Tech- nology, Guangzhou 510006, China (e-mail: huangyiheng.gdut@gmail.com; cw252128385@gmail.com; liuwy@gdut.edu.cn). Junhong Chen is with the College of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China, and also with the Digital Future Lab, Flanders Make, Hasselt University, 3590 Diepen- beek, Belgium (e-mail: CSChenjunhong@hotmail.com). Anqi Ning is with the College of Engineering, Shantou University, Shantou 515063, China (e-mail: ninganqi.stu@gmail.com). Nick Michiels is with the Digital Future Lab, Flanders Make, Hasselt University, 3590 Diepenbeek, Belgium. Luc Claesen is with Hasselt University, 3530 Diepenbeek, Belgium. Digital Object Identifier (DOI): see top of this page. Fig. 1. The first row shows a set of consecutive image frames from the RobotCar dataset. The next three rows show the depth maps predicted by RNW [9], STEPS [10], and our method. The green boxes mark a tree, while the red boxes indicate a moving vehicle. From the figures, we can observe that our method effectively captures spatial structure and maintains consistency in dynamic scenes.