FR-CDNet: Unified Scene Change Detection Model across Viewpoint Variations and Different Temporal Conditions
Yilin Peng, Yingchun Fu, Xiangru Li, Zhenhao Li, Shuqi Chen, Shunping Ji
AI summary
Problem
Existing scene change detection methods rely on ideal image alignment and consistent temporal conditions, causing severe performance degradation when faced with real-world viewpoint variations and reversed image orders.
Approach
The method employs a bidirectional Cross Fusion architecture paired with Spatial Prior-guided Cross Attention to align features and preserve spatial information, enabling mutual comparison of image pairs regardless of input order or viewpoint differences.
Key results
- Significant F1-score gains on unaligned scenes with large viewpoint differences
- Maintains state-of-the-art performance on aligned scenes
- Enables weakly-supervised change disentanglement without additional labels
- Framework seamlessly transfers to mainstream baselines to boost performance without extra parameters
Why it matters
It enables reliable, label-efficient urban monitoring and autonomous navigation in real-world environments where camera viewpoints and capture times are unpredictable.
Abstract
Scene Change Detection (SCD) is a critical task for building smart cities, yet its practical application faces dual challenges: existing methods typically rely on temporal conditions present in the training data and the ideal assumption of small viewpoint differences. Consequently, they struggle to handle the common and significant viewpoint variations in real- world scenarios and exhibit strong sensitivity to temporal condi- tions, leading to drastic performance degradation under unseen temporal settings. To address these challenges, we propose the Fusion-Refinement Change Detection Network (FR-CDNet). By modeling correspondences between objects and preserving spatial prior information from ideally aligned scenes during the disentangled processing of different temporal directions, our network achieves a unified handling of varying degrees of viewpoint variations and different temporal conditions—a capability existing methods lack. Furthermore, FR-CDNet can automatically distinguish the temporal attribution of change entities to better support downstream tasks. To better evaluate performance in real-world settings, we further construct the URSCD dataset, which includes larger viewpoint differences and more diverse change scenarios. Extensive experiments demonstrate the universal scene detection capability of our method: it achieves significant improvement in F1-score on unaligned scenes while maintaining performance comparable to SOTA on aligned scenes. Ablation studies further demonstrate that the proposed framework can be migrated to enhance various mainstream models, effectively eliminating temporal condition dependency while improving overall performance.