← Back ICRA 2026

FR-CDNet: Unified Scene Change Detection Model across Viewpoint Variations and Different Temporal Conditions

Yilin Peng, Yingchun Fu, Xiangru Li, Zhenhao Li, Shuqi Chen, Shunping Ji

PDF

AI summary

Key figure (auto-extracted from paper)

FR-CDNet achieves robust scene change detection across varying viewpoints and reversed temporal orders without extra labels, significantly outperforming existing methods on unaligned scenes while matching state-of-the-art on aligned ones.

Scene Change Detection Viewpoint Variation Temporal Invariance Cross Attention Change Disentanglement URSCD Dataset

Problem

Existing scene change detection methods rely on ideal image alignment and consistent temporal conditions, causing severe performance degradation when faced with real-world viewpoint variations and reversed image orders.

Approach

The method employs a bidirectional Cross Fusion architecture paired with Spatial Prior-guided Cross Attention to align features and preserve spatial information, enabling mutual comparison of image pairs regardless of input order or viewpoint differences.

Key results

Significant F1-score gains on unaligned scenes with large viewpoint differences
Maintains state-of-the-art performance on aligned scenes
Enables weakly-supervised change disentanglement without additional labels
Framework seamlessly transfers to mainstream baselines to boost performance without extra parameters

Why it matters

It enables reliable, label-efficient urban monitoring and autonomous navigation in real-world environments where camera viewpoints and capture times are unpredictable.

Abstract

Scene Change Detection (SCD) is a critical task for building smart cities, yet its practical application faces dual challenges: existing methods typically rely on temporal conditions present in the training data and the ideal assumption of small viewpoint differences. Consequently, they struggle to handle the common and significant viewpoint variations in real- world scenarios and exhibit strong sensitivity to temporal condi- tions, leading to drastic performance degradation under unseen temporal settings. To address these challenges, we propose the Fusion-Refinement Change Detection Network (FR-CDNet). By modeling correspondences between objects and preserving spatial prior information from ideally aligned scenes during the disentangled processing of different temporal directions, our network achieves a unified handling of varying degrees of viewpoint variations and different temporal conditions—a capability existing methods lack. Furthermore, FR-CDNet can automatically distinguish the temporal attribution of change entities to better support downstream tasks. To better evaluate performance in real-world settings, we further construct the URSCD dataset, which includes larger viewpoint differences and more diverse change scenarios. Extensive experiments demonstrate the universal scene detection capability of our method: it achieves significant improvement in F1-score on unaligned scenes while maintaining performance comparable to SOTA on aligned scenes. Ablation studies further demonstrate that the proposed framework can be migrated to enhance various mainstream models, effectively eliminating temporal condition dependency while improving overall performance.

Index terms

Deep Learning for Visual Perception Computer Vision for Automation Computer Vision for Transportation