← Back ICRA 2026

TERDNet: Transformer Encoder-Recurrent Decoder Network for Scene Change Detection

Jiae Yoon, Ue-Hwan Kim

PDF

AI summary

Key figure (auto-extracted from paper)

TERDNet achieves state-of-the-art scene change detection by combining a transformer encoder with a recurrent 3-gate-GRU decoder for iterative refinement, producing more accurate and detailed change masks than prior methods.

Scene Change Detection Transformer Encoder Recurrent Decoder 3-gate-GRU Feature Fusion Robotic Perception

Problem

Existing Scene Change Detection models overlook the varying importance of multi-level features, rely on single-step decoders that limit output refinement, and lack systematic analysis of encoder pretraining strategies.

Approach

TERDNet integrates a transformer-based encoder with a correlation-driven feature fusion module and a recurrent 3-gate-GRU decoder that iteratively refines change predictions by dynamically weighting layer importance.

Key results

State-of-the-art F1-score and mIoU across four public benchmarks
More precise change masks with sharper boundaries and complete regions
Ablation confirms benefits of segmentation-based pretraining and fusion design
Robust performance under viewpoint misalignment for robotic deployment

Why it matters

Provides a reliable perception backbone for autonomous robots and vehicles navigating dynamic real-world environments.

Abstract

In this work, we address the challenge of Scene Change Detection (SCD), where the goal is to identify vari- ations between two images of the same location captured at different times. Existing SCD models often overlook the varying importance of features across layers, employ single- step decoders that confine refinement, and provide limited insight into encoder pretraining strategies. We propose TERD- Net, a Transformer Encoder–Recurrent Decoder Network de- signed to overcome these limitations. TERDNet consists of a transformer-based encoder that extracts multi-level repre- sentations, a feature fusion module that integrates correla- tion volumes with these features, a recurrent 3-gate-GRU decoder that performs iterative refinement, and a combined convolution–interpolation upsampler that restores fine-grained resolution. Extensive experiments on four public benchmarks show that TERDNet consistently outperforms prior approaches and produces more accurate and detailed change masks. Ablation studies confirm the benefit of segmentation-based pretraining and the effectiveness of our fusion design. In addi- tion, robustness tests under viewpoint misalignment confirm TERDNet’s potential for deployment in real-world robotic systems, where reliable perception is critical. Our code is at https://github.com/AutoCompSysLab/TERDNet.

Index terms

Semantic Scene Understanding Recognition