Is Pre-Training Applicable to the Decoder for Dense Prediction?
Chao Ning, Wanshui Gan, Weihao Xuan, Naoto Yokoya
AI summary
Problem
Traditional dense prediction networks rely on pre-trained encoders but randomly initialize decoders, leaving the potential benefits of decoder pre-training unexplored despite advances in large-scale model training.
Approach
×Net repurposes a pre-trained classification model as a decoder by reversing its structure, introducing a Re-Shape module and Post Feature Pyramid, and mixing encoded features with original images to align decoding optimization with pre-training objectives.
Key results
- First framework to successfully load pre-trained weights into a dense prediction decoder
- Eliminates traditional feature pyramid connections to simplify optimization
- Surpasses advanced methods on monocular depth estimation and semantic segmentation benchmarks
- Demonstrates that decoder pre-training injects vital semantic information for sharper, more detailed predictions
Why it matters
It challenges conventional encoder-decoder design paradigms and provides a streamlined, highly effective approach for robotics and computer vision applications requiring precise dense perception.
Abstract
Encoder-decoder networks are commonly used model architectures for dense prediction tasks, where the encoder typically employs a model pre-trained on upstream tasks, while the decoder is often either randomly initialized or pre-trained on other tasks. In this paper, we introduce ×Net, a novel framework that leverages a model pre-trained on upstream tasks as the decoder, fostering a “pre-trained en- coder × pre-trained decoder” collaboration within the encoder- decoder network. ×Net effectively addresses the challenges associated with using pre-trained models in the decoding, applying the learned representations to enhance the decoding process. This enables the model to achieve more precise and high-quality dense predictions. By simply coupling the pre- trained encoder and pre-trained decoder, ×Net distinguishes itself as a highly promising approach. Remarkably, it achieves this without relying on decoding-specific structures or task- specific algorithms. Despite its streamlined design, ×Net out- performs advanced methods in tasks such as monocular depth estimation and semantic segmentation. The code is available at https://2j472no.github.io/xNet/.