← Back ICRA 2026

Is Pre-Training Applicable to the Decoder for Dense Prediction?

Chao Ning, Wanshui Gan, Weihao Xuan, Naoto Yokoya

PDF

AI summary

Key figure (auto-extracted from paper)

Pre-training the decoder alongside the encoder significantly improves dense prediction accuracy without requiring task-specific architectural modifications.

Dense prediction Pre-trained decoder Encoder-decoder networks Monocular depth estimation Semantic segmentation ×Net

Problem

Traditional dense prediction networks rely on pre-trained encoders but randomly initialize decoders, leaving the potential benefits of decoder pre-training unexplored despite advances in large-scale model training.

Approach

×Net repurposes a pre-trained classification model as a decoder by reversing its structure, introducing a Re-Shape module and Post Feature Pyramid, and mixing encoded features with original images to align decoding optimization with pre-training objectives.

Key results

First framework to successfully load pre-trained weights into a dense prediction decoder
Eliminates traditional feature pyramid connections to simplify optimization
Surpasses advanced methods on monocular depth estimation and semantic segmentation benchmarks
Demonstrates that decoder pre-training injects vital semantic information for sharper, more detailed predictions

Why it matters

It challenges conventional encoder-decoder design paradigms and provides a streamlined, highly effective approach for robotics and computer vision applications requiring precise dense perception.

Abstract

Encoder-decoder networks are commonly used model architectures for dense prediction tasks, where the encoder typically employs a model pre-trained on upstream tasks, while the decoder is often either randomly initialized or pre-trained on other tasks. In this paper, we introduce ×Net, a novel framework that leverages a model pre-trained on upstream tasks as the decoder, fostering a “pre-trained en- coder × pre-trained decoder” collaboration within the encoder- decoder network. ×Net effectively addresses the challenges associated with using pre-trained models in the decoding, applying the learned representations to enhance the decoding process. This enables the model to achieve more precise and high-quality dense predictions. By simply coupling the pre- trained encoder and pre-trained decoder, ×Net distinguishes itself as a highly promising approach. Remarkably, it achieves this without relying on decoding-specific structures or task- specific algorithms. Despite its streamlined design, ×Net out- performs advanced methods in tasks such as monocular depth estimation and semantic segmentation. The code is available at https://2j472no.github.io/xNet/.

Index terms

Deep Learning for Visual Perception RGB-D Perception Semantic Scene Understanding