← Back ICRA 2026

D-CLING: Prior-Preserving Depth-Conditioned Fine-Tuning for Navigation Foundation Models

Shintaro Nakaoka, Takayuki Kanai, Kazuhito Tanaka

PDF

AI summary

Key figure (auto-extracted from paper)

A ControlNet-inspired depth-conditioned fine-tuning method enables Navigation Foundation Models to adapt to novel environments while preserving their pretrained knowledge, drastically improving real-world navigation robustness.

Navigation Foundation Models Depth Conditioning ControlNet Catastrophic Forgetting Visual Navigation Fine-Tuning

Problem

Navigation Foundation Models struggle to adapt to novel scenes and camera configurations due to geometric domain shifts, while standard fine-tuning often causes catastrophic forgetting that degrades obstacle avoidance and goal-reaching.

Approach

D-CLING attaches a trainable RGB-D branch to a frozen pretrained backbone, injecting dense depth cues via zero-initialized residual pathways to learn in-domain geometry without overwriting pretrained priors.

Key results

70% success rate in basic obstacle avoidance, surpassing zero-shot and full fine-tuning baselines
Over 50% reduction in human interventions during 50-meter long-range navigation
Maintains offline action prediction accuracy across pretraining domains, confirming prior preservation
Outperforms early RGB-D fusion strategies in dynamic corridor scenarios with unmapped obstacles

Why it matters

Enables reliable, real-world deployment of large-scale vision navigation models in novel settings without costly retraining or catastrophic forgetting.

Abstract

Navigation Foundation Models (NFMs) trained on large, cross-embodied datasets have demonstrated powerful generalizability on various scenarios. Adopting in-domain fine- tuning upon an NFM efficiently calibrates the visuomotor policy, promising further improvement even in a novel sce- nario. However, the fine-tuned models still suffer from poor obstacle avoidance or fail to properly reach the provided goals. Furthermore, such model updates in a small subset of data typically erode the pretrained prior, compromising the pretraining generalization. Consequently, fine-tuning rather deteriorates the model’s capability of robust and accurate navigation. In this work, we present a novel fine-tuning method that leverages large-scale pretraining while efficiently learning in novel setups, such as environments or camera configurations. In particular, inspired by ControlNet, we fine-tune an NFM by attaching a trainable copy of the pretrained backbone using zero-initialized residual pathways, thereby learning geometric cues. This design enables the model to efficiently acquire in- domain geometry while preserving pretrained knowledge across various behaviors. Despite its simplicity, our comprehensive evaluation of real-world navigation suggests that our proposal effectively enables robust long-horizon navigation with minimal collisions and human intervention. Additionally, our offline analysis shows that the proposed strategy maintains or further improves action prediction capability beyond the fine-tuned dataset, providing a key insight into continual learning for general navigation. The project page: https://toyotafrc.github.io/DCLING-Proj/

Index terms

Vision-Based Navigation Transfer Learning Imitation Learning