D-CLING: Prior-Preserving Depth-Conditioned Fine-Tuning for Navigation Foundation Models
Shintaro Nakaoka, Takayuki Kanai, Kazuhito Tanaka
AI summary
Problem
Navigation Foundation Models struggle to adapt to novel scenes and camera configurations due to geometric domain shifts, while standard fine-tuning often causes catastrophic forgetting that degrades obstacle avoidance and goal-reaching.
Approach
D-CLING attaches a trainable RGB-D branch to a frozen pretrained backbone, injecting dense depth cues via zero-initialized residual pathways to learn in-domain geometry without overwriting pretrained priors.
Key results
- 70% success rate in basic obstacle avoidance, surpassing zero-shot and full fine-tuning baselines
- Over 50% reduction in human interventions during 50-meter long-range navigation
- Maintains offline action prediction accuracy across pretraining domains, confirming prior preservation
- Outperforms early RGB-D fusion strategies in dynamic corridor scenarios with unmapped obstacles
Why it matters
Enables reliable, real-world deployment of large-scale vision navigation models in novel settings without costly retraining or catastrophic forgetting.
Abstract
Navigation Foundation Models (NFMs) trained on large, cross-embodied datasets have demonstrated powerful generalizability on various scenarios. Adopting in-domain fine- tuning upon an NFM efficiently calibrates the visuomotor policy, promising further improvement even in a novel sce- nario. However, the fine-tuned models still suffer from poor obstacle avoidance or fail to properly reach the provided goals. Furthermore, such model updates in a small subset of data typically erode the pretrained prior, compromising the pretraining generalization. Consequently, fine-tuning rather deteriorates the model’s capability of robust and accurate navigation. In this work, we present a novel fine-tuning method that leverages large-scale pretraining while efficiently learning in novel setups, such as environments or camera configurations. In particular, inspired by ControlNet, we fine-tune an NFM by attaching a trainable copy of the pretrained backbone using zero-initialized residual pathways, thereby learning geometric cues. This design enables the model to efficiently acquire in- domain geometry while preserving pretrained knowledge across various behaviors. Despite its simplicity, our comprehensive evaluation of real-world navigation suggests that our proposal effectively enables robust long-horizon navigation with minimal collisions and human intervention. Additionally, our offline analysis shows that the proposed strategy maintains or further improves action prediction capability beyond the fine-tuned dataset, providing a key insight into continual learning for general navigation. The project page: https://toyotafrc.github.io/DCLING-Proj/