← Back ICRA 2026

DINO-VO: A Feature-Based Visual Odometry Leveraging a Visual Foundation Model

Maulana Bisyir Azhari, David Hyunchul Shim

PDF

AI summary

Key figure (auto-extracted from paper)

DINO-VO achieves state-of-the-art, real-time visual odometry by effectively integrating the DINOv2 foundation model with tailored keypoint detection and fine-grained geometric features.

Visual Odometry DINOv2 Visual Foundation Model Feature Matching Real-time Tracking Deep Learning

Problem

Learning-based monocular visual odometry faces robustness and generalization challenges, while directly applying visual foundation models like DINOv2 is limited by their coarse feature granularity.

Approach

DINO-VO combines a patch-aligned salient keypoint detector with a lightweight CNN for fine-grained geometric features, using a transformer matcher and differentiable pose layer to estimate camera motion efficiently.

Key results

Outperforms prior frame-to-frame VO methods on TartanAir and KITTI datasets
Competitive accuracy on the EuRoC dataset
Runs efficiently at 72 FPS with under 1GB GPU memory
Demonstrates superior generalization and robustness over standalone DINOv2 and traditional descriptors

Why it matters

Enables reliable, real-time camera tracking for autonomous robots and vehicles in texture-poor or dynamic environments without heavy computational costs.

Abstract

Learning-based monocular visual odometry (VO) poses robustness, generalization, and efficiency challenges in robotics. Recent advances in visual foundation models, such as DINOv2, have improved robustness and generalization in various vision tasks, yet their integration in VO remains limited due to coarse feature granularity. In this paper, we present DINO-VO, a feature-based VO system leveraging DINOv2 visual foundation model for its sparse feature matching. To address the integration challenge, we propose a salient keypoints detector tai- lored to DINOv2’s coarse features. Furthermore, we complement DINOv2’s robust-semantic features with fine-grained geometric features, resulting in more localizable representations. Finally, a transformer-based matcher and differentiable pose estimation layer enable precise camera motion estimation by learning good matches. Against prior detector-descriptor networks like Super- Point, DINO-VO demonstrates greater robustness in challenging environments. Furthermore, we show superior accuracy and gen- eralization of the proposed feature descriptors against standalone DINOv2 coarse features. DINO-VO outperforms prior frame-to- frame VO methods on the TartanAir and KITTI datasets and is competitive on EuRoC dataset, while running efficiently at 72 FPS with less than 1GB of memory usage on a single GPU. Moreover, it performs competitively against Visual SLAM sys- tems on outdoor driving scenarios, showcasing its generalization capabilities.

Index terms

Deep Learning Methods Localization