ForestVO: Enhancing Visual Odometry in Forest Environments through ForestGlue
Thomas Pritchard, Saifullah Ijaz, Ronald Clark, Basaran Bahadir Kocer
AI summary
Problem
Dense foliage, variable lighting, and repetitive textures degrade traditional visual odometry feature matching, while GPS is unreliable and LiDAR is too computationally heavy and expensive for lightweight drones.
Approach
ForestGlue adapts SuperPoint and retrained LightGlue/SuperGlue for forest-specific feature detection and matching, feeding the results into a lightweight transformer model that directly regresses relative camera poses from 2D keypoint coordinates.
Key results
- Achieves baseline pose accuracy using only 512 keypoints
- Outperforms DSO by 40% on TartanAir forest sequences
- Matches TartanVO performance with 10% of the training data
- Delivers superior accuracy-efficiency trade-off over dense methods
Why it matters
Enables reliable, real-time autonomous navigation for resource-constrained drones and robots in GPS-denied forest environments.
Abstract
Recent advancements in visual odometry systems have improved autonomous navigation, yet challenges persist in complex environments like forests, where dense foliage, variable lighting, and repetitive textures compromise the accuracy of fea- ture correspondences. To address these challenges, we introduce ForestGlue. ForestGlue enhances the SuperPoint feature detector through four configurations – grayscale, RGB, RGB-D, and stereo-vision inputs – optimised for various sensing modalities. For feature matching, we employ LightGlue or SuperGlue, both of which have been retrained using synthetic forest data. ForestGlue achieves comparable pose estimation accuracy to baseline LightGlue and SuperGlue models, yet require only 512 keypoints, just 25% of the 2048 keypoints used by baseline models, to achieve an LO-RANSAC AUC score of 0.745 at a 10° threshold. With a 1/4 of the keypoints required, ForestGlue has the potential to reduce computational overhead whilst being effective in dynamic forest environments, making it a promising candidate for real-time deployment on resource-constrained plat- forms such as drones or mobile robotic platforms. By combining ForestGlue with a novel transformer based pose estimation model, we propose ForestVO, which estimates relative camera poses using the 2D pixel coordinates of matched features between frames. On challenging TartanAir forest sequences, ForestVO achieves an average relative pose error (RPE) of 1.09 m and kitti score of 2.33%, outperforming direct-based methods such as DSO in dynamic scenes by 40%, while maintaining competitive performance with TartanVO despite being a significantly lighter model trained on only 10% of the dataset. This work establishes an end-to-end deep learning pipeline tailored for visual odometry in forested environments, leveraging forest-specific training data to optimise feature correspondence and pose estimation for improved accuracy and robustness in autonomous navigation systems.