← Back IROS 2024

Sharing Attention Mechanism in V-SLAM: Relative Pose Estimation with Messenger Tokens on Small Datasets

Dun Dai, Quan Quan, Kai-Yuan Cai

PDF

Abstract

In V-SLAM, the estimation of relative camera pose is crucial to determine the spatial relationship between consecutive camera images, helping to accurately track the movement of the camera in its environment. In small indoor scenes, when the training set is limited, which is very common in robot SLAM, learning-based methods may fail to converge, especially the Transformer architecture, which requires a more substantial dataset to match the performance of the CNN architecture model. This work addresses this problem with the sharing attention mechanism, building on recent improvements in solving visual Transformer architectures on small datasets while incorporating messenger tokens. Besides, double-embedding is introduced to capture the spatial of images and order of images. In summary, we introduce an intuitive end-to-end relative pose estimation solution and prove its accuracy on the two smallest sub-datasets of 7Scenes. The proposed method is tested with a set of comparison experiments conducted across CNN-based, Transformer-based end-to-end relative pose estimation models, and the robust feature-matching non-learning method. Our model outperforms in all comparisons. Furthermore, ablation studies clearly illustrate that these innovations are crucial for the accuracy of relative pose estimation on small datasets.

Index terms

SLAM Deep Learning for Visual Perception Visual Learning