Masked Mutual Guidance Transformer Tracking
baojie fan, Zhiquan Wang, Jiajun Ai, caiyu zhang
Abstract
Visual mask learning has received increasing at- tention in the field of visual object tracking. However, most existing studies merely utilize visual mask learning works as pre-training models without fully exploiting their potential for visual representation. In this paper, we present a novel approach for learning tracking target features, leveraging an encoder- decoder architecture with a masked mutual guidance track- ing(MMG). Initially, we perform joint visual feature extraction on both the template and search areas. Subsequently, these features undergo separate self-decoding processes, followed by mutual guidance decoding to reconstruct the original search and template images. This process fosters mutual understanding between the images, facilitating improved learning of object states and shapes across different frames. During the inference phase, we offload the decoder and implement a simple and effective tracker. Experimental results indicate that our pro- posed method is effective that the mutual guidance strategy can achieve state-of-the-art performance on five tracking datasets.