SDTrack: Spatially Decoupled Tracker for Visual Tracking
zihao xia, Xin Bi, baojie fan, Zhiquan Wang
Abstract
Recent models based on encoder-decoder architec- ture have shown excellent performance in visual object track- ing. The encoder models the global spatiotemporal feature cor- relation between the template and the search regions, while the decoder learns query embeddings to predict the spatial lo- cation of the target. However, in previous methods, decoders are query-shared, which may lead to suboptimal results. We observe that different regions in the visual feature map are suitable for performing different tasks. Salient regions in ob- ject provide important information for classification task, while the boundaries around it are more beneficial for box localiza- tion task. We therefore propose a spatially decoupled tracker called SDTrack. The tracker contains a query selection mod- ule that we carefully design to select appropriate queries for both classification and regression tasks. We divide the cross- attention module in the decoder and add the box-to-pixel rel- ative position offset (BoxRPB) term to the cross-attention, so that the attention is more focused on the respective areas of in- terest while introducing smaller overhead. Finally, we propose an alignment loss to solve the misalignment problem between accurate classification and precise localization, further improv- ing tracking performance. Through extensive experiments, we demonstrate that SDTrack achieves new SOTA performance on multiple benchmarks compared to previous work, while run- ning at real-time speeds.