← Back ICRA 2026

DMTrack: Spatio-Temporal Multimodal Tracking Via Dual-Adapter

Weihong Li, Shaohua Dong, Haonan Lu, Yanhao Zhang, Heng Fan, Libo Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

DMTrack achieves state-of-the-art multimodal tracking with only 0.93M trainable parameters by leveraging a dual-adapter architecture for efficient spatio-temporal modeling.

multimodal tracking parameter-efficient tuning spatio-temporal modeling adapter tuning cross-modal fusion vision transformers

Problem

Current parameter-efficient multimodal trackers either ignore temporal dynamics or rely on full fine-tuning that demands prohibitive memory and computational resources.

Approach

The method freezes a pre-trained RGB foundation model and injects two lightweight adapters: a spatio-temporal modality adapter for self-prompting within each modality, and a progressive modality complementary adapter for pixel-wise cross-modal prompting.

Key results

State-of-the-art performance on five benchmarks
Requires only 0.93M trainable parameters
Converges to optimal performance within 5 hours
Operates at approximately 39.21 FPS during inference

Why it matters

Enables robust, video-level multimodal tracking on resource-constrained hardware without sacrificing accuracy, advancing practical deployment of foundation models in computer vision.

Abstract

In this paper, we explore adapter tuning and introduce a novel dual-adapter architecture for spatio-temporal multimodal tracking, dubbed DMTrack. The key of our DM- Track lies in two simple yet effective modules, including a spatio-temporal modality adapter (STMA) and a progressive modality complementary adapter (PMCA) module. The former, applied to each modality alone, aims to adjust spatio-temporal features extracted from a frozen backbone by self-prompting, which to some extent can bridge the gap between different modalities and thus allows better cross-modality fusion. The latter seeks to facilitate cross-modality prompting progres- sively with two specially designed pixel-wise shallow and deep adapters. The shallow adapter employs shared parameters between the two modalities, aiming to bridge the information flow between the two modality branches, thereby laying the foundation for following modality fusion, while the deep adapter modulates the preliminarily fused information flow with pixel- wise inner-modal attention and further generates modality- aware prompts through pixel-wise inter-modal attention. With such designs, DMTrack achieves promising spatio-temporal multimodal tracking performance with merely 0.93M trainable parameters. Extensive experiments on five benchmarks demon- strate that DMTrack achieves state-of-the-art results. Our code and models will be available at https://github.com/Nightwatch- Fox11/DMTrack.

Index terms

Visual Tracking Deep Learning for Visual Perception Sensor Fusion