TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking
Jiahang Liu, Yunpeng Qi, Jiazhao Zhang, Minghan Li, Shaoan Wang, wu kui, Hanjing Ye, Hong Zhang, Zhibo Chen, Fangwei Zhong, Zhizheng Zhang, He Wang
AI summary
Problem
Existing language-guided tracking models lack explicit spatial reasoning and robust temporal memory, causing frequent target loss during prolonged occlusions or in crowded scenes with similar distractors.
Approach
TrackVLA++ integrates a Polar Chain-of-Thought mechanism to predict the target's relative position as a compact token, paired with a Target Identification Memory module that uses confidence-aware gating to preserve target identity over long horizons.
Key results
- State-of-the-art success rates on EVT-Bench DT split (+5.1% egocentric, +12% multi-camera)
- New SOTA performance on the Gym-UnrealCV benchmark
- Robust zero-shot generalization in dynamic real-world tracking scenarios
- Efficient single-token spatial reasoning that maintains high inference speed
Why it matters
Provides a reliable foundation for companion and service robots to maintain continuous target tracking in complex, unstructured real-world environments.
Abstract
Embodied Visual Tracking (EVT) is a funda- mental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack ex- plicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar- looking distractors. To address these challenges, we present TrackVLA++, a novel Vision–Language–Action (VLA) model that enhances embodied visual tracking with two key modules: a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of- Thought paradigm, termed Polar-CoT, which infers the target’s relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long- horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1% and 12% respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real- world tracking in dynamic and occluded scenarios. ∗Equal Contribution, † Equal Advising