← Back ICRA 2026

TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

Jiahang Liu, Yunpeng Qi, Jiazhao Zhang, Minghan Li, Shaoan Wang, wu kui, Hanjing Ye, Hong Zhang, Zhibo Chen, Fangwei Zhong, Zhizheng Zhang, He Wang

PDF

AI summary

Key figure (auto-extracted from paper)

TrackVLA++ achieves state-of-the-art embodied visual tracking by combining compact spatial reasoning with confidence-gated long-horizon memory, enabling robust tracking under severe occlusions and distractors.

Embodied Visual Tracking Vision-Language-Action Spatial Reasoning Long-Horizon Memory Polar Chain-of-Thought Robot Navigation

Problem

Existing language-guided tracking models lack explicit spatial reasoning and robust temporal memory, causing frequent target loss during prolonged occlusions or in crowded scenes with similar distractors.

Approach

TrackVLA++ integrates a Polar Chain-of-Thought mechanism to predict the target's relative position as a compact token, paired with a Target Identification Memory module that uses confidence-aware gating to preserve target identity over long horizons.

Key results

State-of-the-art success rates on EVT-Bench DT split (+5.1% egocentric, +12% multi-camera)
New SOTA performance on the Gym-UnrealCV benchmark
Robust zero-shot generalization in dynamic real-world tracking scenarios
Efficient single-token spatial reasoning that maintains high inference speed

Why it matters

Provides a reliable foundation for companion and service robots to maintain continuous target tracking in complex, unstructured real-world environments.

Abstract

Embodied Visual Tracking (EVT) is a funda- mental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack ex- plicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar- looking distractors. To address these challenges, we present TrackVLA++, a novel Vision–Language–Action (VLA) model that enhances embodied visual tracking with two key modules: a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of- Thought paradigm, termed Polar-CoT, which infers the target’s relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long- horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1% and 12% respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real- world tracking in dynamic and occluded scenarios. ∗Equal Contribution, † Equal Advising

Index terms

Visual Tracking Vision-Based Navigation Learning from Demonstration