The Temporal Trap: Entanglement in Pre-Trained Visual Representations for Visuomotor Policy Learning
Nikolaos Tsagkas, Andreas Sochopoulos, Duolikun Danier, Chris Xiaoxuan Lu, Oisin Mac Aodha
AI summary
Problem
Pre-trained visual representations optimized for static images fail to capture temporal dependencies in sequential robot tasks, leading to temporal entanglement that obscures critical task-progression cues.
Approach
The authors quantify short- and long-range temporal entanglement across multiple models and introduce a simple baseline that augments policy inputs with timestep-based positional encodings to explicitly signal task progression.
Key results
- Quantified short- and long-range temporal entanglement across 14 pre-trained models
- Established strong negative correlation between entanglement/task-progression loss and policy success
- Proposed a timestep-encoding baseline that injects explicit task-progression signals
- Demonstrated that traditional feature augmentation methods are insufficient compared to explicit temporal signaling
Why it matters
Provides robotics researchers with a clear diagnostic metric and a simple architectural fix to overcome a fundamental limitation of static vision models in sequential control tasks.
Abstract
The integration of pre-trained visual represen- tations (PVRs) has significantly advanced visuomotor policy learning. However, effectively leveraging these models remains a challenge. We identify temporal entanglement as a critical, inherent issue when using these time-invariant models in sequential decision-making tasks. This entanglement arises be- cause PVRs, optimised for static image understanding, struggle to represent the temporal dependencies crucial for visuomotor control. In this work, we quantify the impact of temporal entanglement, demonstrating a strong correlation between a policy’s success rate and the ability of its latent space to capture task-progression cues. Based on these insights, we propose a simple, yet effective disentanglement baseline designed to mitigate temporal entanglement. Our empirical results show that traditional methods aimed at enriching features with temporal components are insufficient on their own, highlighting the necessity of explicitly addressing temporal disentanglement for robust visuomotor policy learning. Project Page: tsagkas.github.io/te.