← Back ICRA 2026

The Temporal Trap: Entanglement in Pre-Trained Visual Representations for Visuomotor Policy Learning

Nikolaos Tsagkas, Andreas Sochopoulos, Duolikun Danier, Chris Xiaoxuan Lu, Oisin Mac Aodha

PDF

AI summary

Key figure (auto-extracted from paper)

Static pre-trained visual representations suffer from temporal entanglement that hinders robot control, but explicitly injecting a task-progression signal significantly boosts policy performance.

Visuomotor Policy Learning Pre-trained Visual Representations Temporal Entanglement Task Progression Robot Learning Feature Disentanglement

Problem

Pre-trained visual representations optimized for static images fail to capture temporal dependencies in sequential robot tasks, leading to temporal entanglement that obscures critical task-progression cues.

Approach

The authors quantify short- and long-range temporal entanglement across multiple models and introduce a simple baseline that augments policy inputs with timestep-based positional encodings to explicitly signal task progression.

Key results

Quantified short- and long-range temporal entanglement across 14 pre-trained models
Established strong negative correlation between entanglement/task-progression loss and policy success
Proposed a timestep-encoding baseline that injects explicit task-progression signals
Demonstrated that traditional feature augmentation methods are insufficient compared to explicit temporal signaling

Why it matters

Provides robotics researchers with a clear diagnostic metric and a simple architectural fix to overcome a fundamental limitation of static vision models in sequential control tasks.

Abstract

The integration of pre-trained visual represen- tations (PVRs) has significantly advanced visuomotor policy learning. However, effectively leveraging these models remains a challenge. We identify temporal entanglement as a critical, inherent issue when using these time-invariant models in sequential decision-making tasks. This entanglement arises be- cause PVRs, optimised for static image understanding, struggle to represent the temporal dependencies crucial for visuomotor control. In this work, we quantify the impact of temporal entanglement, demonstrating a strong correlation between a policy’s success rate and the ability of its latent space to capture task-progression cues. Based on these insights, we propose a simple, yet effective disentanglement baseline designed to mitigate temporal entanglement. Our empirical results show that traditional methods aimed at enriching features with temporal components are insufficient on their own, highlighting the necessity of explicitly addressing temporal disentanglement for robust visuomotor policy learning. Project Page: tsagkas.github.io/te.

Index terms

Visual Learning Imitation Learning Learning from Demonstration