← Back ICRA 2026

History-Aware Visuomotor Policy Learning Via Point Tracking

Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang, Cewu Lu

PDF

AI summary

Key figure (auto-extracted from paper)

Object-centric point tracking compresses arbitrary-length visual history into compact tokens, enabling visuomotor policies to reliably execute long-horizon manipulation tasks without the Markov assumption.

Visuomotor policies History representation Point tracking Object-centric memory Long-horizon manipulation Markov assumption

Problem

Most visuomotor policies rely on the Markov assumption, failing in manipulation tasks that require remembering past actions, repeated states, or long-horizon dependencies. Existing history-aware methods struggle with computational inefficiency, redundancy, or limited scalability.

Approach

The method tracks 3D points on task-relevant objects across time to build object-centric trajectories, then compresses these unbounded tracks into compact feature tokens using a patch-based transformer encoder for seamless integration into standard policies.

Key results

Significantly outperforms Markovian and prior history-based baselines across diverse real-world manipulation tasks
Accurately handles varied memory demands including action counting, spatial memorization, and task stage identification
Achieves full history awareness with high computational efficiency by avoiding redundant frame processing
Maintains robustness to asynchronous tracking delays through training-time random dropping augmentation

Why it matters

Provides a scalable, memory-efficient solution for robots to execute complex, long-horizon tasks, bridging the gap between human-like abstraction and practical visuomotor policy deployment.

Abstract

Many manipulation tasks require memory beyond the current observation, yet most visuomotor policies rely on the Markov assumption and thus struggle with repeated states or long-horizon dependencies. Existing methods attempt to extend observation horizons but remain insufficient for diverse memory requirements. To this end, we propose an object-centric history representation based on point tracking, which abstracts past ob- servations into a compact and structured form that retains only essential task-relevant information. Tracked points are encoded and aggregated at the object level, yielding a compact history representation that can be seamlessly integrated into various visuomotor policies. Our design provides full history-awareness with high computational efficiency, leading to improved overall task performance and decision accuracy. Through extensive evaluations on diverse manipulation tasks, we show that our method addresses multiple facets of memory requirements — such as task stage identification, spatial memorization, and action counting, as well as longer-term demands like contin- uous and pre-loaded memory — and consistently outperforms both Markovian baselines and prior history-based approaches. Project website: https://tonyfang.net/history/.

Index terms

Imitation Learning Learning from Demonstration Deep Learning in Grasping and Manipulation