History-Aware Visuomotor Policy Learning Via Point Tracking
Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang, Cewu Lu
AI summary
Problem
Most visuomotor policies rely on the Markov assumption, failing in manipulation tasks that require remembering past actions, repeated states, or long-horizon dependencies. Existing history-aware methods struggle with computational inefficiency, redundancy, or limited scalability.
Approach
The method tracks 3D points on task-relevant objects across time to build object-centric trajectories, then compresses these unbounded tracks into compact feature tokens using a patch-based transformer encoder for seamless integration into standard policies.
Key results
- Significantly outperforms Markovian and prior history-based baselines across diverse real-world manipulation tasks
- Accurately handles varied memory demands including action counting, spatial memorization, and task stage identification
- Achieves full history awareness with high computational efficiency by avoiding redundant frame processing
- Maintains robustness to asynchronous tracking delays through training-time random dropping augmentation
Why it matters
Provides a scalable, memory-efficient solution for robots to execute complex, long-horizon tasks, bridging the gap between human-like abstraction and practical visuomotor policy deployment.
Abstract
Many manipulation tasks require memory beyond the current observation, yet most visuomotor policies rely on the Markov assumption and thus struggle with repeated states or long-horizon dependencies. Existing methods attempt to extend observation horizons but remain insufficient for diverse memory requirements. To this end, we propose an object-centric history representation based on point tracking, which abstracts past ob- servations into a compact and structured form that retains only essential task-relevant information. Tracked points are encoded and aggregated at the object level, yielding a compact history representation that can be seamlessly integrated into various visuomotor policies. Our design provides full history-awareness with high computational efficiency, leading to improved overall task performance and decision accuracy. Through extensive evaluations on diverse manipulation tasks, we show that our method addresses multiple facets of memory requirements — such as task stage identification, spatial memorization, and action counting, as well as longer-term demands like contin- uous and pre-loaded memory — and consistently outperforms both Markovian baselines and prior history-based approaches. Project website: https://tonyfang.net/history/.