← Back ICRA 2026

MTIL: Encoding Full History with Mamba for Temporal Imitation Learning

PDF

AI summary

Key figure (auto-extracted from paper)

MTIL leverages Mamba's linear-time recurrent dynamics to encode full trajectory history, enabling robots to resolve long-term temporal ambiguities and outperform state-of-the-art imitation learning methods.

Imitation learning Mamba State Space Models temporal ambiguity long-horizon control robotics

Problem

Standard imitation learning relies on the Markov assumption, causing failures in long-horizon tasks where historical context is critical. Existing history-aware architectures like Transformers are computationally infeasible for long sequences due to quadratic complexity.

Approach

MTIL uses the Mamba-2 State Space Model to maintain a compressed hidden state that efficiently encodes the entire observation history, conditioning action predictions on this full temporal context rather than just the current frame.

Key results

Achieves perfect or near-perfect success rates on ACT benchmark tasks
Outperforms state-of-the-art methods like ACT and Diffusion Policy
Enables efficient training on commodity hardware without out-of-memory errors
Demonstrates superior lifelong learning performance on the LIBERO benchmark

Why it matters

It makes history-aware imitation learning computationally feasible and highly effective for complex, long-horizon robotic manipulation tasks.

Abstract

Standard imitation learning (IL) methods have achieved considerable success in robotics, yet often rely on the Markov assumption, which falters in long-horizon tasks where historyiscrucialforresolvingperceptualambiguity.Thislimitation stems not only from a conceptual gap but also from a fundamental computational barrier: prevailing architectures like Transform- ers are often constrained by quadratic complexity, rendering the processing of long, high-dimensional observation sequences in- feasible. To overcome this dual challenge, we introduce Mamba Temporal Imitation Learning (MTIL). Our approach represents a new paradigm for robotic learning, which we frame as a prac- tical synthesis of World Model and Dynamical System concepts. By leveraging the linear-time recurrent dynamics of State Space Models (SSMs), MTIL learns an implicit, action-oriented world model that efficiently encodes the entire trajectory history into a compressed, evolving state. This allows the policy to be conditioned on a comprehensive temporal context, transcending the confines of Markovian approaches. Through extensive experiments on simu- lated benchmarks (ACT, Robomimic, LIBERO) and on challeng- ing real-world tasks, MTIL demonstrates superior performance against SOTA methods like ACT and Diffusion Policy, particularly in resolving long-term temporal ambiguities. Our findings not only affirm the necessity of full temporal context but also validate MTIL as a powerful and a computationally feasible approach for learning long-horizon, non-Markovian behaviors from high-dimensional observations.

Index terms

Imitation Learning Deep Learning in Grasping and Manipulation Learning from Demonstration