← Back ICRA 2026

T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation

Xingzu Zhan, Chen Xie, Honghang Chen, Yixun Lin, Xiaochun Mai

PDF

AI summary

Key figure (auto-extracted from paper)

Coupling keyframe saliency with motion periodicity in a Mamba-based architecture eliminates long-sequence drift and stabilizes text-to-motion generation against textual paraphrases.

Text-to-motion generation Mamba state-space models motion periodicity keyframe saliency cross-modal alignment diffusion models

Problem

Current text-to-motion models suffer from generation drift in long sequences by treating keyframe saliency and motion periodicity as independent, and they are highly sensitive to minor textual paraphrases that destabilize cross-modal alignment.

Approach

T2M Mamba explicitly weights physically meaningful keyframes and injects estimated motion phases into a state-space model, while using a periodic differential alignment module to robustly match text and motion embeddings across varying time scales.

Key results

Achieves an FID of 0.068 on HumanML3D, surpassing state-of-the-art baselines
Eliminates historical forgetting in long sequences via explicit keyframe weighting and phase injection
Stabilizes cross-modal alignment against semantic paraphrases using phase-rotated differential attention
Maintains linear-time computational complexity with negligible overhead

Why it matters

Provides a stable, efficient foundation for high-fidelity avatar animation and humanoid robotic interaction by ensuring consistent motion synthesis from natural language.

Abstract

Text-to-motion generation, which converts motion language descriptions into coherent 3D human motion sequences, has attracted increasing attention in fields, such as avatar animation and humanoid robotic interaction. Though existing models have achieved significant fidelity, they still suffer from two core limitations: (i) They treat motion periodicity and keyframe saliency as independent factors, overlooking their coupling and causing generation drift in long sequences. (ii) They are fragile to semantically equivalent paraphrases, where minor synonym substitutions distort textual embeddings, propagating through the decoder and producing unstable or erroneous motions. In this work, we propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba, which utilizes novel algorithms for keyframe weight estimation via enhanced Density Peaks Clustering and motion periodicity estimation via FFT-accelerated autocorrelation to capture coupled dynamics with minimal computational overhead, and (ii) constructing a Periodic Differential Cross-modal Alignment Module (PDCAM) to enhance robust alignment of textual and motion embeddings. Extensive experiments on HumanML3D and KIT-ML datasets have been conducted, confirming the effectiveness of our approach, achieving an FID of 0.068 and consistent gains on all other metrics.

Index terms

Human and Humanoid Motion Analysis and Synthesis Modeling and Simulating Humans Virtual Reality and Interfaces