T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation
Xingzu Zhan, Chen Xie, Honghang Chen, Yixun Lin, Xiaochun Mai
AI summary
Problem
Current text-to-motion models suffer from generation drift in long sequences by treating keyframe saliency and motion periodicity as independent, and they are highly sensitive to minor textual paraphrases that destabilize cross-modal alignment.
Approach
T2M Mamba explicitly weights physically meaningful keyframes and injects estimated motion phases into a state-space model, while using a periodic differential alignment module to robustly match text and motion embeddings across varying time scales.
Key results
- Achieves an FID of 0.068 on HumanML3D, surpassing state-of-the-art baselines
- Eliminates historical forgetting in long sequences via explicit keyframe weighting and phase injection
- Stabilizes cross-modal alignment against semantic paraphrases using phase-rotated differential attention
- Maintains linear-time computational complexity with negligible overhead
Why it matters
Provides a stable, efficient foundation for high-fidelity avatar animation and humanoid robotic interaction by ensuring consistent motion synthesis from natural language.
Abstract
Text-to-motion generation, which converts motion language descriptions into coherent 3D human motion sequences, has attracted increasing attention in fields, such as avatar animation and humanoid robotic interaction. Though existing models have achieved significant fidelity, they still suffer from two core limitations: (i) They treat motion periodicity and keyframe saliency as independent factors, overlooking their coupling and causing generation drift in long sequences. (ii) They are fragile to semantically equivalent paraphrases, where minor synonym substitutions distort textual embeddings, propagating through the decoder and producing unstable or erroneous motions. In this work, we propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba, which utilizes novel algorithms for keyframe weight estimation via enhanced Density Peaks Clustering and motion periodicity estimation via FFT-accelerated autocorrelation to capture coupled dynamics with minimal computational overhead, and (ii) constructing a Periodic Differential Cross-modal Alignment Module (PDCAM) to enhance robust alignment of textual and motion embeddings. Extensive experiments on HumanML3D and KIT-ML datasets have been conducted, confirming the effectiveness of our approach, achieving an FID of 0.068 and consistent gains on all other metrics.