SHAF: Small Language Model Integrated with Motion Modality for Multimodal Interaction
Aamir Ahmad Ansari, Nguyen Tan Viet Tuyen, Sarvapali Ramchurn
AI summary
Problem
Human motion remains underexplored in multimodal LLMs, with existing frameworks lacking native motion integration, full multimodal output capabilities, and publicly available multi-turn conversational datasets for daily activities.
Approach
SHAF converts images and 3D human motion into discrete tokens via vector quantization, aligns them with text in a shared embedding space, and fine-tunes a small 3B-parameter LLM on a newly created multi-turn multimodal dataset.
Key results
- Introduction of SHAF, a unified vision-language-motion LLM framework
- Creation of the SHAF dataset containing 14,296 multi-turn multimodal conversational samples
- Competitive performance in text-to-motion and motion-to-text tasks versus specialized models
- Demonstrated capability for multi-turn cross-modal tasks including motion-to-image and image reasoning
Why it matters
Provides a lightweight, cost-effective foundation for researchers and developers building context-aware human-robot interaction systems that natively understand and generate human motion.
Abstract
Multimodal interaction plays a vital role in hu- man–AI interaction, enabling robots or AI agents to interpret human input from multiple sensory channels and respond through diverse communication modalities. This paper intro- duces SHAF, an LLM-based multimodal model capable of handling text, image, and human motion as both input and output modalities across different multi-turn conversational settings. In SHAF, vector quantization is employed to convert images and human motion into an aligned set of tokens, followed by pre-training and instruction fine-tuning of a small Large Language Model (LLM) on our newly created SHAF dataset. Experimental results demonstrate that SHAF achieves competitive performance in text-to-motion and motion-to-text tasks in comparison to relevant works, while handling an additional modality and supporting a broader range of tasks. This research contributes an LLM-based multimodal approach, with the aim of fostering deeper exploration of human motion modality in LLMs within the context of HRI and related domains.