Research Analyzer
← Back ICRA 2026

SHAF: Small Language Model Integrated with Motion Modality for Multimodal Interaction

Aamir Ahmad Ansari, Nguyen Tan Viet Tuyen, Sarvapali Ramchurn

PDF

AI summary

Key figure (auto-extracted from paper)
A lightweight LLM equipped with vector-quantized image and motion tokens achieves competitive cross-modal performance while enabling multi-turn human-AI conversations.
Multimodal LLMs Human-Robot Interaction Motion Generation Vector Quantization Multimodal Datasets Small Language Models

Problem

Human motion remains underexplored in multimodal LLMs, with existing frameworks lacking native motion integration, full multimodal output capabilities, and publicly available multi-turn conversational datasets for daily activities.

Approach

SHAF converts images and 3D human motion into discrete tokens via vector quantization, aligns them with text in a shared embedding space, and fine-tunes a small 3B-parameter LLM on a newly created multi-turn multimodal dataset.

Key results

  • Introduction of SHAF, a unified vision-language-motion LLM framework
  • Creation of the SHAF dataset containing 14,296 multi-turn multimodal conversational samples
  • Competitive performance in text-to-motion and motion-to-text tasks versus specialized models
  • Demonstrated capability for multi-turn cross-modal tasks including motion-to-image and image reasoning

Why it matters

Provides a lightweight, cost-effective foundation for researchers and developers building context-aware human-robot interaction systems that natively understand and generate human motion.

Abstract

Multimodal interaction plays a vital role in hu- man–AI interaction, enabling robots or AI agents to interpret human input from multiple sensory channels and respond through diverse communication modalities. This paper intro- duces SHAF, an LLM-based multimodal model capable of handling text, image, and human motion as both input and output modalities across different multi-turn conversational settings. In SHAF, vector quantization is employed to convert images and human motion into an aligned set of tokens, followed by pre-training and instruction fine-tuning of a small Large Language Model (LLM) on our newly created SHAF dataset. Experimental results demonstrate that SHAF achieves competitive performance in text-to-motion and motion-to-text tasks in comparison to relevant works, while handling an additional modality and supporting a broader range of tasks. This research contributes an LLM-based multimodal approach, with the aim of fostering deeper exploration of human motion modality in LLMs within the context of HRI and related domains.

Index terms

Multi-Modal Perception for HRI Gesture Posture and Facial Expressions Deep Learning Methods

Related papers