← Back ICRA 2026

Pragmatic Embodied Spoken Instruction Following in Human-Robot Collaboration with Theory of Mind

Lance Ying, Xinyi Li, Shivam Aarya, Yizirui Fang, Jason Xinyu Liu, Yifan Yin, Stefanie Tellex, Joshua Tenenbaum, Tianmin Shu

PDF

AI summary

Key figure (auto-extracted from paper)

SIFToM enables robots to pragmatically interpret noisy spoken instructions by leveraging collaborative context and Theory of Mind reasoning, outperforming state-of-the-art VLMs and approaching human-level accuracy.

Theory of Mind Spoken Instruction Following Neurosymbolic AI Human-Robot Collaboration Vision-Language Models Noisy Speech Robustness

Problem

Real-world human-robot collaboration is hindered by noisy or ambiguous spoken instructions that standard speech recognition and vision-language models fail to decode correctly. Humans naturally overcome this using pragmatic reasoning and shared context, but robots currently lack this capability.

Approach

The authors introduce SIFToM, a neurosymbolic framework that uses a vision-language model to parse multimodal inputs into symbolic representations like scene graphs and action sequences, then applies probabilistic Theory of Mind inference to deduce the human's true intent despite speech corruption.

Key results

Novel neurosymbolic framework (SIFToM) for robust instruction following under noisy speech
New simulated dataset (UnclearInstruct) with real human speech and injected noise for evaluation
SIFToM with a lightweight VLM outperforms a larger state-of-the-art VLM baseline
Model performance approaches human-level accuracy in challenging spoken instruction tasks

Why it matters

This work bridges the gap between fragile AI speech processing and robust human-like pragmatic reasoning, advancing reliable human-robot collaboration in unstructured, real-world environments.

Abstract

Spoken language instructions are ubiquitous in agent collaboration. However, in real-world human-robot col- laboration, following human spoken instructions can be chal- lenging due to various speaker and environmental factors, such as background noise or mispronunciation. When faced with noisy auditory inputs, humans can leverage the collabo- rative context in the embodied environment to interpret noisy spoken instructions and take pragmatic assistive actions. In this paper, we present a cognitively inspired neurosymbolic model, Spoken Instruction Following through Theory of Mind (SIFToM), which leverages a Vision-Language Model with model-based mental inference to enable robots to pragmatically follow human instructions under diverse speech conditions. We test SIFToM in both simulated environments (VirtualHome) and real-world human-robot collaborative settings with human evaluations. Results show that SIFToM can significantly im- prove the performance of a lightweight base VLM (Gemini 2.5 Flash), outperforming state-of-the-art VLMs (Gemini 2.5 Pro) and approaching human-level accuracy on challenging spoken instruction following tasks.

Index terms

Human-Robot Collaboration Embodied Cognitive Science Human-Robot Teaming