← Back ICRA 2026

MIMO: A Multimodal Imitation Learning Framework for Mobile Manipulation with Exoskeleton-VR Teleoperation

Jie Mei, Xinkai Wu, Yue Zhang, Tao Song, Zhongxia Xiong

PDF

AI summary

Key figure (auto-extracted from paper)

MIMO combines a low-cost exoskeleton-VR teleoperation system with a novel multimodal imitation learning policy to enable precise, long-horizon whole-body mobile manipulation by a single operator.

Mobile Manipulation Imitation Learning Exoskeleton Teleoperation VR Control Long-Horizon Planning Multimodal Alignment

Problem

Current teleoperation systems for whole-body mobile manipulation are costly, complex, and lack intuitive force feedback, hindering high-quality data collection. Meanwhile, imitation learning algorithms struggle with long-horizon error accumulation and fail to align multi-scale visual features with precise motion phases.

Approach

The authors introduce an integrated exoskeleton-VR teleoperation platform for single-operator whole-body control, paired with MIMO, an encoder-decoder imitation learning framework. MIMO uses a linear-complexity temporal model to mitigate long-sequence errors and a dual-path attention network to align multi-scale visual cues with corresponding motion phases.

Key results

Low-cost exoskeleton-VR teleoperation platform enabling single-operator whole-body control with force feedback
Efficient Context Modeling Network (ECM-Net) for linear-complexity long-horizon temporal modeling
Multi-Receptive Field Fusion Network (MRF-Net) aligning multi-scale visual features with action semantics via dual-path attention
Superior success rates over state-of-the-art baselines in real-world whole-body mobile manipulation tasks

Why it matters

Enables scalable, high-quality data collection and precise long-horizon control for mobile robots, accelerating the deployment of complex manipulation skills in unstructured real-world environments.

Abstract

In whole-body mobile manipulation, existing tele- operation systems often suffer from high complexity and cost, while imitation learning approaches are frequently limited by insufficient modeling of long-horizon action sequences and inadequate fusion of multi-receptive-field visual features. These constraints significantly hinder the collection of high-quality demonstration data and the effective transfer of complex robotic skills. To address these challenges, this paper proposes an integrated exoskeleton-VR teleoperation system that enables single-operator whole-body control of mobile manipulators with basic force feedback, substantially reducing the cost of data collection while improving demonstration quality. Furthermore, we introduce MIMO, an encoder–decoder imitation learning framework, which incorporates an Efficient Context Modeling Network (ECM-Net) based on linear-complexity temporal mod- eling to mitigate error accumulation in long-horizon tasks, and a Multi-Receptive Field Fusion Network (MRF-Net) that employs dual-path attention to achieve precise alignment between multi- scale visual cues and motion phases. Real-world experiments on a mobile manipulator demonstrate that MIMO consistently outperforms state-of-the-art baselines across multiple whole- body mobile manipulation tasks, confirming its effectiveness in long-horizon, fine-grained robotic control.

Index terms

Mobile Manipulation Imitation Learning Wheeled Robots