MIMO: A Multimodal Imitation Learning Framework for Mobile Manipulation with Exoskeleton-VR Teleoperation
Jie Mei, Xinkai Wu, Yue Zhang, Tao Song, Zhongxia Xiong
AI summary
Problem
Current teleoperation systems for whole-body mobile manipulation are costly, complex, and lack intuitive force feedback, hindering high-quality data collection. Meanwhile, imitation learning algorithms struggle with long-horizon error accumulation and fail to align multi-scale visual features with precise motion phases.
Approach
The authors introduce an integrated exoskeleton-VR teleoperation platform for single-operator whole-body control, paired with MIMO, an encoder-decoder imitation learning framework. MIMO uses a linear-complexity temporal model to mitigate long-sequence errors and a dual-path attention network to align multi-scale visual cues with corresponding motion phases.
Key results
- Low-cost exoskeleton-VR teleoperation platform enabling single-operator whole-body control with force feedback
- Efficient Context Modeling Network (ECM-Net) for linear-complexity long-horizon temporal modeling
- Multi-Receptive Field Fusion Network (MRF-Net) aligning multi-scale visual features with action semantics via dual-path attention
- Superior success rates over state-of-the-art baselines in real-world whole-body mobile manipulation tasks
Why it matters
Enables scalable, high-quality data collection and precise long-horizon control for mobile robots, accelerating the deployment of complex manipulation skills in unstructured real-world environments.
Abstract
In whole-body mobile manipulation, existing tele- operation systems often suffer from high complexity and cost, while imitation learning approaches are frequently limited by insufficient modeling of long-horizon action sequences and inadequate fusion of multi-receptive-field visual features. These constraints significantly hinder the collection of high-quality demonstration data and the effective transfer of complex robotic skills. To address these challenges, this paper proposes an integrated exoskeleton-VR teleoperation system that enables single-operator whole-body control of mobile manipulators with basic force feedback, substantially reducing the cost of data collection while improving demonstration quality. Furthermore, we introduce MIMO, an encoder–decoder imitation learning framework, which incorporates an Efficient Context Modeling Network (ECM-Net) based on linear-complexity temporal mod- eling to mitigate error accumulation in long-horizon tasks, and a Multi-Receptive Field Fusion Network (MRF-Net) that employs dual-path attention to achieve precise alignment between multi- scale visual cues and motion phases. Real-world experiments on a mobile manipulator demonstrate that MIMO consistently outperforms state-of-the-art baselines across multiple whole- body mobile manipulation tasks, confirming its effectiveness in long-horizon, fine-grained robotic control.