Seeing Motion, Generating Action: Explicit Motion-Aware Policy for Robotic Action Generation
Yixiong Li, Ye Zhang, Yun Pei, Yongjian Zhang, Ruimao Zhang, Yulan Guo
AI summary
Problem
End-to-end visuomotor imitation learning struggles with modality mismatch and RGB redundancy, causing brittle policies that fail under visual perturbations like lighting changes.
Approach
The Motion-Aware Two-Stream Policy (MTP) separates spatial RGB features and temporal optical flow, fuses them via a TempoFormer module, and predicts actions using conditional flow matching.
Key results
- Highest average success rate (63.1%) across five Maniskill simulation tasks
- Significantly outperforms baselines under green and blue ambient lighting shifts
- Maintains stable performance under extreme saturation jitter where other policies fail
- Novel TempoFormer module effectively fuses multi-step optical flow features
Why it matters
Provides a robust, appearance-invariant foundation for deploying reliable visuomotor policies in visually diverse real-world environments.
Abstract
Imitation learning (IL) offers a scalable frame- work for teaching robots complex manipulation skills from human demonstrations. However, conventional end-to-end vi- suomotor IL models often suffer from poor performance and robustness due to the significant modality mismatch between high-dimensional visual inputs and low-dimensional motor ac- tions. The redundant information in RGB image, such as color of ambient light, leads models to depend on strong yet brittle task irrelevant priors that ultimately degrade performance across diverse visual environments. To address these limitations, we propose Motion-Aware Two-Stream Policy (MTP) – a novel imitation learning architecture that explicitly incorporates motion priors via optical flow alongside RGB observations. MTP employs a two-stream perception module that separately encodes spatial (RGB) and temporal (optical flow) information. These spatial-temporal features are fused and fed into a condi- tional flow matching module to generate actions. We evaluate MTP extensively in both simulation and real-world robot tasks. Results show that MTP significantly outperforms state-of-the- art baselines in terms of success rate and robustness to visual perturbations, demonstrating its effectiveness in generalizable robotic manipulation.