← Back ICRA 2026

Seeing Motion, Generating Action: Explicit Motion-Aware Policy for Robotic Action Generation

Yixiong Li, Ye Zhang, Yun Pei, Yongjian Zhang, Ruimao Zhang, Yulan Guo

PDF

AI summary

Key figure (auto-extracted from paper)

Explicitly incorporating optical flow into a two-stream architecture significantly boosts the success rate and visual robustness of robotic imitation learning policies.

Imitation Learning Visuomotor Policy Optical Flow Two-Stream Architecture Conditional Flow Matching Robotic Manipulation

Problem

End-to-end visuomotor imitation learning struggles with modality mismatch and RGB redundancy, causing brittle policies that fail under visual perturbations like lighting changes.

Approach

The Motion-Aware Two-Stream Policy (MTP) separates spatial RGB features and temporal optical flow, fuses them via a TempoFormer module, and predicts actions using conditional flow matching.

Key results

Highest average success rate (63.1%) across five Maniskill simulation tasks
Significantly outperforms baselines under green and blue ambient lighting shifts
Maintains stable performance under extreme saturation jitter where other policies fail
Novel TempoFormer module effectively fuses multi-step optical flow features

Why it matters

Provides a robust, appearance-invariant foundation for deploying reliable visuomotor policies in visually diverse real-world environments.

Abstract

Imitation learning (IL) offers a scalable frame- work for teaching robots complex manipulation skills from human demonstrations. However, conventional end-to-end vi- suomotor IL models often suffer from poor performance and robustness due to the significant modality mismatch between high-dimensional visual inputs and low-dimensional motor ac- tions. The redundant information in RGB image, such as color of ambient light, leads models to depend on strong yet brittle task irrelevant priors that ultimately degrade performance across diverse visual environments. To address these limitations, we propose Motion-Aware Two-Stream Policy (MTP) – a novel imitation learning architecture that explicitly incorporates motion priors via optical flow alongside RGB observations. MTP employs a two-stream perception module that separately encodes spatial (RGB) and temporal (optical flow) information. These spatial-temporal features are fused and fed into a condi- tional flow matching module to generate actions. We evaluate MTP extensively in both simulation and real-world robot tasks. Results show that MTP significantly outperforms state-of-the- art baselines in terms of success rate and robustness to visual perturbations, demonstrating its effectiveness in generalizable robotic manipulation.

Index terms

Imitation Learning Learning from Demonstration