← Back ICRA 2026

M4Diffuser: Multi-View Diffusion Policy with Manipulability-Aware Control for Robust Mobile Manipulation

Ju Dong, Lei Zhang, Liding Zhang, Yao Ling, Yu Fu, Kaixin Bai, Zoltan-Csaba Marton, Zhenshan Bing, Zhaopeng Chen, Alois Knoll, Jianwei Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

M4Diffuser significantly improves mobile manipulation success rates and reduces collisions by fusing multi-view diffusion policies with a computationally efficient, manipulability-aware whole-body controller.

Mobile manipulation Diffusion policy Whole-body control Quadratic programming Multi-view perception Manipulability-aware control

Problem

Single-view learning policies lack robustness and generalization in unstructured environments, while classical whole-body controllers suffer from high computational overhead, trajectory jerk, and poor manipulability near singularities.

Approach

The framework combines a multi-view diffusion transformer policy that generates robust end-effector goals from complementary camera feeds with a novel ReM-QP controller that removes slack variables for speed and adds manipulability preferences for stability.

Key results

7%–56% higher success rates and 3%–31% fewer collisions over baselines
28% faster task execution and 35% lower end-effector jerk via ReM-QP
23%–56% success rate gain from multi-view over single-view diffusion policies
Strong generalization to unseen objects and novel task configurations

Why it matters

Enables reliable, real-time whole-body coordination for autonomous robots operating in complex, unstructured environments.

Abstract

Mobile manipulation requires the coordinated control of a mobile base and a robotic arm while simultaneously perceiving both global scene context and fine-grained object details. Existing single-view approaches often fail in unstructured environments due to limited fields of view, exploration, and generalization abilities. Moreover, classical controllers, although stable, struggle with efficiency and manipulability near singularities. To address these challenges, we propose M4Diffuser, a hybrid framework that integrates a Multi-View Diffusion Policy with a novel Reduced and Manipulability-aware Quadratic Programming (ReM-QP) controller for mobile manipulation. The diffusion policy leverages proprioceptive states and complementary camera perspectives with both close-range object details and global scene context to generate task-relevant end-effector goals in the world frame. These high-level goals are then executed by the ReM-QP controller, which eliminates slack variables for computational efficiency and incorporates manipulability-aware preferences for robustness near singularities. Comprehensive experiments in simulation and real- world environments show that M4Diffuser achieves 7%–56% higher success rates and reduces collisions by 3%–31% over baselines. Our approach demonstrates robust performance for smooth whole-body coordination, and strong generalization to unseen tasks, paving the way for reliable mobile manipulation in unstructured environments. Details of the demo and supplemental †Corresponding author. lei.zhang-1@studium.uni-hamburg.de 1TAMS (Technical Aspects of Multimodal Systems), Department of Informatics, University of Hamburg, Hamburg, Germany. 2Technical University of Munich, Germany. 3Agile Robots SE, Munich, Germany. This work is supported by National Key Research and Development Program of China (2025YFE0217000). material are available on our project website https://sites.google. com/view/m4diffuser.

Index terms

Mobile Manipulation Task and Motion Planning Deep Learning Methods