← Back ICRA 2026

Multi-Modal Manipulation Via Multi-Modal Policy Consensus

Haonan Chen, Jiaming Xu, Hongyu Chen, Kaiwen Hong, Binghao Huang, Chaoqi Liu, Jiayuan Mao, Yunzhu Li, Yilun Du, Katherine Driggs-Campbell

PDF

AI summary

Key figure (auto-extracted from paper)

Factorizing policies into modality-specific diffusion experts and dynamically weighting them via a learned router outperforms feature concatenation and adapts to occlusion and sensor loss.

Multi-modal learning Policy composition Diffusion models Robotic manipulation Sensor fusion Adaptive routing

Problem

Monolithic feature concatenation in robotic policies suppresses sparse but critical sensory signals like touch and cannot flexibly incorporate new or missing modalities without full retraining.

Approach

The method factorizes the policy into separate diffusion models for each sensory modality and uses a learned router network to dynamically weight and combine their outputs at the policy level.

Key results

18% relative improvement in simulation success rate over feature concatenation baselines
Highest success rates across real-world occluded picking, in-hand reorientation, and puzzle insertion tasks
Robust performance under physical perturbations, runtime disturbances, and sensor corruption
Quantitative analysis confirms adaptive, context-dependent shifts between vision and tactile reliance

Why it matters

Enables flexible, robust, and interpretable multi-modal robot learning that adapts to changing task phases and sensor availability, advancing reliable embodied AI.

Abstract

Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical ap- proach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic archi- tectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in RLBench, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities. Project website: https://adaptivescene.github.io

Index terms

Agent-Based Systems Force and Tactile Sensing AI-Enabled Robotics