Multi-Modal Manipulation Via Multi-Modal Policy Consensus
Haonan Chen, Jiaming Xu, Hongyu Chen, Kaiwen Hong, Binghao Huang, Chaoqi Liu, Jiayuan Mao, Yunzhu Li, Yilun Du, Katherine Driggs-Campbell
AI summary
Problem
Monolithic feature concatenation in robotic policies suppresses sparse but critical sensory signals like touch and cannot flexibly incorporate new or missing modalities without full retraining.
Approach
The method factorizes the policy into separate diffusion models for each sensory modality and uses a learned router network to dynamically weight and combine their outputs at the policy level.
Key results
- 18% relative improvement in simulation success rate over feature concatenation baselines
- Highest success rates across real-world occluded picking, in-hand reorientation, and puzzle insertion tasks
- Robust performance under physical perturbations, runtime disturbances, and sensor corruption
- Quantitative analysis confirms adaptive, context-dependent shifts between vision and tactile reliance
Why it matters
Enables flexible, robust, and interpretable multi-modal robot learning that adapts to changing task phases and sensor availability, advancing reliable embodied AI.
Abstract
Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical ap- proach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic archi- tectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in RLBench, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities. Project website: https://adaptivescene.github.io