Multimodal Diffusion Forcing for Forceful Manipulation
Zixuan Huang, Huaidian Hou, Dmitry Berenson
AI summary
Problem
Standard imitation learning typically maps fixed observations directly to actions, ignoring complex cross-modal dependencies and lacking robustness to partial or corrupted sensory inputs at inference time.
Approach
The authors train a diffusion model using a 2D time-modality noise matrix to randomly corrupt and reconstruct multimodal trajectory sequences, capturing temporal and cross-modal dependencies while allowing flexible conditioning at inference.
Key results
- Achieves on-par or superior success rates to specialized models in simulated and real-world forceful manipulation tasks
- Demonstrates strong robustness to noisy, partial, or missing sensory observations at inference
- Enables zero-shot test-time flexibility for policy execution, world modeling, and fine-grained anomaly detection
- Introduces a time-modality noise matrix that enables fine-grained anomaly localization across modalities and timesteps
Why it matters
Provides a single, robust framework for multimodal robot learning that adapts to varying sensor configurations and downstream tasks without retraining.
Abstract
Given a dataset of expert trajectories, standard imitation learning approaches typically learn a direct mapping from observations (e.g., RGB images) to actions. However, such methods often overlook the rich interplay between different modalities, i.e., sensory inputs, actions, and rewards — which is crucial for modeling robot behavior and understanding task outcomes. In this work, we propose Multimodal Diffusion Forcing, a unified framework for learning from multimodal robot trajectories that extends beyond action generation. Rather than modeling a fixed distribution, MDF applies random partial masking and trains a diffusion model to reconstruct the trajectory. This training objective encourages the model to learn temporal and cross-modal dependencies, such as predicting the effects of actions on force signals or inferring states from partial observations. We evaluate MDF on contact- rich, forceful manipulation tasks in simulated and real-world environments. Our results show that MDF not only delivers versatile functionalities, but also achieves strong performance, and robustness under noisy observations. More visualizations can be found on our website https://unified-df.github.io