← Back ICRA 2026

Multimodal Diffusion Forcing for Forceful Manipulation

Zixuan Huang, Huaidian Hou, Dmitry Berenson

PDF

AI summary

Key figure (auto-extracted from paper)

A unified diffusion model trained with time-modality noise masking learns multimodal robot trajectories, enabling flexible inference and robust performance in contact-rich manipulation tasks.

Multimodal learning diffusion models robot manipulation masked training anomaly detection forceful control

Problem

Standard imitation learning typically maps fixed observations directly to actions, ignoring complex cross-modal dependencies and lacking robustness to partial or corrupted sensory inputs at inference time.

Approach

The authors train a diffusion model using a 2D time-modality noise matrix to randomly corrupt and reconstruct multimodal trajectory sequences, capturing temporal and cross-modal dependencies while allowing flexible conditioning at inference.

Key results

Achieves on-par or superior success rates to specialized models in simulated and real-world forceful manipulation tasks
Demonstrates strong robustness to noisy, partial, or missing sensory observations at inference
Enables zero-shot test-time flexibility for policy execution, world modeling, and fine-grained anomaly detection
Introduces a time-modality noise matrix that enables fine-grained anomaly localization across modalities and timesteps

Why it matters

Provides a single, robust framework for multimodal robot learning that adapts to varying sensor configurations and downstream tasks without retraining.

Abstract

Given a dataset of expert trajectories, standard imitation learning approaches typically learn a direct mapping from observations (e.g., RGB images) to actions. However, such methods often overlook the rich interplay between different modalities, i.e., sensory inputs, actions, and rewards — which is crucial for modeling robot behavior and understanding task outcomes. In this work, we propose Multimodal Diffusion Forcing, a unified framework for learning from multimodal robot trajectories that extends beyond action generation. Rather than modeling a fixed distribution, MDF applies random partial masking and trains a diffusion model to reconstruct the trajectory. This training objective encourages the model to learn temporal and cross-modal dependencies, such as predicting the effects of actions on force signals or inferring states from partial observations. We evaluate MDF on contact- rich, forceful manipulation tasks in simulated and real-world environments. Our results show that MDF not only delivers versatile functionalities, but also achieves strong performance, and robustness under noisy observations. More visualizations can be found on our website https://unified-df.github.io

Index terms

Sensorimotor Learning Deep Learning in Grasping and Manipulation Force and Tactile Sensing