← Back ICRA 2026

Mini Diffuser: Fast Multi-Task Diffusion Policy Training Using Two-Level Mini-Batches

Yutong Hu, Pinhao Song, Kehan Wen, Renaud Detry

PDF

AI summary

Key figure (auto-extracted from paper)

Mini Diffuser cuts multi-task diffusion policy training time and memory by an order of magnitude while preserving 95% of state-of-the-art performance.

Diffusion Policies Multi-task Learning Training Efficiency Robotic Manipulation Two-level Batching Transformer Architecture

Problem

Training generalist robotic diffusion policies is computationally prohibitive due to the repeated processing of high-dimensional visual and language conditions for each action sample.

Approach

The method exploits the dimensionality asymmetry between conditions and actions by reusing shared conditions across multiple noised action samples via two-level mini-batching, supported by a masked-attention transformer architecture.

Key results

Achieves 95.4% of 3D Diffuser Actor performance on RLBench
Reduces training time to 13 hours on a single RTX 4090
Cuts memory consumption to 7% of baseline methods
Preserves multimodal action generation in real-world experiments

Why it matters

Lowers computational barriers for training generalist robotic policies, enabling rapid iteration and deployment on consumer-grade hardware.

Abstract

We present a method that reduces, by an order of magnitude, the time and memory needed to train multi- task vision-language robotic diffusion policies. This improvement arises from a previously underexplored distinction between action diffusion and the image diffusion techniques that inspired it: In image generation, the target is high-dimensional. By contrast, in action generation, the dimensionality of the target is compara- tively small, and only the image condition is high-dimensional. Our approach, Mini Diffuser, exploits this asymmetry by in- troducing two-level minibatching, which pairs multiple noised action samples with each vision-language condition, instead of the conventional one-to-one sampling strategy. To support this batching scheme, we introduce architectural adaptations to the diffusion transformer that prevent information leakage across samples while maintaining full conditioning access. In RLBench simulations, Mini-Diffuser achieves 95% of the performance of state-of-the-art multi-task diffusion policies, while using only 5% of the training time and 7% of the memory. Real-world experiments further validate that Mini-Diffuser preserves the key strengths of diffusion-based policies, including the ability to model multimodal action distributions and produce behavior conditioned on diverse perceptual inputs. Code available at mini- diffuse-actor.github.io along with videos and training logs.

Index terms

Deep Learning in Grasping and Manipulation Learning from Demonstration Imitation Learning