Mini Diffuser: Fast Multi-Task Diffusion Policy Training Using Two-Level Mini-Batches
Yutong Hu, Pinhao Song, Kehan Wen, Renaud Detry
AI summary
Problem
Training generalist robotic diffusion policies is computationally prohibitive due to the repeated processing of high-dimensional visual and language conditions for each action sample.
Approach
The method exploits the dimensionality asymmetry between conditions and actions by reusing shared conditions across multiple noised action samples via two-level mini-batching, supported by a masked-attention transformer architecture.
Key results
- Achieves 95.4% of 3D Diffuser Actor performance on RLBench
- Reduces training time to 13 hours on a single RTX 4090
- Cuts memory consumption to 7% of baseline methods
- Preserves multimodal action generation in real-world experiments
Why it matters
Lowers computational barriers for training generalist robotic policies, enabling rapid iteration and deployment on consumer-grade hardware.
Abstract
We present a method that reduces, by an order of magnitude, the time and memory needed to train multi- task vision-language robotic diffusion policies. This improvement arises from a previously underexplored distinction between action diffusion and the image diffusion techniques that inspired it: In image generation, the target is high-dimensional. By contrast, in action generation, the dimensionality of the target is compara- tively small, and only the image condition is high-dimensional. Our approach, Mini Diffuser, exploits this asymmetry by in- troducing two-level minibatching, which pairs multiple noised action samples with each vision-language condition, instead of the conventional one-to-one sampling strategy. To support this batching scheme, we introduce architectural adaptations to the diffusion transformer that prevent information leakage across samples while maintaining full conditioning access. In RLBench simulations, Mini-Diffuser achieves 95% of the performance of state-of-the-art multi-task diffusion policies, while using only 5% of the training time and 7% of the memory. Real-world experiments further validate that Mini-Diffuser preserves the key strengths of diffusion-based policies, including the ability to model multimodal action distributions and produce behavior conditioned on diverse perceptual inputs. Code available at mini- diffuse-actor.github.io along with videos and training logs.