← Back ICRA 2026

Multimodal Variational DeepMDP: An Efficient Approach for Industrial Assembly in High-Mix, Low-Volume Production

Grzeogrz Bartyzel

PDF

AI summary

Key figure (auto-extracted from paper)

MVDeepMDP fuses multimodal sensor data via a generalized Product-of-Experts to enable robust, sample-efficient robotic insertion across diverse parts and layouts without retraining.

Reinforcement Learning Multimodal Representation Industrial Assembly Transferability DeepMDP Robotics

Problem

High-mix, low-volume manufacturing requires robots to rapidly adapt to new components and layouts, but existing reinforcement learning methods lack the transferability and sample efficiency needed for contact-rich assembly tasks.

Approach

The method learns separate latent dynamic representations for each sensor modality (vision, pose, force/torque) and combines them using a weighted generalized Product-of-Experts mechanism to create a unified state representation for reinforcement learning.

Key results

Generalized Product-of-Experts effectively balances modality confidence for better task-relevant feature extraction
Per-modality dynamic and reward prediction significantly improves policy transferability across unseen parts and layouts
Independent processing of sensor modalities yields more informative latent states than direct concatenation
Successfully generalizes to diverse electronic components and 3D-printed blocks under background disturbances

Why it matters

Reduces production downtime and retooling costs by enabling flexible, data-efficient robotic assembly for customized manufacturing lines.

Abstract

Transferability, along with sample efficiency, is a critical factor for a reinforcement learning (RL) agent’s successful application in real-world contact-rich manipulation tasks, such as product assembly. For instance, in the case of the industrial insertion task on high-mix, low-volume (HMLV) production lines, transferability could eliminate the need for machine retooling, thus reducing production line downtimes. In our work, we introduce a method called Multimodal Variational DeepMDP (MVDeepMDP) that demonstrates the ability to generalize to var- ious environmental variations not encountered during training. The key feature of our approach involves learning a multimodal latent dynamic representation. We demonstrate the effectiveness of our method in the context of an electronic parts insertion task, which is challenging for RL agents due to the diverse physical properties of the non-standardized components, as well as simple 3D-printed blocks insertion. Furthermore, we evaluate the transferability of MVDeepMDP and analyze the impact of the balancing mechanism of the generalized Product-of-Expert, which is used to combine observable modalities. Finally, we explore the influence of separately processing state modalities of different physical quantities, such as pose and 6D force/torque (F/T) data.

Index terms

Reinforcement Learning Assembly Representation Learning