← Back ICRA 2026

UniOMA: Unified Optimal-Transport Multi-ModalStructural Alignment for Robot Perception

Xinrui Zu, Kevin Sebastian Luck, Shujian Yu

PDF

AI summary

Key figure (auto-extracted from paper)

UniOMA closes the structural alignment gap in multimodal learning by using a Gromov-Wasserstein barycenter regularizer, consistently improving downstream robotic tasks across 3+ modalities.

Multimodal alignment Gromov-Wasserstein Robot perception Structural alignment Optimal transport Contrastive learning

Problem

Contrastive objectives align multimodal representations at the instance level but fail to preserve intra-modal geometric structures, creating a structural alignment gap that hinders performance in robotics where trajectories, contacts, and physical constraints matter.

Approach

UniOMA augments contrastive learning with a Gromov-Wasserstein barycenter regularizer that computes a shared structural consensus and aligns each modality's embedding geometry to it, scaling linearly to three or more modalities.

Key results

Consistent performance gains across five robotic benchmarks
Linear scaling to 3+ modalities versus quadratic pairwise complexity
Plug-and-play GW regularizer boosts existing contrastive baselines
Learned modality weights provide interpretable per-dataset salience diagnostics

Why it matters

It enables scalable, structure-preserving multimodal alignment that directly improves perception and control for contact-rich robotic systems.

Abstract

Contrastive objectives such as InfoNCE align mul- timodal representations at the instance level but are unable to keep intra-modal geometries, which is called a structural alignment gap. We propose UniOMA, a multimodal structural alignment method using Gromov–Wasserstein (GW) barycenter regularizer to align each modality to a shared structural consensus, scaling linearly to 3+ modalities. Experiments on five robotic benchmarks (vision, force, depth, audio, tactile, proprioception) show consistent improvements in downstream tasks like regression, classification, and cross-modal retrieval.

Index terms

Representation Learning Sensor Fusion Perception-Action Coupling