UniOMA: Unified Optimal-Transport Multi-ModalStructural Alignment for Robot Perception
Xinrui Zu, Kevin Sebastian Luck, Shujian Yu
AI summary
Problem
Contrastive objectives align multimodal representations at the instance level but fail to preserve intra-modal geometric structures, creating a structural alignment gap that hinders performance in robotics where trajectories, contacts, and physical constraints matter.
Approach
UniOMA augments contrastive learning with a Gromov-Wasserstein barycenter regularizer that computes a shared structural consensus and aligns each modality's embedding geometry to it, scaling linearly to three or more modalities.
Key results
- Consistent performance gains across five robotic benchmarks
- Linear scaling to 3+ modalities versus quadratic pairwise complexity
- Plug-and-play GW regularizer boosts existing contrastive baselines
- Learned modality weights provide interpretable per-dataset salience diagnostics
Why it matters
It enables scalable, structure-preserving multimodal alignment that directly improves perception and control for contact-rich robotic systems.
Abstract
Contrastive objectives such as InfoNCE align mul- timodal representations at the instance level but are unable to keep intra-modal geometries, which is called a structural alignment gap. We propose UniOMA, a multimodal structural alignment method using Gromov–Wasserstein (GW) barycenter regularizer to align each modality to a shared structural consensus, scaling linearly to 3+ modalities. Experiments on five robotic benchmarks (vision, force, depth, audio, tactile, proprioception) show consistent improvements in downstream tasks like regression, classification, and cross-modal retrieval.