← Back ICRA 2026

M2R2: MultiModal Robotic Representation for Temporal Action Segmentation

Daniel Sliwowski, Dongheui Lee

PDF

AI summary

Key figure (auto-extracted from paper)

M2R2 decouples multimodal feature extraction from segmentation models, achieving state-of-the-art robotic action segmentation while enabling feature reuse across diverse architectures.

Temporal Action Segmentation Multimodal Learning Robotic Perception Feature Extraction Transformer Fusion Surgical Robotics

Problem

Existing robotic temporal action segmentation methods rely on rigid end-to-end multimodal pipelines or vision-only models that fail under occlusion, limiting feature reuse and generalization across tasks.

Approach

M2R2 uses a late-fusion transformer to independently process and combine proprioceptive and exteroceptive sensor data into reusable features, trained with a novel objective that aligns temporal windows with action descriptions and boundary regression.

Key results

Achieves state-of-the-art performance on REASSEMBLE, (Im)PerfectPour, and JIGSAWS datasets
Enables seamless feature reuse across multiple state-of-the-art temporal action segmentation models
Demonstrates strong cross-domain generalization across different robotic embodiments and tasks
Quantifies modality contributions through extensive ablation studies on sensor fusion impact

Why it matters

Provides a flexible, high-performance foundation for robotic perception that overcomes the rigidity of end-to-end models and the fragility of vision-only approaches in complex manipulation and surgical environments.

Abstract

Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have primarily focused on lever- aging proprioceptive information to determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast, computer vision typically relies on exteroceptive sensors, such as cameras. Existing multimodal TAS models in robotics integrate feature fusion within the model, making it difficult to reuse learned features across different models. Meanwhile, pretrained vision-only feature extractors commonly used in computer vision struggle in scenarios with limited object visibility. In this work, we address these challenges by proposing M2R2, a multimodal feature extractor tailored for TAS, which combines information from both proprioceptive and exteroceptive sensors. We introduce a novel training strategy that enables the reuse of learned features across multiple TAS models. Our method sets a new state-of-the-art performance on three robotic datasets REASSEMBLE, (Im)PerfectPour, and JIGSAWS. Additionally, we conduct an extensive ablation study to evaluate the contribution of different modalities in robotic TAS tasks.

Index terms

Representation Learning Deep Learning Methods Sensor Fusion