← Back ICRA 2026

Multimodal Knowledge Distillation for Egocentric Action Recognition Robust to Missing Modalities

Dustin Carrión-Ojeda, Maria Santos-Villafranca, Alejandro Perez-Yus, Jesus Bermudez-Cameo, Jose J. Guerrero, Simone Schaub-Meyer

PDF

AI summary

Key figure (auto-extracted from paper)

KARMMA distills a large multimodal teacher into a lightweight student that maintains high accuracy and robustness even when sensor inputs are missing.

Multimodal learning Knowledge distillation Egocentric action recognition Missing modalities Robotics Edge deployment

Problem

Existing multimodal egocentric action recognition models assume all modalities are available at inference, causing significant accuracy drops or failure when inputs are missing. They also typically require modality-aligned training data and are too computationally heavy for on-robot deployment.

Approach

The authors introduce a multimodal-to-multimodal knowledge distillation framework that uses modality dropout and learnable tokens to train a flexible student capable of processing any subset of available modalities without retraining or aligned data.

Key results

Novel multimodal-to-multimodal distillation framework eliminating the need for modality-aligned data
Lightweight student model requiring approximately 50% fewer computational resources than the teacher
Competitive accuracy on Epic-Kitchens and Something-Something datasets across all modality combinations
Substantial reduction in performance degradation under missing modality conditions compared to baselines

Why it matters

Enables reliable, efficient egocentric perception for real-world robotics and edge devices where sensor availability is unpredictable.

Abstract

Egocentric action recognition enables robots to facilitate human-robot interactions and monitor task progress. Existing methods often rely solely on RGB videos, although additional modalities, such as audio, can improve accuracy under challenging conditions. However, most multimodal ap- proaches assume that all modalities are available at inference time, leading to significant accuracy drops, or even failure, when inputs are missing. To address this limitation, we introduce KARMMA, a multimodal Knowledge distillation framework for egocentric Action Recognition robust to Missing ModAlities that does not require modality alignment across all samples during training or inference. KARMMA distills knowledge from a multimodal teacher into a multimodal student that leverages all available modalities while remaining robust to missing ones, enabling deployment across diverse sensor con- figurations without retraining. Our student uses approximately 50 % fewer computational resources than the teacher, resulting in a lightweight and fast model that is well suited for on-robot deployment. Experiments on Epic-Kitchens and Something- Something demonstrate that our student achieves competitive accuracy while significantly reducing performance degradation under missing modality conditions. Project page available at: https://visinf.github.io/KARMMA/

Index terms

Recognition Deep Learning for Visual Perception Sensor Fusion