← Back ICRA 2026

D-CAT: Decoupled Cross-Attention Knowledge Transfer between Sensor Modalities for Unimodal Inference

Leen Daher, Zhaobo Wang, Malcolm Mielle

PDF

AI summary

D-CAT enables accurate single-sensor inference by transferring knowledge from multi-modal training data through a novel cross-attention alignment loss.

Cross-modal transfer unimodal inference decoupled learning human activity recognition cross-attention loss sensor fusion

Problem

Existing cross-modal transfer methods require paired sensor data at both training and inference, limiting their deployment in resource-constrained environments where full sensor suites are economically or technically unfeasible.

Approach

D-CAT aligns modality-specific feature spaces using a novel cross-attention loss while keeping classification pipelines decoupled, allowing knowledge transfer from a frozen source modality to a target modality without requiring joint sensor input at inference.

Key results

+10% F1-score gain from high-to-low modality transfer in in-distribution scenarios
Weak source modalities improve target performance in out-of-distribution settings
Accurate single-sensor inference across IMU, video, and audio datasets
Generalizable framework that eliminates hardware redundancy for scalable perception

Why it matters

It enables scalable, cost-effective robotic perception systems that leverage rich multi-modal training data while operating with minimal hardware during real-world deployment.

Abstract

Cross-modal transfer learning is used to improve multi-modal classification models (e.g., for human activity recognition in human-robot collaboration). However, existing methods require paired sensor data at both training and infer- ence, limiting deployment in resource-constrained environments where full sensor suites are not economically and technically usable. To address this, we propose Decoupled Cross-Attention Transfer (D-CAT), a framework that aligns modality-specific representations without requiring joint sensor modality during inference. Our approach combines a self-attention module for feature extraction with a novel cross-attention alignment loss, which enforces the alignment of sensors’ feature spaces without requiring the coupling of the classification pipelines of both modalities. We evaluate D-CAT on three multi-modal human activity datasets (IMU, video, and audio) under both in-distribution and out-of-distribution scenarios, comparing against uni-modal models. Results show that in in-distribution scenarios, transferring from high-performing modalities (e.g., video to IMU) yields up to +10% F1-score gains over uni- modal training. In out-of-distribution scenarios, even weaker source modalities (e.g., IMU to video) improve target perfor- mance, as long as the target model isn’t overfitted on the training data. By enabling single-sensor inference with cross- modal knowledge, D-CAT reduces hardware redundancy for perception systems while maintaining accuracy, which is critical for cost-sensitive or adaptive deployments (e.g., assistive robots in homes with variable sensor availability). Code is available at https://github.com/Schindler-EPFL-Lab/D-CAT.

Index terms

Multi-Modal Perception for HRI Gesture Posture and Facial Expressions Sensor Fusion