← Back ICRA 2026

TUNI: Real-Time RGB-T Semantic Segmentation with Unified Multi-Modal Feature Extraction and Cross-Modal Feature Fusion

Xiaodong Guo, TONG LIU, Yike Li, Zi'ang Lin, Zhihong Deng

PDF

AI summary

Key figure (auto-extracted from paper)

TUNI unifies feature extraction and cross-modal fusion in a compact encoder with RGB-T pre-training, achieving state-of-the-art segmentation accuracy and real-time speed on edge devices.

RGB-thermal segmentation real-time inference cross-modal fusion edge deployment pseudo-thermal pre-training adaptive cosine similarity

Problem

Existing RGB-thermal segmentation models rely on separate encoders and fusion modules, causing suboptimal thermal feature extraction, insufficient cross-modal integration, and high computational redundancy that prevents real-time deployment on edge devices.

Approach

TUNI employs a unified encoder with stacked blocks that simultaneously extract features and fuse cross-modal information, pre-trained on RGB and pseudo-thermal data, and enhanced by an RGB-T local module using adaptive cosine similarity for precise local fusion.

Key results

Competitive mIoU on FMB, PST900, and CART with drastically reduced parameters and FLOPs
10.63M parameters and 17.16G FLOPs, cutting complexity by up to 9× compared to SOTA
Real-time inference at 120 FPS on RTX 4090 and 27 FPS on Jetson Orin NX
RGB-T pre-training and adaptive cosine similarity module consistently boost segmentation accuracy

Why it matters

Enables robust, real-time environmental perception for resource-constrained autonomous robots and drones operating in challenging lighting and weather conditions.

Abstract

RGB-thermal (RGB-T) semantic segmentation im- proves the environmental perception of autonomous platforms in challenging conditions. Prevailing models employ encoders pre-trained on RGB images to extract features from both RGB and infrared inputs, and design additional modules to achieve cross-modal feature fusion. This results in limited thermal feature extraction and suboptimal cross-modal fusion, while the redundant encoders further compromises the model’s real- time efficiency. To address the above issues, we propose TUNI, with an RGB-T encoder consisting of multiple stacked blocks that simultaneously perform multi-modal feature extraction and cross-modal fusion. By leveraging large-scale pre-training with RGB and pseudo-thermal data, the RGB-T encoder learns to integrate feature extraction and fusion in a unified manner. By slimming down the thermal branch, the encoder achieves a more compact architecture. Moreover, we introduce an RGB-T local module to strengthen the encoder’s capacity for cross- modal local feature fusion. The RGB-T local module employs adaptive cosine similarity to selectively emphasize salient con- sistent and distinct local features across RGB-T modalities. Experimental results show that TUNI achieves competitive performance with state-of-the-art models on FMB, PST900 and CART, with fewer parameters and lower computational cost. Meanwhile, it achieves an inference speed of 27 FPS on a Jetson Orin NX, demonstrating its real-time capability in deployment. Codes are available at https://github.com/xiaodonguo/TUNI.

Index terms

Deep Learning for Visual Perception Sensor Fusion Semantic Scene Understanding