TUNI: Real-Time RGB-T Semantic Segmentation with Unified Multi-Modal Feature Extraction and Cross-Modal Feature Fusion
Xiaodong Guo, TONG LIU, Yike Li, Zi'ang Lin, Zhihong Deng
AI summary
Problem
Existing RGB-thermal segmentation models rely on separate encoders and fusion modules, causing suboptimal thermal feature extraction, insufficient cross-modal integration, and high computational redundancy that prevents real-time deployment on edge devices.
Approach
TUNI employs a unified encoder with stacked blocks that simultaneously extract features and fuse cross-modal information, pre-trained on RGB and pseudo-thermal data, and enhanced by an RGB-T local module using adaptive cosine similarity for precise local fusion.
Key results
- Competitive mIoU on FMB, PST900, and CART with drastically reduced parameters and FLOPs
- 10.63M parameters and 17.16G FLOPs, cutting complexity by up to 9× compared to SOTA
- Real-time inference at 120 FPS on RTX 4090 and 27 FPS on Jetson Orin NX
- RGB-T pre-training and adaptive cosine similarity module consistently boost segmentation accuracy
Why it matters
Enables robust, real-time environmental perception for resource-constrained autonomous robots and drones operating in challenging lighting and weather conditions.
Abstract
RGB-thermal (RGB-T) semantic segmentation im- proves the environmental perception of autonomous platforms in challenging conditions. Prevailing models employ encoders pre-trained on RGB images to extract features from both RGB and infrared inputs, and design additional modules to achieve cross-modal feature fusion. This results in limited thermal feature extraction and suboptimal cross-modal fusion, while the redundant encoders further compromises the model’s real- time efficiency. To address the above issues, we propose TUNI, with an RGB-T encoder consisting of multiple stacked blocks that simultaneously perform multi-modal feature extraction and cross-modal fusion. By leveraging large-scale pre-training with RGB and pseudo-thermal data, the RGB-T encoder learns to integrate feature extraction and fusion in a unified manner. By slimming down the thermal branch, the encoder achieves a more compact architecture. Moreover, we introduce an RGB-T local module to strengthen the encoder’s capacity for cross- modal local feature fusion. The RGB-T local module employs adaptive cosine similarity to selectively emphasize salient con- sistent and distinct local features across RGB-T modalities. Experimental results show that TUNI achieves competitive performance with state-of-the-art models on FMB, PST900 and CART, with fewer parameters and lower computational cost. Meanwhile, it achieves an inference speed of 27 FPS on a Jetson Orin NX, demonstrating its real-time capability in deployment. Codes are available at https://github.com/xiaodonguo/TUNI.