← Back ICRA 2026

Detection of EMU Components Based on Optical Flow Attention Prior and Multi-Modal RGBD RTDETR

Mingjun Cong, Gang Peng, Yongchang Tang, Chaowei Song, Chaoze Wang

PDF

AI summary

Key figure (auto-extracted from paper)

The proposed RTDETR-FAMC network significantly improves high-speed train component defect detection accuracy by fusing RGB-D data with optical flow-guided spatial priors.

EMU inspection RGB-D detection Optical flow attention RTDETR Multi-modal fusion Defect detection

Problem

Manual inspection of high-speed rail EMU chassis is inefficient and error-prone due to complex backgrounds and diverse, compact components. Existing AI detectors struggle to accurately localize and classify these parts under varying conditions.

Approach

RTDETR-FAMC combines a dual-branch CSwin Transformer for RGB-D feature extraction with Sea-RAFT optical flow to generate dynamic spatial attention masks, enhanced by wavelet-based multi-scale fusion and channel-space attention modules.

Key results

Achieves 0.952 mAP50 on a custom high-resolution EMU chassis dataset
Outperforms YOLO series and standard RTDETR by at least 3% in mAP50
Reduces model parameters to 46.2M while maintaining high detection accuracy
Effectively localizes and classifies 34 distinct EMU component types across 28 camera positions

Why it matters

Enables safer, faster, and more reliable automated maintenance for high-speed rail networks, reducing reliance on labor-intensive manual inspections.

Abstract

To address challenges in high-speed train inspec- tion such as complex backgrounds, diverse component types, and compact dimensions, this paper proposes a defect detection method called RTDETR-FAMC (RTDETR with Optical Flow Attention and Multimodal CSwin Transformer). The approach integrates RGB images and depth data through a dual-branch CSwin Transformer backbone network that fully utilizes both visual and depth information. At the same time, the improved Sea-RAFT optical flow estimation is combined to generate dynamic spatial prior attention for standard images and test images, so as to guide the network to focus on target regions. A Mask Feature Fusion (MFF) module achieves channel-space attention synergy optimization, while HWD wavelet transform downsampling and CSP-PAC multi-scale feature fusion modules enhance detection accuracy. Experimental results based on a self-built high-speed rail EMU fine-grained scanning dataset (containing 3,881 high-resolution images) demonstrate signifi- cant accuracy improvements compared to mainstream detec- tion algorithms. Compared with YOLO series and standard RTDETR methods, the proposed approach achieves at least 3% improvement in mAP50 metric, validating its effectiveness as a reliable technical solution for intelligent EMU inspection.

Index terms

RGB-D Perception Computer Vision for Transportation Visual Learning