Research Analyzer
← Back ICRA 2026

Detection of EMU Components Based on Optical Flow Attention Prior and Multi-Modal RGBD RTDETR

Mingjun Cong, Gang Peng, Yongchang Tang, Chaowei Song, Chaoze Wang

PDF

AI summary

Key figure (auto-extracted from paper)
The proposed RTDETR-FAMC network significantly improves high-speed train component defect detection accuracy by fusing RGB-D data with optical flow-guided spatial priors.
EMU inspection RGB-D detection Optical flow attention RTDETR Multi-modal fusion Defect detection

Problem

Manual inspection of high-speed rail EMU chassis is inefficient and error-prone due to complex backgrounds and diverse, compact components. Existing AI detectors struggle to accurately localize and classify these parts under varying conditions.

Approach

RTDETR-FAMC combines a dual-branch CSwin Transformer for RGB-D feature extraction with Sea-RAFT optical flow to generate dynamic spatial attention masks, enhanced by wavelet-based multi-scale fusion and channel-space attention modules.

Key results

  • Achieves 0.952 mAP50 on a custom high-resolution EMU chassis dataset
  • Outperforms YOLO series and standard RTDETR by at least 3% in mAP50
  • Reduces model parameters to 46.2M while maintaining high detection accuracy
  • Effectively localizes and classifies 34 distinct EMU component types across 28 camera positions

Why it matters

Enables safer, faster, and more reliable automated maintenance for high-speed rail networks, reducing reliance on labor-intensive manual inspections.

Abstract

To address challenges in high-speed train inspec- tion such as complex backgrounds, diverse component types, and compact dimensions, this paper proposes a defect detection method called RTDETR-FAMC (RTDETR with Optical Flow Attention and Multimodal CSwin Transformer). The approach integrates RGB images and depth data through a dual-branch CSwin Transformer backbone network that fully utilizes both visual and depth information. At the same time, the improved Sea-RAFT optical flow estimation is combined to generate dynamic spatial prior attention for standard images and test images, so as to guide the network to focus on target regions. A Mask Feature Fusion (MFF) module achieves channel-space attention synergy optimization, while HWD wavelet transform downsampling and CSP-PAC multi-scale feature fusion modules enhance detection accuracy. Experimental results based on a self-built high-speed rail EMU fine-grained scanning dataset (containing 3,881 high-resolution images) demonstrate signifi- cant accuracy improvements compared to mainstream detec- tion algorithms. Compared with YOLO series and standard RTDETR methods, the proposed approach achieves at least 3% improvement in mAP50 metric, validating its effectiveness as a reliable technical solution for intelligent EMU inspection.

Index terms

RGB-D Perception Computer Vision for Transportation Visual Learning

Related papers