IMH-MOT: Interactive Multi-Hierarchical Image and Point Cloud Fusion for Multi-Object Tracking
Wenyuan Qin, Zhiyan Zhou, Jiong Luo, Chengwei Pan, Hao XU, Xiwang Dong, Danwei Wang
AI summary
Problem
Single-modality tracking degrades under occlusion and sparsity, while existing fusion methods struggle with non-overlapping observations and rely on unreliable single-frame appearance features.
Approach
The framework aligns 2D and 3D detections to recover missing spatial data, fuses cross-modal motion cues, and encodes temporal appearance consistency via a transformer, all integrated through a multi-hierarchical data association strategy.
Key results
- 80.90% HOTA and 89.73% MOTA on the KITTI MOT benchmark
- 470 ID switches, outperforming state-of-the-art methods
- Effective spatial recovery for image-only targets via guided point cloud clustering
- Validated effectiveness of alignment and long-term appearance modules through ablation studies
Why it matters
Enables reliable object tracking for autonomous driving and surveillance in complex, real-world conditions where sensors frequently fail or provide incomplete data.
Abstract
Multi-object tracking (MOT) plays a critical role in applications such as autonomous driving and surveillance. Camera-based approaches offer rich texture features for object association, while LiDAR-based methods provide accurate geometric information for spatial reasoning. Although each modality addresses different challenges, their intrinsic discrepancies hinder effective cross-modal fusion and unified representation learning. To overcome these limitations, we propose IMH-MOT, an interactive multi-hierarchical MOT framework comprising three key modules. The Multi-modality Alignment Module (MMAM) enhances spatial representations by sampling and clustering instance-level point clouds. From different modalities are motion cues integrated by the Multi-modality Motion Estimation Module (MMEM) to build a unified motion model. To mitigate the impact of occlusion on single-frame appearance features, the Long-term Appearance Module (LAM) captures temporal appearance consistency by constructing a long-term appearance embedding. Guided by modality-aware cues from MMAM, MMEM generates reliable spatial representations, while LAM encodes robust long-term appearance features. These components are jointly integrated through a Multi-hierarchical Data Association (MHDA) strategy, enabling stable and accurate tracking. Extensive experiments on the KITTI MOT benchmark demonstrate the effectiveness of our framework, achieving 80.90% HOTA, 89.73% MOTA, and 470 IDSW, outperforming state-of-the-art methods in both standard and challenging scenarios.