← Back ICRA 2026

IMH-MOT: Interactive Multi-Hierarchical Image and Point Cloud Fusion for Multi-Object Tracking

Wenyuan Qin, Zhiyan Zhou, Jiong Luo, Chengwei Pan, Hao XU, Xiwang Dong, Danwei Wang

PDF

AI summary

Key figure (auto-extracted from paper)

IMH-MOT achieves state-of-the-art multi-object tracking by dynamically fusing camera and LiDAR data through hierarchical alignment, joint motion estimation, and long-term appearance modeling.

Multi-object tracking LiDAR-camera fusion point cloud clustering long-term appearance motion estimation autonomous driving

Problem

Single-modality tracking degrades under occlusion and sparsity, while existing fusion methods struggle with non-overlapping observations and rely on unreliable single-frame appearance features.

Approach

The framework aligns 2D and 3D detections to recover missing spatial data, fuses cross-modal motion cues, and encodes temporal appearance consistency via a transformer, all integrated through a multi-hierarchical data association strategy.

Key results

80.90% HOTA and 89.73% MOTA on the KITTI MOT benchmark
470 ID switches, outperforming state-of-the-art methods
Effective spatial recovery for image-only targets via guided point cloud clustering
Validated effectiveness of alignment and long-term appearance modules through ablation studies

Why it matters

Enables reliable object tracking for autonomous driving and surveillance in complex, real-world conditions where sensors frequently fail or provide incomplete data.

Abstract

Multi-object tracking (MOT) plays a critical role in applications such as autonomous driving and surveillance. Camera-based approaches offer rich texture features for object association, while LiDAR-based methods provide accurate geometric information for spatial reasoning. Although each modality addresses different challenges, their intrinsic discrepancies hinder effective cross-modal fusion and unified representation learning. To overcome these limitations, we propose IMH-MOT, an interactive multi-hierarchical MOT framework comprising three key modules. The Multi-modality Alignment Module (MMAM) enhances spatial representations by sampling and clustering instance-level point clouds. From different modalities are motion cues integrated by the Multi-modality Motion Estimation Module (MMEM) to build a unified motion model. To mitigate the impact of occlusion on single-frame appearance features, the Long-term Appearance Module (LAM) captures temporal appearance consistency by constructing a long-term appearance embedding. Guided by modality-aware cues from MMAM, MMEM generates reliable spatial representations, while LAM encodes robust long-term appearance features. These components are jointly integrated through a Multi-hierarchical Data Association (MHDA) strategy, enabling stable and accurate tracking. Extensive experiments on the KITTI MOT benchmark demonstrate the effectiveness of our framework, achieving 80.90% HOTA, 89.73% MOTA, and 470 IDSW, outperforming state-of-the-art methods in both standard and challenging scenarios.

Index terms

Visual Tracking Sensor Fusion Computer Vision for Automation