← Back ICRA 2026

EIMC: Efficient Instance-Aware Multi-Modal Collaborative Perception

Kang Yang, Peng Wang, Lantao Li, Tianci Bu, Chen Sun, Deying Li, Yongcai Wang

PDF

AI summary

Key figure (auto-extracted from paper)

EIMC achieves state-of-the-art 3D detection accuracy while cutting communication bandwidth by 87.98% through early voxel injection and heatmap-driven instance communication.

collaborative perception multi-modal fusion 3D object detection bandwidth efficiency instance-aware communication autonomous driving

Problem

Current multi-modal collaborative perception methods rely on dense intermediate fusion or late fusion, which either demand prohibitive communication bandwidth or sacrifice detection accuracy for occluded objects.

Approach

The framework injects lightweight collaborative voxels early into local fusion to form compact 3D priors, then uses a heatmap-driven consensus protocol to query and refine only critical instance vectors from neighboring agents via cross-attention.

Key results

State-of-the-art 3D detection on OPV2V and DAIR-V2X benchmarks
87.98% reduction in communication bandwidth
Novel Mix-Voxel and Heterogeneous Modality Fusion modules bridge LiDAR-camera gaps
Heatmap-driven instance completion and refinement recover occluded objects efficiently

Why it matters

Enables real-time, bandwidth-efficient multi-agent perception for autonomous driving and robotics where communication constraints are critical.

Abstract

Multi-modal collaborative perception calls for great attention to enhancing the safety of autonomous driv- ing. However, current multi-modal approaches remain a “lo- cal fusion →communication” sequence, which fuses multi- modal data locally and needs high bandwidth to transmit an individual’s feature data before collaborative fusion. EIMC innovatively proposes an early collaborative paradigm. It in- jects lightweight collaborative voxels, transmitted by neighbor agents, into the ego’s local modality-fusion step, yielding com- pact yet informative 3D collaborative priors that tighten cross- modal alignment. Next, a heatmap-driven consensus protocol identifies exactly where cooperation is needed by computing per-pixel confidence heatmaps. Only the Top-K instance vec- tors located in these low-confidence, high-discrepancy regions are queried from peers, then fused via cross-attention for completion. Afterwards, we apply a refinement fusion that involves collecting the top-K most confident instances from each agent and enhancing their features using self-attention. The above instance-centric messaging reduces redundancy while guaranteeing that critical occluded objects are recovered. Evaluated on OPV2V and DAIR-V2X, EIMC attains 73.01% AP@0.5 while reducing byte bandwidth usage by 87.98% compared with the best published multi-modal collaborative detector. Code publicly released at https://github.com/ sidiangongyuan/EIMC.

Index terms

Computer Vision for Transportation Deep Learning for Visual Perception Object Detection Segmentation and Categorization