EIMC: Efficient Instance-Aware Multi-Modal Collaborative Perception
Kang Yang, Peng Wang, Lantao Li, Tianci Bu, Chen Sun, Deying Li, Yongcai Wang
AI summary
Problem
Current multi-modal collaborative perception methods rely on dense intermediate fusion or late fusion, which either demand prohibitive communication bandwidth or sacrifice detection accuracy for occluded objects.
Approach
The framework injects lightweight collaborative voxels early into local fusion to form compact 3D priors, then uses a heatmap-driven consensus protocol to query and refine only critical instance vectors from neighboring agents via cross-attention.
Key results
- State-of-the-art 3D detection on OPV2V and DAIR-V2X benchmarks
- 87.98% reduction in communication bandwidth
- Novel Mix-Voxel and Heterogeneous Modality Fusion modules bridge LiDAR-camera gaps
- Heatmap-driven instance completion and refinement recover occluded objects efficiently
Why it matters
Enables real-time, bandwidth-efficient multi-agent perception for autonomous driving and robotics where communication constraints are critical.
Abstract
Multi-modal collaborative perception calls for great attention to enhancing the safety of autonomous driv- ing. However, current multi-modal approaches remain a “lo- cal fusion →communication” sequence, which fuses multi- modal data locally and needs high bandwidth to transmit an individual’s feature data before collaborative fusion. EIMC innovatively proposes an early collaborative paradigm. It in- jects lightweight collaborative voxels, transmitted by neighbor agents, into the ego’s local modality-fusion step, yielding com- pact yet informative 3D collaborative priors that tighten cross- modal alignment. Next, a heatmap-driven consensus protocol identifies exactly where cooperation is needed by computing per-pixel confidence heatmaps. Only the Top-K instance vec- tors located in these low-confidence, high-discrepancy regions are queried from peers, then fused via cross-attention for completion. Afterwards, we apply a refinement fusion that involves collecting the top-K most confident instances from each agent and enhancing their features using self-attention. The above instance-centric messaging reduces redundancy while guaranteeing that critical occluded objects are recovered. Evaluated on OPV2V and DAIR-V2X, EIMC attains 73.01% AP@0.5 while reducing byte bandwidth usage by 87.98% compared with the best published multi-modal collaborative detector. Code publicly released at https://github.com/ sidiangongyuan/EIMC.