CMG3D: Compensation towards Modality Gap for Open-Vocabulary Indoor 3D Object Detection
Sheng Zhang, Lian Huai, Yuyu Liu, Xingqun Jiang
AI summary
Problem
Existing open-vocabulary indoor 3D object detection methods ignore the significant feature gap between images and point clouds, especially for distant objects, which degrades detection accuracy and pseudo-label matching.
Approach
CMG3D fuses image features into the point cloud voxel space to compensate for distant objects, filters noisy proposals via confidence thresholds, and refines 2D detector outputs with the SigLIP multimodal LLM to generate high-quality 3D pseudo labels.
Key results
- Achieves state-of-the-art mAP on SUN RGB-D and ScanNet benchmarks
- Multimodal compensation successfully enriches features for distant objects
- SigLIP-based refinement eliminates erroneous pseudo-label impacts
- Outperforms prior methods like CoDA and OV-3DET across novel and base categories
Why it matters
Enables scalable, accurate open-vocabulary 3D perception for indoor robotics and autonomous navigation without costly category-specific retraining.
Abstract
For open-vocabulary indoor three-dimensional (3D) object detection (OVI3DOD), there is a gap between the image and the point cloud for indoor scenes, especially on distant objects. However, existing algorithms ignore this problem, which weakens the detection performance. Therefore, we propose Compensation towards the Modal Gap for open- vocabulary indoor 3D object detection (CMG3D). CMG3D consists of three modules: multimodal compensation (MC), object proposal filtering (OPF) and pseudo label refinement and generation (PLRG). In the MC, features from images are converted into the pseudo voxel space and then summed with the voxel space of the point cloud, which is used to compensate for the modality gap, while the OPF filters the object proposals to avoid confusion between the foreground and background. Finally, in the PLRG, the predictions from the two-dimensional (2D) detector are refined by the multimodal large language model (LLM) SigLIP and then transformed into 3D pseudo labels for the training process. Finally, we evaluate CMG3D on two indoor datasets, SUN RGB-D and ScanNet, and achieve state-of-the-art results.