← Back ICRA 2026

CMG3D: Compensation towards Modality Gap for Open-Vocabulary Indoor 3D Object Detection

Sheng Zhang, Lian Huai, Yuyu Liu, Xingqun Jiang

PDF

AI summary

Key figure (auto-extracted from paper)

CMG3D bridges the image-point cloud modality gap for distant objects, achieving state-of-the-art open-vocabulary indoor 3D detection.

open-vocabulary detection indoor 3D object detection modality gap multimodal fusion pseudo-label refinement SigLIP

Problem

Existing open-vocabulary indoor 3D object detection methods ignore the significant feature gap between images and point clouds, especially for distant objects, which degrades detection accuracy and pseudo-label matching.

Approach

CMG3D fuses image features into the point cloud voxel space to compensate for distant objects, filters noisy proposals via confidence thresholds, and refines 2D detector outputs with the SigLIP multimodal LLM to generate high-quality 3D pseudo labels.

Key results

Achieves state-of-the-art mAP on SUN RGB-D and ScanNet benchmarks
Multimodal compensation successfully enriches features for distant objects
SigLIP-based refinement eliminates erroneous pseudo-label impacts
Outperforms prior methods like CoDA and OV-3DET across novel and base categories

Why it matters

Enables scalable, accurate open-vocabulary 3D perception for indoor robotics and autonomous navigation without costly category-specific retraining.

Abstract

For open-vocabulary indoor three-dimensional (3D) object detection (OVI3DOD), there is a gap between the image and the point cloud for indoor scenes, especially on distant objects. However, existing algorithms ignore this problem, which weakens the detection performance. Therefore, we propose Compensation towards the Modal Gap for open- vocabulary indoor 3D object detection (CMG3D). CMG3D consists of three modules: multimodal compensation (MC), object proposal filtering (OPF) and pseudo label refinement and generation (PLRG). In the MC, features from images are converted into the pseudo voxel space and then summed with the voxel space of the point cloud, which is used to compensate for the modality gap, while the OPF filters the object proposals to avoid confusion between the foreground and background. Finally, in the PLRG, the predictions from the two-dimensional (2D) detector are refined by the multimodal large language model (LLM) SigLIP and then transformed into 3D pseudo labels for the training process. Finally, we evaluate CMG3D on two indoor datasets, SUN RGB-D and ScanNet, and achieve state-of-the-art results.

Index terms

Deep Learning for Visual Perception RGB-D Perception Computer Vision for Automation