Boosting 3D Visual Grounding by Object-Centric Referring Network
Ruilong Ren, Jian Cao, Weichen Xu, Tianhao Fu, Yilei Dong, Xinxin Xu, Zicong Hu, Xing Zhang
Abstract
3D visual grounding is tasked with locating a specific object within a 3D scene, as described by a given textual reference. This task is challenging because it requires (1) the accurate recognition of various objects in a 3D scene and (2) the understanding of spatial relations in the description. How- ever, current studies encounter difficulties in situations where multiple similar objects are present or when the descriptions involve intricate and abstract relations. In this paper, a novel, simple, and efficient Object-Centric Referring network, namely 3D-OCR, is presented to take high-quality semantic represen- tation and deep relation modeling into account. Specifically, an offline Fine-grained Semantic Enhancement (FSE) module is designed to reinforce the object-centric semantic awareness with fine-grained high-quality object semantic representations. To achieve superior object-centric relation awareness, we propose a Deep Relation Modeling (DRM) module with the explicit and implicit relation self-attention module, enriching object features with relational context. Moreover, we utilize a vision-language contrastive loss to further improve the matching process be- tween point cloud and language. Comprehensive experiments conducted on the challenging ScanRefer and Nr3D datasets corroborate the exceptional performance of our method, with an increase of +1.47% on ScanRefer and +1.2% on Nr3D.