← Back IROS 2024

EMBOSR: Embodied Spatial Reasoning for Enhanced Situated Question Answering in 3D Scenes

Yu Hao, Fan Yang, Nicholas Fang, Yu-Shen Liu

PDF

Abstract

3D Embodied Spatial Reasoning, emphasizing an agent’s interaction with its surroundings for spatial information inference, is adeptly facilitated by the process of Situated Question Answering in 3D Scenes (SQA3D). SQA3D requires an agent to comprehend its position and orientation within a 3D scene based on a textual situation and then utilize this understanding to answer questions about the surrounding environment in that context. Previous methods in this field face substantial challenges, including a dependency on constant retraining on limited datasets, which leads to poor performance in unseen scenarios, limited expandability, and inadequate generalization. To address these challenges, we present a new embodied spatial reasoning paradigm for enhanced SQA3D, fusing the capabilities of foundation models with the chain of thought methodology. This approach is designed to elevate adaptability and scalability in a wide array of 3D environments. A new aspect of our model is the integration of a chain of thought reasoning process, which significantly augments the model’s capability for spatial reasoning and complex query handling in diverse 3D environments. In our structured exper- iments, we compare our approach against other methods with varying architectures, demonstrating its efficacy in multiple tasks including SQA3D and 3D captioning. We also assess the informativeness contained in the generated answers for com- plex queries. Ablation studies further delineate the individual contributions of our method to its overall performance. The results consistently affirm our proposed method’s effectiveness and efficiency.

Index terms

Semantic Scene Understanding Computer Vision for Automation