← Back ICRA 2026

Structured Interfaces for Automated Reasoning with 3D Scene Graphs

Aaron Ray, Jacob Arkin, Harel Biggie, Chuchu Fan, Luca Carlone, Nicholas Roy

PDF

AI summary

Key figure (auto-extracted from paper)

A Cypher-based graph database interface enables LLMs to efficiently retrieve and reason over large 3D scene graphs, drastically cutting token usage while boosting performance on language grounding tasks.

3D Scene Graphs Large Language Models GraphRAG Cypher Robot Navigation Natural Language Grounding

Problem

Serializing entire 3D scene graphs into an LLM's context window exceeds token limits and degrades performance on large environments, leaving a gap in scaling natural language grounding to complex, real-world robot operating areas.

Approach

The authors store the 3D scene graph in a graph database and provide the LLM with a Cypher query interface as an agentic tool, allowing dynamic retrieval of relevant subgraphs and offloading of spatial reasoning.

Key results

Scales effectively to kilometer-scale 3D scene graphs with thousands of nodes
Achieves higher success rates on instruction grounding and scene question-answering tasks
Substantially reduces token count compared to text-serialization baselines
Outperforms code-generation baselines in both efficiency and accuracy

Why it matters

Provides a scalable, token-efficient pathway for robots to understand and execute complex natural language commands in large, structured environments.

Abstract

In order to provide a robot with the ability to understand and react to a user’s natural language inputs, the natural language must be connected to the robot’s underlying representations of the world. Recently, large language models (LLMs) and 3D scene graphs (3DSGs) have become a popular choice for grounding natural language and representing the world. In this work, we address the challenge of using LLMs with 3DSGs to ground natural language. Existing methods encode the scene graph as serialized text within the LLM’s context window, but this encoding does not scale to large or rich 3DSGs. Instead, we propose to use a form of Retrieval Augmented Generation to select a subset of the 3DSG relevant to the task. We encode a 3DSG in a graph database and provide a query language interface (Cypher) as a tool to the LLM with which it can retrieve relevant data for language grounding. We evaluate our approach on instruction following and scene question-answering tasks and compare against baseline context window and code generation methods. Our results show that using Cypher as an interface to 3D scene graphs scales significantly better to large, rich graphs on both local and cloud- based models. This leads to large performance improvements in grounded language tasks while also substantially reducing the token count of the scene graph content.

Index terms

Semantic Scene Understanding AI-Based Methods Human-Robot Teaming