← Back ICRA 2026

KeySG: Hierarchical Keyframe-Based 3D Scene Graphs

Abdelrhman Werby, Dennis Rotondi, Fabio Scaparro, Kai Oliver Arras

PDF

AI summary

Key figure (auto-extracted from paper)

KeySG enables scalable, task-agnostic robotic reasoning by representing environments as a hierarchical graph of keyframes and using a multi-modal RAG pipeline to efficiently query LLMs without context limits.

3D scene graphs hierarchical representation keyframe sampling retrieval-augmented generation vision-language models robotic reasoning

Problem

Current 3D scene graphs rely on predefined relationships and become unmanageable for LLMs in large environments due to context window limits and attention degradation.

Approach

The framework builds a hierarchical graph of floors, rooms, objects, and functional elements, augmenting nodes with multi-modal descriptions from optimally sampled keyframes and querying them via a hierarchical retrieval-augmented generation pipeline.

Key results

Surpasses prior methods on open-vocabulary 3D semantic segmentation benchmarks
Achieves superior functional element segmentation accuracy
Accurately grounds objects from complex, hierarchical natural language queries
Eliminates context window bottlenecks through hierarchical summarization and targeted RAG retrieval

Why it matters

Provides robots with a scalable, semantically rich world model that supports complex reasoning and planning in large-scale human environments.

Abstract

In recent years, 3D scene graphs have emerged as a powerful world representation, offering both geometric accuracy and semantic richness. Combining 3D scene graphs with large language models enables robots to reason, plan, and navigate in complex human-centered environments. However, current approaches for constructing 3D scene graphs are semantically limited to a predefined set of relationships, and their serialization in large environments can easily exceed an LLM’s context window. We introduce KeySG, a framework that represents 3D scenes as a hierarchical graph consisting of floors, rooms, objects, and functional elements, where nodes are augmented with multi-modal information extracted from keyframes selected to optimize geometric and visual coverage. The keyframes allow us to efficiently leverage VLMs to extract scene information, alleviating the need to explicitly model relationship edges between objects, enabling more general, task-agnostic reasoning and planning. Our approach can process complex and ambiguous queries while mitigating the scalability issues associated with large scene graphs by utilizing a hierarchical multi-modal retrieval-augmented generation (RAG) pipeline to extract relevant context from the graph. Evaluated across three distinct benchmarks, 3D object semantic segmentation, functional element segmentation, and complex query retrieval KeySG outperforms prior approaches on most metrics, demonstrating its superior semantic richness and efficiency. See our project page at https://keysg-lab.github.io/

Index terms

Semantic Scene Understanding Mapping RGB-D Perception