← Back ICRA 2026

Relationship-Aware Hierarchical 3D Scene Graph for Task Reasoning

Albert Gassol Puigjaner, Angelos Zacharia, Kostas Alexis

PDF

AI summary

Key figure (auto-extracted from paper)

REASONINGGRAPH enables autonomous robots to incrementally build relationship-aware 3D scene graphs and execute complex, language-grounded tasks through integrated vision-language reasoning.

3D scene graphs hierarchical mapping open-vocabulary perception task reasoning quadruped robotics vision-language models

Problem

Traditional metric-semantic maps and existing 3D scene graphs lack scalable hierarchical structures, open-vocabulary semantics, and online relational reasoning, limiting autonomous agents' ability to understand and interact with dynamic environments.

Approach

The framework incrementally constructs a five-layer hierarchical 3D scene graph enriched with open-vocabulary CLIP embeddings and VLM-inferred object relationships, then uses an LLM-VLM module to decompose natural language tasks and evaluate interaction feasibility.

Key results

Extends hierarchical 3D scene graphs with multi-level open-vocabulary features
Integrates VLM-derived visual features to model explicit object relationships
Combines LLMs and VLMs to decompose tasks, identify relevant objects, and assess interaction feasibility
Demonstrates online graph construction and task reasoning on a quadruped robot across multiple environments

Why it matters

Provides a scalable, online representation that bridges low-level spatial perception and high-level language reasoning for autonomous robotic navigation and manipulation.

Abstract

Representing and understanding 3D environments in a structured manner is crucial for autonomous agents to navigate and reason about their surroundings. While traditional Simultaneous Localization and Mapping (SLAM) methods generate metric reconstructions and can be extended to metric- semantic mapping, they lack a higher level of abstraction and relational reasoning. To address this gap, 3D scene graphs have emerged as a powerful representation for capturing hierarchical structures and object relationships. In this work, we propose an enhanced hierarchical 3D scene graph that integrates open-vocabulary features across multiple abstraction levels and supports object-relational reasoning. Our approach leverages a Vision Language Model (VLM) to infer semantic relationships. Notably, we introduce a task reasoning module that combines Large Language Models (LLM) and a VLM to interpret the scene graph’s semantic and relational information, enabling agents to reason about tasks and interact with their environment more intelligently. We validate our method by deploying it on a quadruped robot in multiple environments and tasks, highlighting its ability to reason about them.

Index terms

Semantic Scene Understanding