← Back ICRA 2026

Have We Scene It All? Scene Graph-Aware Deep Point Cloud Compression

Nikolaos Stathoulopoulos, Christoforos Kanellakis, George Nikolakopoulos

PDF

AI summary

Key figure (auto-extracted from paper)

A scene graph-conditioned deep autoencoder compresses 3D point clouds by up to 98% while preserving geometric and semantic fidelity for efficient multi-robot data transmission.

Point Cloud Compression Semantic Scene Graphs Deep Learning Multi-Robot Systems Edge Computing 3D Perception

Problem

The massive size and irregular structure of LiDAR point clouds strain bandwidth and storage, degrading performance in multi-agent robotic systems. Existing compression pipelines fail to leverage semantic or relational scene structure to guide efficient encoding.

Approach

The framework partitions point clouds into semantically coherent patches via a scene graph, encodes them with a FiLM-conditioned transformer into compact latent vectors, and reconstructs them using a bounding-box-guided folding decoder.

Key results

Up to 98% data reduction on SemanticKITTI and nuScenes
Outperforms Draco and MPEG codecs in compression metrics
Preserves geometric and semantic fidelity in reconstructed patches
Maintains raw-LiDAR-level accuracy for multi-robot pose graph optimization and map merging

Why it matters

Enables reliable, bandwidth-efficient 3D data sharing for edge-computing multi-robot systems without compromising navigation or perception tasks.

Abstract

Efficient transmission of 3D point cloud data is critical for advanced perception in centralized and decentral- ized multi-agent robotic systems, especially nowadays with the growing reliance on edge and cloud-based processing. However, the large and complex nature of point clouds creates challenges under bandwidth constraints and intermittent connectivity, often degrading system performance. We propose a deep compression framework based on semantic scene graphs. The method decom- poses point clouds into semantically coherent patches and encodes them into compact latent representations with semantic-aware encoders conditioned by Feature-wise Linear Modulation (FiLM). A folding-based decoder, guided by latent features and graph node attributes, enables structurally accurate reconstruction. Experiments on the SemanticKITTI and nuScenes datasets show that the framework achieves state-of-the-art compression rates, reducing data size by up to 98% while preserving both struc- tural and semantic fidelity. In addition, it supports downstream applications such as multi-robot pose graph optimization and map merging, achieving trajectory accuracy and map alignment comparable to those obtained with raw LiDAR scans.

Index terms

Range Sensing Deep Learning for Visual Perception