← Back ICRA 2026

Open-Vocabulary Spatio-Temporal Scene Graph for Robot Perception and Teleoperation Planning

Yi Wang, Zeyu Xue, Mujie Liu, Tongqin Zhang, Yan Hu, Zhou Zhao, Chenguang Yang, Zhenyu Lu

PDF

AI summary

Key figure (auto-extracted from paper)

ST-OVSG mitigates teleoperation command delays by aligning operator intent with historical scene states, boosting planning success to 70.5% under latency.

Teleoperation Scene Graphs Latency Compensation Open-Vocabulary Perception Spatio-Temporal Reasoning Large Vision-Language Models

Problem

Communication latency in teleoperation creates mismatches between operator commands and remote robot states, while existing scene representations are static and lack temporal dynamics or redundancy filtering.

Approach

The authors propose ST-OVSG, a spatio-temporal open-vocabulary scene graph that tracks objects across time using Hungarian assignment and embeds latency tags to retrospectively query past scene states, coupled with a task-oriented subgraph filtering strategy for efficient LVLM planning.

Key results

Achieves 74% node accuracy on the Replica benchmark, surpassing ConceptGraph
Enables LVLM planners to reach a 70.5% success rate in latency-robustness experiments
Introduces a lightweight latency tag and temporal matching cost to align delayed commands with historical scene states
Proposes a task-oriented subgraph filtering strategy that reduces planner input redundancy while preserving open-vocabulary flexibility

Why it matters

Critical for safe, reliable teleoperation in high-risk, remote environments like deep-sea exploration or nuclear response where network delays are unavoidable.

Abstract

Teleoperation via natural-language reduces oper- ator workload and enhances safety in high-risk or remote settings. However, in dynamic remote scenes, transmission latency during bidirectional communication creates gaps be- tween remote perceived states and operator intent, leading to command misunderstanding and incorrect execution. To miti- gate this, we introduce the Spatio-Temporal Open-Vocabulary Scene Graph (ST-OVSG), a representation that enriches open- vocabulary perception with temporal dynamics and lightweight latency annotations. ST-OVSG leverages LVLMs to construct open-vocabulary 3D object representations, and extends them into the temporal domain via Hungarian assignment with our temporal matching cost, yielding a unified spatio-temporal scene graph. A latency tag is embedded to enable LVLM planners to retrospectively query past scene states, thereby resolving local–remote state mismatches caused by transmission delays. To further reduce redundancy and highlight task- relevant cues, we propose a task-oriented subgraph filtering strategy that produces compact inputs for the planner. ST- OVSG generalizes to novel categories and enhances planning robustness against transmission latency without requiring fine- tuning. Experiments show that our method achieves 74% node accuracy on Replica benchmark, outperforming ConceptGraph. Notably, in latency-robustness experiment, the LVLM planner assisted by ST-OVSG achieved a planning success rate of 70.5%. We refer to the project for the code and results.

Index terms

Semantic Scene Understanding Telerobotics and Teleoperation RGB-D Perception