← Back ICRA 2026

Searching in Space and Time: Unified Memory-Action Loops for Open-World Object Retrieval

Taijing Chen, Sateesh Kumar, Junhong Xu, Georgios Pavlakos, Joydeep Biswas, Roberto MartÃn-MartÃn

PDF

AI summary

Key figure (auto-extracted from paper)

STAR unifies spatial and temporal search in a single decision loop, enabling robots to retrieve open-world objects using both long-term memory and embodied actions.

open-world object retrieval spatiotemporal reasoning embodied AI long-term memory vision-language models active search

Problem

Service robots struggle to retrieve arbitrary objects in dynamic environments when requests combine open-vocabulary attributes, spatial context, and temporal references, as existing methods only handle space or time separately.

Approach

STAR integrates a non-parametric long-term memory with a working memory inside a unified action space, allowing a vision-language model to alternately query past observations or execute spatial actions to gather evidence until the target is found.

Key results

Outperforms scene-graph and memory-only baselines across all tasks
Successfully transfers to a physical Tiago robot in a mock apartment
Shows pronounced gains on tasks requiring temporal reasoning
Introduces STARBench, a 360-task benchmark for dynamic household search

Why it matters

It provides a scalable framework for assistive robots to handle complex, real-world retrieval requests that depend on both where and when objects were last seen.

Abstract

Service robots must retrieve objects in dynamic, open-world settings where requests may reference attributes (“the red mug”), spatial context (“the mug on the table”), or past states (“the mug that was here yesterday”). Existing approaches capture only parts of this problem: scene graphs capture spatial relations but ignore temporal grounding, tem- poral reasoning methods model dynamics but do not support embodied interaction, and dynamic scene graphs handle both but remain closed-world with fixed vocabularies. We present STAR (SpatioTemporal Active Retrieval), a framework that unifies memory queries and embodied actions within a sin- gle decision loop. STAR leverages non-parametric long-term memory and a working memory to support efficient recall, and uses a vision-language model to select either temporal or spatial actions at each step. We introduce STARBench, a benchmark of spatiotemporal object search tasks across simulated and real environments. Experiments in STARBench and on a Tiago robot show that STAR consistently outperforms scene-graph and memory-only baselines, demonstrating the benefits of treating search in time and search in space as a unified problem. For more information: https://amrl.cs.utexas.edu/STAR.

Index terms

Mobile Manipulation AI-Enabled Robotics Autonomous Agents