← Back ICRA 2026

VISTA: Open-Vocabulary, Task-Relevant Robot Exploration with Online Semantic Gaussian Splatting

Keiko Nagami, Timothy Chen, Javier Yu, Ola Shorinwa, Maximilian Adang, Carlyn Dougherty, Eric Cristofalo, Mac Schwager

PDF

AI summary

Key figure (auto-extracted from paper)

VISTA enables robots to efficiently search unmapped environments by planning trajectories that simultaneously maximize geometric map quality and open-vocabulary task relevance.

Active exploration Semantic Gaussian Splatting Open-vocabulary navigation 3D mapping Informative planning Robot autonomy

Problem

Robots struggle to efficiently search for specific objects in unstructured, unmapped environments while building high-fidelity 3D maps in real time.

Approach

VISTA integrates online semantic 3D Gaussian Splatting with a novel viewpoint-semantic coverage metric to plan receding-horizon trajectories that prioritize both geometric view diversity and task-relevant semantic information.

Key results

Outperforms FisherRF and Bayes’ Rays in reconstruction quality and computation speed
Achieves 6x higher success rates in challenging hardware exploration maps
Introduces a scalable, recursively updatable information-gain metric combining view diversity and semantic relevance
Demonstrates platform-agnostic deployment on quadrotor and quadruped robots

Why it matters

Provides a scalable, real-time framework for autonomous object search, advancing robotics applications in search-and-rescue and dynamic inspection.

Abstract

We present VISTA (Viewpoint-based Image selection with Semantic Task Awareness), an active exploration method for robots to plan informative trajectories that improve 3D map quality in areas most relevant for task completion. Given an open-vocabulary search instruction (e.g., “find a person”), VISTA enables a robot to explore its environment to search for the object of interest, while simultaneously building a real- time semantic 3D Gaussian Splatting reconstruction of the scene. The robot navigates its environment by planning receding- horizon trajectories that prioritize semantic similarity to the query and exploration of unseen regions of the environment. To evaluate trajectories, VISTA introduces a novel, efficient viewpoint-semantic coverage metric that quantifies both the geometric view diversity and task relevance in the 3D scene. On static datasets, our coverage metric outperforms state-of-the- art baselines, FisherRF and Bayes’ Rays, in computation speed and reconstruction quality. In quadrotor hardware experiments, VISTA achieves 6x higher success rates in challenging maps, compared to baseline methods, while matching baseline perfor- mance in less challenging maps. Lastly, we show that VISTA is platform-agnostic by deploying it on a quadrotor drone and a Spot quadruped robot. Code and videos can be found on our project page: https://stanfordmsl.github.io/VISTA/.

Index terms

Semantic Scene Understanding Task and Motion Planning Mapping