← Back IROS 2024

Incrementally Building Room-Scale Language-Embedded Gaussian Splats (LEGS) with a Mobile Robot

Justin Yu, Kush Hari, Kishore Srinivas, Karim El-Refai, Adam Rashid, Chung Min Kim, Justin Kerr, Richard Cheng, Muhammad Zubair Irshad, Ashwin Balakrishna, Thomas Kollar, Ken Goldberg

PDF

Abstract

Building semantic 3D maps is valuable for search- ing for objects of interest in offices, warehouses, stores, and homes. We present a mapping system that incrementally builds a Language-Embedded Gaussian Splat (LEGS): a detailed 3D scene representation that encodes both appearance and semantics in a unified representation. LEGS is trained online as a robot traverses its environment to enable localization of open-vocabulary object queries. We evaluate LEGS on 4 room-scale scenes where we query for objects in the scene to assess how LEGS can capture semantic meaning. We compare LEGS to LERF [1] and find that while both systems have comparable object query success rates, LEGS trains over 3.5x faster than LERF. Results suggest that a multi-camera setup and incremental bundle adjustment can boost visual reconstruction quality in constrained robot trajectories, and suggest LEGS can localize open-vocabulary and long-tail object queries with up to 66% accuracy. See project website at: berkeleyautomation.github.io/LEGS I . I N T RO D U C T I O N Consider open vocabulary robot requests such as “Where are gluten-free crackers?” or “Get a stain remover spray”, the robots must parse such queries, localize relevant objects, and navigate to them. A large body of recent work uses large vision-language models by distilling their outputs into 3D representations like point clouds or NeRFs [2]. These semantic representations have been applied to both manipulation [3], [4], [5] and large-scale scene understanding [6], [7], [8], showing promise of using large models zero-shot for open- vocabulary task specification. One key challenge for scaling these methods to large environments is the underlying 3D representation, which should be flexible to a variety of scales, able to update with new observations, and fast. Although NeRFs are commonly used as the 3D representation for distilling 2D semantic features [9], [10], [1], scaling NeRFs to large scenes can be cumbersome because they typically rely on a fixed spatial resolution [11], [12], [13], are difficult to modify, and slower to render. A popular alternative is pointclouds [14], [7], [8], [6], which work seamlessly with many SLAM algorithms. However, a given point is assigned a single color and semantic feature by fusing CLIP in the pointcloud with a contrastively supervised field, whereas a multi-scale model of the world can simultaneously reason about objects and their parts, similar to how LERF-TOGO [5] leverages multi-scale semantics in LERF [1]. ∗Equal contribution 1The AUTOLab at UC Berkeley (automation.berkeley.edu). 2Toyota Research Institute, Los Altos, CA. Fig. 1: Language-Embedded Gaussian Splat in TRI Grocery Store Testbed [15]. LEGS relies entirely on pretrained VLMs and does not require any inventory data or finetuning. 3D Gaussian Splatting (3DGS) [16] models the 3D scene using a large set of 3D Gaussians. Recent works [17], [18] successfully assign semantic features to every Gaussian in the scene. However, existing techniques combining semantic features and 3D Gaussian Splatting (3DGS) scene reconstruc- tion require offline computation of keyframe transforms and 3D Gaussian initialization points. In this paper, we focus on linking language understanding to Gaussian Splats in large-scale scenes, while incremen- tally training on a stream of RGBD images of the scene from a mobile robot. This incremental training method offers substantial benefits, notably enabling the robot to autonomously determine its position within the environment and subsequently use the map data for enhanced operational efficiency. LEGS combines geometry and appearance information from 3DGS with semantic knowledge from CLIP by ground- ing language embeddings into the 3DGS similar to the method described in [17]. LEGS incrementally registers images and simultaneously optimizes both 3D Gaussians and dense language fields. This allows robots to build maps that contain rich representations of their surroundings that can be queried with natural language. This paper makes 3 contributions: • An online multi-camera 3DGS reconstruction system for large-scale scenes. The system takes as input three video streams from a mobile robot, and incrementally builds the 3D scene. • Language-Embedded Gaussian Splatting (LEGS), a hybrid 3D semantic representation that uses explicit 3D Gaussians for geometry and implicit scale-conditioned 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) October 14-18, 2024. Abu Dhabi, UAE 979-8-3503-7769-9/24/$31.00 ©2024 IEEE 13325

Index terms

Semantic Scene Understanding Inventory Management AI-Enabled Robotics