← Back ICRA 2026

Explicit Memory through Online 3D Gaussian Splatting Improves Class-Agnostic Video Segmentation

Anthony Opipari, Aravindhan Krishnan, Shreekant Gayaka, Min Sun, Cheng-Hao Kuo, Arnab Sen, Odest Chadwicke Jenkins

PDF

AI summary

Key figure (auto-extracted from paper)

Augmenting class-agnostic video segmentation models with an explicit 3D Gaussian splatting memory significantly improves prediction accuracy and temporal consistency.

3D Gaussian Splatting Video Segmentation Class-Agnostic Segmentation Explicit Memory Robotic Perception Semantic Mapping

Problem

Class-agnostic video segmentation models struggle with temporal inconsistency and error accumulation in dynamic environments. Existing methods rely on no memory or implicit neural features, lacking a dense spatial history to stabilize predictions over time.

Approach

The authors incrementally build a 3D Gaussian splatting memory of past object segments and use it to condition current predictions via segment matching or re-prompting strategies.

Key results

FastSAM-Splat extends a memoryless image model with 3DGS memory
SAM2-Splat introduces a 3DGS-based re-prompting strategy to correct track errors
Explicit 3D memory improves accuracy and temporal consistency over baselines
Ablation studies validate fusion design and hyperparameter settings

Why it matters

This method enables robots to maintain stable, open-world semantic maps by overcoming the temporal instability of foundation segmentation models.

Abstract

Remembering where object segments were predicted in the past is useful for improving the accuracy and consistency of class-agnostic video segmentation algorithms. Existing video segmentation algorithms typically use either no object-level memory (e.g. FastSAM) or they use implicit memories in the form of recurrent neural network features (e.g. SAM2). In this paper, we augment both types of segmentation models using an explicit 3D memory and show that the resulting models have more accurate and consistent predictions. For this, we develop an online 3D Gaussian Splatting (3DGS) technique to store predicted object-level segments generated throughout the duration of a video. Based on this 3DGS representation, a set of fusion techniques are developed, named FastSAM- Splat and SAM2-Splat, that use the explicit 3DGS memory to improve their respective foundation models’ predictions. Ablation experiments are used to validate the proposed techniques’ design and hyperparameter settings. Results from both real-world and simulated benchmarking experiments show that models which use explicit 3D memories result in more accurate and consistent predictions than those which use no memory or only implicit neural network memories. Project Page: https://topipari.com/projects/FastSAM-Splat

Index terms

Object Detection Segmentation and Categorization RGB-D Perception