← Back ICRA 2026

Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking

Bastian Pätzold, Jan Nogga, Sven Behnke

PDF

AI summary

Key figure (auto-extracted from paper)

Integrating off-the-shelf vision-language models with open-vocabulary detectors and video segmentation enables real-time, open-vocabulary instance segmentation and tracking without task-specific training.

open-vocabulary detection instance segmentation visual tracking vision-language models robotics perception foundation models

Problem

VLMs lack reliable grounding, fast inference, and machine-readable outputs for robotics, while existing open-vocabulary detectors struggle with complex attributes and require manual prompt engineering.

Approach

The pipeline uses a VLM to generate structured JSON descriptions of visible objects, grounds them with an open-vocabulary detector, and passes bounding boxes to a video segmentation model for real-time mask generation and tracking.

Key results

Unified pipeline for robust object identification, description, grounding, and tracking using off-the-shelf models
Instance-aware assignment scheme to curate detector outputs and reduce duplicates
Validation protocol for grounded descriptions compatible with standard detection benchmarks
Real-world evaluation on a mobile manipulator and custom dataset with non-standard objects

Why it matters

Enables robots to perceive and interact with arbitrary, unseen objects in dynamic environments using only general-purpose foundation models, eliminating the need for manual prompt engineering or task-specific training.

Abstract

Vision-language models (VLMs) excel in visual un- derstanding but often lack reliable grounding capabilities and ac- tionable inference rates. Integrating them with open-vocabulary object detection (OVD), instance segmentation, and tracking leverages their strengths while mitigating these drawbacks. We utilize VLM-generated structured descriptions to identify visi- ble object instances, collect application-relevant attributes, and inform an open-vocabulary detector to extract corresponding bounding boxes that are passed to a video segmentation model providing segmentation masks and tracking. Once initialized, this model directly extracts segmentation masks, processing image streams in real time with minimal computational overhead. Tracks can be updated online as needed by generating new struc- tured descriptions and detections. This combines the descriptive power of VLMs with the grounding capability of OVD and the pixel-level understanding and speed of video segmentation. Our evaluation across datasets and robotics platforms demonstrates the broad applicability of this approach, showcasing its ability to extract task-specific attributes from non-standard objects in dynamic environments.

Index terms

Object Detection Segmentation and Categorization Semantic Scene Understanding Visual Tracking