Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking
Bastian Pätzold, Jan Nogga, Sven Behnke
AI summary
Problem
VLMs lack reliable grounding, fast inference, and machine-readable outputs for robotics, while existing open-vocabulary detectors struggle with complex attributes and require manual prompt engineering.
Approach
The pipeline uses a VLM to generate structured JSON descriptions of visible objects, grounds them with an open-vocabulary detector, and passes bounding boxes to a video segmentation model for real-time mask generation and tracking.
Key results
- Unified pipeline for robust object identification, description, grounding, and tracking using off-the-shelf models
- Instance-aware assignment scheme to curate detector outputs and reduce duplicates
- Validation protocol for grounded descriptions compatible with standard detection benchmarks
- Real-world evaluation on a mobile manipulator and custom dataset with non-standard objects
Why it matters
Enables robots to perceive and interact with arbitrary, unseen objects in dynamic environments using only general-purpose foundation models, eliminating the need for manual prompt engineering or task-specific training.
Abstract
Vision-language models (VLMs) excel in visual un- derstanding but often lack reliable grounding capabilities and ac- tionable inference rates. Integrating them with open-vocabulary object detection (OVD), instance segmentation, and tracking leverages their strengths while mitigating these drawbacks. We utilize VLM-generated structured descriptions to identify visi- ble object instances, collect application-relevant attributes, and inform an open-vocabulary detector to extract corresponding bounding boxes that are passed to a video segmentation model providing segmentation masks and tracking. Once initialized, this model directly extracts segmentation masks, processing image streams in real time with minimal computational overhead. Tracks can be updated online as needed by generating new struc- tured descriptions and detections. This combines the descriptive power of VLMs with the grounding capability of OVD and the pixel-level understanding and speed of video segmentation. Our evaluation across datasets and robotics platforms demonstrates the broad applicability of this approach, showcasing its ability to extract task-specific attributes from non-standard objects in dynamic environments.