Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation
Asim Unmesh, Ramesh Kaki, Rahul Jain, Mayank Patel, Karthik Ramani
AI summary
Problem
Existing temporal action segmentation methods are constrained to closed vocabularies and fixed label sets, limiting their ability to generalize to unseen actions or diverse domains due to the infeasibility of collecting comprehensive annotated datasets.
Approach
OVTAS matches video frames to candidate action labels using frame-action embedding similarity and enforces temporal consistency via an optimal transport-based decoder, requiring no task-specific training.
Key results
- Introduces OVTAS, a training-free two-stage pipeline for open-vocabulary zero-shot action segmentation.
- Systematically benchmarks 14 VLMs across three datasets, revealing SigLIP's dominance and counterintuitive scaling trends.
- Achieves strong segmentation performance on Breakfast, 50 Salads, and GTEA without task-specific supervision.
- Releases code and pre-extracted VLM embeddings to lower computational barriers for future research.
Why it matters
It enables scalable, zero-shot action understanding for new domains and activities, providing a practical baseline and resource release for robotics, video analysis, and human-computer interaction researchers.
Abstract
Temporal Action Segmentation (TAS) requires di- viding videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero- Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of Vision–Language Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: (i) Frame–Action Embed- ding Similarity (FAES) matches video frames to candidate action labels, and (ii) Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action segmentation. Experiments on standard benchmarks show that OVTAS achieves strong results without task-specific supervision, underscoring the potential of VLMs for structured temporal understanding. We release code and embeddings at our project page.