← Back ICRA 2026

Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation

Asim Unmesh, Ramesh Kaki, Rahul Jain, Mayank Patel, Karthik Ramani

PDF

AI summary

Key figure (auto-extracted from paper)

A training-free, two-stage pipeline leveraging Vision-Language Models achieves strong open-vocabulary zero-shot temporal action segmentation without task-specific training.

Open-Vocabulary Action Segmentation Zero-Shot Learning Vision-Language Models Optimal Transport Training-Free Pipeline Temporal Action Segmentation

Problem

Existing temporal action segmentation methods are constrained to closed vocabularies and fixed label sets, limiting their ability to generalize to unseen actions or diverse domains due to the infeasibility of collecting comprehensive annotated datasets.

Approach

OVTAS matches video frames to candidate action labels using frame-action embedding similarity and enforces temporal consistency via an optimal transport-based decoder, requiring no task-specific training.

Key results

Introduces OVTAS, a training-free two-stage pipeline for open-vocabulary zero-shot action segmentation.
Systematically benchmarks 14 VLMs across three datasets, revealing SigLIP's dominance and counterintuitive scaling trends.
Achieves strong segmentation performance on Breakfast, 50 Salads, and GTEA without task-specific supervision.
Releases code and pre-extracted VLM embeddings to lower computational barriers for future research.

Why it matters

It enables scalable, zero-shot action understanding for new domains and activities, providing a practical baseline and resource release for robotics, video analysis, and human-computer interaction researchers.

Abstract

Temporal Action Segmentation (TAS) requires di- viding videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero- Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of Vision–Language Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: (i) Frame–Action Embed- ding Similarity (FAES) matches video frames to candidate action labels, and (ii) Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action segmentation. Experiments on standard benchmarks show that OVTAS achieves strong results without task-specific supervision, underscoring the potential of VLMs for structured temporal understanding. We release code and embeddings at our project page.

Index terms

Deep Learning for Visual Perception Recognition