← Back ICRA 2026

Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence

Wenxin Li, Kunyu Peng, Di Wen, Ruiping LIU, Mengfei Duan, Kai Luo, Kailun Yang

PDF

AI summary

Key figure (auto-extracted from paper)

Different noise-robust learning strategies show distinct robustness profiles governed by a foreground-background trade-off, necessitating task-specific solutions like the proposed parallel mask head mechanism.

Action-based video segmentation label noise robust learning embodied perception benchmark mask annotation

Problem

Action-based video object segmentation relies on costly, inconsistent annotations prone to multimodal noise, yet its robustness to text prompt errors and imprecise mask boundaries remains largely unexplored.

Approach

The authors introduce ActiSeg-NL, a benchmark injecting controlled text and mask annotation noise into the VISOR dataset, and adapt six noise-robust learning strategies alongside a new parallel mask head mechanism to evaluate and mitigate these disturbances.

Key results

Introduction of ActiSeg-NL benchmark for text, mask, and mixed label noise
Evaluation of six noise-robust strategies revealing distinct robustness profiles and foreground-background trade-offs
Identification of characteristic failure modes like boundary leakage and identity substitution
Proposal of the Parallel Mask Head Mechanism to mitigate mask annotation noise

Why it matters

It establishes a critical sensitivity profile for action-based segmentation under imperfect annotations, guiding the development of robust perception systems for embodied AI and human-robot interaction.

Abstract

Embodied intelligence relies on accurately seg- menting objects actively involved in interactions. Action-based video object segmentation addresses this by linking segmen- tation with action semantics, but it depends on large-scale annotations and prompts that are costly, inconsistent, and prone to multimodal noise such as imprecise masks and referential ambiguity. To date, this challenge remains unexplored. In this work, we take the first step by studying action-based video object segmentation under label noise, focusing on two sources: textual prompt noise (category flips and within-category noun substitutions) and mask annotation noise (perturbed object boundaries to mimic imprecise supervision). Our contributions are threefold. First, we introduce two types of label noises for the action-based video object segmentation task. Second, we build up the first action-based video object segmentation under a label noise benchmark ActiSeg-NL and adapt six label-noise learning strategies to this setting, and establish protocols for evaluating them under textual, boundary, and mixed noise. Third, we provide a comprehensive analysis linking noise types to failure modes and robustness gains, and we introduce a Parallel Mask Head Mechanism (PMHM) to address mask annotation noise. Qualitative evaluations further reveal characteristic failure modes, including boundary leakage and mislocalization under boundary perturbations, as well as occasional identity substitutions under textual flips. Our comparative analysis reveals that different learning strategies exhibit distinct robustness profiles, governed by a foreground- background trade-off where some achieve balanced perfor- mance while others prioritize foreground accuracy at the cost of background precision. These results establish a clear sensitivity profile of action-based video object segmentation to imperfect annotations and set a benchmark for studying noise-robust learning in embodied perception. The established benchmark and source code will be made publicly available at https: //github.com/mylwx/ActiSeg-NL.

Index terms

Deep Learning for Visual Perception Object Detection Segmentation and Categorization Semantic Scene Understanding