← Back ICRA 2026

Semantically-Aware Diver Activity Recognition Framework for Effective Underwater Multi-Human-Robot Collaboration

Sadman Sakib Enan, Junaed Sattar

PDF

AI summary

Key figure (auto-extracted from paper)

DAR-Net achieves 73.33% accuracy in classifying underwater diver activities by leveraging pixel-level semantic supervision, outperforming state-of-the-art models and addressing critical data scarcity.

Underwater robotics diver activity recognition human-robot collaboration transformer networks semantic segmentation UDA dataset

Problem

Autonomous underwater vehicles struggle to recognize diver activities in low-visibility environments due to a severe lack of large-scale datasets, hindering safe and effective human-robot collaboration.

Approach

The authors propose DAR-Net, a transformer-based framework that jointly optimizes activity classification with pixel-level semantic segmentation to focus on relevant divers and robots, alongside the first Underwater Diver Activity (UDA) dataset of 2,640 annotated images.

Key results

73.33% classification accuracy, outperforming state-of-the-art baselines
Release of the UDA dataset with 2,640 pixel-level annotated images across six activity categories
Semantic supervision significantly improves model attention on relevant scene elements
Robust performance across precision, recall, and F1-score metrics on held-out test data

Why it matters

Provides the foundational dataset and recognition capability necessary for advancing safe, real-time collaboration between human divers and autonomous underwater vehicles.

Abstract

Effective multi-human-robot collaboration is es- sential for expanding human-led operations in the challeng- ing and high-risk underwater environment. For autonomous underwater vehicles (AUVs) to become true teammates, they must be able to comprehend their surroundings and recognize a diver’s activities to offer assistance and ensure safety. Towards this goal, we introduce DAR-Net, a novel transformer-based framework that analyzes complex underwater scenes to classify diver activities. Our contribution lies in a semantically guided learning formulation that couples transformer-based temporal reasoning with pixel-level scene supervision. This multi-loss training strategy explicitly aligns global activity recognition with local human–robot interaction semantics, which is particu- larly critical in low-visibility underwater conditions. To address the significant challenge of data scarcity in this domain, we present the first-ever Underwater Diver Activity (UDA) dataset, a foundational resource containing over 2, 600 annotated images with pixel-level masks. Through rigorous experimental evalua- tions in a controlled environment, we demonstrate that DAR- Net achieves promising accuracy in recognizing six distinct diver activities, outperforming state-of-the-art models. While this dataset provides a crucial baseline, our work serves as a pioneering step, laying the groundwork for future research and facilitating the development of more intelligent, collaborative underwater robotic systems.

Index terms

Marine Robotics Human-Robot Collaboration Human-Robot Teaming