← Back ICRA 2026

Causal Transformer-Based Online Action Recognition for High-Level Control of a Unitree Go1 Robot

Chaitanya Bandi, Kristof Kitz, Ulrike Thomas

PDF

AI summary

Key figure (auto-extracted from paper)

A novel causal transformer model achieves 98.4% accuracy on a new human-robot interaction dataset, enabling reliable real-time skeleton-based control of a quadruped robot.

Causal Transformer Online Action Recognition Skeleton-Based Control Human-Robot Interaction Quadruped Robot Spatial-Attention Tokenization

Problem

Current skeleton-based action recognition methods are largely optimized for offline benchmarks and lack the real-time, causal processing required for safe, high-level control of quadruped robots during human-robot interaction.

Approach

The authors introduce a causal transformer architecture that uses Spatial-Attention Tokenization to group joint features into meaningful soft tokens and Multi-Resolution Causal Temporal Mixing to combine causal convolutions with self-attention for future-free, real-time temporal modeling.

Key results

98.4% accuracy on the newly introduced GoHAR-12 dataset
Superior performance in distinguishing highly similar gestures compared to baseline causal models
Competitive results on public benchmarks NTU-RGB+D 60/120 and NW-UCLA
Successful real-time, closed-loop control demonstrations on the Unitree Go1 robot

Why it matters

Provides a robust, real-time perception pipeline that bridges the gap between academic action recognition research and practical, safety-critical quadruped robot control.

Abstract

We present a new causal transformer system consisting of Spatial-Attention Tokenization (SAT) with Multi- Resolution Causal Temporal Mixing (MRCTM) to perform online skeleton-based action recognition during human–robot interaction. The novel architecture uses Spatial-Attention To- kenization (SAT) to generate soft tokens from human joint groups. MRCTM performs causal convolutions and self- attention operations to detect both detailed motion patterns and extended temporal relationships. We introduce GoHAR- 12 dataset as an evaluation tool as it contains 12 gesture and posture classes which are recorded in human-robot in- teraction (HRI) settings and directly translate to high-level commands for the Unitree Go1 quadruped. The proposed model reaches 98.4% accuracy on the GoHAR-12 dataset and it shows superior performance in distinguishing between actions that are quite similar in motion, maintains strong results on public benchmarks such as NTU-RGB+D and NW-UCLA. We demonstrate how causal transformer performs for reliable real- time skeleton-based control of the Unitree Go1 robot.

Index terms

Human Detection and Tracking Gesture Posture and Facial Expressions Deep Learning Methods