← Back ICRA 2026

Text-Conditioned Beat Gesture Generation for a Social Robot Via a Conditional Variational Autoencoder

Alejandro Climent PeÃ±alver, Enrique Fernandez-Rodicio, Ãlvaro Castro-GonzÃ¡lez

PDF

AI summary

Key figure (auto-extracted from paper)

A lightweight CVAE model successfully generates smooth, text-conditioned beat gestures in real time on a resource-constrained social robot while strictly respecting kinematic limits.

Beat gesture generation Conditional VAE Social robotics Real-time synthesis Kinematic retargeting Human-robot interaction

Problem

Social robots struggle to produce dynamic, open-domain gestures due to reliance on fixed templates and the tight computational and kinematic constraints of embedded platforms.

Approach

The system uses a BERT-conditioned CVAE to generate 2D upper-body poses from speech transcripts, which are deterministically mapped to the robot's limited joint space and executed in lockstep with audio via a state machine.

Key results

High-fidelity, text-conditioned beat gesture synthesis
Real-time execution on embedded hardware
Deterministic 2D-to-joint retargeting with kinematic enforcement
Seamless integration into standard HRI synchronization stacks

Why it matters

Provides a practical, low-latency solution for expressive nonverbal communication in resource-limited social robots, advancing natural human-robot interaction.

Abstract

Conversation can benefit from small rhythmic gestures that track prosody, reinforce structure, and help to keep attention. However, many robots used in human–robot interaction still rely on fixed templates or clip libraries that scale poorly to open-domain interactions; moreover, embedded plat- forms impose tight limits on motion range, speeds, and timing. Consequently, gesture generation methods must be lightweight, stable, and easy to integrate. To address this need, this work presents a lightweight gesture-generation model that generates in real time beat gestures based on the transcription of the robot’s speech. First, a Conditional Variational Autoencoder (CVAE) conditioned on sentence-level BERT embeddings is trained on 2D pose–text pairs to produce upper-body pose sequences. Next, a geometry-based retargeting algorithm de- terministically maps those poses to the robot’s joints while en- forcing kinematic limits. Finally, the joint sequence is converted into a pseudo-state machine and triggered in lockstep with the utterance. The results obtained show that the system achieves smooth, text-conditioned beat gestures with solid fidelity and temporal diversity, and demonstrates real-time performance when integrated on a social robot.

Index terms

Human and Humanoid Motion Analysis and Synthesis Natural Dialog for HRI AI-Based Methods