Text-Conditioned Beat Gesture Generation for a Social Robot Via a Conditional Variational Autoencoder
Alejandro Climent Peñalver, Enrique Fernandez-Rodicio, Ãlvaro Castro-González
AI summary
Problem
Social robots struggle to produce dynamic, open-domain gestures due to reliance on fixed templates and the tight computational and kinematic constraints of embedded platforms.
Approach
The system uses a BERT-conditioned CVAE to generate 2D upper-body poses from speech transcripts, which are deterministically mapped to the robot's limited joint space and executed in lockstep with audio via a state machine.
Key results
- High-fidelity, text-conditioned beat gesture synthesis
- Real-time execution on embedded hardware
- Deterministic 2D-to-joint retargeting with kinematic enforcement
- Seamless integration into standard HRI synchronization stacks
Why it matters
Provides a practical, low-latency solution for expressive nonverbal communication in resource-limited social robots, advancing natural human-robot interaction.
Abstract
Conversation can benefit from small rhythmic gestures that track prosody, reinforce structure, and help to keep attention. However, many robots used in human–robot interaction still rely on fixed templates or clip libraries that scale poorly to open-domain interactions; moreover, embedded plat- forms impose tight limits on motion range, speeds, and timing. Consequently, gesture generation methods must be lightweight, stable, and easy to integrate. To address this need, this work presents a lightweight gesture-generation model that generates in real time beat gestures based on the transcription of the robot’s speech. First, a Conditional Variational Autoencoder (CVAE) conditioned on sentence-level BERT embeddings is trained on 2D pose–text pairs to produce upper-body pose sequences. Next, a geometry-based retargeting algorithm de- terministically maps those poses to the robot’s joints while en- forcing kinematic limits. Finally, the joint sequence is converted into a pseudo-state machine and triggered in lockstep with the utterance. The results obtained show that the system achieves smooth, text-conditioned beat gestures with solid fidelity and temporal diversity, and demonstrates real-time performance when integrated on a social robot.