Teaching to Individual Needs: Bidirectional Teacher-Student Learning for Wheeled-Legged Locomotion
Guangsheng Li, Charles Wu, XinHua Zheng, shiyu zhu, Shenglan Liu
AI summary
Problem
Applying the conventional Teacher-Student reinforcement learning paradigm to wheeled-legged robots suffers from multimodal confusion, which causes the student to average out diverse action modes, and low imitability, where the teacher generates actions the student cannot reliably reproduce.
Approach
The authors propose a bidirectional Teacher-Student framework that uses a mixture density network to explicitly model and select the student's dominant action modes, alongside an imitation-aware reward that guides the teacher to generate actions the student can reliably reproduce.
Key results
- Explicit modeling of multimodal action distributions via HWC-MDN to prevent behavioral averaging
- Imitation-Aware Reward using Mahalanobis distance to guide teacher action reproducibility
- Significantly improved training efficiency and traversability in simulation
- Real-world MagicDog-W navigation of 45 cm obstacles and 45° slopes using only proprioception
Why it matters
It enables robust sim-to-real locomotion for complex wheeled-legged robots in unstructured environments, advancing practical deployment for search-and-rescue and inspection tasks.
Abstract
Reinforcement Learning (RL) enables robust and adaptive locomotion in legged and wheeled-legged robots. A common approach is the Teacher-Student (TS) paradigm, in which a teacher policy with privileged information supervises a proprioceptive student. While the TS paradigm has proven effective on legged robots, we encounter two critical issues when applying it to wheeled-legged robots. One issue is multimodal confusion, where teacher actions become multimodal under the student proprioceptive observations, resulting in the student generating averaged action modes. The other is low imitability of teacher actions, as the teacher overlooks their reproducibility by the student. To address these issues, we propose Teaching to Individual Needs (TIN), a bidirectional TS framework. To mit- igate multimodal confusion within the student policy, we design a Highest-Weight Component Mixture Density Network (HWC- MDN). By utilizing HWC-MDN, TIN student can explicitly model multimodal action distributions and outputs the highest- weight component. To improve imitability, we propose an Imitation-Aware Reward (IAR) that encourages the teacher to generate more reproducible actions by the student. Simulation experiments show that TIN significantly improves both training efficiency and traversability. Real-world tests illustrate that TIN enables the wheeled-legged robot MagicDog-W to traverse 45 cm obstacles and ascend 45◦slopes.