Swimming under Constraints: A Safe Reinforcement Learning Framework for Quadrupedal Bio-Inspired Propulsion
Xinyu Cui, Fei Han, Hang Xu, Yongcheng Zeng, Luoyang Sun, RuiZhi Zhang, Jian Zhao, Haifeng Zhang, Weikun Li, Hao Chen, Jun Wang, Dixia Fan
AI summary
Problem
Bio-inspired aquatic propulsion systems generate high thrust but suffer from destabilizing lift fluctuations and pitch oscillations amplified by complex fluid-structure interactions. Naive reinforcement learning fails to safely balance thrust maximization with stability constraints during on-hardware training.
Approach
The authors formulate gait learning as a constrained optimization problem and propose ACPPO-PID, a safe RL algorithm that dynamically enforces stability constraints using a PID-regulated Lagrange multiplier, accelerates exploration with asymmetric clipping, and stabilizes training through cycle-wise geometric aggregation.
Key results
- Formulated quadrupedal swimming as a constrained thrust optimization problem
- Developed ACPPO-PID safe RL algorithm with PID-regulated constraints and asymmetric clipping
- Achieved superior thrust efficiency and lift suppression in towing-tank experiments
- Enabled stable free-swimming via diagonal-phase policy transfer to a quadrupedal robot
Why it matters
This work advances robust underwater robotics by providing a practical, constraint-aware safe RL framework that bridges the sim-to-real gap for efficient and stable bio-inspired locomotion in complex fluid environments.
Abstract
Bio-inspired aquatic propulsion offers high thrust and maneuverability but is prone to destabilizing forces such as lift fluctuations, which are further amplified by six-degree-of-freedom (6-DoF) fluid coupling. We formulate quadrupedal swimming as a constrained optimization problem that maximizes forward thrust while minimizing destabiliz- ing fluctuations. Our proposed framework, Accelerated Con- strained Proximal Policy Optimization with a PID-regulated Lagrange multiplier (ACPPO-PID), enforces constraints with a PID-regulated Lagrange multiplier, accelerates learning via conditional asymmetric clipping, and stabilizes updates through cycle-wise geometric aggregation. Initialized with imitation learning and refined through on-hardware towing-tank ex- periments, ACPPO-PID produces control policies that trans- fer effectively to quadrupedal free-swimming trials. Results demonstrate improved thrust efficiency, reduced destabilizing forces, and faster convergence compared with state-of-the-art baselines, underscoring the importance of constraint-aware safe RL for robust and generalizable bio-inspired locomotion in complex fluid environments.