SRPO: Self-Reflection Policy Optimization for Stable and Robust Autonomous Driving
Dejin Wang, Seyede Fatemeh Ghoreishi
AI summary
Problem
Conventional reinforcement learning methods for autonomous driving often exhibit unstable convergence, high sensitivity to reward scaling, and brittle behavior under distribution shifts or adversarial conditions.
Approach
SRPO benchmarks each training iteration against a slowly updated historical policy reference to compute a relative improvement score, then redistributes this signal across trajectory steps using a rank-based, scale-invariant credit assignment mechanism.
Key results
- Self-reflective RL framework benchmarking policies against historical references
- Rank-based step-level credit assignment for scale-invariant reward shaping
- Theoretical guarantees for policy optimality and convergence preservation
- Empirical gains in training stability, sample efficiency, and robustness in Highway-env and CARLA
Why it matters
Provides a theoretically grounded, plug-and-play RL enhancement that makes autonomous driving policies more reliable and robust to real-world uncertainties without requiring complex adversarial setups or reward tuning.
Abstract
Autonomous driving demands reinforcement learning (RL) agents that are not only performant, but also stable, sample-efficient, and robust to uncertainty. However, conventional policy optimization methods often suffer from unstable convergence, sensitivity to reward scaling, and limited generalization in safety-critical or out-of-distribution scenarios. We propose Self-Reflection Policy Optimization (SRPO), a principled, model-free framework that introduces policy-level self-evaluation by benchmarking each policy iteration against its own historical performance. This self-reflection yields a reward-shaping signal based on relative improvement, which is redistributed across trajectory steps using a rank-based credit assignment mechanism. This design emphasizes informative steps, eliminates dependence on absolute reward magnitudes, and improves stability in practice. We theoretically show that a bounds-based variant of SRPO preserves policy optimality and convergence. Empirically, we evaluate SRPO on both Highway- env and the high-fidelity CARLA simulator under adversarial perturbations and out-of-distribution driving conditions. SRPO consistently improves training efficiency, robustness, and policy performance compared to the baseline techniques. These results position SRPO as a promising and theoretically grounded approach to more reliable decision-making for autonomous driving. The source code is available at: https://github. com/dejin-wang/SRPO_anonymous_code.