← Back ICRA 2026

SRPO: Self-Reflection Policy Optimization for Stable and Robust Autonomous Driving

Dejin Wang, Seyede Fatemeh Ghoreishi

PDF

AI summary

Key figure (auto-extracted from paper)

SRPO stabilizes reinforcement learning for autonomous driving by using a policy's own historical performance to generate scale-invariant reward signals, significantly improving robustness and sample efficiency.

Reinforcement learning autonomous driving reward shaping credit assignment policy optimization robustness

Problem

Conventional reinforcement learning methods for autonomous driving often exhibit unstable convergence, high sensitivity to reward scaling, and brittle behavior under distribution shifts or adversarial conditions.

Approach

SRPO benchmarks each training iteration against a slowly updated historical policy reference to compute a relative improvement score, then redistributes this signal across trajectory steps using a rank-based, scale-invariant credit assignment mechanism.

Key results

Self-reflective RL framework benchmarking policies against historical references
Rank-based step-level credit assignment for scale-invariant reward shaping
Theoretical guarantees for policy optimality and convergence preservation
Empirical gains in training stability, sample efficiency, and robustness in Highway-env and CARLA

Why it matters

Provides a theoretically grounded, plug-and-play RL enhancement that makes autonomous driving policies more reliable and robust to real-world uncertainties without requiring complex adversarial setups or reward tuning.

Abstract

Autonomous driving demands reinforcement learning (RL) agents that are not only performant, but also stable, sample-efficient, and robust to uncertainty. However, conventional policy optimization methods often suffer from unstable convergence, sensitivity to reward scaling, and limited generalization in safety-critical or out-of-distribution scenarios. We propose Self-Reflection Policy Optimization (SRPO), a principled, model-free framework that introduces policy-level self-evaluation by benchmarking each policy iteration against its own historical performance. This self-reflection yields a reward-shaping signal based on relative improvement, which is redistributed across trajectory steps using a rank-based credit assignment mechanism. This design emphasizes informative steps, eliminates dependence on absolute reward magnitudes, and improves stability in practice. We theoretically show that a bounds-based variant of SRPO preserves policy optimality and convergence. Empirically, we evaluate SRPO on both Highway- env and the high-fidelity CARLA simulator under adversarial perturbations and out-of-distribution driving conditions. SRPO consistently improves training efficiency, robustness, and policy performance compared to the baseline techniques. These results position SRPO as a promising and theoretically grounded approach to more reliable decision-making for autonomous driving. The source code is available at: https://github. com/dejin-wang/SRPO_anonymous_code.

Index terms

Autonomous Vehicle Navigation Planning under Uncertainty