Towards Accurate and Robust Dynamics and Reward Modeling for Model-Based Offline Inverse Reinforcement Learning
Gengyu Zhang, Yan Yan
Abstract
This paper enhances model-based offline inverse reinforcement learning (IRL) by refining conservative Markov decision process (MDP) frameworks, traditionally employing uncertainty penalties to deter exploitation in uncertain areas. Existing methods, dependent on neural network ensembles to model MDP dynamics and quantify uncertainty through ensemble prediction heuristics, face limitations: they presume Gaussian-distributed state transitions, leading to simplified environmental representations. Additionally, ensemble modeling often results in high variance, indicating potential overfitting and a lack of generalizability. Moreover, the heuristic reliance for uncertainty quantification struggles to fully grasp environmental complexities, offering an incomplete foundation for informed decisions. Maintaining multiple models also demands substantial computational resources. Addressing these shortcomings, we propose leveraging score-based diffusion generative models for dynamic modeling. This method significantly broadens the scope of representable target distributions, surpassing Gaussian constraints. It not only improves the accuracy of transition modeling but also roots uncertainty quantification in diffusion models’ theoretical underpinnings, enabling more precise and dependable reward regularization. We further innovate by incorporating a transition stability regularizer (TSR) into the reward estimation. This novel element embeds stability into the reward learning process, diminishing the influence of transition variability and promoting more consistent policy optimization. Our empirical studies on diverse Mujoco robotic control tasks demonstrate that our diffusion-based methodology not only furnishes more accurate transition estimations but also surpasses conventional ensemble approaches in policy effectiveness. The addition of the TSR marks a distinctive advancement in offline IRL by enhancing the reward and policy learning efficacy. Code: https://github.com/GabrielZH/doc-irl.