X-Neuron: Interpreting, Locating and Editing of Neurons in Reinforcement Learning Policy
Yuhong Ge, Xun Zhao, Jiangmiao Pang, Mingguo Zhao, Dahua Lin
Abstract
Despite the impressive performance of Reinforce- ment Learning (RL), the black-box neural network backbone hinders users from trusting and deploying trained agents in real-world applications where safety is crucial. In order to make agents more trustworthy and controllable, for a given RL- trained policy, we propose to enhance its interpretability and make it human-controllable without retraining. We accomplish this goal by following a 3-step pipeline: 1) We interpret neurons by analyzing the causal effect of neurons on the kinematic attributes; To help agents unlock novel skills and enable human to assist agents in accomplishing tasks, 2) we locate the X- neuron, the optimal neuron that is capable of evoking the desired behavior; 3) and edit its activation values to achieve the precise control. We evaluate our method on various RL tasks ranging from autonomous driving to robot locomotion, and the results display that our approach outperforms previous work regarding almost all evaluation metrics. Through enhancing interpretability and introducing human control, the agents can improve safety and performance, even in unseen environments and novel tasks. For locomotion robots simply trained to walk forward, our method unlocks diverse controllable behaviors ranging from jump to backflip.