Proximal Policy Optimization Based on Self-directed Action Selection

SHEN Yi, LIU Quan

doi:10.11896/jsjkx.201000163

Jisuanji kexue (Dec 2021)

Proximal Policy Optimization Based on Self-directed Action Selection

SHEN Yi, LIU Quan

Affiliations

SHEN Yi, LIU Quan: 1 School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China<br/>2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China<br/>3 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China<br/>4 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China

DOI: https://doi.org/10.11896/jsjkx.201000163
Journal volume & issue: Vol. 48, no. 12
pp. 297 – 303

Abstract

Read online

The optimization algorithm of monotonous improvement of strategy in reinforcement learning is a current research hotspot,and it has achieved good performance in both discrete and continuous control tasks.Proximal policy optimization(PPO)algorithm is a classic strategy monotonic promotion algorithm,but it is an on-policy algorithm with low sample utilization.To solve this problem,an algorithm named proximal policy optimization based on self-directed action selection(SDAS-PPO)is proposed.The SDAS-PPO algorithm not only uses the sample experience according to the importance sampling weight,but also adds a synchronously updated experience pool to store its own excellent sample experience,and uses the self-directed network learned from the experience pool to guide the choice of actions.The SDAS-PPO algorithm greatly improves the sample utilization rate and ensures that the intelligent body can learn quickly and effectively when training the network model.In order to verify the effectiveness of the SDAS-PPO algorithm,the SDAS-PPO algorithm and the TRPO algorithm,PPO algorithm and PPO-AMBER algorithm are used in the continuous control task Mujoco simulation platform for comparative experiments.Experimental results show that this method has better performance in most environments.

reinforcement learning|deep reinforcement learning|policy gradient|proximal policy optimization|self-directed

Published in Jisuanji kexue

ISSN: 1002-137X (Print)
Publisher: Editorial office of Computer Science
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software; Technology: Technology (General)
Website: http://www.jsjkx.com/CN/1002-137X/home.shtml

About the journal

Abstract

Keywords