Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning

FAN Jing-yu, LIU Quan

doi:10.11896/jsjkx.210300081

Jisuanji kexue (Jun 2022)

Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning

FAN Jing-yu, LIU Quan

Affiliations

FAN Jing-yu, LIU Quan: 1 School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China ;2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China ;3 Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China ;4 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000,China

DOI: https://doi.org/10.11896/jsjkx.210300081
Journal volume & issue: Vol. 49, no. 6
pp. 335 – 341

Abstract

Read online

Reinforcement learning is an important branch of machine learning.With the development of deep learning,deep reinforcement learning research has gradually developed into the focus of reinforcement learning research.Model-free off-policy deep reinforcement learning algorithms for continuous control attract everyone’s attention because of their strong practicality.Like Q-learning,algorithms based on actor-critic suffer from the problem of overestimations.To a certain extent,clipped double Q-lear-ning method solves the effect of the overestimation in actor-critic algorithms,but it also introduces underestimation to the lear-ning process.In order to further solve the problems of overestimation and underestimation in the actor-critic algorithms,a new learning method,randomly weighted triple Q-learning method is proposed.In addition,combining the new method with the soft actor critic algorithm,a new soft actor critic algorithm based on randomly weighted triple Q-learning is proposed.This algorithm not only limits the Q estimation value near the real Q value,but also increases the randomness of the Q estimation value through randomly weighted method,so as to solve the problems of overestimation and underestimation of action value in the learning process.Experiment results show that,compared to the SAC algorithm and other currently popular deep reinforcement learning algorithms such as DDPG,PPO and TD3,the SAC-RWTQ algorithm has better performance on several Mujoco tasks on the gym simulation platform.

q-learning|deep learning|off-policy reinforcement learning|continuous action space|maximum entropy|soft actor critic algorithm

Published in Jisuanji kexue

ISSN: 1002-137X (Print)
Publisher: Editorial office of Computer Science
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software; Technology: Technology (General)
Website: http://www.jsjkx.com/CN/1002-137X/home.shtml

About the journal

Abstract

Keywords