Self-Adaptive Priority Correction for Prioritized Experience Replay

Hongjie Zhang; Cheng Qu; Jindou Zhang; Jing Li

doi:10.3390/app10196925

Applied Sciences (Oct 2020)

Self-Adaptive Priority Correction for Prioritized Experience Replay

Hongjie Zhang,
Cheng Qu,
Jindou Zhang,
Jing Li

Affiliations

Hongjie Zhang: School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China
Cheng Qu: School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China
Jindou Zhang: School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China
Jing Li: School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China

DOI: https://doi.org/10.3390/app10196925
Journal volume & issue: Vol. 10, no. 19
p. 6925

Abstract

Read online

Deep Reinforcement Learning (DRL) is a promising approach for general artificial intelligence. However, most DRL methods suffer from the problem of data inefficiency. To alleviate this problem, DeepMind proposed Prioritized Experience Replay (PER). Though PER improves data utilization, the priorities of most samples in its Experience Memory (EM) are out of date, as only the priorities of a small part of the data are updated while the Q network parameters are updated. Consequently, the difference between storage and real priority distributions gradually increases, which will introduce bias into the gradients of Deep Q-Learning (DQL) and make the DQL update toward a non-ideal direction. In this work, we propose a novel self-adaptive priority correction algorithm named Importance-PER (Imp-PER) to fix the update deviation. Specifically, we predict the sum of real Temporal-Difference error (TD-error) of all data in EM. Data are corrected by an importance weight, which is estimated by the predicted sum and the real TD-error calculated by the latest agent. To control the unbounded importance weight, we use truncated importance sampling with a self-adaptive truncation threshold. The conducted experiments on various games of Atari 2600 with Double Deep Q-Network and MuJoCo with Deep Deterministic Policy Gradient demonstrate that Imp-PER improves the data utilization and final policy quality on discrete states and continuous states tasks without increasing the computational cost.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords