Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Comparative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games

Yueqi Hou; Xiaolong Liang; Jiaqiang Zhang; Qisong Yang; Aiwu Yang; Ning Wang

doi:10.3390/app13148283

Applied Sciences (Jul 2023)

Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Comparative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games

Yueqi Hou,
Xiaolong Liang,
Jiaqiang Zhang,
Qisong Yang,
Aiwu Yang,
Ning Wang

Affiliations

Yueqi Hou: Air Traffic Control and Navigation School, Air Force Engineering University, Xi’an 710051, China
Xiaolong Liang: Air Traffic Control and Navigation School, Air Force Engineering University, Xi’an 710051, China
Jiaqiang Zhang: Air Traffic Control and Navigation School, Air Force Engineering University, Xi’an 710051, China
Qisong Yang: Xi’an Research Institute of High-Technology, Xi’an 710051, China
Aiwu Yang: Air Traffic Control and Navigation School, Air Force Engineering University, Xi’an 710051, China
Ning Wang: Air Traffic Control and Navigation School, Air Force Engineering University, Xi’an 710051, China

DOI: https://doi.org/10.3390/app13148283
Journal volume & issue: Vol. 13, no. 14
p. 8283

Abstract

Read online

Invalid action masking is a practical technique in deep reinforcement learning to prevent agents from taking invalid actions. Existing approaches rely on action masking during policy training and utilization. This study focuses on developing reinforcement learning algorithms that incorporate action masking during training but can be used without action masking during policy execution. The study begins by conducting a theoretical analysis to elucidate the distinction between naive policy gradient and invalid action policy gradient. Based on this analysis, we demonstrate that the naive policy gradient is a valid gradient and is equivalent to the proposed composite objective algorithm, which optimizes both the masked policy and the original policy in parallel. Moreover, we propose an off-policy algorithm for invalid action masking that employs the masked policy for sampling while optimizing the original policy. To compare the effectiveness of these algorithms, experiments are conducted using a simplified real-time strategy (RTS) game simulator called Gym-μRTS. Based on empirical findings, we recommend utilizing the off-policy algorithm for addressing most tasks while employing the composite objective algorithm for handling more complex tasks.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords