Computer Go Research Based on Variable Scale Training and PUB-PMCTS

Jinhan Huang; Zhixing Huang; Shengcai Cen; Wurui Shi; Xiaoxiao Huang; Xueyun Chen

doi:10.1109/ACCESS.2023.3244320

IEEE Access (Jan 2024)

Computer Go Research Based on Variable Scale Training and PUB-PMCTS

Jinhan Huang,
Zhixing Huang,
Shengcai Cen,
Wurui Shi,
Xiaoxiao Huang,
Xueyun Chen

Affiliations

Jinhan Huang: ORCiD; School of Electrical Engineering, Guangxi University, Nanning, China
Zhixing Huang: ORCiD; School of Electrical Engineering, Guangxi University, Nanning, China
Shengcai Cen: ORCiD; School of Electrical Engineering, Guangxi University, Nanning, China
Wurui Shi: ORCiD; School of Electrical Engineering, Guangxi University, Nanning, China
Xiaoxiao Huang: ORCiD; School of Electrical Engineering, Guangxi University, Nanning, China
Xueyun Chen: ORCiD; School of Electrical Engineering, Guangxi University, Nanning, China

DOI: https://doi.org/10.1109/ACCESS.2023.3244320
Journal volume & issue: Vol. 12
pp. 67246 – 67255

Abstract

Read online

The mainstream Go AI algorithms represented by AlphaZero and KataGo suffer from low-quality samples in the early training period and low exploration efficiency when performing traditional Monte Carlo Tree Search (MCTS). For the shortcomings mentioned above: The variable scale training is proposed, i.e., introducing a variable scale board with boundary conditions of randomly placed stones at the boundary periphery, to pre-train a small-scale network for recommending local move strategy and ownership. This network is used to improve the backbone network’s moving policy and state value, enhancing the quality of game samples in the early stages of training. To improve the efficiency and convergence speed of the search, we propose the Parallel Monte Carlo Tree Search with Potential-Upper-Bound (PUB-PMCTS), i.e., executing multiple unevaluated searches sequentially and then evaluating multiple leaf nodes in parallel; also, the variance of the node’s action values are used to forecast the potential upper limit of the node. In addition, we add a self-attention mechanism in the network to extract global context information of features and add maximum entropy loss to grow the exploration ability of the model. With the improvements described above, the bot TransGo is designed. Experimental results show that in a $13\times 13$ Go environment, TransGo has more stable performance and higher game level in the early training period compared with other algorithms. After four days of training with TransGo, KataGo, and AlphaZero: TransGo improved by 102 Elo compared to KataGo and over 1000 Elo compared to AlphaZero.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords