IEEE Access (Jan 2024)
M0RV Model: Advancing the MuZero Algorithm Through Strategic Data Optimization Reuse and Value Function Refinement
Abstract
This paper introduces a model, M0RV, that improves the MuZero algorithm through data reuse and loss function optimization. It proposes reusing training trajectories generated by Monte Carlo Tree Search (MCTS) after filtering through an evaluation function trace into the training process, and on this basis, employs the Advantage-Value method to optimize the neural network loss function, ultimately optimizing the training process. A comparative analysis is conducted between the baseline MuZero algorithm, its A0GB algorithm-enhanced variant M0GB, and the further refined M0RV algorithm, across a spectrum of Atari and intricate board games. Notably, M0RV outperforms its predecessors in both the Lunar Lander and Breakout games, as well as in the board game Hex, under consistent steps parameters and unified reward benchmarks. The empirical findings demonstrate that the M0RV model, in comparison to the MuZero model, substantially enhances training efficacy, successfully fulfilling the objective of optimizing the training methodology.
Keywords