Dynamic Grouping within Minimax Optimal Strategy for Stochastic Multi-ArmedBandits in Reinforcement Learning Recommendation

Jiamei Feng; Junlong Zhu; Xuhui Zhao; Zhihang Ji

doi:10.3390/app14083441

Applied Sciences (Apr 2024)

Dynamic Grouping within Minimax Optimal Strategy for Stochastic Multi-ArmedBandits in Reinforcement Learning Recommendation

Jiamei Feng,
Junlong Zhu,
Xuhui Zhao,
Zhihang Ji

Affiliations

Jiamei Feng: College of Information Engineering, Henan University of Science and Technology, Luoyang 471000, China
Junlong Zhu: College of Information Engineering, Henan University of Science and Technology, Luoyang 471000, China
Xuhui Zhao: College of Information Engineering, Henan University of Science and Technology, Luoyang 471000, China
Zhihang Ji: College of Information Engineering, Henan University of Science and Technology, Luoyang 471000, China

DOI: https://doi.org/10.3390/app14083441
Journal volume & issue: Vol. 14, no. 8
p. 3441

Abstract

Read online

The multi-armed bandit (MAB) problem is a typical problem of exploration and exploitation. As a classical MAB problem, the stochastic multi-armed bandit (SMAB) is the basis of reinforcement learning recommendation. However, most existing SMAB and MAB algorithms have two limitations: (1) they do not make full use of feedback from the environment or agent, such as the number of arms and rewards contained in user feedback; (2) they overlook the utilization of different action selections, which can affect the exploration and exploitation of the algorithm. These limitations motivate us to propose a novel dynamic grouping within the minimax optimal strategy in the stochastic case (DG-MOSS) algorithm for reinforcement learning recommendation for small and medium-sized data scenarios. DG-MOSS does not require additional contextual data and can be used for recommendation of various types of data. Specifically, we designed a new exploration calculation method based on dynamic grouping which uses the feedback information automatically in the selection process and adopts different action selections. During the thorough training of the algorithm, we designed an adaptive episode length to effectively improve the training efficiency. We also analyzed and proved the upper bound of DG-MOSS’s regret. Our experimental results for different scales, densities, and field datasets show that DG-MOSS can yield greater rewards than nine baselines with sufficiently trained recommendation and demonstrate that it has better robustness.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords