An <i>ϵ</i>-Greedy Multiarmed Bandit Approach to Markov Decision Processes

Isa Muqattash; Jiaqiao Hu

doi:10.3390/stats6010006

Stats (Jan 2023)

An <i>ϵ</i>-Greedy Multiarmed Bandit Approach to Markov Decision Processes

Isa Muqattash,
Jiaqiao Hu

Affiliations

Isa Muqattash: Independent Researcher, Stony Brook, NY 11794-3600, USA
Jiaqiao Hu: Department of Applied Mathematics and Statistics, The State University of New York at Stony Brook, Stony Brook, NY 11794-3600, USA

DOI: https://doi.org/10.3390/stats6010006
Journal volume & issue: Vol. 6, no. 1
pp. 99 – 112

Abstract

Read online

We present REGA, a new adaptive-sampling-based algorithm for the control of finite-horizon Markov decision processes (MDPs) with very large state spaces and small action spaces. We apply a variant of the ϵ-greedy multiarmed bandit algorithm to each stage of the MDP in a recursive manner, thus computing an estimation of the “reward-to-go” value at each stage of the MDP. We provide a finite-time analysis of REGA. In particular, we provide a bound on the probability that the approximation error exceeds a given threshold, where the bound is given in terms of the number of samples collected at each stage of the MDP. We empirically compare REGA against another sampling-based algorithm called RASA by running simulations against the SysAdmin benchmark problem with 210 states. The results show that REGA and RASA achieved similar performance. Moreover, REGA and RASA empirically outperformed an implementation of the algorithm that uses the “original” ϵ-greedy algorithm that commonly appears in the literature.

Published in Stats

ISSN: 2571-905X (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Social Sciences: Statistics
Website: https://www.mdpi.com/journal/stats

About the journal

Abstract

Keywords