Actively learning costly reward functions for reinforcement learning

André Eberhard; Houssam Metni; Georg Fahland; Alexander Stroh; Pascal Friederich

doi:10.1088/2632-2153/ad33e0

Machine Learning: Science and Technology (Jan 2024)

Actively learning costly reward functions for reinforcement learning

André Eberhard,
Houssam Metni,
Georg Fahland,
Alexander Stroh,
Pascal Friederich

Affiliations

André Eberhard: ORCiD; Institute of Theoretical Informatics, Karlsruhe Institute of Technology , Kaiserstr. 12, 76131 Karlsruhe, Germany
Houssam Metni: ORCiD; Institute of Theoretical Informatics, Karlsruhe Institute of Technology , Kaiserstr. 12, 76131 Karlsruhe, Germany; Université de Strasbourg , 4 rue Blaise Pascal, 67081 Strasbourg, France
Georg Fahland: Institute of Fluid Mechanics, Karlsruhe Institute of Technology , Kaiserstr. 12, 76131 Karlsruhe, Germany
Alexander Stroh: ORCiD; Institute of Fluid Mechanics, Karlsruhe Institute of Technology , Kaiserstr. 12, 76131 Karlsruhe, Germany
Pascal Friederich: ORCiD; Institute of Theoretical Informatics, Karlsruhe Institute of Technology , Kaiserstr. 12, 76131 Karlsruhe, Germany; Institute of Nanotechnology, Karlsruhe Institute of Technology , Kaiserstr. 12, 76131 Karlsruhe, Germany

DOI: https://doi.org/10.1088/2632-2153/ad33e0
Journal volume & issue: Vol. 5, no. 1
p. 015055

Abstract

Read online

Transfer of recent advances in deep reinforcement learning to real-world applications is hindered by high data demands and thus low efficiency and scalability. Through independent improvements of components such as replay buffers or more stable learning algorithms, and through massively distributed systems, training time could be reduced from several days to several hours for standard benchmark tasks. However, while rewards in simulated environments are well-defined and easy to compute, reward evaluation becomes the bottleneck in many real-world environments, e.g. in molecular optimization tasks, where computationally demanding simulations or even experiments are required to evaluate states and to quantify rewards. When ground-truth evaluations become orders of magnitude more expensive than in research scenarios, direct transfer of recent advances would require massive amounts of scale, just for evaluating rewards rather than training the models. We propose to alleviate this problem by replacing costly ground-truth rewards with rewards modeled by neural networks, counteracting non-stationarity of state and reward distributions during training with an active learning component. We demonstrate that using our proposed method, it is possible to train agents in complex real-world environments orders of magnitudes faster than would be possible when using ground-truth rewards. By enabling the application of RL methods to new domains, we show that we can find interesting and non-trivial solutions to real-world optimization problems in chemistry, materials science and engineering. We demonstrate speed-up factors of 50–3000 when applying our approach to challenges of molecular design and airfoil optimization.

Published in Machine Learning: Science and Technology

ISSN: 2632-2153 (Online)
Publisher: IOP Publishing
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://iopscience.iop.org/journal/2632-2153

About the journal

Abstract

Keywords