Applied Sciences (Nov 2024)
Online Attentive Kernel-Based Off-Policy Temporal Difference Learning
Abstract
Temporal difference (TD) learning is a powerful framework for value function approximation in reinforcement learning. However, standard TD methods often struggle with feature representation and off-policy learning challenges. In this paper, we propose a novel framework, online attentive kernel-based off-policy TD learning, and in combination with well-known algorithms, introduce OAKGTD2, OAKTDC, and OAKETD. This framework uses two-timescale optimization. In the slow-timescale, a sparse representation of state features is learned using an online attentive kernel-based method. In the fast-timescale, auxiliary variables are used to update the value function parameters under the off-policy setting. We theoretically prove the convergence of all three algorithms. Through experiments conducted in several standard reinforcement learning environments, we demonstrate the effectiveness of the improved algorithms and compare their performance with existing algorithms. Specifically, from the perspective of cumulative rewards, the proposed algorithm achieves an average improvement of 15% compared to on-policy algorithms and an average improvement of 25% compared to common off-policy algorithms.
Keywords