CAAI Transactions on Intelligence Technology (Apr 2024)

Deep reinforcement learning using least‐squares truncated temporal‐difference

  • Junkai Ren,
  • Yixing Lan,
  • Xin Xu,
  • Yichuan Zhang,
  • Qiang Fang,
  • Yujun Zeng

DOI
https://doi.org/10.1049/cit2.12202
Journal volume & issue
Vol. 9, no. 2
pp. 425 – 439

Abstract

Read online

Abstract Policy evaluation (PE) is a critical sub‐problem in reinforcement learning, which estimates the value function for a given policy and can be used for policy improvement. However, there still exist some limitations in current PE methods, such as low sample efficiency and local convergence, especially on complex tasks. In this study, a novel PE algorithm called Least‐Squares Truncated Temporal‐Difference learning (LST2D) is proposed. In LST2D, an adaptive truncation mechanism is designed, which effectively takes advantage of the fast convergence property of Least‐Squares Temporal Difference learning and the asymptotic convergence property of Temporal Difference learning (TD). Then, two feature pre‐training methods are utilised to improve the approximation ability of LST2D. Furthermore, an Actor‐Critic algorithm based on LST2D and pre‐trained feature representations (ACLPF) is proposed, where LST2D is integrated into the critic network to improve learning‐prediction efficiency. Comprehensive simulation studies were conducted on four robotic tasks, and the corresponding results illustrate the effectiveness of LST2D. The proposed ACLPF algorithm outperformed DQN, ACER and PPO in terms of sample efficiency and stability, which demonstrated that LST2D can be applied to online learning control problems by incorporating it into the actor‐critic architecture.

Keywords