IEEE Access (Jan 2024)
Combined Constraint on Behavior Cloning and Discriminator in Offline Reinforcement Learning
Abstract
In recent years, reinforcement learning (RL) has received a lot of attention because we can automatically learn optimal behavioral policies. However, since RL acquires the policy by repeatedly interacting with the environment, it is difficult to learn about realistic tasks. In recent years, there has been a lot of research on offline RL (batch RL), which does not need to interact with the environment, but learns from the accumulated experience prepared in advance. Learning does not work by applying common RL methods directly to offline RL because of a problem called distributional shift. Methods to suppress distributional shift have been actively studied in offline RL. In this study, we propose a new offline RL algorithm that adds constraints from discriminators used in Generative Adversarial Networks to the offline RL method called TD3+BC. We compare and validate the proposed method with existing methods using a benchmark for 3D robot control simulation. In TD3+BC, the constraint was tightened to suppress distribution shift, but a challenge arose when the quality of the dataset was poor, leading to difficulties in successful learning. The proposed approach addresses this issue by incorporating features to mitigate distribution shift while introducing new constraints to enable learning that is not solely dependent on the dataset’s quality. This innovative strategy aims to improve accuracy even in cases where the dataset exhibits poor characteristics.
Keywords