Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse Rewards

Rui Tu; Zhipeng Luo; Chuanliang Pan; Zhong Wang; Jie Su; Yu Zhang; Yifan Wang

doi:10.1007/s44230-025-00093-7

Human-Centric Intelligent Systems (Feb 2025)

Offline Safe Reinforcement Learning for Sepsis Treatment: Tackling Variable-Length Episodes with Sparse Rewards

Rui Tu,
Zhipeng Luo,
Chuanliang Pan,
Zhong Wang,
Jie Su,
Yu Zhang,
Yifan Wang

Affiliations

Rui Tu: School of Computing and Artificial Intelligence, Southwest Jiaotong University
Zhipeng Luo: School of Computing and Artificial Intelligence, Southwest Jiaotong University
Chuanliang Pan: Department of Intensive Care Units, The Third People’s Hospital
Zhong Wang: Department of Intensive Care Units, The Third People’s Hospital
Jie Su: Department of Intensive Care Units, The Third People’s Hospital
Yu Zhang: Department of Intensive Care Units, Tangshan People’s Hospital
Yifan Wang: School of Computing and Artificial Intelligence, Southwest Jiaotong University

DOI: https://doi.org/10.1007/s44230-025-00093-7
Journal volume & issue: Vol. 5, no. 1
pp. 63 – 76

Abstract

Read online

Abstract In critical medicine, data-driven methods that assist in physician decisions often require accurate responses and controllable safety risks. Most recent reinforcement learning models developed for clinical research typically use fixed-length and very short time series data. Unfortunately, such methods generalize poorly on variable-length data that can be overlong. In such as case, a single final reward signal appears very sparse. Meanwhile, safety is often overlooked by many models, leading them to make excessively extreme recommendations. In this paper, we study how to recommend effective and safe treatments for critically ill septic patients. We develop an offline reinforcement learning model based on CQL (Conservative Q-Learning), which underestimates the expected rewards of rarely seen treatments in data, thus enjoying a high safety standard. We further enhance the model with intermediate rewards by particularly using the Apache II scoring system. This can effectively deal with variable-length episodes with sparse rewards. By performing extensive experiments on the MIMIC-III database, we demonstrated the enhanced performance and robustness in safety. Our code of data extraction, preprocessing, and modeling can be found at https://github.com/OOPSDINOSAUR/RL_safety_model .

Published in Human-Centric Intelligent Systems

ISSN: 2667-1336 (Online)
Publisher: Springer Nature
Country of publisher: Netherlands
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.springer.com/journal/44230

About the journal

Abstract

Keywords