Transitive Sequencing Medical Records for Mining Predictive and Interpretable Temporal Representations
Hossein Estiri,
Zachary H. Strasser,
Jeffery G. Klann,
Thomas H. McCoy, Jr.,
Kavishwar B. Wagholikar,
Sebastien Vasey,
Victor M. Castro,
MaryKate E. Murphy,
Shawn N. Murphy
Affiliations
Hossein Estiri
Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA 02144, USA; Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA; Harvard Medical School, Boston, MA 02115, USA; Corresponding author
Zachary H. Strasser
Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA 02144, USA; Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA; Harvard Medical School, Boston, MA 02115, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
Jeffery G. Klann
Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA 02144, USA; Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA; Harvard Medical School, Boston, MA 02115, USA
Thomas H. McCoy, Jr.
Harvard Medical School, Boston, MA 02115, USA; Center for Quantitative Health, Massachusetts General Hospital, Boston, MA 02114, USA
Kavishwar B. Wagholikar
Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA 02144, USA; Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA; Harvard Medical School, Boston, MA 02115, USA
Sebastien Vasey
Department of Mathematics, Harvard University, Cambridge, MA 02138, USA
Victor M. Castro
Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA
MaryKate E. Murphy
Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA
Shawn N. Murphy
Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA 02144, USA; Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA; Harvard Medical School, Boston, MA 02115, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA; Department of Neurology, Massachusetts General Hospital, Boston, MA 02114, USA
Summary: Electronic health records (EHRs) contain important temporal information about the progression of disease and treatment outcomes. This paper proposes a transitive sequencing approach for constructing temporal representations from EHR observations for downstream machine learning. Using clinical data from a cohort of patients with congestive heart failure, we mined temporal representations by transitive sequencing of EHR medication and diagnosis records for classification and prediction tasks. We compared the classification and prediction performances of the transitive sequential representations (bag-of-sequences approach) with the conventional approach of using aggregated vectors of EHR data (aggregated vector representation) across different classifiers. We found that the transitive sequential representations are better phenotype “differentiators” and predictors than the “atemporal” EHR records. Our results also demonstrated that data representations obtained from transitive sequencing of EHR observations can present novel insights about the progression of the disease that are difficult to discern when clinical data are treated independently of the patient's history. The Bigger Picture: Over the past decade, billions of dollars have been spent to institute meaningful use of electronic health record (EHR) systems. For a multitude of reasons, however, EHR data are still complex and have ample quality issues, which make it difficult to leverage these data to address pressing health issues, especially during pandemics such as COVID-19, when rapid responses are needed. In this paper, we propose a transitive sequential pattern mining algorithm for exploiting the temporal information in the EHRs that are distorted by layers of administrative and healthcare system processes. Perhaps more importantly, we propose a machine learning (ML) pipeline that is capable of engineering predictive features without the need for expert involvement to model diseases and health outcomes. Together, the temporal sequences and the ML pipeline can be rapidly deployed to develop computational models for identifying and validating novel disease markers and advancing medical knowledge discovery.