Identifying multi-resolution clusters of diseases in ten million patients with multimorbidity in primary care in England

Thomas Beaney; Jonathan Clarke; David Salman; Thomas Woodcock; Azeem Majeed; Paul Aylin; Mauricio Barahona

doi:10.1038/s43856-024-00529-4

Communications Medicine (May 2024)

Identifying multi-resolution clusters of diseases in ten million patients with multimorbidity in primary care in England

Thomas Beaney,
Jonathan Clarke,
David Salman,
Thomas Woodcock,
Azeem Majeed,
Paul Aylin,
Mauricio Barahona

Affiliations

Thomas Beaney: Department of Primary Care and Public Health, Imperial College London
Jonathan Clarke: Department of Mathematics, Imperial College London
David Salman: Department of Primary Care and Public Health, Imperial College London
Thomas Woodcock: Department of Primary Care and Public Health, Imperial College London
Azeem Majeed: Department of Primary Care and Public Health, Imperial College London
Paul Aylin: Department of Primary Care and Public Health, Imperial College London
Mauricio Barahona: Department of Mathematics, Imperial College London

DOI: https://doi.org/10.1038/s43856-024-00529-4
Journal volume & issue: Vol. 4, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Background Identifying clusters of diseases may aid understanding of shared aetiology, management of co-morbidities, and the discovery of new disease associations. Our study aims to identify disease clusters using a large set of long-term conditions and comparing methods that use the co-occurrence of diseases versus methods that use the sequence of disease development in a person over time. Methods We use electronic health records from over ten million people with multimorbidity registered to primary care in England. First, we extract data-driven representations of 212 diseases from patient records employing (i) co-occurrence-based methods and (ii) sequence-based natural language processing methods. Second, we apply the graph-based Markov Multiscale Community Detection (MMCD) to identify clusters based on disease similarity at multiple resolutions. We evaluate the representations and clusters using a clinically curated set of 253 known disease association pairs, and qualitatively assess the interpretability of the clusters. Results Both co-occurrence and sequence-based algorithms generate interpretable disease representations, with the best performance from the skip-gram algorithm. MMCD outperforms k-means and hierarchical clustering in explaining known disease associations. We find that diseases display an almost-hierarchical structure across resolutions from closely to more loosely similar co-occurrence patterns and identify interpretable clusters corresponding to both established and novel patterns. Conclusions Our method provides a tool for clustering diseases at different levels of resolution from co-occurrence patterns in high-dimensional electronic health records, which could be used to facilitate discovery of associations between diseases in the future.

Published in Communications Medicine

ISSN: 2730-664X (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine
Website: https://www.nature.com/commsmed/

About the journal