Communications Medicine (May 2024)

Identifying multi-resolution clusters of diseases in ten million patients with multimorbidity in primary care in England

  • Thomas Beaney,
  • Jonathan Clarke,
  • David Salman,
  • Thomas Woodcock,
  • Azeem Majeed,
  • Paul Aylin,
  • Mauricio Barahona

DOI
https://doi.org/10.1038/s43856-024-00529-4
Journal volume & issue
Vol. 4, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Background Identifying clusters of diseases may aid understanding of shared aetiology, management of co-morbidities, and the discovery of new disease associations. Our study aims to identify disease clusters using a large set of long-term conditions and comparing methods that use the co-occurrence of diseases versus methods that use the sequence of disease development in a person over time. Methods We use electronic health records from over ten million people with multimorbidity registered to primary care in England. First, we extract data-driven representations of 212 diseases from patient records employing (i) co-occurrence-based methods and (ii) sequence-based natural language processing methods. Second, we apply the graph-based Markov Multiscale Community Detection (MMCD) to identify clusters based on disease similarity at multiple resolutions. We evaluate the representations and clusters using a clinically curated set of 253 known disease association pairs, and qualitatively assess the interpretability of the clusters. Results Both co-occurrence and sequence-based algorithms generate interpretable disease representations, with the best performance from the skip-gram algorithm. MMCD outperforms k-means and hierarchical clustering in explaining known disease associations. We find that diseases display an almost-hierarchical structure across resolutions from closely to more loosely similar co-occurrence patterns and identify interpretable clusters corresponding to both established and novel patterns. Conclusions Our method provides a tool for clustering diseases at different levels of resolution from co-occurrence patterns in high-dimensional electronic health records, which could be used to facilitate discovery of associations between diseases in the future.