npj Biofilms and Microbiomes (Oct 2024)

DGCNN approach links metagenome-derived taxon and functional information providing insight into global soil organic carbon

  • Laura-Jayne Gardiner,
  • Matthew Marshall,
  • Katharina Reusch,
  • Chris Dearden,
  • Mark Birmingham,
  • Anna Paola Carrieri,
  • Edward O. Pyzer-Knapp,
  • Ritesh Krishna,
  • Andrew L. Neal

DOI
https://doi.org/10.1038/s41522-024-00583-9
Journal volume & issue
Vol. 10, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Metagenomics can provide insight into the microbial taxa present in a sample and, through gene identification, the functional potential of the community. However, taxonomic and functional information are typically considered separately in downstream analyses. We develop interpretable machine learning (ML) approaches for modelling metagenomic data, combining the biological representation of species with their associated genetically encoded functions within models. We apply our methods to investigate soil organic carbon (SOC) stocks. First, we combine a diverse global set of soil microbiome samples with environmental data, improving the predictive performance of classic ML and providing new insights into the role of soil microbiomes in global carbon cycling. Our network analysis of predictive taxa identified by classical ML models provides context for their ecological significance, extending the focus beyond just the most predictive taxa to ‘hidden’ features within the model that might be considered less predictive using standard methods for explainability. We next develop unique graph representations for individual microbiomes, linking microbial taxa to their associated functions directly, enabling predictions of SOC via deep graph convolutional neural networks (DGCNNs). Interpretation of the DGCNNs distinguished between the importance of functions of key individual species, providing genome sequence differences, e.g., gene loss/acquisition, that associate with SOC. These approaches identify several members of the Verrucomicrobiaceae family and a range of genetically encoded functions, e.g., related to carbohydrate metabolism, as important for SOC stocks and effective global SOC predictors. These relatively understudied but widespread organisms could play an important role in SOC dynamics globally.