Machine learning classification by fitting amplicon sequences to existing OTUs

Courtney R. Armour; Kelly L. Sovacool; William L. Close; Begüm D. Topçuoğlu; Jenna Wiens; Patrick D. Schloss

doi:10.1128/msphere.00336-23

mSphere (Oct 2023)

Machine learning classification by fitting amplicon sequences to existing OTUs

Courtney R. Armour,
Kelly L. Sovacool,
William L. Close,
Begüm D. Topçuoğlu,
Jenna Wiens,
Patrick D. Schloss

Affiliations

Courtney R. Armour: Department of Microbiology and Immunology, University of Michigan , Ann Arbor, Michigan, USA
Kelly L. Sovacool: Department of Computational Medicine and Bioinformatics, University of Michigan , Ann Arbor, Michigan, USA
William L. Close: Department of Microbiology and Immunology, University of Michigan , Ann Arbor, Michigan, USA
Begüm D. Topçuoğlu: Department of Microbiology and Immunology, University of Michigan , Ann Arbor, Michigan, USA
Jenna Wiens: Department of Electrical Engineering and Computer Science, University of Michigan , Ann Arbor, Michigan, USA
Patrick D. Schloss: Department of Microbiology and Immunology, University of Michigan , Ann Arbor, Michigan, USA

DOI: https://doi.org/10.1128/msphere.00336-23
Journal volume & issue: Vol. 8, no. 5

Abstract

Read online

ABSTRACT The ability to use 16S rRNA gene sequence data to train machine learning classification models offers the opportunity to diagnose patients based on the composition of their microbiome. In some applications, the taxonomic resolution that provides the best models may require the use of de novo operational taxonomic units (OTUs) whose composition changes when new data are added. We previously developed a new reference-based approach, OptiFit, that fits new sequence data to existing de novo OTUs without changing the composition of the original OTUs. While OptiFit produces OTUs that are as high quality as de novo OTUs, it is unclear whether this method for fitting new sequence data into existing OTUs will impact the performance of classification models relative to models trained and tested only using de novo OTUs. We used OptiFit to cluster sequences into existing OTUs and evaluated model performance in classifying a dataset containing samples from patients with and without colonic screen relevant neoplasia (SRN). We compared the performance of this model to standard methods including de novo and database-reference-based clustering. We found that using OptiFit performed as well or better in classifying SRNs. OptiFit can streamline the process of classifying new samples by avoiding the need to retrain models using reclustered sequences. IMPORTANCE There is great potential for using microbiome data to aid in diagnosis. A challenge with de novo operational taxonomic unit (OTU)-based classification models is that 16S rRNA gene sequences are often assigned to OTUs based on similarity to other sequences in the dataset. If data are generated from new patients, the old and new sequences must be reclustered to OTUs and the classification model retrained. Yet there is a desire to have a single, validated model that can be widely deployed. To overcome this obstacle, we applied the OptiFit clustering algorithm to fit new sequence data to existing OTUs allowing for reuse of the model. A random forest model implemented using OptiFit performed as well as the traditional reassign and retrain approach. This result shows that it is possible to train and apply machine learning models based on OTU relative abundance data that do not require retraining or the use of a reference database.

Published in mSphere

ISSN: 2379-5042 (Online)
Publisher: American Society for Microbiology
Country of publisher: United States
LCC subjects: Science: Microbiology
Website: https://journals.asm.org/journal/msphere

About the journal

Abstract

Keywords