Performance comparisons between clustering models for reconstructing NGS results from technical replicates

Yue Zhai; Yue Zhai; Yue Zhai; Claire Bardel; Claire Bardel; Claire Bardel; Claire Bardel; Claire Bardel; Maxime Vallée; Jean Iwaz; Jean Iwaz; Jean Iwaz; Jean Iwaz; Pascal Roy; Pascal Roy; Pascal Roy; Pascal Roy

doi:10.3389/fgene.2023.1148147

Frontiers in Genetics (Mar 2023)

Performance comparisons between clustering models for reconstructing NGS results from technical replicates

Yue Zhai,
Yue Zhai,
Yue Zhai,
Claire Bardel,
Claire Bardel,
Claire Bardel,
Claire Bardel,
Claire Bardel,
Maxime Vallée,
Jean Iwaz,
Jean Iwaz,
Jean Iwaz,
Jean Iwaz,
Pascal Roy,
Pascal Roy,
Pascal Roy,
Pascal Roy

Affiliations

Yue Zhai: Université Lyon 1, Lyon, France
Yue Zhai: Université de Lyon, Lyon, France
Yue Zhai: Laboratoire de Biométrie et Biologie Évolutive, Villeurbanne, France
Claire Bardel: Université Lyon 1, Lyon, France
Claire Bardel: Université de Lyon, Lyon, France
Claire Bardel: Laboratoire de Biométrie et Biologie Évolutive, Villeurbanne, France
Claire Bardel: Service de Biostatistique-Bioinformatique, Hospices Civils de Lyon, Lyon, France
Claire Bardel: Service de Génétique, Hospices Civils de Lyon, Bron, France
Maxime Vallée: Cellule Bioinformatique de La Plateforme de Séquençage Haut Débit NGS-HCL, Hospices Civils de Lyon, Bron, France
Jean Iwaz: Université Lyon 1, Lyon, France
Jean Iwaz: Université de Lyon, Lyon, France
Jean Iwaz: Laboratoire de Biométrie et Biologie Évolutive, Villeurbanne, France
Jean Iwaz: Service de Biostatistique-Bioinformatique, Hospices Civils de Lyon, Lyon, France
Pascal Roy: Université Lyon 1, Lyon, France
Pascal Roy: Université de Lyon, Lyon, France
Pascal Roy: Laboratoire de Biométrie et Biologie Évolutive, Villeurbanne, France
Pascal Roy: Service de Biostatistique-Bioinformatique, Hospices Civils de Lyon, Lyon, France

DOI: https://doi.org/10.3389/fgene.2023.1148147
Journal volume & issue: Vol. 14

Abstract

Read online

To improve the performance of individual DNA sequencing results, researchers often use replicates from the same individual and various statistical clustering models to reconstruct a high-performance callset. Here, three technical replicates of genome NA12878 were considered and five model types were compared (consensus, latent class, Gaussian mixture, Kamila–adapted k-means, and random forest) regarding four performance indicators: sensitivity, precision, accuracy, and F1-score. In comparison with no use of a combination model, i) the consensus model improved precision by 0.1%; ii) the latent class model brought 1% precision improvement (97%–98%) without compromising sensitivity (= 98.9%); iii) the Gaussian mixture model and random forest provided callsets with higher precisions (both >99%) but lower sensitivities; iv) Kamila increased precision (>99%) and kept a high sensitivity (98.8%); it showed the best overall performance. According to precision and F1-score indicators, the compared non-supervised clustering models that combine multiple callsets are able to improve sequencing performance vs. previously used supervised models. Among the models compared, the Gaussian mixture model and Kamila offered non-negligible precision and F1-score improvements. These models may be thus recommended for callset reconstruction (from either biological or technical replicates) for diagnostic or precision medicine purposes.

Published in Frontiers in Genetics

ISSN: 1664-8021 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Science: Biology (General): Genetics
Website: http://journal.frontiersin.org/journal/genetics

About the journal

Abstract

Keywords