PredictION: a predictive model to establish the performance of Oxford sequencing reads of SARS-CoV-2
David E. Valencia-Valencia,
Diana Lopez-Alvarez,
Nelson Rivera-Franco,
Andres Castillo,
Johan S. Piña,
Carlos A. Pardo,
Beatriz Parra
Affiliations
David E. Valencia-Valencia
Laboratorio de Técnicas y Análisis Ómicos—TAOLab/CiBioFi, Facultad de Ciencias Naturales y Exactas, Universidad del Valle, Cali, Valle del Cauca, Colombia
Diana Lopez-Alvarez
Laboratorio de Técnicas y Análisis Ómicos—TAOLab/CiBioFi, Facultad de Ciencias Naturales y Exactas, Universidad del Valle, Cali, Valle del Cauca, Colombia
Nelson Rivera-Franco
Laboratorio de Técnicas y Análisis Ómicos—TAOLab/CiBioFi, Facultad de Ciencias Naturales y Exactas, Universidad del Valle, Cali, Valle del Cauca, Colombia
Andres Castillo
Laboratorio de Técnicas y Análisis Ómicos—TAOLab/CiBioFi, Facultad de Ciencias Naturales y Exactas, Universidad del Valle, Cali, Valle del Cauca, Colombia
Johan S. Piña
Department of Data Science, People Contact, Manizales, Caldas, Colombia
Carlos A. Pardo
Department of Neurology, Pathology, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America
Beatriz Parra
Grupo VIREM—Virus Emergentes y Enfermedad, Escuela de Ciencias Básicas, Facultad de Salud, Universidad del Valle, Cali, Valle del Cauca, Colombia
The optimization of resources for research in developing countries forces us to consider strategies in the wet lab that allow the reuse of molecular biology reagents to reduce costs. In this study, we used linear regression as a method for predictive modeling of coverage depth given the number of MinION reads sequenced to define the optimum number of reads necessary to obtain >200X coverage depth with a good lineage-clade assignment of SARS-CoV-2 genomes. The research aimed to create and implement a model based on machine learning algorithms to predict different variables (e.g., coverage depth) given the number of MinION reads produced by Nanopore sequencing to maximize the yield of high-quality SARS-CoV-2 genomes, determine the best sequencing runtime, and to be able to reuse the flow cell with the remaining nanopores available for sequencing in a new run. The best accuracy was −0.98 according to the R squared performance metric of the models. A demo version is available at https://genomicdashboard.herokuapp.com/.