Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets

Milko Krachunov; Maria Nisheva; Dimitar Vassilev

doi:10.3390/computers6040029

Computers (Nov 2017)

Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets

Milko Krachunov,
Maria Nisheva,
Dimitar Vassilev

Affiliations

Milko Krachunov: Faculty of Mathematics and Informatics, Sofia University, 5 James Bourchier Blvd., 1164 Sofia, Bulgaria
Maria Nisheva: Faculty of Mathematics and Informatics, Sofia University, 5 James Bourchier Blvd., 1164 Sofia, Bulgaria
Dimitar Vassilev: Faculty of Mathematics and Informatics, Sofia University, 5 James Bourchier Blvd., 1164 Sofia, Bulgaria

DOI: https://doi.org/10.3390/computers6040029
Journal volume & issue: Vol. 6, no. 4
p. 29

Abstract

Read online

For metagenomics datasets, datasets of complex polyploid genomes, and other high-variation genomics datasets, there are difficulties with the analysis, error detection and variant calling, stemming from the challenges of discerning sequencing errors from biological variation. Confirming base candidates with high frequency of occurrence is no longer a reliable measure because of the natural variation and the presence of rare bases. The paper discusses an approach to the application of machine learning models to classify bases into erroneous and rare variations after preselecting potential error candidates with a weighted frequency measure, which aims to focus on unexpected variations by using the inter-sequence pairwise similarity. Different similarity measures are used to account for different types of datasets. Four machine learning models are implemented and tested.

Published in Computers

ISSN: 2073-431X (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.mdpi.com/journal/computers

About the journal

Abstract

Keywords