PanKA: Leveraging population pangenome to predict antibiotic resistance
Van Hoan Do,
Van Sang Nguyen,
Son Hoang Nguyen,
Duc Quang Le,
Tam Thi Nguyen,
Canh Hao Nguyen,
Tho Huu Ho,
Nam S. Vo,
Trang Nguyen,
Hoang Anh Nguyen,
Minh Duc Cao
Affiliations
Van Hoan Do
Center for Applied Mathematics and Informatics, Le Quy Don Technical University, Hanoi, Vietnam; Corresponding author
Van Sang Nguyen
Center for Biomedical Informatics, Vingroup Big Data Institute, Hanoi, Vietnam
Son Hoang Nguyen
AMROMICS JSC, Vinh, Nghe An, Vietnam
Duc Quang Le
Faculty of IT, Hanoi University of Civil Engineering, Hanoi, Vietnam
Tam Thi Nguyen
Oxford University Clinical Research Unit, Hanoi, Vietnam
Canh Hao Nguyen
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto, Japan
Tho Huu Ho
Department of Medical Microbiology, The 103 Military Hospital, Vietnam Military Medical University, Hanoi, Vietnam; Department of Genomics & Cytogenetics, Institute of Biomedicine & Pharmacy, Vietnam Military Medical University, Hanoi, Vietnam
Nam S. Vo
Center for Biomedical Informatics, Vingroup Big Data Institute, Hanoi, Vietnam
Trang Nguyen
AMROMICS JSC, Vinh, Nghe An, Vietnam
Hoang Anh Nguyen
AMROMICS JSC, Vinh, Nghe An, Vietnam
Minh Duc Cao
AMROMICS JSC, Vinh, Nghe An, Vietnam; Corresponding author
Summary: Machine learning has the potential to be a powerful tool in the fight against antimicrobial resistance (AMR), a critical global health issue. Machine learning can identify resistance mechanisms from DNA sequence data without prior knowledge. The first step in building a machine learning model is a feature extraction from sequencing data. Traditional methods like single nucleotide polymorphism (SNP) calling and k-mer counting yield numerous, often redundant features, complicating prediction and analysis. In this paper, we propose PanKA, a method using the pangenome to extract a concise set of relevant features for predicting AMR. PanKA not only enables fast model training and prediction but also improves accuracy. Applied to the Escherichia coli and Klebsiella pneumoniae bacterial species, our model is more accurate than conventional and state-of-the-art methods in predicting AMR.