NASCUP: Nucleic Acid Sequence Classification by Universal Probability

Sunyoung Kwon; Gyuwan Kim; Byunghan Lee; Jongsik Chun; Sungroh Yoon; Young-Han Kim

doi:10.1109/access.2021.3127957

IEEE Access (Jan 2021)

NASCUP: Nucleic Acid Sequence Classification by Universal Probability

Sunyoung Kwon,
Gyuwan Kim,
Byunghan Lee,
Jongsik Chun,
Sungroh Yoon,
Young-Han Kim

Affiliations

Sunyoung Kwon: ORCiD; School of Biomedical Convergence Engineering, Pusan National University, Yangsan, South Korea
Gyuwan Kim: ORCiD; Department of Computer Science, University of California, Santa Barbara, Santa Barbara, CA, USA
Byunghan Lee: ORCiD; Department of Electronic and IT Media Engineering, Seoul National University of Science and Technology, Seoul, South Korea
Jongsik Chun: ORCiD; School of Biological Sciences, Seoul National University, Seoul, South Korea
Sungroh Yoon: ORCiD; Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea
Young-Han Kim: ORCiD; Department of Electrical and Computer Engineering, University of California, San Diego, San Diego, CA, USA

DOI: https://doi.org/10.1109/access.2021.3127957
Journal volume & issue: Vol. 9
pp. 162779 – 162791

Abstract

Read online

Nucleic acid sequence classification is a fundamental task in the field of bioinformatics. Due to the increasing amount of unlabeled nucleotide sequences, fast and accurate classification of them on a large scale has become crucial. In this work, we developed NASCUP, a new classification method that captures statistical structures of nucleotide sequences by compact context-tree models and universal probability from information theory. A comprehensive experimental study involving nine public databases for functional non-coding RNA, microbial taxonomy and coding/non-coding RNA classification demonstrates the advantages of NASCUP over widely-used alternatives in efficiency, accuracy, and scalability across all datasets considered. NASCUP achieved BLAST-like classification accuracy consistently for several large-scale databases in orders-of-magnitude reduced runtime, and was applied to other bioinformatics tasks such as outlier detection and synthetic sequence generation.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords