Revealing the Presence of a Symbolic Sequence Representing Multiple Nucleotides Based on K-Means Clustering of Oligonucleotides

Byoungsang Lee; So Yeon Ahn; Charles Park; James J. Moon; Jung Heon Lee; Dan Luo; Soong Ho Um; Seung Won Shin

doi:10.3390/molecules24020348

Molecules (Jan 2019)

Revealing the Presence of a Symbolic Sequence Representing Multiple Nucleotides Based on K-Means Clustering of Oligonucleotides

Byoungsang Lee,
So Yeon Ahn,
Charles Park,
James J. Moon,
Jung Heon Lee,
Dan Luo,
Soong Ho Um,
Seung Won Shin

Affiliations

Byoungsang Lee: School of Advanced Materials Science and Engineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, South Korea
So Yeon Ahn: School of Chemical Engineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, South Korea
Charles Park: Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USA
James J. Moon: Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USA
Jung Heon Lee: School of Advanced Materials Science and Engineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, South Korea
Dan Luo: Department of Biological and Environmental Engineering, Cornell University, Ithaca, NY 14850, USA
Soong Ho Um: School of Chemical Engineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, South Korea
Seung Won Shin: School of Chemical Engineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, South Korea

DOI: https://doi.org/10.3390/molecules24020348
Journal volume & issue: Vol. 24, no. 2
p. 348

Abstract

Read online

In biological systems, a few sequence differences diversify the hybridization profile of nucleotides and enable the quantitative control of cellular metabolism in a cooperative manner. In this respect, the information required for a better understanding may not be in each nucleotide sequence, but representative information contained among them. Existing methodologies for nucleotide sequence design have been optimized to track the function of the genetic molecule and predict interaction with others. However, there has been no attempt to extract new sequence information to represent their inheritance function. Here, we tried to conceptually reveal the presence of a representative sequence from groups of nucleotides. The combined application of the K-means clustering algorithm and the social network analysis theorem enabled the effective calculation of the representative sequence. First, a “common sequence” is made that has the highest hybridization property to analog sequences. Next, the sequence complementary to the common sequence is designated as a ‘representative sequence’. Based on this, we obtained a representative sequence from multiple analog sequences that are 8–10-bases long. Their hybridization was empirically tested, which confirmed that the common sequence had the highest hybridization tendency, and the representative sequence better alignment with the analogs compared to a mere complementary.

Published in Molecules

ISSN: 1420-3049 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Chemistry: Organic chemistry
Website: http://www.mdpi.com/journal/molecules

About the journal

Abstract

Keywords