Search for SINE repeats in the rice genome using correlation-based position weight matrices

Yulia M. Suvorova; Anastasia M. Kamionskaya; Eugene V. Korotkov

doi:10.1186/s12859-021-03977-0

BMC Bioinformatics (Feb 2021)

Search for SINE repeats in the rice genome using correlation-based position weight matrices

Yulia M. Suvorova,
Anastasia M. Kamionskaya,
Eugene V. Korotkov

Affiliations

Yulia M. Suvorova: Research Center of Biotechnology of the Russian Academy of Sciences
Anastasia M. Kamionskaya: Research Center of Biotechnology of the Russian Academy of Sciences
Eugene V. Korotkov: Research Center of Biotechnology of the Russian Academy of Sciences

DOI: https://doi.org/10.1186/s12859-021-03977-0
Journal volume & issue: Vol. 22, no. 1
pp. 1 – 18

Abstract

Read online

Abstract Background Transposable elements (TEs) constitute a significant part of eukaryotic genomes. Short interspersed nuclear elements (SINEs) are non-autonomous TEs, which are widely represented in mammalian genomes and also found in plants. After insertion in a new position in the genome, TEs quickly accumulate mutations, which complicate their identification and annotation by modern bioinformatics methods. In this study, we searched for highly divergent SINE copies in the genome of rice (Oryza sativa subsp. japonica) using the Highly Divergent Repeat Search Method (HDRSM). Results The HDRSM considers correlations of neighboring symbols to construct position weight matrix (PWM) for a SINE family, which is then used to perform a search for new copies. In order to evaluate the accuracy of the method and compare it with the RepeatMasker program, we generated a set of SINE copies containing nucleotide substitutions and indels and inserted them into an artificial chromosome for analysis. The HDRSM showed better results both in terms of the number of identified inserted repeats and the accuracy of determining their boundaries. A search for the copies of 39 SINE families in the rice genome produced 14,030 hits; among them, 5704 were not detected by RepeatMasker. Conclusions The HDRSM could find divergent SINE copies, correctly determine their boundaries, and offer a high level of statistical significance. We also found that RepeatMasker is able to find relatively short copies of the SINE families with a higher level of similarity, while HDRSM is able to find more diverged copies. To obtain a comprehensive profile of SINE distribution in the genome, combined application of the HDRSM and RepeatMasker is recommended.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords