Computational and Structural Biotechnology Journal (Jan 2022)

Regions with two amino acids in protein sequences: A step forward from homorepeats into the low complexity landscape

  • Pablo Mier,
  • Miguel A. Andrade-Navarro

Journal volume & issue
Vol. 20
pp. 5516 – 5523

Abstract

Read online

Low complexity regions (LCRs) differ in amino acid composition from the background provided by the corresponding proteomes. The simplest LCRs are homorepeats (or polyX), regions composed of mostly-one amino acid type. Extensive research has been done to characterize homorepeats, and their taxonomic, functional and structural features depend on the amino acid type and sequence context. From them, the next step towards the study of LCRs are the regions composed of two types of amino acids, which we call polyXY. We classify polyXY in three categories based on the arrangement of the two amino acid types ‘X’ and ‘Y’: direpeats (e.g. ‘XYXYXY’), joined (e.g. ‘XXXYYY’) and shuffled (e.g. ‘XYYXXY’). We developed a script to search for polyXY, and located them in a comprehensive set of 20,340 reference proteomes. These results are available in a dedicated web server called XYs, in which the user can also submit their own protein datasets to detect polyXY. We studied the distribution of polyXY types by amino acid pair XY and category, and show that polyXY in Eukaryota are mainly located within intrinsically disordered regions. Our study provides a first step towards the characterization of polyXY as protein motifs.

Keywords