Tehnički Vjesnik (Jan 2020)

Distributed Representation of Protein Sequence Based on Multi-Alignment Results

  • Siqi Wang,
  • Liu He,
  • Shi Cheng,
  • Xiaohu Shi

DOI
https://doi.org/10.17559/TV-20200417091724
Journal volume & issue
Vol. 27, no. 4
pp. 1237 – 1243

Abstract

Read online

Protein sequence representation is a key problem for protein studies, especially for those sequence-based models. In this paper, a distributed representation model of protein sequence is proposed, which involves evolutionary information by introducing multi-alignment results. Firstly, we construct a non-redundancy protein dataset and perform multi-alignment for each protein. Then k-mer amino acids "biology corpus" was abstracted from the alignment results which are "evolutionary information" enriched. Using the "biology corpus", k-mer amino acids distributed embedding vectors could be trained according to word2vec method. We compared the amino acid pair distance derived from our produced 1-mer amino acids distributed embedding vectors with that derived from BLOSUM62; it was found that their Pearson coefficient is 0.937, showing they have strong correlation. Then we applied the obtained amino acids distributed embedding representation to protein secondary structure recognition and solubility prediction. For both of the experiments, our proposed alignment results based amino acid distributed representation outperforms that derived directly from protein sequences. Moreover, compared to those existing up-to-date algorithms, our method could get better or comparative results, on condition of only using the feature of our produced amino acid distributed vectors.

Keywords