Representation learning applications in biological sequence analysis

Hitoshi Iuchi; Taro Matsutani; Keisuke Yamada; Natsuki Iwano; Shunsuke Sumi; Shion Hosoda; Shitao Zhao; Tsukasa Fukunaga; Michiaki Hamada

Computational and Structural Biotechnology Journal (Jan 2021)

Representation learning applications in biological sequence analysis

Hitoshi Iuchi,
Taro Matsutani,
Keisuke Yamada,
Natsuki Iwano,
Shunsuke Sumi,
Shion Hosoda,
Shitao Zhao,
Tsukasa Fukunaga,
Michiaki Hamada

Affiliations

Hitoshi Iuchi: Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan; Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan; Corresponding authors at: Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan (Hitoshi Iuchi); Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University, Tokyo 169-8555, Japan (Michiaki Hamada).
Taro Matsutani: Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan; Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
Keisuke Yamada: School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
Natsuki Iwano: Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
Shunsuke Sumi: Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan; Department of Life Science Frontiers, Center for iPS Cell Research and Application, Kyoto University, Kyoto 606-8507, Japan
Shion Hosoda: Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan; Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
Shitao Zhao: Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan
Tsukasa Fukunaga: Waseda Institute for Advanced Study, Waseda University, Tokyo 169-0051, Japan; Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-0032, Japan
Michiaki Hamada: Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan; Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan; School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan; Graduate School of Medicine, Nippon Medical School, Tokyo 113-8602, Japan; Corresponding authors at: Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan (Hitoshi Iuchi); Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University, Tokyo 169-8555, Japan (Michiaki Hamada).

Journal volume & issue: Vol. 19
pp. 3198 – 3208

Abstract

Read online

Although remarkable advances have been reported in high-throughput sequencing, the ability to aptly analyze a substantial amount of rapidly generated biological (DNA/RNA/protein) sequencing data remains a critical hurdle. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention. In this method, biological sequences are regarded as sentences while the single nucleic acids/amino acids or k-mers in these sequences represent the words. Embedding is an essential step in NLP, which performs the conversion of these words into vectors. Specifically, representation learning is an approach used for this transformation process, which can be applied to biological sequences. Vectorized biological sequences can then be applied for function and structure estimation, or as input for other probabilistic models. Considering the importance and growing trend for the application of representation learning to biological research, in the present study, we have reviewed the existing knowledge in representation learning for biological sequence analysis.

Published in Computational and Structural Biotechnology Journal

ISSN: 2001-0370 (Online)
Publisher: Elsevier
Country of publisher: Netherlands
LCC subjects: Technology: Chemical technology: Biotechnology
Website: https://www.journals.elsevier.com/computational-and-structural-biotechnology-journal

About the journal

Abstract

Keywords