Jisuanji kexue yu tansuo (Sep 2021)

Research Status and Prospect of Transformer in Speech Recognition

  • ZHANG Xiaoxu, MA Zhiqiang, LIU Zhiqiang, ZHU Fangyuan, WANG Chunyu

DOI
https://doi.org/10.3778/j.issn.1673-9418.2103020
Journal volume & issue
Vol. 15, no. 9
pp. 1578 – 1594

Abstract

Read online

As a new deep learning algorithm framework, Transformer has attracted more and more researchers?? attention and has become a current research hotspot. Inspired by humans focusing on important things only, the self-attention mechanism in the Transformer model mainly learns important information in the input sequence. For speech recogni-tion tasks, the focus is to transcribe the information of the input speech sequence into the corresponding language text. The past practice was to combine acoustic models, pronunciation dictionaries, and language models into a speech recognition system to achieve speech recognition tasks, while Transformer can integrate them into a single neural network to form an end-to-end speech recognition system, which solves the issues such as forced alignment and multi-module training of the traditional speech recognition system. Therefore, it is very necessary to discuss the problems of Transformer in speech recognition tasks. In this paper, the structure of the Transformer model is first introduced. Besides, the problems confronted by speech recognition are analyzed with respect to input speech sequence, deep model architecture, and model inference. Then the methods to solve the obstacles within the three aspects afore mentioned are outlined and summarized. Finally, the future application and direction of Transformer in speech recognition are concluded and prospected.

Keywords