EURASIP Journal on Audio, Speech, and Music Processing (Oct 2019)

A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

  • Chu-Xiong Qin,
  • Wen-Lin Zhang,
  • Dan Qu

DOI
https://doi.org/10.1186/s13636-019-0161-0
Journal volume & issue
Vol. 2019, no. 1
pp. 1 – 12

Abstract

Read online

Abstract A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-of-the-art end-to-end system in WER.

Keywords