Journal of King Saud University: Computer and Information Sciences (Mar 2023)

Attention-based latent features for jointly trained end-to-end automatic speech recognition with modified speech enhancement

  • Da-Hee Yang,
  • Joon-Hyuk Chang

Journal volume & issue
Vol. 35, no. 3
pp. 202 – 210

Abstract

Read online

In this paper, we propose a joint training framework that efficiently combines time-domain speech enhancement (SE) with an end-to-end (E2E) automatic speech recognition (ASR) system utilizing attention-based latent features. Using the latent feature to train E2E ASR implies that various time-domain SE models can be applied for noise-robust ASR and our modified framework is the first approach. We implement a fully E2E scheme pipelined from SE to ASR without domain knowledge and short-time Fourier transform (STFT) consistency constraints by applying a time-domain SE model. Therefore, using the latent feature of time-domain SE as appropriate features for ASR inputs is the main approach in our framework. Furthermore, we apply an attention algorithm to the time-domain SE model to selectively concentrate on certain latent features to achieve the better relevant feature for the task. Detailed experiments are conducted on the hybrid CTC/attention architecture for E2E ASR, and we demonstrate the superiority of our approach compared to baseline ASR systems trained with Mel filter bank coefficients features as input. Compared to the baseline ASR model trained only on clean data, the proposed joint training method achieves 63.6% and 86.8% relative error reductions on the TIMIT and WSJ “matched” test set, respectively.

Keywords