Advanced recurrent network-based hybrid acoustic models for low resource speech recognition

Jian Kang; Wei-Qiang Zhang; Wei-Wei Liu; Jia Liu; Michael T. Johnson

doi:10.1186/s13636-018-0128-6

EURASIP Journal on Audio, Speech, and Music Processing (Jul 2018)

Advanced recurrent network-based hybrid acoustic models for low resource speech recognition

Jian Kang,
Wei-Qiang Zhang,
Wei-Wei Liu,
Jia Liu,
Michael T. Johnson

Affiliations

Jian Kang: Tsinghua National Laboratory for Information Science and Technology, Department of Electronic, Engineering, Tsinghua University
Wei-Qiang Zhang: Tsinghua National Laboratory for Information Science and Technology, Department of Electronic, Engineering, Tsinghua University
Wei-Wei Liu: 62315 Unit, Chinese People’s Liberation Army
Jia Liu: Tsinghua National Laboratory for Information Science and Technology, Department of Electronic, Engineering, Tsinghua University
Michael T. Johnson: Electrical and Computer Engineering, College of Engineering, University of Kentucky

DOI: https://doi.org/10.1186/s13636-018-0128-6
Journal volume & issue: Vol. 2018, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Recurrent neural networks (RNNs) have shown an ability to model temporal dependencies. However, the problem of exploding or vanishing gradients has limited their application. In recent years, long short-term memory RNNs (LSTM RNNs) have been proposed to solve this problem and have achieved excellent results. Bidirectional LSTM (BLSTM), which uses both preceding and following context, has shown particularly good performance. However, the computational requirements of BLSTM approaches are quite heavy, even when implemented efficiently with GPU-based high performance computers. In addition, because the output of LSTM units is bounded, there is often still a vanishing gradient issue over multiple layers. The large size of LSTM networks makes them susceptible to overfitting problems. In this work, we combine local bidirectional architecture, a new recurrent unit, gated recurrent units (GRU), and residual architectures to address the above problems. Experiments are conducted on the benchmark datasets released under the IARPA Babel Program. The proposed models achieve 3 to 10% relative improvements over their corresponding DNN or LSTM baselines across seven language collections. In addition, the new models accelerate learning speed by a factor of more than 1.6 compared to conventional BLSTM models. By using these approaches, we achieve good results in the IARPA Babel Program.

Published in EURASIP Journal on Audio, Speech, and Music Processing

ISSN: 1687-4722 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Science: Physics: Acoustics. Sound; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://asmp-eurasipjournals.springeropen.com

About the journal

Abstract

Keywords