Combined Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients Using Autoencoder for Speaker Recognition

Young-Long Chen; Neng-Chung Wang; Jing-Fong Ciou; Rui-Qi Lin

doi:10.3390/app13127008

Applied Sciences (Jun 2023)

Combined Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients Using Autoencoder for Speaker Recognition

Young-Long Chen,
Neng-Chung Wang,
Jing-Fong Ciou,
Rui-Qi Lin

Affiliations

Young-Long Chen: Department of Computer Science and Information Engineering, National Taichung University of Science and Technology, Taichung 404336, Taiwan
Neng-Chung Wang: Department of Computer Science and Information Engineering, National United University, Miaoli 360302, Taiwan
Jing-Fong Ciou: Department of Computer Science and Information Engineering, National Taichung University of Science and Technology, Taichung 404336, Taiwan
Rui-Qi Lin: Department of Computer Science and Information Engineering, National Taichung University of Science and Technology, Taichung 404336, Taiwan

DOI: https://doi.org/10.3390/app13127008
Journal volume & issue: Vol. 13, no. 12
p. 7008

Abstract

Read online

Recently, neural network technology has shown remarkable progress in speech recognition, including word classification, emotion recognition, and identity recognition. This paper introduces three novel speaker recognition methods to improve accuracy. The first method, called long short-term memory with mel-frequency cepstral coefficients for triplet loss (LSTM-MFCC-TL), utilizes MFCC as input features for the LSTM model and incorporates triplet loss and cluster training for effective training. The second method, bidirectional long short-term memory with mel-frequency cepstral coefficients for triplet loss (BLSTM-MFCC-TL), enhances speaker recognition accuracy by employing a bidirectional LSTM model. The third method, bidirectional long short-term memory with mel-frequency cepstral coefficients and autoencoder features for triplet loss (BLSTM-MFCCAE-TL), utilizes an autoencoder to extract additional AE features, which are then concatenated with MFCC and fed into the BLSTM model. The results showed that the performance of the BLSTM model was superior to the LSTM model, and the method of adding AE features achieved the best learning effect. Moreover, the proposed methods exhibit faster computation times compared to the reference GMM-HMM model. Therefore, utilizing pre-trained autoencoders for speaker encoding and obtaining AE features can significantly enhance the learning performance of speaker recognition. Additionally, it also offers faster computation time compared to traditional methods.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords