IEEE Access (Jan 2020)
Text-Independent Speaker Identification Through Feature Fusion and Deep Neural Network
Abstract
Speaker identification refers to the process of recognizing human voice using artificial intelligence techniques. Speaker identification technologies are widely applied in voice authentication, security and surveillance, electronic voice eavesdropping, and identity verification. In the speaker identification process, extracting discriminative and salient features from speaker utterances is an important task to accurately identify speakers. Various features for speaker identification have been recently proposed by researchers. Most studies on speaker identification have utilized short-time features, such as perceptual linear predictive (PLP) coefficients and Mel frequency cepstral coefficients (MFCC), due to their capability to capture the repetitive nature and efficiency of signals. Various studies have shown the effectiveness of MFCC features in correctly identifying speakers. However, the performances of these features degrade on complex speech datasets, and therefore, these features fail to accurately identify speaker characteristics. To address this problem, this study proposes a novel fusion of MFCC and time-based features (MFCCT), which combines the effectiveness of MFCC and time-domain features to improve the accuracy of text-independent speaker identification (SI) systems. The extracted MFCCT features were fed as input to a deep neural network (DNN) to construct the speaker identification model. Results showed that the proposed MFCCT features coupled with DNN outperformed existing baseline MFCC and time-domain features on the LibriSpeech dataset. In addition, DNN obtained better classification results compared with five machine learning algorithms that were recently utilized in speaker recognition. Moreover, this study evaluated the effectiveness of one-level and two-level classification methods for speaker identification. The experimental results showed that two-level classification presented better results than one-level classification. The proposed features and classification model for identifying a speaker can be widely applied to different types of speaker datasets.
Keywords