The Journal of Engineering (Dec 2021)

How many Mel‐frequency cepstral coefficients to be utilized in speech recognition? A study with the Bengali language

  • Md. Rakibul Hasan,
  • Md. Mahbub Hasan,
  • Md Zakir Hossain

DOI
https://doi.org/10.1049/tje2.12082
Journal volume & issue
Vol. 2021, no. 12
pp. 817 – 827

Abstract

Read online

Abstract Speech‐related research has a wide range of applications. Most speech‐related researches employ Mel‐frequency cepstral coefficients (MFCCs) as acoustic features. However, finding the optimum number of MFCCs is an active research question. MFCC‐based speech classification was performed for both vowels and words in the Bengali language. As for the classification model, deep neural network (DNN) with Adam optimizer was used. The performances were measured with five different performance metrics, namely confusion matrix, classification accuracy, area under curve of receiver operating characteristic (AUC‐ROC), F1 score, and Cohen's Kappa with four‐fold cross‐validations at different number of MFCCs. All performance metrics gave the best score for 24/25 MFCCs; hence it is suggested that the optimum number of MFCCs should be 25, although many existing studies use only 13 MFCCs. Furthermore, it is verified that increasing the number of MFCCs yields better classification metrics with lower computational burden than the increment of hidden layers. Lastly, the optimum number of MFCCs obtained from this study was used in a more improved DNN model, from which 99% and 90% accuracies were achieved for vowel and word classification, respectively, and the vowel classification score outperformed state‐of‐the‐art results.