IEEE Access (Jan 2024)

Mel-Scale Frequency Extraction and Classification of Dialect-Speech Signals With 1D CNN Based Classifier for Gender and Region Recognition

  • Hsiang-Yueh Lai,
  • Chia-Chieh Hu,
  • Chia-Hung Wen,
  • Jian-Xing Wu,
  • Neng-Sheng Pai,
  • Cheng-Yu Yeh,
  • Chia-Hung Lin

DOI
https://doi.org/10.1109/ACCESS.2024.3430296
Journal volume & issue
Vol. 12
pp. 102962 – 102976

Abstract

Read online

Humans communicate and interact through natural languages, such as American English (AE), Taiwanese, Italian, and numerous variants of Spanish. Through automatic speech analysis and recognition technologies, human-machine interaction systems (HMISs) can be used for language learning in query systems, smart devices, and healthcare applications, emphasizing the need to enhance user interaction across different sectors. Because people differ in their basic attributes (e.g., gender, age group, and spoken dialect), an HMIS must be able to identify the speaker’s gender, age group, and regional dialect on the basis of their speech signals. To achieve automatic speech recognition, we analyzed and distinguished feature patterns using a feature extraction method and identified gender and region using a convolutional neural network (CNN)-based classifier. Mel-frequency cepstral coefficients were used to extract Mel-scale frequencies (MSF) from dialect-sentence speech signals for conversion into specific feature patterns. Subsequently, a one-dimensional CNN-based classifier was used to identify these features patterns by gender and regional dialect. The proposed speech classifier was rigorously trained, tested, and validated using dialect-sentence speech corpora from AE, Italian (IT), and Spanish (SP) acoustic-phonetic continuous speech database. The experimental results indicate that the proposed model with MSF features can perform accurate gender and region recognition. The classifier was evaluated in metrics of precision (%), recall (%), F1 score, and accuracy (%).

Keywords