IEEE Access (Jan 2024)

Triangular Region Cut-Mix Augmentation Algorithm-Based Speech Emotion Recognition System With Transfer Learning Approach

  • V. Preethi,
  • V. Elizabeth Jesi

DOI
https://doi.org/10.1109/ACCESS.2024.3428336
Journal volume & issue
Vol. 12
pp. 98436 – 98449

Abstract

Read online

Recently, spectrogram energy patterns that capture emotional information have demonstrated strong performance in the vocal-based image emotion detection challenge. The proposed augmentation technique, called Triangular Region cut-mix, is a novel system for emotion recognition. It utilizes voice image information to enhance classification accuracy by focusing on triangular regions instead of box regions, while minimizing information loss. This study utilizes an innovative approach that incorporates a triangular area to enhance the cutting or mixing of the input images, while preserving the information in order to create additional training examples. The dearth of information to enhance the accuracy of speech emotion recognition is therefore mitigated. In order to increase the amount of training data and enhance the precision of voice emotion recognition, a vanilla gradient technique is employed. The pitch attribution demonstrates the significance of a pixel to the human visual system. In contrast, transfer learning results in superior performance. Previous studies have not identified a model that achieves good performance in voice image emotion identification while using triangle region augmentation without sacrificing information. This limitation has been observed in earlier works. Constructing a proficient model for automatic emotion recognition is challenging in the absence of annotated data. We utilize raw, labeled audio data from kaggle’s Ravdess dataset. Initially, we convert this data into a spectrogram, which serves as a representation of the audio image. We then apply image classification algorithms to classify the emotion. Additionally, we employ triangular region augmentation to expand the labeled training data. We conduct an assessment and evaluation of two distinct methodologies: 1) Transfer learning without augmentation; 2) Transfer learning with triangular region augmentation. We utilize a pre-trained VGG16 model that has undergone pre-training for image classification. Our model achieves an accuracy of 84.2% in detecting emotions in speech images. Experimental results demonstrates that the proposed system has achieved a 5.6% increase in accuracy compared to the baseline model without augmentation.

Keywords