Speech emotion recognition using overlapping sliding window and Shapley additive explainable deep neural network

Nhat Truong Pham; Sy Dzung Nguyen; Vu Song Thuy Nguyen; Bich Ngoc Hong Pham; Duc Ngoc Minh Dang

doi:10.1080/24751839.2023.2187278

Journal of Information and Telecommunication (Jul 2023)

Speech emotion recognition using overlapping sliding window and Shapley additive explainable deep neural network

Nhat Truong Pham,
Sy Dzung Nguyen,
Vu Song Thuy Nguyen,
Bich Ngoc Hong Pham,
Duc Ngoc Minh Dang

Affiliations

Nhat Truong Pham: Division of Computational Mechatronics, Institute for Computational Science, Ton Duc Thang University, Ho Chi Minh City, Vietnam
Sy Dzung Nguyen: Laboratory for Computational Mechatronics, Institute for Computational Science and Artificial Intelligence, Van Lang University, Ho Chi Minh City, Vietnam
Vu Song Thuy Nguyen: Department of Computer Science and Engineering, Michigan State University, Michigan, MI, USA
Bich Ngoc Hong Pham: Faculty of Information Technology, Ho Chi Minh City Open University, Ho Chi Minh City, Vietnam
Duc Ngoc Minh Dang: Computing Fundamental Department, FPT University, Ho Chi Minh City, Vietnam

DOI: https://doi.org/10.1080/24751839.2023.2187278
Journal volume & issue: Vol. 7, no. 3
pp. 317 – 335

Abstract

Read online

ABSTRACTSpeech emotion recognition (SER) has several applications, such as e-learning, human-computer interaction, customer service, and healthcare systems. Although researchers have investigated lots of techniques to improve the accuracy of SER, it has been challenging with feature extraction, classifier schemes, and computational costs. To address the aforementioned problems, we propose a new set of 1D features extracted by using an overlapping sliding window (OSW) technique for SER in this study. In addition, a deep neural network-based classifier scheme called the deep Pattern Recognition Network (PRN) is designed to categorize emotional states from the new set of 1D features. We evaluate the proposed method on the Emo-DB and the AESSD datasets that contain several different emotional states. The experimental results show that the proposed method achieves an accuracy of 98.5% and 87.1% on the Emo-DB and AESSD datasets, respectively. It is also more comparable with accuracy to and better than the state-of-the-art and current approaches that use 1D features on the same datasets for SER. Furthermore, the SHAP (SHapley Additive exPlanations) analysis is employed for interpreting the prediction model to assist system developers in selecting the optimal features to integrate into the desired system.

Published in Journal of Information and Telecommunication

ISSN: 2475-1839 (Print); 2475-1847 (Online)
Publisher: Taylor & Francis Group
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Telecommunication; Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: https://www.tandfonline.com/journals/tjit

About the journal

Abstract

Keywords