Task-specific speech enhancement and data augmentation for improved multimodal emotion recognition under noisy conditions

Shruti Kshirsagar; Anurag Pendyala; Tiago H. Falk

doi:10.3389/fcomp.2023.1039261

Frontiers in Computer Science (Mar 2023)

Task-specific speech enhancement and data augmentation for improved multimodal emotion recognition under noisy conditions

Shruti Kshirsagar,
Anurag Pendyala,
Tiago H. Falk

Affiliations

Shruti Kshirsagar: Institut National de la Recherche Scientifique, University of Quebec, Montréal, QC, Canada
Anurag Pendyala: International Institute of Information Technology, Bangalore, India
Tiago H. Falk: Institut National de la Recherche Scientifique, University of Quebec, Montréal, QC, Canada

DOI: https://doi.org/10.3389/fcomp.2023.1039261
Journal volume & issue: Vol. 5

Abstract

Read online

Automatic emotion recognition (AER) systems are burgeoning and systems based on either audio, video, text, or physiological signals have emerged. Multimodal systems, in turn, have shown to improve overall AER accuracy and to also provide some robustness against artifacts and missing data. Collecting multiple signal modalities, however, can be very intrusive, time consuming, and expensive. Recent advances in deep learning based speech-to-text and natural language processing systems, however, have enabled the development of reliable multimodal systems based on speech and text while only requiring the collection of audio data. Audio data, however, is extremely sensitive to environmental disturbances, such as additive noise, thus faces some challenges when deployed “in the wild.” To overcome this issue, speech enhancement algorithms have been deployed at the input signal level to improve testing accuracy in noisy conditions. Speech enhancement algorithms can come in different flavors and can be optimized for different tasks (e.g., for human perception vs. machine performance). Data augmentation, in turn, has also been deployed at the model level during training time to improve accuracy in noisy testing conditions. In this paper, we explore the combination of task-specific speech enhancement and data augmentation as a strategy to improve overall multimodal emotion recognition in noisy conditions. We show that AER accuracy under noisy conditions can be improved to levels close to those seen in clean conditions. When compared against a system without speech enhancement or data augmentation, an increase in AER accuracy of 40% was seen in a cross-corpus test, thus showing promising results for “in the wild” AER.

Published in Frontiers in Computer Science

ISSN: 2624-9898 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.frontiersin.org/journals/computer-science#

About the journal

Abstract

Keywords