Exploring Prosodic Features Modelling for Secondary Emotions Needed for Empathetic Speech Synthesis

Jesin James; Balamurali B.T.; Catherine Watson; Hansjörg Mixdorff

doi:10.3390/s23062999

Sensors (Mar 2023)

Exploring Prosodic Features Modelling for Secondary Emotions Needed for Empathetic Speech Synthesis

Jesin James,
Balamurali B.T.,
Catherine Watson,
Hansjörg Mixdorff

Affiliations

Jesin James: Department of Electrical, Computer, and Software Engineering, The University of Auckland, Auckland 1010, New Zealand
Balamurali B.T.: Science, Maths and Technology, Singapore University of Technology and Design, Singapore 487372, Singapore
Catherine Watson: Department of Electrical, Computer, and Software Engineering, The University of Auckland, Auckland 1010, New Zealand
Hansjörg Mixdorff: Computer Science and Media, Berliner Hochschule für Technik, 13353 Berlin, Germany

DOI: https://doi.org/10.3390/s23062999
Journal volume & issue: Vol. 23, no. 6
p. 2999

Abstract

Read online

A low-resource emotional speech synthesis system for empathetic speech synthesis based on modelling prosody features is presented here. Secondary emotions, identified to be needed for empathetic speech, are modelled and synthesised in this investigation. As secondary emotions are subtle in nature, they are difficult to model compared to primary emotions. This study is one of the few to model secondary emotions in speech as they have not been extensively studied so far. Current speech synthesis research uses large databases and deep learning techniques to develop emotion models. There are many secondary emotions, and hence, developing large databases for each of the secondary emotions is expensive. Hence, this research presents a proof of concept using handcrafted feature extraction and modelling of these features using a low-resource-intensive machine learning approach, thus creating synthetic speech with secondary emotions. Here, a quantitative-model-based transformation is used to shape the emotional speech’s fundamental frequency contour. Speech rate and mean intensity are modelled via rule-based approaches. Using these models, an emotional text-to-speech synthesis system to synthesise five secondary emotions-anxious, apologetic, confident, enthusiastic and worried-is developed. A perception test to evaluate the synthesised emotional speech is also conducted. The participants could identify the correct emotion in a forced response test with a hit rate greater than 65%.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords