Emotion Recognition from Speech Using the Bag-of-Visual Words on Audio Segment Spectrograms

Evaggelos Spyrou; Rozalia Nikopoulou; Ioannis Vernikos; Phivos Mylonas

doi:10.3390/technologies7010020

Technologies (Feb 2019)

Emotion Recognition from Speech Using the Bag-of-Visual Words on Audio Segment Spectrograms

Evaggelos Spyrou,
Rozalia Nikopoulou,
Ioannis Vernikos,
Phivos Mylonas

Affiliations

Evaggelos Spyrou: Institute of Informatics and Telecommunications, National Centre for Scientific Research “Demokritos”, 15341 Athens, Greece
Rozalia Nikopoulou: Department of Informatics, Ionian University, 49132 Corfu, Greece
Ioannis Vernikos: Department of Computer Science, University of Thessaly, 38221 Lamia, Greece
Phivos Mylonas: Department of Informatics, Ionian University, 49132 Corfu, Greece

DOI: https://doi.org/10.3390/technologies7010020
Journal volume & issue: Vol. 7, no. 1
p. 20

Abstract

Read online

It is noteworthy nowadays that monitoring and understanding a human’s emotional state plays a key role in the current and forthcoming computational technologies. On the other hand, this monitoring and analysis should be as unobtrusive as possible, since in our era the digital world has been smoothly adopted in everyday life activities. In this framework and within the domain of assessing humans’ affective state during their educational training, the most popular way to go is to use sensory equipment that would allow their observing without involving any kind of direct contact. Thus, in this work, we focus on human emotion recognition from audio stimuli (i.e., human speech) using a novel approach based on a computer vision inspired methodology, namely the bag-of-visual words method, applied on several audio segment spectrograms. The latter are considered to be the visual representation of the considered audio segment and may be analyzed by exploiting well-known traditional computer vision techniques, such as construction of a visual vocabulary, extraction of speeded-up robust features (SURF) features, quantization into a set of visual words, and image histogram construction. As a last step, support vector machines (SVM) classifiers are trained based on the aforementioned information. Finally, to further generalize the herein proposed approach, we utilize publicly available datasets from several human languages to perform cross-language experiments, both in terms of actor-created and real-life ones.

Published in Technologies

ISSN: 2227-7080 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology
Website: http://www.mdpi.com/journal/technologies

About the journal

Abstract

Keywords