JVNV: A Corpus of Japanese Emotional Speech With Verbal Content and Nonverbal Expressions

Detai Xin; Junfeng Jiang; Shinnosuke Takamichi; Yuki Saito; Akiko Aizawa; Hiroshi Saruwatari

doi:10.1109/ACCESS.2024.3360885

IEEE Access (Jan 2024)

JVNV: A Corpus of Japanese Emotional Speech With Verbal Content and Nonverbal Expressions

Detai Xin,
Junfeng Jiang,
Shinnosuke Takamichi,
Yuki Saito,
Akiko Aizawa,
Hiroshi Saruwatari

Affiliations

Detai Xin: ORCiD; Graduate School of Information Science and Technology, The University of Tokyo, Bunkyo, Tokyo, Japan
Junfeng Jiang: Graduate School of Information Science and Technology, The University of Tokyo, Bunkyo, Tokyo, Japan
Shinnosuke Takamichi: ORCiD; Graduate School of Information Science and Technology, The University of Tokyo, Bunkyo, Tokyo, Japan
Yuki Saito: ORCiD; Graduate School of Information Science and Technology, The University of Tokyo, Bunkyo, Tokyo, Japan
Akiko Aizawa: ORCiD; National Institute of Informatics, Chiyoda, Tokyo, Japan
Hiroshi Saruwatari: ORCiD; Graduate School of Information Science and Technology, The University of Tokyo, Bunkyo, Tokyo, Japan

DOI: https://doi.org/10.1109/ACCESS.2024.3360885
Journal volume & issue: Vol. 12
pp. 19752 – 19764

Abstract

Read online

We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by a large-scale language model. Existing emotional speech corpora lack not only proper emotional scripts but also nonverbal vocalizations (NVs) that are essential expressions in spoken language to express emotions. We propose an automatic script generation method to produce emotional scripts by providing seed words with sentiment polarity and phrases of nonverbal vocalizations to ChatGPT using prompt engineering. We select 514 scripts with balanced phoneme coverage from the generated candidate scripts with the assistance of emotion confidence scores and language fluency scores. Experimental results show that JVNV has better phoneme coverage and emotion recognizability than previous Japanese emotional speech corpora. We then benchmark JVNV on emotional text-to-speech synthesis using discrete codes to represent NVs. The results demonstrate that there still exists a gap between the performance of synthesizing read-aloud speech and emotional speech, and adding NVs in the speech makes the task even harder, which brings new challenges for this task and makes JVNV a valuable resource for relevant works in the future. To our best knowledge, JVNV is the first speech corpus that generates scripts automatically using large language models.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords