Joint Speech-Text Embeddings for Multitask Speech Processing

Michael Gian Gonzales; Peter Corcoran; Naomi Harte; Michael Schukat

doi:10.1109/ACCESS.2024.3473743

IEEE Access (Jan 2024)

Joint Speech-Text Embeddings for Multitask Speech Processing

Michael Gian Gonzales,
Peter Corcoran,
Naomi Harte,
Michael Schukat

Affiliations

Michael Gian Gonzales: ORCiD; School of Computer Science, College of Science and Engineering, University of Galway, Galway, Ireland
Peter Corcoran: ORCiD; Department of Electronic Engineering, College of Science and Engineering, University of Galway, Galway, Ireland
Naomi Harte: School of Engineering, Trinity College Dublin, Dublin 2, Ireland
Michael Schukat: School of Computer Science, College of Science and Engineering, University of Galway, Galway, Ireland

DOI: https://doi.org/10.1109/ACCESS.2024.3473743
Journal volume & issue: Vol. 12
pp. 145955 – 145967

Abstract

Read online

Devices that use speech as the communication medium between human and computer have been emerging for the past few years. The technologies behind this interface are called Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The two are distinct fields in speech signal processing that have independently made great strides in recent years. This paper proposes an architecture that takes advantage of the two modalities present in ASR and TTS, speech and text, while simultaneously training three tasks, adding speaker recognition to the underlying ASR and TTS tasks. This architecture not only reduces the memory footprint required to run all tasks, but also has performance comparable to single-task models. The dataset used to train and evaluate the model is the CSTR VCTK Corpus. Results show a 97.64% accuracy in the speaker recognition task, word and character error rates of 18.18% and 7.95% for the ASR task, a mel cepstral distortion of 4.31 and two predicted MOS of 2.98 and 3.28 for the TTS task. While voice conversion is not part of the training tasks, the architecture is capable of doing this and was evaluated to have 5.22, 2.98, and 2.73 for mel cepstral distortion and predicted MOS, respectively.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords