Fast Neural Speech Waveform Generative Models With Fully-Connected Layer-Based Upsampling

Haruki Yamashita; Takuma Okamoto; Ryoichi Takashima; Yamato Ohtani; Tetsuya Takiguchi; Tomoki Toda; Hisashi Kawai

doi:10.1109/ACCESS.2024.3366707

IEEE Access (Jan 2024)

Fast Neural Speech Waveform Generative Models With Fully-Connected Layer-Based Upsampling

Haruki Yamashita,
Takuma Okamoto,
Ryoichi Takashima,
Yamato Ohtani,
Tetsuya Takiguchi,
Tomoki Toda,
Hisashi Kawai

Affiliations

Haruki Yamashita: ORCiD; Graduate School of System Informatics, Kobe University, Kobe, Japan
Takuma Okamoto: ORCiD; National Institute of Information and Communications Technology, Kyoto, Japan
Ryoichi Takashima: ORCiD; Graduate School of System Informatics, Kobe University, Kobe, Japan
Yamato Ohtani: ORCiD; National Institute of Information and Communications Technology, Kyoto, Japan
Tetsuya Takiguchi: ORCiD; Graduate School of System Informatics, Kobe University, Kobe, Japan
Tomoki Toda: ORCiD; National Institute of Information and Communications Technology, Kyoto, Japan
Hisashi Kawai: ORCiD; National Institute of Information and Communications Technology, Kyoto, Japan

DOI: https://doi.org/10.1109/ACCESS.2024.3366707
Journal volume & issue: Vol. 12
pp. 31409 – 31421

Abstract

Read online

Although end-to-end (E2E) text-to-speech (TTS) models with HiFi-GAN-based neural vocoder (e.g. VITS and JETS) can achieve human-like speech quality with fast inference speed, these models still have room to further improve the inference speed with a CPU for practical implementations because HiFi-GAN-based neural vocoder unit is a bottleneck. Additionally, HiFi-GAN is widely used not only for TTS but also for many speech and audio applications. To accelerate HiFi-GAN while maintaining the synthesis quality, Multi-stream (MS)-HiFi-GAN, iSTFTNet and MS-iSTFT-HiFi-GAN have been proposed. Although inverse short-term Fourier transform (iSTFT)-based fast upsampling is introduced in iSTFTNet and MS-iSTFT-HiFi-GAN, we first find that the predicted intermediate features input to the iSTFT layer are completely different from the original STFT spectra due to the redundancy of the overlap-add operation in iSTFT. To further improve the synthesis quality and inference speed, we propose FC-HiFi-GAN and MS-FC-HiFi-GAN by introducing trainable fully-connected (FC) layer-based fast upsampling without overlap-add operation instead of the iSTFT layer. The experimental results for unseen speaker synthesis and E2E TTS conditions show that the proposed methods can slightly accelerate the inference speed and significantly improve the synthesis quality in JETS-based E2E TTS than iSTFTNet and MS-iSTFT-HiFi-GAN. Therefore, the iSTFT layer can be replaced by the proposed trainable FC layer-based upsampling without overlap-add operation in HiFi-GAN-based neural vocoders.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords