PeriodNet: A Non-Autoregressive Raw Waveform Generative Model With a Structure Separating Periodic and Aperiodic Components

Yukiya Hono; Shinji Takaki; Kei Hashimoto; Keiichiro Oura; Yoshihiko Nankaku; Keiichi Tokuda

doi:10.1109/ACCESS.2021.3118033

IEEE Access (Jan 2021)

PeriodNet: A Non-Autoregressive Raw Waveform Generative Model With a Structure Separating Periodic and Aperiodic Components

Yukiya Hono,
Shinji Takaki,
Kei Hashimoto,
Keiichiro Oura,
Yoshihiko Nankaku,
Keiichi Tokuda

Affiliations

Yukiya Hono: ORCiD; Department of Computer Science, Nagoya Institute of Technology, Nagoya, Japan
Shinji Takaki: ORCiD; Department of Computer Science, Nagoya Institute of Technology, Nagoya, Japan
Kei Hashimoto: ORCiD; Department of Computer Science, Nagoya Institute of Technology, Nagoya, Japan
Keiichiro Oura: Department of Computer Science, Nagoya Institute of Technology, Nagoya, Japan
Yoshihiko Nankaku: Department of Computer Science, Nagoya Institute of Technology, Nagoya, Japan
Keiichi Tokuda: ORCiD; Department of Computer Science, Nagoya Institute of Technology, Nagoya, Japan

DOI: https://doi.org/10.1109/ACCESS.2021.3118033
Journal volume & issue: Vol. 9
pp. 137599 – 137612

Abstract

Read online

This paper presents PeriodNet, a non-autoregressive (non-AR) waveform generative model with a new model structure for modeling periodic and aperiodic components in speech waveforms. Non-AR raw waveform generative models have enabled the fast generation of high-quality waveforms. However, the variations of waveforms that these models can reconstruct are limited by training data. In addition, typical non-AR models reconstruct a speech waveform from a single Gaussian input despite the mixture of periodic and aperiodic signals in speech. These may significantly affect the waveform generation process in some applications such as singing voice synthesis systems, which require reproducing accurate pitch and natural sounds with less periodicity, including husky and breath sounds. PeriodNet uses a parallel or series model structure to model a speech waveform to tackle these problems. Two sub-generators connected in parallel or in series take an explicit periodic and aperiodic signal (sine wave and Gaussian noise) as an input. Since PeriodNet models periodic and aperiodic components by focusing on whether these input signals are autocorrelated or not, it does not require external periodic/aperiodic decomposition during training. Experimental results show that our proposed structure improves the naturalness of generated waveforms. We also show that speech waveforms with a pitch outside of the training data range can be generated with more naturalness.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords