IEEE Access (Jan 2023)

Perturbation AUTOVC: Voice Conversion From Perturbation and Autoencoder Loss

  • Hwa-Young Park,
  • Young Han Lee,
  • Chanjun Chun

DOI
https://doi.org/10.1109/ACCESS.2023.3341434
Journal volume & issue
Vol. 11
pp. 140174 – 140185

Abstract

Read online

AUTOVC is a voice-conversion method that performs self-reconstruction using an autoencoder structure for zero-shot voice conversion. AUTOVC has the advantage of being easy and simple to learn because it only uses the autoencoder loss for learning. However, it performs voice conversion by disentangling speech information from speakers and linguistic information by adjusting the bottleneck dimension; this requires highly meticulous fine tuning of the bottleneck dimension and involves a tradeoff between speech quality and speaker similarity. To address these issues, neural analysis and synthesis (NANSY)—a fully self-supervised learning system that uses perturbations to extract speech features—is proposed. NANSY solves the problem of the adjustment of the bottleneck dimension by utilizing perturbation and exhibits high-reconstruction performance. In this study, we propose perturbation AUTOVC, a voice conversion method that utilizes the structure of AUTOVC and the perturbation of NANSY. The proposed method applies perturbations to speech signals (such as NANSY signals) to solve the problem of the voice conversion method using bottleneck dimensions. Perturbation is applied to remove the speaker-dependent information present in the speech, leaving only the linguistic information, which is then passed through a content encoder and modeled as a content embedding containing only the linguistic information. To obtain speaker information, we used x-vectors, which are extensively used in pretrained speaker recognition. The concatenated linguistic and speaker information extracted from the encoder and additional energy information is used as input to the decoder to perform self-reconstruction. Similar to AUTOVC, it is easy and simple to learn using only the autoencoder loss. For the evaluation, we measured three objective evaluation metrics: character error rate (%), cosine similarity, and short-time objective intelligibility, as well as a subjective evaluation metric: mean opinion score. The experimental results demonstrate that our proposed method outperforms other voice conversion techniques and demonstrated robust performance in zero-shot conversion.

Keywords