IEEE Access (Jan 2022)

Zero-Shot Unseen Speaker Anonymization via Voice Conversion

  • Hyung-Pil Chang,
  • In-Chul Yoo,
  • Changhyeon Jeong,
  • Dongsuk Yook

DOI
https://doi.org/10.1109/ACCESS.2022.3227963
Journal volume & issue
Vol. 10
pp. 130190 – 130199

Abstract

Read online

Speech-based interfaces provide convenient methods for controlling various smart devices. For these interfaces to work reliably, considerable speech data with various noise and speaker characteristics must be collected to train the associated speech-processing models. Gathering spoken commands from actual users of devices can improve those devices’ performance by familiarizing each device with the individual acoustic characteristic of its particular user’s speech. However, the direct acquisition of spoken commands could threaten the privacy of users, as the spoken data would contain sensitive speaker-specific information. Speaker anonymization algorithms can be applied to suppress such sensitive information, while preserving the linguistic content of a user’s speech. Previous speaker anonymization algorithms could handle only the voice of speakers who contributed to the training datasets. As speaker anonymization algorithms are typically applied to new speakers (who are absent from the training datasets), a method of handling such speakers (commonly referred to as “unseen speakers”) should be developed. In this paper, we propose a novel method that can effectively suppress the individual characteristics in an unseen speaker’s voice, while retaining the linguistic content of the speech. It adopts zero-shot voice conversion methods for the unseen speaker anonymization. Since the proposed method utilizes speaker identity vectors commonly used in many-to-many voice conversion algorithms and does not modify the conversion algorithm itself, it can be easily combined with many other voice conversion algorithms. The proposed method is evaluated using the VCC2018 and VCTK corpora. Speaker identification rate and speech recognition rate are used for quantitative analysis. The experimental results showed that the average speaker identification accuracy was decreased by 92.3% point absolutely and the average speech recognition accuracy was decreased by 17.7% point absolutely after the speaker anonymization by the proposed method.

Keywords