Reimagining speech: a scoping review of deep learning-based methods for non-parallel voice conversion

Anders R. Bargum; Anders R. Bargum; Stefania Serafin; Cumhur Erkut

doi:10.3389/frsip.2024.1339159

Frontiers in Signal Processing (Aug 2024)

Reimagining speech: a scoping review of deep learning-based methods for non-parallel voice conversion

Anders R. Bargum,
Anders R. Bargum,
Stefania Serafin,
Cumhur Erkut

Affiliations

Anders R. Bargum: Multi-Sensory Experience Laboratory, Department of Architecture, Design and Media Technology, Aalborg University, Copenhagen, Denmark
Anders R. Bargum: Heka, Khora VR, Copenhagen, Denmark
Stefania Serafin: Multi-Sensory Experience Laboratory, Department of Architecture, Design and Media Technology, Aalborg University, Copenhagen, Denmark
Cumhur Erkut: Multi-Sensory Experience Laboratory, Department of Architecture, Design and Media Technology, Aalborg University, Copenhagen, Denmark

DOI: https://doi.org/10.3389/frsip.2024.1339159
Journal volume & issue: Vol. 4

Abstract

Read online

Research on deep learning-powered voice conversion (VC) in speech-to-speech scenarios are gaining increasing popularity. Although many of the works in the field of voice conversion share a common global pipeline, there is considerable diversity in the underlying structures, methods, and neural sub-blocks used across research efforts. Thus, obtaining a comprehensive understanding of the reasons behind the choice of the different methods included when training voice conversion models can be challenging, and the actual hurdles in the proposed solutions are often unclear. To shed light on these aspects, this paper presents a scoping review that explores the use of deep learning in speech analysis, synthesis, and disentangled speech representation learning within modern voice conversion systems. We screened 628 publications from more than 38 venues between 2017 and 2023, followed by an in-depth review of a final database of 130 eligible studies. Based on the review, we summarise the most frequently used approaches to voice conversion based on deep learning and highlight common pitfalls. We condense the knowledge gathered to identify main challenges, supply solutions grounded in the analysis and provide recommendations for future research directions.

Published in Frontiers in Signal Processing

ISSN: 2673-8198 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://www.frontiersin.org/journals/signal-processing

About the journal

Abstract

Keywords