Assessment of Self-Supervised Denoising Methods for Esophageal Speech Enhancement

Madiha Amarjouf; El Hassan Ibn Elhaj; Mouhcine Chami; Kadria Ezzine; Joseph Di Martino

doi:10.3390/app14156682

Applied Sciences (Jul 2024)

Assessment of Self-Supervised Denoising Methods for Esophageal Speech Enhancement

Madiha Amarjouf,
El Hassan Ibn Elhaj,
Mouhcine Chami,
Kadria Ezzine,
Joseph Di Martino

Affiliations

Madiha Amarjouf: Research Laboratory in Telecommunications Systems: Networks and Services (STRS), Research Team: Multimedia, Signal and Communications Systems (MUSICS), National Institute of Posts and Telecommunications (INPT), Av. Allal Al Fassi, Rabat 10112, Morocco
El Hassan Ibn Elhaj: Research Laboratory in Telecommunications Systems: Networks and Services (STRS), Research Team: Multimedia, Signal and Communications Systems (MUSICS), National Institute of Posts and Telecommunications (INPT), Av. Allal Al Fassi, Rabat 10112, Morocco
Mouhcine Chami: Research Laboratory in Telecommunications Systems: Networks and Services (STRS), Research Team: Secure and Mixed Architecture for Reliable Technologies and Systems (SMARTS), National Institute of Posts and Telecommunications (INPT), Av. Allal Al Fassi, Rabat 10112, Morocco
Kadria Ezzine: LORIA-Laboratoire Lorrain de Recherche en Informatique et ses Applications, B.P. 239, 54506 Vandœuvre-lès-Nancy, France
Joseph Di Martino: LORIA-Laboratoire Lorrain de Recherche en Informatique et ses Applications, B.P. 239, 54506 Vandœuvre-lès-Nancy, France

DOI: https://doi.org/10.3390/app14156682
Journal volume & issue: Vol. 14, no. 15
p. 6682

Abstract

Read online

Esophageal speech (ES) is a pathological voice that is often difficult to understand. Moreover, acquiring recordings of a patient’s voice before a laryngectomy proves challenging, thereby complicating enhancing this kind of voice. That is why most supervised methods used to enhance ES are based on voice conversion, which uses healthy speaker targets, things that may not preserve the speaker’s identity. Otherwise, unsupervised methods for ES are mostly based on traditional filters, which cannot alone beat this kind of noise, making the denoising process difficult. Also, these methods are known for producing musical artifacts. To address these issues, a self-supervised method based on the Only-Noisy-Training (ONT) model was applied, consisting of denoising a signal without needing a clean target. Four experiments were conducted using Deep Complex UNET (DCUNET) and Deep Complex UNET with Complex Two-Stage Transformer Module (DCUNET-cTSTM) for assessment. Both of these models are based on the ONT approach. Also, for comparison purposes and to calculate the evaluation metrics, the pre-trained VoiceFixer model was used to restore the clean wave files of esophageal speech. Even with the fact that ONT-based methods work better with noisy wave files, the results have proven that ES can be denoised without the need for clean targets, and hence, the speaker’s identity is retained.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords