IEEE Access (Jan 2024)

Significance of Variational Mode Decomposition for Epoch Based Prosody Modification of Speech With Clipping Distortions

  • M. Rama Rajeswari,
  • D. Govind,
  • Suryakanth V. Gangashetty,
  • Akhilesh Kumar Dubey

DOI
https://doi.org/10.1109/ACCESS.2024.3425394
Journal volume & issue
Vol. 12
pp. 98928 – 98944

Abstract

Read online

Clipping is one of the non-linear distortions commonly introduced due to microphone saturation during speech recording. Present work focuses on the effect of clipping in the task of prosody modification. Since, $F_{0}$ contour and duration are the important prosodic parameters, the present work studies the effect of clipping in the manipulation of $F_{0}$ and duration of a given speech. Epoch based prosody modification is considered as the popular method to generate waveforms with good perceptual quality by scaling $F_{0}$ contour and duration of the given speech by fixed scaling factors. Therefore, present work studies the effect of waveform clipping on the perceptual quality of prosody modified speech. Deviations in the estimation of epochs (which are used as the analysis pitch marks) and method used for generating the waveform are the two ways wherein perceptual quality in epoch based prosody modification can be compromised. The work proposed in this paper examines, effect of clipping on the aforesaid stages of epoch based prosody modification affecting the perceptual quality of the generated speech. Zero frequency filtering (ZFF), a simple and popular method, is chosen as the epoch estimation algorithm for epoch based prosody modification presented in the paper. Based on comparative epoch estimation performance analysis carried out by introducing various amplitude clipping levels, epoch identification rates are confirmed to be unchanged, irrespective of the level of clipping distortions present. However, due to saturation in the waveform samples, the waveform generation stage of the prosody modification was observed to be affected to the level which was proportional to the clipping distortions present in the signal. A variational mode decomposition (VMD) based signal approximation of the prosody modified speech is proposed to reduce the non-linear effect due to clipping. At the gross level, the re-estimated speech signal obtained from the VMD modes observed to have improved the perceptual quality of the pitch and duration modified speech. The improved perceptual quality of VMD based re-estimation of prosody modified speech was confirmed from subjective and NIST-STNR based objective assessments. Further, VMD based refinement is proposed as an alternative to local mean subtraction for trend removal in conventional ZFF of speech for the accurate epoch estimation. Comparative performance analysis carried out on CMU arctic database, confirmed improvement in the identification accuracy for the epochs estimated by using VMD based trend removal in ZFF algorithm.

Keywords