IEEE Access (Jan 2021)

Creating Song From Lip and Tongue Videos With a Convolutional Vocoder

  • Jianyu Zhang,
  • Pierre Roussel,
  • Bruce Denby

DOI
https://doi.org/10.1109/ACCESS.2021.3050843
Journal volume & issue
Vol. 9
pp. 13076 – 13082

Abstract

Read online

A convolutional neural network and deep autoencoder are used to predict Line Spectral Frequencies, F0, and a voiced/unvoiced flag in singing data, using as input only ultrasound images of the tongue and visual images of the lips. A novel convolutional vocoder to transform the learned parameters into an audio signal is also presented. Spectral Distortion of predicted Line Spectral Frequencies is reduced compared to that in an earlier study using handcrafted features and multilayer perceptrons on the same data set; while predicted F0 and voiced/unvoiced flag predictions are found to be highly correlated with their ground truth values. Comparison of the convolutional vocoder to standard vocoders is made. Results can be of interest in the study of singing articulation as well as for silent speech interface research. Sample predicted audio files are available online. Source code: https://github.com/TjuJianyu/SSI_DL.

Keywords