CCLCap-AE-AVSS: Cycle consistency loss based capsule autoencoders for audio–visual speech synthesis

Ghosh Subhayu; Jana Nanda Dulal; Si Tapas; Mallik Saurav; Shah Mohd Asif

doi:10.1515/jisys-2023-0171

Journal of Intelligent Systems (Jun 2024)

CCLCap-AE-AVSS: Cycle consistency loss based capsule autoencoders for audio–visual speech synthesis

Ghosh Subhayu,
Jana Nanda Dulal,
Si Tapas,
Mallik Saurav,
Shah Mohd Asif

Affiliations

Ghosh Subhayu: Department of Computer Science and Engineering, National Institute of Technology Durgapur, West Bengal 713209, India
Jana Nanda Dulal: Department of Computer Science and Engineering, National Institute of Technology Durgapur, West Bengal 713209, India
Si Tapas: Department of Computer Science and Engineering, AI Innovation Lab, University of Engineering and Management, Jaipur, Rajasthan 303807, India
Mallik Saurav: Department of Environmental Health, Harvard T H Chan School of Public Health, Boston, MA 02115, United States of America
Shah Mohd Asif: Department of Economics, Kebri Dehar University, Jigjiga 3060, Ethiopia

DOI: https://doi.org/10.1515/jisys-2023-0171
Journal volume & issue: Vol. 33, no. 1
pp. 3893 – 6

Abstract

Read online

Audio–visual speech synthesis (AVSS) is a rapidly growing field in the paradigm of audio–visual learning, involving the conversion of one person’s speech into the audio–visual stream of another while preserving the speech content. AVSS comprises two primary components: voice conversion (VC), which alters the vocal characteristics from the source speaker to the target speaker, followed by audio–visual synthesis, which creates the audio–visual presentation of the converted VC output for the target speaker. Despite the progress in deep learning (DL) technologies, DL models in AVSS have received limited attention in existing literature. Therefore, this article presents a novel approach for AVSS utilizing capsule network (Caps-Net)-based autoencoders, with the incorporation of cycle consistency loss. Caps-Net addresses translation invariance issues in convolutional neural network approaches for effective feature capture. Additionally, the inclusion of cycle consistency loss ensures the retention of content information from the source speaker. The proposed approach is referred to as cycle consistency loss-based capsule autoencoders for audio–visual speech synthesis (CCLCap-AE-AVSS). The proposed CCLCap-AE-AVSS is trained and tested using VoxCeleb2 and LRS3-TED datasets. The subjective and objective assessments of the generated samples demonstrate the superior performance of the proposed work compared to the current state-of-the-art models.

Published in Journal of Intelligent Systems

ISSN: 0334-1860 (Print); 2191-026X (Online)
Publisher: De Gruyter
Country of publisher: Poland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.degruyter.com/view/journals/jisys/jisys-overview.xml

About the journal

Abstract

Keywords