Pose-Aware Speech Driven Facial Landmark Animation Pipeline for Automated Dubbing

Dan Bigioi; Hugh Jordan; Rishabh Jain; Rachel McDonnell; Peter Corcoran

doi:10.1109/ACCESS.2022.3231137

IEEE Access (Jan 2022)

Pose-Aware Speech Driven Facial Landmark Animation Pipeline for Automated Dubbing

Dan Bigioi,
Hugh Jordan,
Rishabh Jain,
Rachel McDonnell,
Peter Corcoran

Affiliations

Dan Bigioi: ORCiD; School of Electrical and Electronics Engineering, National University of Ireland, University of Galway, Galway, Ireland
Hugh Jordan: Trinity College Dublin, University of Dublin, Dublin 2, Ireland
Rishabh Jain: ORCiD; School of Electrical and Electronics Engineering, National University of Ireland, University of Galway, Galway, Ireland
Rachel McDonnell: Trinity College Dublin, University of Dublin, Dublin 2, Ireland
Peter Corcoran: ORCiD; School of Electrical and Electronics Engineering, National University of Ireland, University of Galway, Galway, Ireland

DOI: https://doi.org/10.1109/ACCESS.2022.3231137
Journal volume & issue: Vol. 10
pp. 133357 – 133369

Abstract

Read online

A novel neural pipeline allowing one to generate pose aware 3D animated facial landmarks synchronised to a target speech signal is proposed for the task of automatic dubbing. The goal is to automatically synchronize a target actors’ lips and facial motion to an unseen speech sequence, while maintaining the quality of the original performance. Given a 3D facial key point sequence extracted from any reference video, and a target audio clip, the neural pipeline learns how to generate head pose aware, identity aware landmarks and outputs accurate 3D lip motion directly at the inference stage. These generated landmarks can be used to render a photo-realistic video via an additional image to image conversion stage. In this paper, a novel data augmentation technique is introduced that increases the size of the training dataset from N audio/visual pairs up to NxN unique pairs for the task of automatic dubbing. The trained inference pipeline employs a LSTM-based network that takes Mel-coefficients as input from an unseen speech sequence, combined with head pose, and identity parameters extracted from a reference video to generate a new set of pose aware 3D landmarks that are synchronized with the unseen speech.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords