MusicFace: Music-driven expressive singing face synthesis

Pengfei Liu; Wenjin Deng; Hengda Li; Jintai Wang; Yinglin Zheng; Yiwei Ding; Xiaohu Guo; Ming Zeng

doi:10.1007/s41095-023-0343-7

Computational Visual Media (Nov 2023)

MusicFace: Music-driven expressive singing face synthesis

Pengfei Liu,
Wenjin Deng,
Hengda Li,
Jintai Wang,
Yinglin Zheng,
Yiwei Ding,
Xiaohu Guo,
Ming Zeng

Affiliations

Pengfei Liu: School of Informatics, Xiamen University
Wenjin Deng: School of Informatics, Xiamen University
Hengda Li: School of Informatics, Xiamen University
Jintai Wang: School of Informatics, Xiamen University
Yinglin Zheng: School of Informatics, Xiamen University
Yiwei Ding: School of Informatics, Xiamen University
Xiaohu Guo: Department of Computer Science, The University of Texas at Dallas
Ming Zeng: School of Informatics, Xiamen University

DOI: https://doi.org/10.1007/s41095-023-0343-7
Journal volume & issue: Vol. 10, no. 1
pp. 119 – 136

Abstract

Read online

Abstract It remains an interesting and challenging problem to synthesize a vivid and realistic singing face driven by music. In this paper, we present a method for this task with natural motions for the lips, facial expression, head pose, and eyes. Due to the coupling of mixed information for the human voice and backing music in common music audio signals, we design a decouple-and-fuse strategy to tackle the challenge. We first decompose the input music audio into a human voice stream and a backing music stream. Due to the implicit and complicated correlation between the two-stream input signals and the dynamics of the facial expressions, head motions, and eye states, we model their relationship with an attention scheme, where the effects of the two streams are fused seamlessly. Furthermore, to improve the expressivenes of the generated results, we decompose head movement generation in terms of speed and direction, and decompose eye state generation into short-term blinking and long-term eye closing, modeling them separately. We have also built a novel dataset, SingingFace, to support training and evaluation of models for this task, including future work on this topic. Extensive experiments and a user study show that our proposed method is capable of synthesizing vivid singing faces, qualitatively and quantitatively better than the prior state-of-the-art.

Published in Computational Visual Media

ISSN: 2096-0433 (Print); 2096-0662 (Online)
Publisher: SpringerOpen
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.springer.com/41095

About the journal

Abstract

Keywords