大数据 (Sep 2024)
Survey of audio-driven talking face generation technology
Abstract
In the interdisciplinary field of modern computer vision and natural language processing, digital talking facial generation technology has become an increasingly important research topic. Digital facial generation technology focuses on generating realistic facial images based on predetermined text or audio sequences. In recent years, deep learning methods such as convolutional neural networks, generative adversarial networks, and neural rendering fields have been used for digital talking face generation, which shows significant research and application value. These methods have not only attract widespread attention from the academic community, but also have been applied in industry to solve specific problems in image processing and computer vision. Although some progress has been made, the practical application of these technologies still faces many challenges. Comprehensively review and evaluate the specific implementation of deep learning methods in the generation of digital talking face to identify the advantages and disadvantages of existing methods, explore common problems that need to be solved, and highlight open issues that still require further research. In addition, currently available datasets from a statistical perspective were listed, evaluated and compared so that researchers can more easily choose datasets that meet their needs.