Audio2AB: Audio-driven collaborative generation of virtual character animation

Lichao Niu; Wenjun Xie; Dong Wang; Zhongrui Cao; Xiaoping Liu

Virtual Reality & Intelligent Hardware (Feb 2024)

Audio2AB: Audio-driven collaborative generation of virtual character animation

Lichao Niu,
Wenjun Xie,
Dong Wang,
Zhongrui Cao,
Xiaoping Liu

Affiliations

Lichao Niu: School of Computer Science and Information Engineering, Hefei University of Technology, Heifei 230009, China
Wenjun Xie: School of Software, Hefei University of Technology, Heifei 230009, China; Anhui Province Key Laboratory of Industry Safety and Emergency Technology, Hefei University of Technology, Hefei 230601, China; Corresponding author.
Dong Wang: School of Computer Science and Information Engineering, Hefei University of Technology, Heifei 230009, China
Zhongrui Cao: School of Computer Science and Information Engineering, Hefei University of Technology, Heifei 230009, China
Xiaoping Liu: School of Computer Science and Information Engineering, Hefei University of Technology, Heifei 230009, China

Journal volume & issue: Vol. 6, no. 1
pp. 56 – 70

Abstract

Read online

Background: Considerable research has been conducted in the areas of audio-driven virtual character gestures and facial animation with some degree of success. However, few methods exist for generating full-body animations, and the portability of virtual character gestures and facial animations has not received sufficient attention. Methods: Therefore, we propose a deep-learning-based audio-to-animation-and-blendshape (Audio2AB) network that generates gesture animations andARK it’s 52 facial expression parameter blendshape weights based on audio, audio-corresponding text, emotion labels, and semantic relevance labels to generate parametric data for full- body animations. This parameterization method can be used to drive full-body animations of virtual characters and improve their portability. In the experiment, we first downsampled the gesture and facial data to achieve the same temporal resolution for the input, output, and facial data. The Audio2AB network then encoded the audio, audio- corresponding text, emotion labels, and semantic relevance labels, and then fused the text, emotion labels, and semantic relevance labels into the audio to obtain better audio features. Finally, we established links between the body, gestures, and facial decoders and generated the corresponding animation sequences through our proposed GAN-GF loss function. Results: By using audio, audio-corresponding text, and emotional and semantic relevance labels as input, the trained Audio2AB network could generate gesture animation data containing blendshape weights. Therefore, different 3D virtual character animations could be created through parameterization. Conclusions: The experimental results showed that the proposed method could generate significant gestures and facial animations.

Published in Virtual Reality & Intelligent Hardware

ISSN: 2096-5796 (Print); 2666-1209 (Online)
Publisher: KeAi Communications Co., Ltd.
Country of publisher: China
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware
Website: https://www.keaipublishing.com/en/journals/virtual-reality-and-intelligent-hardware/

About the journal

Abstract

Keywords