Improving diversity of speech‐driven gesture generation with memory networks as dynamic dictionaries

Zeyu Zhao; Nan Gao; Zhi Zeng; Guixuan Zhang; Jie Liu; Shuwu Zhang

doi:10.1049/cit2.12321

CAAI Transactions on Intelligence Technology (Oct 2024)

Improving diversity of speech‐driven gesture generation with memory networks as dynamic dictionaries

Zeyu Zhao,
Nan Gao,
Zhi Zeng,
Guixuan Zhang,
Jie Liu,
Shuwu Zhang

Affiliations

Zeyu Zhao: Institute of Automation Chinese Academy of Sciences Beijing China
Nan Gao: Institute of Automation Chinese Academy of Sciences Beijing China
Zhi Zeng: Beijing University of Posts and Telecommunications Beijing China
Guixuan Zhang: Institute of Automation Chinese Academy of Sciences Beijing China
Jie Liu: Institute of Automation Chinese Academy of Sciences Beijing China
Shuwu Zhang: Beijing University of Posts and Telecommunications Beijing China

DOI: https://doi.org/10.1049/cit2.12321
Journal volume & issue: Vol. 9, no. 5
pp. 1275 – 1289

Abstract

Read online

Abstract Generating co‐speech gestures for interactive digital humans remains challenging because of the indeterministic nature of the problem. The authors observe that gestures generated from speech audio or text by existing neural methods often contain less movement shift than expected, which can be viewed as slow or dull. Thus, a new generative model coupled with memory networks as dynamic dictionaries for speech‐driven gesture generation with improved diversity is proposed. More specifically, the dictionary network dynamically stores connections between text and pose features in a list of key‐value pairs as the memory for the pose generation network to look up; the pose generation network then merges the matching pose features and input audio features for generating the final pose sequences. To make the improvements more accurately measurable, a new objective evaluation metric for gesture diversity that can remove the influence of low‐quality motions is also proposed and tested. Quantitative and qualitative experiments demonstrate that the proposed architecture succeeds in generating gestures with improved diversity.

Published in CAAI Transactions on Intelligence Technology

ISSN: 2468-2322 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Language and Literature: Philology. Linguistics: Computational linguistics. Natural language processing; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://ietresearch.onlinelibrary.wiley.com/journal/24682322

About the journal

Abstract

Keywords