IEEE Access (Jan 2024)
Gaze Generation for Avatars Using GANs
Abstract
The movement of our eyes during conversations plays a crucial role in our communication. Through a mixture of aimed and subconscious control of our gaze, we nonverbally manage turn-taking in conversations and convey information about our state of mind and even neurological disorders. For animated avatars or robots, it is hence of fundamental importance to exhibit realistic eye movement in conversations to withstand the scrutiny of an observer and not fall into the Uncanny Valley. Otherwise, they will be rejected by the observer as unnatural and possibly scary, provoking disapproval of the entire avatar. Although there exist many promising application areas for avatars and great attention has been given to the automatic animation of mouth and facial expressions, the animation of the eyes is often left to simplistic, rule-based models or ignored altogether. In this work, we aim to alleviate this limitation by leveraging Generative Adversarial Networks (GANs), a potent machine-learning approach, to synthesize eye movement. By focusing on a restricted scenario of face-to-monitor interaction, we can concentrate on the eyes, ignoring additional factors such as gestures, body movement, and spatial positioning of conversation partners. Using a recently published dataset on eye movements during conversation, we train two GANs and compare their performance against three statistical models with hand-crafted rules. We subject all five models to statistical analysis, comparing them to the ground-truth data. We find that the GANs produce the best data of the four models that synthesized reasonable eye movement (excluding the best-scoring model for generating absurd movements). Additionally, we perform a user study, comparing each model pairwise against the others based on 73 participants, resulting in a total of 1314 pairwise comparisons. It shows that the GANs achieve acceptance ratings of 55.3% and 43.7%, outperforming the baseline model with an acceptance rate of 34.0%. Although the best model reaches 67.0%, beating our GANs using a set of rules, we argue that this approach will not be feasible once information like emotions or speech is added to the input.
Keywords