STT-Net: Simplified Temporal Transformer for Emotion Recognition

Mustaqeem Khan; Abdulmotaleb El Saddik; Mohamed Deriche; Wail Gueaieb

doi:10.1109/ACCESS.2024.3413136

IEEE Access (Jan 2024)

STT-Net: Simplified Temporal Transformer for Emotion Recognition

Mustaqeem Khan,
Abdulmotaleb El Saddik,
Mohamed Deriche,
Wail Gueaieb

Affiliations

Mustaqeem Khan: ORCiD; Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, United Arab Emirates
Abdulmotaleb El Saddik: ORCiD; Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, United Arab Emirates
Mohamed Deriche: ORCiD; Artificial Intelligence Research Centre (AIRC), Ajman University, Ajman, United Arab Emirates
Wail Gueaieb: ORCiD; Artificial Intelligence Research Centre (AIRC), Ajman University, Ajman, United Arab Emirates

DOI: https://doi.org/10.1109/ACCESS.2024.3413136
Journal volume & issue: Vol. 12
pp. 86220 – 86231

Abstract

Read online

Emotion recognition is one of the crucial topics in computer vision to efficiently recognize the expression of humans through faces. Recently, transformers have been recognized as a robust architecture, and many vision-based transformer models for emotion recognition have been proposed. The major drawback of such models is the high computational cost of the attention mechanism for computing space-time attention. To that end, we studied temporal feature shifting for frame-wise deep learning models to avoid this burden. In this work, we propose a novel temporal shifting approach for a frame-wise transformer-based model by replacing multi-head self-attention (MSA) with multi-head self/cross-attention (MSCA) to model the temporal interactions between tokens without additional cost. The contextual connection between and inside channels and across time is encoded by the proposed MSCA to enhance the recognition rate and reduce the latency for real-world applications. We extensively evaluated our system on CK+ (Cohn-Kanad) and Fer-2013plus (Facial-Emotion-Recognition) benchmark datasets with geometric transforms-based augmentation to address the imbalance issue in the data. According to the results, the proposed MSCA has either outperformed or closely matched the performance of state-of-the-art (SOTA) techniques. However, we conducted an ablation study on a challenging Fer2013+ dataset to demonstrate the significance and potential of our model for complex emotion recognition tasks.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords