Micro Expression Recognition Using Convolution Patch in Vision Transformer

Sakshi Indolia; Swati Nigam; Rajiv Singh; Vivek Kumar Singh; Manoj Kumar Singh

doi:10.1109/ACCESS.2023.3314797

IEEE Access (Jan 2023)

Micro Expression Recognition Using Convolution Patch in Vision Transformer

Sakshi Indolia,
Swati Nigam,
Rajiv Singh,
Vivek Kumar Singh,
Manoj Kumar Singh

Affiliations

Sakshi Indolia: Department of Computer Science, Banasthali Vidyapith, Tonk, Rajasthan, India
Swati Nigam: ORCiD; Department of Computer Science, Banasthali Vidyapith, Tonk, Rajasthan, India
Rajiv Singh: ORCiD; Department of Computer Science, Banasthali Vidyapith, Tonk, Rajasthan, India
Vivek Kumar Singh: ORCiD; Department of Computer Science, Banaras Hindu University, Varanasi, Uttar Pradesh, India
Manoj Kumar Singh: Department of Computer Science, Banaras Hindu University, Varanasi, Uttar Pradesh, India

DOI: https://doi.org/10.1109/ACCESS.2023.3314797
Journal volume & issue: Vol. 11
pp. 100495 – 100507

Abstract

Read online

Humans possess an intrinsic ability to hide their true emotions. Micro-expressions are subtle changes in facial muscles that are involuntary by nature and easy to hide. To address these issues, several machine and deep learning models have been proposed in the past few years. Convolution neural network (CNN) is a deep learning method that has widely been adopted in vision-related tasks due to its remarkable performance. However, CNN suffers from overfitting due to a large number of trainable parameters. Additionally, CNN cannot capture global information with respect to an input image. Furthermore, the identification of important regions for the classification of micro-expressions is a challenging task. Self-attention mechanism addresses these issues by focusing on key areas. Furthermore, specific transformers, known as vision transformers are widely explored in vision-related applications. However, existing vision transformers divide an input image into a fixed number of patches due to which local correlation of image pixels is lost. Further, a vision transformer relies on self-attention mechanism which effectively captures global dependencies but does not exploit the local spatial relationships in an image. In this work, we propose a vision transformer based on convolution patches to overcome this problem. The proposed algorithm generates $c $ number of feature maps from input images using $c $ filters through convolution operation. These feature maps are then applied to a transformer model as fixed-size image patches to perform classification. Thus, the proposed architecture leverages advantages of both convolutional layers and transformer, and captures both spatial information and global dependencies respectively, leading to improved performance. The performance of the proposed model is evaluated on three benchmark datasets: CASME-I, CASME-II, and SAMM and compared with state-of-the-art machine and deep learning models, which generated classification accuracy of 95.97%, 98.59%, and 100%, respectively.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords