ViTMa: A Novel Hybrid Vision Transformer and Mamba for Kinship Recognition in Indonesian Facial Micro-Expressions

Ike Fibriani; Eko Mulyanto Yuniarno; Ronny Mardiyanto; Mauridhi Hery Purnomo

doi:10.1109/ACCESS.2024.3487180

IEEE Access (Jan 2024)

ViTMa: A Novel Hybrid Vision Transformer and Mamba for Kinship Recognition in Indonesian Facial Micro-Expressions

Ike Fibriani,
Eko Mulyanto Yuniarno,
Ronny Mardiyanto,
Mauridhi Hery Purnomo

Affiliations

Ike Fibriani: Department of Electrical Engineering, Sepuluh Nopember Institute of Technology, Surabaya, Indonesia
Eko Mulyanto Yuniarno: ORCiD; Department of Electrical Engineering, Sepuluh Nopember Institute of Technology, Surabaya, Indonesia
Ronny Mardiyanto: Department of Electrical Engineering, Sepuluh Nopember Institute of Technology, Surabaya, Indonesia
Mauridhi Hery Purnomo: ORCiD; Department of Electrical Engineering, Sepuluh Nopember Institute of Technology, Surabaya, Indonesia

DOI: https://doi.org/10.1109/ACCESS.2024.3487180
Journal volume & issue: Vol. 12
pp. 164002 – 164017

Abstract

Read online

Kinship recognition that primarily focuses on exploiting facial micro expressions is an interesting and challenging problem that aims to determine whether multiple individuals belong to the same family. Previous approaches have been limited by model capacity and insufficient training data, resulting in low-level features and shallow model learning. These common manual features cannot capture information effectively, leading to suboptimal accuracy. In this paper, we propose a kinship recognition that exploits facial micro expressions using a hybrid Vision Transformer and Mamba (ViTMa) model with modified Deep Feature Fusion, which combines different backbone architectures and feature fusion strategies. The ViTMa model is pre-trained on a large dataset and adapted to Indonesian facial images. The Siamese architecture processes two input images, extracts features fused with feature fusion, and passes them to a classification network. Experiments on the FIW-Local Indonesia dataset demonstrate the effectiveness of this method, with the best model using B16 quadratic features and multiplicative fusion achieving an average accuracy of 85.18% across all kinship categories, outperforming previous approaches. We found that B16, despite being the smallest backbone, has the best performance compared to larger backbones such as L16 with an average accuracy of 67.99%, B32 with an average accuracy of 72.98%, and L32 with an average accuracy of 71.69%. Thus, the ViTMa model with our proposed B16 quadratic feature fusion and multiplicative fusion strategy achieves the best performance and achieves better accuracy outperforming previous studies.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords