IEEE Access (Jan 2024)
ViTMa: A Novel Hybrid Vision Transformer and Mamba for Kinship Recognition in Indonesian Facial Micro-Expressions
Abstract
Kinship recognition that primarily focuses on exploiting facial micro expressions is an interesting and challenging problem that aims to determine whether multiple individuals belong to the same family. Previous approaches have been limited by model capacity and insufficient training data, resulting in low-level features and shallow model learning. These common manual features cannot capture information effectively, leading to suboptimal accuracy. In this paper, we propose a kinship recognition that exploits facial micro expressions using a hybrid Vision Transformer and Mamba (ViTMa) model with modified Deep Feature Fusion, which combines different backbone architectures and feature fusion strategies. The ViTMa model is pre-trained on a large dataset and adapted to Indonesian facial images. The Siamese architecture processes two input images, extracts features fused with feature fusion, and passes them to a classification network. Experiments on the FIW-Local Indonesia dataset demonstrate the effectiveness of this method, with the best model using B16 quadratic features and multiplicative fusion achieving an average accuracy of 85.18% across all kinship categories, outperforming previous approaches. We found that B16, despite being the smallest backbone, has the best performance compared to larger backbones such as L16 with an average accuracy of 67.99%, B32 with an average accuracy of 72.98%, and L32 with an average accuracy of 71.69%. Thus, the ViTMa model with our proposed B16 quadratic feature fusion and multiplicative fusion strategy achieves the best performance and achieves better accuracy outperforming previous studies.
Keywords