Natural Language Processing Journal (Dec 2024)
V-LTCS: Backbone exploration for Multimodal Misogynous Meme detection
Abstract
Memes have become a fundamental part of online communication and humour, reflecting and shaping the culture of today’s digital age. The amplified Meme culture is inadvertently endorsing and propagating casual Misogyny. This study proposes V-LTCS (Vision- Language Transformer Combination Search), a framework that encompasses all possible combinations of the most fitting Text (i.e. BERT, ALBERT, and XLM-R) and Vision (i.e. Swin, ConvNeXt, and ViT) Transformer Models to determine the backbone architecture for identifying Memes that contains misogynistic contents. All feasible Vision-Language Transformer Model combinations obtained from the recognized optimal Text and Vision Transformer Models are evaluated on two (smaller and larger) datasets using varied standard metrics (viz. Accuracy, Precision, Recall, and F1-Score). The BERT-ViT combinational Transformer Model demonstrated its efficiency on both datasets, validating its ability to serve as a backbone architecture for all subsequent efforts to recognize Multimodal Misogynous Memes.