AGI-P: A Gender Identification Framework for Authorship Analysis Using Customized Fine-Tuning of Multilingual Language Model

Raheem Sarwar; Le An Ha; Pin Shen Teh; Fahad Sabah; Raheel Nawaz; Ibrahim A. Hameed; Muhammad Umair Hassan

doi:10.1109/ACCESS.2024.3358199

IEEE Access (Jan 2024)

AGI-P: A Gender Identification Framework for Authorship Analysis Using Customized Fine-Tuning of Multilingual Language Model

Raheem Sarwar,
Le An Ha,
Pin Shen Teh,
Fahad Sabah,
Raheel Nawaz,
Ibrahim A. Hameed,
Muhammad Umair Hassan

Affiliations

Raheem Sarwar: ORCiD; Department of Operations, Technology, Events and Hospitality Management, Manchester Metropolitan University, Manchester, U.K
Le An Ha: Research Group in Computational Linguistics, RIILP, University of Wolverhampton, Wolverhampton, U.K
Pin Shen Teh: ORCiD; Department of Operations, Technology, Events and Hospitality Management, Manchester Metropolitan University, Manchester, U.K
Fahad Sabah: Faculty of Information Technology, Beijing University of Technology, Beijing, China
Raheel Nawaz: ORCiD; Executive Office, Staffordshire University, Stoke-on-Trent, U.K
Ibrahim A. Hameed: Department of ICT and Natural Sciences, Norwegian University of Science and Technology, Ålesund, Norway
Muhammad Umair Hassan: ORCiD; Department of ICT and Natural Sciences, Norwegian University of Science and Technology, Ålesund, Norway

DOI: https://doi.org/10.1109/ACCESS.2024.3358199
Journal volume & issue: Vol. 12
pp. 15399 – 15409

Abstract

Read online

In this investigation, we propose a solution for the author’s gender identification task called AGI-P. This task has several real-world applications across different fields, such as marketing and advertising, forensic linguistics, sociology, recommendation systems, language processing, historical analysis, education, and language learning. We created a new dataset to evaluate our proposed method. The dataset is balanced in terms of gender using a random sampling method and consists of 1944 samples in total. We use accuracy as an evaluation measure and compare the performance of the proposed solution (AGI-P) against state-of-the-art machine learning classifiers and fine-tuned pre-trained multilingual language models such as DistilBERT, mBERT, XLM-RoBERTa, and Multilingual DEBERTa. In this regard, we also propose a customized fine-tuning strategy that improves the accuracy of the pre-trained language models for the author gender identification task. Our extensive experimental studies reveal that our solution (AGI-P) outperforms the well-known machine learning classifiers and fine-tuned pre-trained multilingual language models with an accuracy level of 92.03%. Moreover, the pre-trained multilingual language models, fine-tuned with the proposed customized strategy, outperform the fine-tuned pre-trained language models using an out-of-the-box fine-tuning strategy. The codebase and corpus can be accessed on our GitHub page at: https://github.com/mumairhassan/AGI-P

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords