Textual Pre-Trained Models for Gender Identification Across Community Question-Answering Members

Pablo Schwarzenberg; Alejandro Figueroa

doi:10.1109/ACCESS.2023.3235735

IEEE Access (Jan 2023)

Textual Pre-Trained Models for Gender Identification Across Community Question-Answering Members

Pablo Schwarzenberg,
Alejandro Figueroa

Affiliations

Pablo Schwarzenberg: ORCiD; Facultad de Ingeniería, Universidad Andrés Bello, Santiago, Chile
Alejandro Figueroa: ORCiD; Departamento de Ciencias de la Ingeniería, Facultad de Ingeniería, Universidad Andrés Bello, Santiago, Chile

DOI: https://doi.org/10.1109/ACCESS.2023.3235735
Journal volume & issue: Vol. 11
pp. 3983 – 3995

Abstract

Read online

Promoting engagement and participation is vital for online social networks such as community Question-Answering (cQA) sites. One way of increasing the contribution of their members is by connecting their content with the right target audience. To achieve this goal, demographic analysis is pivotal in deciphering the interest of each community fellow. Indeed, demographic factors such as gender are fundamental in reducing the gender disparity across distinct topics. This work assesses the classification rate of assorted state-of-the-art transformer-based models (e.g., BERT and FNET) on the task of gender identification across cQA fellows. For this purpose, it benefited from a massive text-oriented corpus encompassing 548,375 member profiles including their respective full-questions, answers and self-descriptions. This assisted in conducting large-scale experiments considering distinct combinations of encoders and sources. Contrary to our initial intuition, in average terms, self-descriptions were detrimental due to their sparseness. In effect, the best transformer models achieved an AUC of 0.92 by taking full-questions and answers into account (i.e., DeBERTa and MobileBERT). Our qualitative results reveal that fine-tuning on user-generated content is affected by pre-training on clean corpora, and that this adverse effect can be mitigated by correcting the case of words.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords