Comparative analysis of automatic gender detection from names: evaluating the stability and performance of ChatGPT versus Namsor, and Gender-API

Adrián Domínguez-Díaz; Manuel Goyanes; Luis de-Marcos; Víctor Pablo Prado-Sánchez

doi:10.7717/peerj-cs.2378

PeerJ Computer Science (Oct 2024)

Comparative analysis of automatic gender detection from names: evaluating the stability and performance of ChatGPT versus Namsor, and Gender-API

Adrián Domínguez-Díaz,
Manuel Goyanes,
Luis de-Marcos,
Víctor Pablo Prado-Sánchez

Affiliations

Adrián Domínguez-Díaz: Ciencias de la Computación, Universidad de Alcalá, Alcalá de Henares, Spain
Manuel Goyanes: Comunicación, Universidad Carlos III de Madrid, Getafe, Spain
Luis de-Marcos: Ciencias de la Computación, Universidad de Alcalá, Alcalá de Henares, Spain
Víctor Pablo Prado-Sánchez: Ciencias de la Computación, Universidad de Alcalá, Alcalá de Henares, Spain

DOI: https://doi.org/10.7717/peerj-cs.2378
Journal volume & issue: Vol. 10
p. e2378

Abstract

Read online Read online

The gender classification from names is crucial for uncovering a myriad of gender-related research questions. Traditionally, this has been automatically computed by gender detection tools (GDTs), which now face new industry players in the form of conversational bots like ChatGPT. This paper statistically tests the stability and performance of ChatGPT 3.5 Turbo and ChatGPT 4o for gender detection. It also compares two of the most used GDTs (Namsor and Gender-API) with ChatGPT using a dataset of 5,779 records compiled from previous studies for the most challenging variant, which is the gender inference from full name without providing any additional information. Results statistically show that ChatGPT is very stable presenting low standard deviation and tight confidence intervals for the same input, while it presents small differences in performance when prompt changes. ChatGPT slightly outperforms the other tools with an overall accuracy over 96%, although the difference is around 3% with both GDTs. When the probability returned by GDTs is factored in, differences get narrower and comparable in terms of inter-coder reliability and error coded. ChatGPT stands out in the reduced number of non-classifications (0% in most tests), which in combination with the other metrics analyzed, results in a solid alternative for gender inference. This paper contributes to current literature on gender detection classification from names by testing the stability and performance of the most used state-of-the-art AI tool, suggesting that the generative language model of ChatGPT provides a robust alternative to traditional gender application programming interfaces (APIs), yet GDTs (especially Namsor) should be considered for research-oriented purposes.

Published in PeerJ Computer Science

ISSN: 2376-5992 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://peerj.com/computer-science/

About the journal

Abstract

Keywords