Discover Artificial Intelligence (Jul 2024)

ExaAUAC: Arabic Twitter user age prediction corpus based on language and metadata features

  • Reyhaneh Sadeghi,
  • Ahmad Akbari,
  • Mohammad Mehdi Jaziriyan

DOI
https://doi.org/10.1007/s44163-024-00145-0
Journal volume & issue
Vol. 4, no. 1
pp. 1 – 12

Abstract

Read online

Abstract Twitter is a rich resource for analyzing the contents of social media and extracting the age groups of users can be beneficial for recommender systems, marketing and advertising. Age detection task is an aspect of demographic information of users. In this study a large-scale corpus of Arabic Twitter users including 181k user profiles with diverse age groups consisting of −18, 18–24, 25–34, 35–49, 50–64, +65 is presented. The corpus is created by four methods: (1) collecting publicly available birthday announcement tweets using the Twitter Search application programming interface, (2) augmenting data, (3) fetching verified accounts, and (4) manual annotation. To have a best age detection model on the presented corpus, different evaluations are tested to find the model with highest accuracy and efficiency. Number of tweets, regression vs. classification, using metadata of users and tweets, using LSTM+CNN model vs. BERT are some parts of examinations done. Presented methodology is based on language and metadata features and final model is fine-tuned with BERT on 70k users and evaluated on 8200 manually annotated users. We show that our best model, compared with LSTM+CNN model and BERT-based similar model yields an improvement of up to 9% in F1-score and increment of 5% in accuracy, respectively. The model achieved macro-averaged F1-score of 44 on six age groups, and F1-score of 58 on three age groups of −25, 25–34, +35. The link of our proposed data is provided here: www.github.com/exaco/ExaAUAC .

Keywords