IEEE Access (Jan 2023)
Extracting Mental Health Indicators From English and Spanish Social Media: A Machine Learning Approach
Abstract
This study examines the communications of English- and Spanish-speaking Twitter users through traditional and deep learning algorithms to automatically recognize whether they live with one of nine mental health conditions. We created two datasets in English and Spanish. The “diagnosed” set comprises the timeline of 1,500 users who explicitly reported in one or more of their posts having been diagnosed with one of the following: ADHD, Anxiety, Autism, Bipolar, Depression, Eating disorders, OCD, PTSD, and Schizophrenia. The “control” set comprises the timeline of 1,700 randomly selected users who had not disclosed a diagnosis. We extracted a variety of text features from the collected data, such as n-grams, q-grams, Part-of-speech (POS) tags, topic modeling, Linguistic Inquiry and Word Count (LIWC), and word embeddings, and trained traditional machine-learning and deep learning classifiers for two tasks: binary classification, to distinguish between diagnosed and non-diagnosed users, and multiclass classification, to identify the specific diagnosis. Overall, XGBoost and convolutional neural network (CNN) performed the best in the two classification tasks. Moreover, lexical attributes based on n-grams and q-grams are the ones that performed well in both datasets. Using our collected datasets, for binary classification, we achieved an AUC of 0.835 on the Spanish Twitter dataset using n-grams of words from one to three (UBT) and 0.846 on the English Twitter dataset with a 5-gram characters (C5) model. In multiclass classification, we obtained an AUC of 0.712 and 0.697 in the Spanish and English Twitter datasets, respectively.
Keywords