Extracting Mental Health Indicators From English and Spanish Social Media: A Machine Learning Approach

Miryam Elizabeth Villa-Perez; Luis A. Trejo; Maisha Binte Moin; Eleni Stroulia

doi:10.1109/ACCESS.2023.3332289

IEEE Access (Jan 2023)

Extracting Mental Health Indicators From English and Spanish Social Media: A Machine Learning Approach

Miryam Elizabeth Villa-Perez,
Luis A. Trejo,
Maisha Binte Moin,
Eleni Stroulia

Affiliations

Miryam Elizabeth Villa-Perez: ORCiD; School of Engineering and Sciences, Tecnologico de Monterrey, Atizapán de Zaragoza, Mexico
Luis A. Trejo: ORCiD; School of Engineering and Sciences, Tecnologico de Monterrey, Atizapán de Zaragoza, Mexico
Maisha Binte Moin: Department of Computing Science, University of Alberta, Edmonton, Canada
Eleni Stroulia: ORCiD; Department of Computing Science, University of Alberta, Edmonton, Canada

DOI: https://doi.org/10.1109/ACCESS.2023.3332289
Journal volume & issue: Vol. 11
pp. 128135 – 128152

Abstract

Read online

This study examines the communications of English- and Spanish-speaking Twitter users through traditional and deep learning algorithms to automatically recognize whether they live with one of nine mental health conditions. We created two datasets in English and Spanish. The “diagnosed” set comprises the timeline of 1,500 users who explicitly reported in one or more of their posts having been diagnosed with one of the following: ADHD, Anxiety, Autism, Bipolar, Depression, Eating disorders, OCD, PTSD, and Schizophrenia. The “control” set comprises the timeline of 1,700 randomly selected users who had not disclosed a diagnosis. We extracted a variety of text features from the collected data, such as n-grams, q-grams, Part-of-speech (POS) tags, topic modeling, Linguistic Inquiry and Word Count (LIWC), and word embeddings, and trained traditional machine-learning and deep learning classifiers for two tasks: binary classification, to distinguish between diagnosed and non-diagnosed users, and multiclass classification, to identify the specific diagnosis. Overall, XGBoost and convolutional neural network (CNN) performed the best in the two classification tasks. Moreover, lexical attributes based on n-grams and q-grams are the ones that performed well in both datasets. Using our collected datasets, for binary classification, we achieved an AUC of 0.835 on the Spanish Twitter dataset using n-grams of words from one to three (UBT) and 0.846 on the English Twitter dataset with a 5-gram characters (C5) model. In multiclass classification, we obtained an AUC of 0.712 and 0.697 in the Spanish and English Twitter datasets, respectively.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords