Data in Brief (Oct 2020)
An annotated data set for identifying women reporting adverse pregnancy outcomes on Twitter
Abstract
Despite the prevalence in the United States of miscarriage [1], stillbirth [2], and infant mortality associated with preterm birth and low birthweight [3], their causes remain largely unknown [4–6]. To advance the use of social media data as a complementary resource for epidemiology of adverse pregnancy outcomes, we present a data set of 6487 tweets that mention miscarriage, stillbirth, preterm birth or premature labor, low birthweight, neonatal intensive care, or fetal/infant loss in general. These tweets are a subset of 22,912 tweets retrieved by applying hand-written regular expressions to a database containing more than 400 million public tweets posted by more than 100,000 women who have announced their pregnancy on Twitter [7]. Two professional annotators labeled the 6487 tweets in a binary fashion, distinguishing those potentially reporting that the user has personally experienced the outcome (“outcome” tweets) from those that merely mention the outcome (“non-outcome” tweets). Inter-annotator agreement was κ = 0.90 (Cohen's kappa). The tweets annotated as “outcome” include 1318 women reporting miscarriage, 94 stillbirth, 591 preterm birth or premature labor, 171 low birthweight, 453 neonatal intensive care, and 356 fetal/infant loss in general. These “outcome” tweets can be used to explore patient experiences and perceptions of adverse pregnancy outcomes, and can direct researchers to the users’ broader timelines—tweets posted by a user over time—for observational studies. Our past work demonstrates the analysis of timelines for selecting a study population [8] and conducting a case-control study [9] of users reporting that their child has a birth defect. For larger-scale studies, the full annotated corpus can be used to train supervised machine learning algorithms to automatically identify additional users reporting adverse pregnancy outcomes on Twitter. We used the annotated corpus to train feature-engineered and deep learning-based classifiers presented in “A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes” [10].