Data in Brief (Jun 2023)

BillionCOV: An enriched billion-scale collection of COVID-19 tweets for efficient hydration

  • Rabindra Lamsal,
  • Maria Rodriguez Read,
  • Shanika Karunasekera

Journal volume & issue
Vol. 48
p. 109229

Abstract

Read online

The COVID-19 pandemic has introduced new norms, such as social distancing, face masks, quarantine, lockdowns, travel restrictions, work/study from home, and business closures, to name a few. The pandemic’s seriousness has made people vocal on social media, especially on microblogs such as Twitter. Since the early days of the outbreak, researchers have been collecting and sharing large-scale datasets of COVID-19 tweets. However, the existing datasets carry issues related to proportion and redundancy. We report that more than 500 million tweet identifiers point to deleted or protected tweets. To address these issues, this paper introduces an enriched global billion-scale English-language COVID-19 tweets dataset, BillionCOV,1 which contains 1.4 billion tweets originating from 240 countries and territories between October 2019 and April 2022. Importantly, BillionCOV facilitates researchers to filter tweet identifiers for efficient hydration. We anticipate that the dataset of this scale with global scope and extended temporal coverage will aid in obtaining a thorough understanding of the pandemic’s conversational dynamics.

Keywords