Multimodal Technologies and Interaction (Aug 2019)

Data-Driven Lexical Normalization for Medical Social Media

  • Anne Dirkson,
  • Suzan Verberne,
  • Abeed Sarker,
  • Wessel Kraaij

DOI
https://doi.org/10.3390/mti3030060
Journal volume & issue
Vol. 3, no. 3
p. 60

Abstract

Read online

In the medical domain, user-generated social media text is increasingly used as a valuablecomplementary knowledge source to scientific medical literature. The extraction of this knowledge iscomplicated by colloquial language use and misspellings. However, lexical normalization of suchdata has not been addressed effectively. This paper presents a data-driven lexical normalizationpipeline with a novel spelling correction module for medical social media. Our method significantlyoutperforms state-of-the-art spelling correction methods and can detect mistakes with an F1 of 0.63despite extreme imbalance in the data. We also present the first corpus for spelling mistake detectionand correction in a medical patient forum.

Keywords