BMC Bioinformatics (Jun 2018)

Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction mention extraction

  • Shashank Gupta,
  • Sachin Pawar,
  • Nitin Ramrakhiyani,
  • Girish Keshav Palshikar,
  • Vasudeva Varma

DOI
https://doi.org/10.1186/s12859-018-2192-4
Journal volume & issue
Vol. 19, no. S8
pp. 1 – 7

Abstract

Read online

Abstract Background Social media is a useful platform to share health-related information due to its vast reach. This makes it a good candidate for public-health monitoring tasks, specifically for pharmacovigilance. We study the problem of extraction of Adverse-Drug-Reaction (ADR) mentions from social media, particularly from Twitter. Medical information extraction from social media is challenging, mainly due to short and highly informal nature of text, as compared to more technical and formal medical reports. Methods Current methods in ADR mention extraction rely on supervised learning methods, which suffer from labeled data scarcity problem. The state-of-the-art method uses deep neural networks, specifically a class of Recurrent Neural Network (RNN) which is Long-Short-Term-Memory network (LSTM). Deep neural networks, due to their large number of free parameters rely heavily on large annotated corpora for learning the end task. But in the real-world, it is hard to get large labeled data, mainly due to the heavy cost associated with the manual annotation. Results To this end, we propose a novel semi-supervised learning based RNN model, which can leverage unlabeled data also present in abundance on social media. Through experiments we demonstrate the effectiveness of our method, achieving state-of-the-art performance in ADR mention extraction. Conclusion In this study, we tackle the problem of labeled data scarcity for Adverse Drug Reaction mention extraction from social media and propose a novel semi-supervised learning based method which can leverage large unlabeled corpus available in abundance on the web. Through empirical study, we demonstrate that our proposed method outperforms fully supervised learning based baseline which relies on large manually annotated corpus for a good performance.

Keywords