IEEE Access (Jan 2020)

Investigating of Disease Name Normalization Using Neural Network and Pre-Training

  • Yinxia Lou,
  • Tao Qian,
  • Fei Li,
  • Junxiang Zhou,
  • Donghong Ji,
  • Ming Cheng

DOI
https://doi.org/10.1109/ACCESS.2020.2992130
Journal volume & issue
Vol. 8
pp. 85729 – 85739

Abstract

Read online

Normalizing disease names is a crucial task for biomedical and healthcare domains. Previous work explored various approaches, including rules, machine learning and deep learning, which focused on only one approach or one model. In this study, we systematically investigated the performances of various neural models and the effects of different features. Our investigation was performed on two benchmark datasets, namely the NCBI disease corpus and the BioCreative V Chemical Disease Relation (BC5CDR) corpus. The convolutional neural network (CNN) performed the best (F1 90.11%) in the NCBI disease corpus and the attention neural network (Attention) performed the best (F1 90.78%) in the BC5CDR corpus. Compared with the state-of-the-art system, DNorm, our models improved the F1s by 1.74% and 0.86% respectively. In terms of features, character information could improve the F1 by about 0.5-1.0% while sentence information worsened the F1 by about 3-4%. Moreover, we proposed a novel approach for pre-training models, which improved the F1 by up to 9%. The CNN and Attention models are comparable in the task of disease name normalization while the recurrent neural network performs much worse. In addition, character information and pre-training techniques are helpful for this task while sentence information hurts the performance. Our proposed models and pre-training approach can be easily adapted to the normalization task for any other type of entities. Our source code is available at: https://github.com/yx100/EntityNorm.

Keywords