Mathematical Biosciences and Engineering (Apr 2021)
A method based on multi-standard active learning to recognize entities in electronic medical record
Abstract
Deep neural networks(DNN)have achieved good results in the application of Named Entity Recognition (NER), but most of the DNN methods are based on large numbers of annotated data. Electronic Medical Record (EMR) belongs to text data of the specific professional field. The annotation of this kind of data needs experts with strong knowledge of the medical field and time labeling. To tackle the problems of professional medical areas, large data volume, and annotation difficulties of EMR, we propose a new method based on multi-standard active learning to recognize entities in EMR. Our approach uses three criteria: the number of labeled data, the cost of sentence annotation, and the balance of data sampling to determine the choice of active learning strategy. We put forward a more suitable way of uncertainty calculation and measurement rule of sentence annotation for NER's neural network model. Also, we use incremental training to speed up the iterative training in the process of active learning. Finally, the named entity experiment of breast clinical EMRs shows that it can achieve the same accuracy of NER results under the premise of obtaining the same sample's quality. Compared with the traditional supervised learning method of randomly selecting labeled data, the method proposed in this paper reduces the amount of data that needs to be labeled by 66.67%. Besides, an improved TF-IDF method based on Word2Vec is also proposed to vectorize the text by considering the word frequency.
Keywords