A method based on multi-standard active learning to recognize entities in electronic medical record

Qiao Pan; Chen Huang; Dehua Chen

doi:10.3934/mbe.2021054

Mathematical Biosciences and Engineering (Apr 2021)

A method based on multi-standard active learning to recognize entities in electronic medical record

Qiao Pan,
Chen Huang,
Dehua Chen

Affiliations

Qiao Pan: School of Computer Science and Technology, Donghua University, Shanghai 201620, China
Chen Huang: School of Computer Science and Technology, Donghua University, Shanghai 201620, China
Dehua Chen: School of Computer Science and Technology, Donghua University, Shanghai 201620, China

DOI: https://doi.org/10.3934/mbe.2021054
Journal volume & issue: Vol. 18, no. 2
pp. 1000 – 1021

Abstract

Read online

Deep neural networks(DNN)have achieved good results in the application of Named Entity Recognition (NER), but most of the DNN methods are based on large numbers of annotated data. Electronic Medical Record (EMR) belongs to text data of the specific professional field. The annotation of this kind of data needs experts with strong knowledge of the medical field and time labeling. To tackle the problems of professional medical areas, large data volume, and annotation difficulties of EMR, we propose a new method based on multi-standard active learning to recognize entities in EMR. Our approach uses three criteria: the number of labeled data, the cost of sentence annotation, and the balance of data sampling to determine the choice of active learning strategy. We put forward a more suitable way of uncertainty calculation and measurement rule of sentence annotation for NER's neural network model. Also, we use incremental training to speed up the iterative training in the process of active learning. Finally, the named entity experiment of breast clinical EMRs shows that it can achieve the same accuracy of NER results under the premise of obtaining the same sample's quality. Compared with the traditional supervised learning method of randomly selecting labeled data, the method proposed in this paper reduces the amount of data that needs to be labeled by 66.67%. Besides, an improved TF-IDF method based on Word2Vec is also proposed to vectorize the text by considering the word frequency.

Published in Mathematical Biosciences and Engineering

ISSN: 1551-0018 (Online)
Publisher: AIMS Press
Country of publisher: United States
LCC subjects: Technology: Chemical technology: Biotechnology; Science: Mathematics
Website: https://www.aimspress.com/journal/MBE

About the journal

Abstract

Keywords