Nongye tushu qingbao xuebao (Jan 2023)

Selection of Papers on the Origins of COVID-19 and Entity Annotation Based on Full Texts

  • XU Shuo, ZHANG Mengmeng, LIU Liyuan, WANG Congcong, SUN Rui, LI Yilin, XU Jinnan, AN Xin

DOI
https://doi.org/10.13998/j.cnki.issn1002-1248.22-0662
Journal volume & issue
Vol. 35, no. 1
pp. 87 – 98

Abstract

Read online

[Purpose/Significance] Since the outbreak of COVID-19, there has been a rapid increase in the number of studies related to COVID-19 at home and abroad. Review of relevant literature on COVID-19 provides data resources for related research on the emergence and transmission mechanism of SARS-CoV-2. However, the current COVID-19 related dataset is a collection of the literature, without classifying the data for each subfield, and the coarse-grained information such as the title and author fails to provide an in-depth understanding of the progress of COVID-19 research. Therefore, this paper created a dataset for the COVID-19 sub-domain and a fine-grained entity dataset. [Method/Process] Firstly, this paper proposed a literature screening method based on active learning model, which can obtain more valuable marker samples with less labor cost, so that the classifier has better generalization performance. We considered three base classifiers: Support Vector Machine (SVM), Logistic Regression (LR), and Random Forest (RF), while considering four query strategies: uncertainty sampling, expected error reduction, committee-based query, and random sampling. Taking the origin of SARS-CoV-2, one of the sub-fields related to SARS-CoV-2, as an example, articles related to the origin of SARS-CoV-2 were efficiently and accurately located from the literature. At the same time, this paper designed a labeling scheme covering 18 types of entities, including not only genes, proteins, compounds and other entities that are universal in the biological field, but also corona viruses and wild animals that are unique to the field of SARS-CoV-2. In this paper, visual annotation tool BRAT was used for entity annotation. The tagging team consisted of an administrator and six annotators, and the entity tagging consisted of two rounds. What's more, multi-k consistency index was used to calculate the consistency score of annotation results. [Results/Conclusions] The results of the active learning model show that the uncertain sampling query strategy has the best performance. SVM, LR and RF based on uncertain sampling can correctly screen 425, 465 and 489 articles, respectively. After the removal of overlapping articles, a dataset related to the origin of SARS-CoV-2 was constructed, containing a total of 885 articles. Secondly, based on the proposed entity labeling scheme, 6 annotators completed 99 papers. Based on the results of fine marking, this paper constructed an entity dataset containing 39,118 entities, which is the largest and most comprehensive entity corpus in the field of COVID-19.

Keywords