Scientific Data (Jul 2024)

AgCNER, the First Large-Scale Chinese Named Entity Recognition Dataset for Agricultural Diseases and Pests

  • Xiaochuang Yao,
  • Xia Hao,
  • Ruilin Liu,
  • Lin Li,
  • Xuchao Guo

DOI
https://doi.org/10.1038/s41597-024-03578-5
Journal volume & issue
Vol. 11, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Named entity recognition is a fundamental subtask for knowledge graph construction and question-answering in the agricultural diseases and pests field. Although several works have been done, the scarcity of the Chinese annotated dataset has restricted the development of agricultural diseases and pests named entity recognition(ADP-NER). To address the issues, a large-scale corpus for the Chinese ADP-NER task named AgCNER was first annotated. It mainly contains 13 categories, 206,992 entities, and 66,553 samples with 3,909,293 characters. Compared with other datasets, AgCNER maintains the best performance in terms of the number of categories, entities, samples, and characters. Moreover, this is the first publicly available corpus for the agricultural field. In addition, the agricultural language model AgBERT is also fine-tuned and released. Finally, the comprehensive experimental results showed that BiLSTM-CRF achieved F1-score of 93.58%, which would be further improved to 94.14% using BERT. The analysis from multiple aspects has verified the rationality of AgCNER and the effectiveness of AgBERT. The annotated corpus and fine-tuned language model are publicly available at https://doi.org/XXX and https://github.com/guojson/AgCNER.git .