Named Entity Recognition Datasets: A Classification Framework

Ying Zhang; Gang Xiao

doi:10.1007/s44196-024-00456-1

International Journal of Computational Intelligence Systems (Mar 2024)

Named Entity Recognition Datasets: A Classification Framework

Ying Zhang,
Gang Xiao

Affiliations

Ying Zhang: Institute of Systems Engineering, Academy of Military Sciences (AMS)
Gang Xiao: Institute of Systems Engineering, Academy of Military Sciences (AMS)

DOI: https://doi.org/10.1007/s44196-024-00456-1
Journal volume & issue: Vol. 17, no. 1
pp. 1 – 17

Abstract

Read online

Abstract Named entity recognition as a fundamental task plays a crucial role in accomplishing some of the tasks and applications in natural language processing. In the age of Internet information, as far as computer applications are concerned, a huge proportion of information is stored in structured and unstructured forms and used for language and text processing. Before neural networks were widely used in natural language processing tasks, research in the field of named entity recognition usually focused on leveraging lexical and syntactic knowledge to improve the performance of models or methods. To promote the development of named entity recognition, researchers have been creating named entity recognition datasets through conferences, projects, and competitions for many years, based on various research goals, and training entity recognition models with increasing accuracy on this basis. However, there has not been much exploration of named entity recognition datasets. Particularly, there have been many datasets available since the introduction of the named entity recognition task, but there is no clear framework to summarize the development of these seemingly independent datasets. A closer look at the context of the development of each dataset and the features it contains reveals that these datasets share some common features to varying degrees. In this thesis, we review the development of named entity recognition datasets over the years and describe them in terms of the language of the dataset, the domain of research, the type of entity, the granularity of the entity, and the annotation of the entity. Finally, we provide an idea for the creation of subsequent named entity recognition datasets.

Published in International Journal of Computational Intelligence Systems

ISSN: 1875-6891 (Print); 1875-6883 (Online)
Publisher: Springer
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.springer.com/journal/44196

About the journal

Abstract

Keywords