IEEE Access (Jan 2025)

OWNER — Toward Unsupervised Open-World Named Entity Recognition

  • Pierre-Yves Genest,
  • Pierre-Edouard Portier,
  • Elod Egyed-Zsigmond,
  • Martino Lovisetto

DOI
https://doi.org/10.1109/access.2025.3552122
Journal volume & issue
Vol. 13
pp. 50077 – 50105

Abstract

Read online

Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP), traditionally addressed through supervised learning, which requires extensive annotated corpora. This requirement poses challenges, particularly in specialized domains with limited labeled data. In response, the field has shifted towards lower-resource approaches, such as few-shot and zero-shot learning, which reduce the dependency on annotated data. However, even zero-shot models require prior knowledge of entity types, limiting their applicability in exploratory scenarios. In this context, we introduce OWNER, our unsupervised and open-world NER model, designed to operate without annotated documents or predefined entity types. OWNER leverages Encoder-only Language Models like BERT to infer and organize entities into dynamic entity types through a two-step process: mention detection and entity typing. Mention detection employs a BIO sequence labeling approach to locate entities, while entity typing uses BERT-based embeddings, refined through contrastive learning, for clustering and naming entity types. This method allows OWNER to automatically identify and structure unknown entity types, offering advantages for exploratory dataset analysis and knowledge graph construction. Our experimental evaluation on 13 domain-specific datasets demonstrates that OWNER surpasses existing LLM-based open-world NER models and remains competitive with more supervised and closed-world zero-shot models. OWNER’s architecture provides a lightweight, easily deployable solution that advances the state of the art in unsupervised and open-world NER. The source code of OWNER is publicly available at https://github.com/alteca/OWNER, facilitating future research in this domain.

Keywords