Data in Brief (Oct 2024)

FloraNER: A new dataset for species and morphological terms named entity recognition in French botanical text

  • Ayoub Nainia,
  • Régine Vignes-Lebbe,
  • Eric Chenin,
  • Maya Sahraoui,
  • Hajar Mousannif,
  • Jihad Zahir

Journal volume & issue
Vol. 56
p. 110824

Abstract

Read online

FloraNER is a distantly supervised named entity recognition dataset (NER). The dataset is built from botanical French literature extracted from the OCR-preprocessed flora of New Caledonia, provided by the National Museum of Natural History in France (MNHN), and distantly annotated with a botanical French corpus created by merging botanical lexicons available online. FloraNER comprises separate sub-datasets for the recognition of plant species names, as well as coarse-grained and fine-grained botanical morphological terms. The resulting datasets are in CSV format, displaying textual data, identified named entities, and their annotations, covering one named entity type “Species” (Espèce in French) for species name identification, two named entity types “Organ” and “Descriptor” for coarse-grained morphological term identification, and eight named entity types for fine-grained morphological term identification: Organ, Descriptor, Form, Color, Development, Structure, Surface, Position, Disposition, and Measure. This dataset can be utilized to train and evaluate named entity recognition models for extracting information from botanical French literature.

Keywords