Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets

Yousef Younes; Ansgar Scherp

doi:10.1109/ACCESS.2023.3309148

IEEE Access (Jan 2023)

Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets

Yousef Younes,
Ansgar Scherp

Affiliations

Yousef Younes: ORCiD; GESIS—Leibniz-Institute for the Social Sciences, Cologne, Germany
Ansgar Scherp: ORCiD; Data Science and Big Data Analytics, University of Ulm, Ulm, Germany

DOI: https://doi.org/10.1109/ACCESS.2023.3309148
Journal volume & issue: Vol. 11
pp. 92775 – 92787

Abstract

Read online

Dataset mention extraction is a difficult problem due to the unstructured nature of text, the sparsity of dataset mentions, and the various ways the same dataset can be mentioned. Extracting unknown dataset mentions which are not part of the training data of the model is even harder. We address this challenge in two ways. First, we consider a two-step approach where a binary classifier filters out positive contexts, i.e., detects sentences with a dataset mention. We consider multiple transformer-based models and strong baselines for this task. Subsequently, the dataset is extracted from the positive context. Second, we consider a one-step approach and directly aim to detect and extract a possible dataset mention. For the extraction of datasets, we consider transformer models in named entity recognition (NER) mode. We contrast NER with the transformers’ capabilities for question answering (QA). We use the Coleridge Initiative “Show US the Data” dataset consisting of $14.3k$ scientific papers with about $35k$ mentions of datasets. We found that using transformers in QA mode is a better choice than NER for extracting unknown datasets. The rationale is that detecting new datasets is an out-of-vocabulary task, i.e., the dataset name has not been seen once during training. Comparing the two-step versus the one-step approach, we found contrasting strengths. A two-step dataset extraction using an MLP for filtering and RoBERTa in QA mode extracts more dataset mentions than a one-step system, but at the cost of a lower F1-score of 62.7%. A one-step extraction with DeBERTa in QA achieves the highest F1-score of 92.88% at the cost of missing dataset mentions. We recommend the one-step approach for the case when accuracy is more important, and the two-step approach when there is a postprocessing mechanism for the extracted dataset mentions, e.g., a manual check. The source code is available at https://github.com/yousef-younes/dataset_mention_extraction.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords