Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data Archive

Erjon Skenderi; Jukka Huhtamäki; Kostas Stefanidis

doi:10.3390/info12120491

Information (Nov 2021)

Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data Archive

Erjon Skenderi,
Jukka Huhtamäki,
Kostas Stefanidis

Affiliations

Erjon Skenderi: Faculty of Management and Business, Tampere University, 33100 Tampere, Finland
Jukka Huhtamäki: Faculty of Management and Business, Tampere University, 33100 Tampere, Finland
Kostas Stefanidis: Faculty of Information Technology and Communication Sciences, Tampere University, 33100 Tampere, Finland

DOI: https://doi.org/10.3390/info12120491
Journal volume & issue: Vol. 12, no. 12
p. 491

Abstract

Read online

In this paper, we consider the task of assigning relevant labels to studies in the social science domain. Manual labelling is an expensive process and prone to human error. Various multi-label text classification machine learning approaches have been proposed to resolve this problem. We introduce a dataset obtained from the Finnish Social Science Archive and comprised of 2968 research studies’ metadata. The metadata of each study includes attributes, such as the “abstract” and the “set of labels”. We used the Bag of Words (BoW), TF-IDF term weighting and pretrained word embeddings obtained from FastText and BERT models to generate the text representations for each study’s abstract field. Our selection of multi-label classification methods includes a Naive approach, Multi-label k Nearest Neighbours (ML-kNN), Multi-Label Random Forest (ML-RF), X-BERT and Parabel. The methods were combined with the text representation techniques and their performance was evaluated on our dataset. We measured the classification accuracy of the combinations using Precision, Recall and F1 metrics. In addition, we used the Normalized Discounted Cumulative Gain to measure the label ranking performance of the selected methods combined with the text representation techniques. The results showed that the ML-RF model achieved a higher classification accuracy with the TF-IDF features and, based on the ranking score, the Parabel model outperformed the other methods.

Published in Information

ISSN: 2078-2489 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://www.mdpi.com/journal/information/

About the journal

Abstract

Keywords