Bag of Words and Embedding Text Representation Methods for Medical Article Classification

Cichosz Paweł

doi:10.34768/amcs-2023-0043

International Journal of Applied Mathematics and Computer Science (Dec 2023)

Bag of Words and Embedding Text Representation Methods for Medical Article Classification

Cichosz Paweł

Affiliations

Cichosz Paweł: aInstitute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665Warsaw, Poland

DOI: https://doi.org/10.34768/amcs-2023-0043
Journal volume & issue: Vol. 33, no. 4
pp. 603 – 621

Abstract

Read online

Text classification has become a standard component of automated systematic literature review (SLR) solutions, where articles are classified as relevant or irrelevant to a particular literature study topic. Conventional machine learning algorithms for tabular data which can learn quickly from not necessarily large and usually imbalanced data with low computational demands are well suited to this application, but they require that the text data be transformed to a vector representation. This work investigates the utility of different types of text representations for this purpose. Experiments are presented using the bag of words representation and selected representations based on word or text embeddings: word2vec, doc2vec, GloVe, fastText, Flair, and BioBERT. Four classification algorithms are used with these representations: a naive Bayes classifier, logistic regression, support vector machines, and random forest. They are applied to datasets consisting of scientific article abstracts from systematic literature review studies in the medical domain and compared with the pre-trained BioBERT model fine-tuned for classification. The obtained results confirm that the choice of text representation is essential for successful text classification. It turns out that, while the standard bag of words representation is hard to beat, fastText word embeddings make it possible to achieve roughly the same level of classification quality with the added benefit of much lower dimensionality and capability of handling out-of-vocabulary words. More refined embeddings methods based on deep neural networks, while much more demanding computationally, do not appear to offer substantial advantages for the classification task. The fine-tuned BioBERT classification model performs on par with conventional algorithms when they are coupled with their best text representation methods.

Published in International Journal of Applied Mathematics and Computer Science

ISSN: 2083-8492 (Online)
Publisher: Sciendo
Country of publisher: Poland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.amcs.uz.zgora.pl/

About the journal

Abstract

Keywords