Proceedings of the XXth Conference of Open Innovations Association FRUCT (May 2023)

Comparison of Unigram, HMM, CRF and Brill's Part-of-Speech Taggers Available in NLTK Library

  • Michal Kvet,
  • Miroslav Potočár

DOI
https://doi.org/10.23919/FRUCT58615.2023.10143061
Journal volume & issue
Vol. 33, no. 1
pp. 226 – 235

Abstract

Read online

Part-of-speech tagging is for many NLP researchers the first task they encounter in the field of natural language processing. This task is undoubtedly related to part-of-speech taggers. We focus on a detailed description of the functioning of the unigram, hidden Markov model, conditional random fields and Brill taggers, followed by a comparison of these models. We use implementations available in the natural language toolkit library, without addressing the selection of the best parameters. We focus on finding out which tagger produces the best results using default settings or in other words, which one works best in "take it as it is" mode. To determine this, we make an experiment in which we track various metrics such as prediction time, accuracy on unknown words, number of correctly labeled sentences and others. From the results of the experiment, we find out that the CRF tagger achieves the highest accuracy among all participants in the experiment. It is also able to tag previously unseen words with the highest accuracy among all taggers compared.

Keywords