Topic specificity: A descriptive metric for algorithm selection and finding the right number of topics

Emil Rijcken; Kalliopi Zervanou; Pablo Mosteiro; Floortje Scheepers; Marco Spruit; Uzay Kaymak

Natural Language Processing Journal (Sep 2024)

Topic specificity: A descriptive metric for algorithm selection and finding the right number of topics

Emil Rijcken,
Kalliopi Zervanou,
Pablo Mosteiro,
Floortje Scheepers,
Marco Spruit,
Uzay Kaymak

Affiliations

Emil Rijcken: Jheronimus Academy of Data Science, Eindhoven University of Technology , ’s Hertogenbosch, The Netherlands; Department of Information and Computing Sciences, Utrecht University, Utrecht, The Netherlands; Corresponding author at: Jheronimus Academy of Data Science, Eindhoven University of Technology , ’s Hertogenbosch, The Netherlands.
Kalliopi Zervanou: Jheronimus Academy of Data Science, Eindhoven University of Technology , ’s Hertogenbosch, The Netherlands
Pablo Mosteiro: Department of Information and Computing Sciences, Utrecht University, Utrecht, The Netherlands; Department of Methodology and Statistics, Utrecht University, Utrecht, The Netherlands
Floortje Scheepers: Psychiatry, University Medical Center Utrecht, Utrecht, The Netherlands
Marco Spruit: Leiden Institute of Advanced Computer Science, Leiden, The Netherlands; Public Health & Primary Care, Leiden University Medical Center, Leiden, The Netherlands
Uzay Kaymak: Jheronimus Academy of Data Science, Eindhoven University of Technology , ’s Hertogenbosch, The Netherlands

Journal volume & issue: Vol. 8
p. 100082

Abstract

Read online

Topic modeling is a prevalent task for discovering the latent structure of a corpus, identifying a set of topics that represent the underlying themes of the documents. Despite its popularity, issues with its evaluation metric, the coherence score, result in two common challenges: algorithm selection and determining the number of topics. To address these two issues, we propose the topic specificity metric, which captures the relative frequency of topic words in the corpus and is used as a proxy for the specificity of a word. In this work, we formulate the metric firstly. Secondly, we demonstrate that algorithms train topics at different specificity levels. This insight can be used to address algorithm selection as it allows users to distinguish and select algorithms with the desired specificity level. Lastly, we show a strictly positive monotonic correlation between the topic specificity and the number of topics for LDA, FLSA-W, NMF and LSI. This correlation can be used to address the selection of the number of topics, as it allows users to adjust the number of topics to their desired level. Moreover, our descriptive metric provides a new perspective to characterize topic models, allowing them to be understood better.

Published in Natural Language Processing Journal

ISSN: 2949-7191 (Online)
Publisher: Elsevier
Country of publisher: Netherlands
LCC subjects: Language and Literature: Philology. Linguistics: Computational linguistics. Natural language processing
Website: https://www.sciencedirect.com/journal/natural-language-processing-journal

About the journal

Abstract

Keywords