LDAPrototype: a model selection algorithm to improve reliability of latent Dirichlet allocation

Jonas Rieger; Carsten Jentsch; Jörg Rahnenführer

doi:10.7717/peerj-cs.2279

PeerJ Computer Science (Sep 2024)

LDAPrototype: a model selection algorithm to improve reliability of latent Dirichlet allocation

Jonas Rieger,
Carsten Jentsch,
Jörg Rahnenführer

Affiliations

Jonas Rieger: Department of Statistics, TU Dortmund University, Dortmund, Germany
Carsten Jentsch: Department of Statistics, TU Dortmund University, Dortmund, Germany
Jörg Rahnenführer: Department of Statistics, TU Dortmund University, Dortmund, Germany

DOI: https://doi.org/10.7717/peerj-cs.2279
Journal volume & issue: Vol. 10
p. e2279

Abstract

Read online Read online

Latent Dirichlet allocation (LDA) is a popular method for analyzing large text corpora, but it suffers from instability due to its reliance on random initialization. This results in different outcomes for replicated runs, hindering reproducibility. To address this, we introduce LDAPrototype, a new approach for selecting the most representative LDA run from multiple replications on the same dataset. LDAPrototype enhances the reliability of LDA conclusions by ensuring greater similarity between replications compared to traditional LDA runs or models chosen based on perplexity or NPMI. A key feature of LDAPrototype is its use of a novel model similarity measure called S-CLOP (Similarity of multiple sets by Clustering with LOcal Pruning). It is based on topic similarities, for which we compare the usage of measures like the thresholded Jaccard coefficient, cosine similarity, Jensen-Shannon divergence, and rank-biased overlap. The effectiveness of LDAPrototype is demonstrated through its application to six real datasets, including newspaper articles and tweets. The results show improved reproducibility and reliability in topic modeling outcomes. LDAPrototype’s approach is noteworthy for its practical applicability, comprehensibility, ease of implementation, and computational efficiency. Furthermore, the algorithm’s concept can be generalized to other topic modeling procedures that characterize topics through word distributions, making it a versatile tool in text data analysis.

Published in PeerJ Computer Science

ISSN: 2376-5992 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://peerj.com/computer-science/

About the journal

Abstract

Keywords