Italian Journal of Animal Science (Dec 2024)
Identifying the optimal number of topics in text mining: a case study on reindeer pastoralism literature
Abstract
Text mining and topic analysis algorithms which group textual contents in the most efficient way, are becoming increasingly useful to summarise the main information contained in large data corpus of complex scientific fields. Using the literature about reindeer pastoralism as a case study, this methodological investigation addressed the issue related to the identification of the suitable number of topics that provide the best in-depth interpretation of a large data corpus. Two-thousand eight hundred and seventy-five documents extracted from Scopus® regarding the scientific literature of reindeer pastoralism were used. Four simulations with 8, 10, 12, and 20 topics were carried out to define the optimal number of topics that best explained the issues related to reindeer husbandry. The results showed that a reasonable trade-off between the number of articles and the number of topics, based on the reduction of the variance explained within the group, leads to an optimal choice in the search for the most meaningful simulation. The adoption of a too large number of topics, with the excessive fragmentation of the data corpus into small aggregations of documents, encourages the emergence of topics without any technical or practical meaning, solely as a result of the unsupervised iterative process.HIGHLIGHTS Text mining for insight vast and complex scientific fields: a case study on reindeer pastoralism. Optimising topic identification to strike a balance between the size of the articles corpus and the number of topics and achieve the most insightful results. Too many topics can lead to fragmentation and irrelevant results, while too few may oversimplify the complexity of the dataset.
Keywords