Iranian Journal of Information Processing & Management (Jul 2023)
Comparison of the performance of approaches in discovering and extracting e-book topics
Abstract
Keyword extraction is one of the most important issues in text processing and analysis and provides a high-level and accurate summary of the text. Therefore, choosing the right method to extract keywords from the text is important. The aim of the present study was to compare the performance of three approaches in discovering and extracting the subject keywords of e-books using text mining and machine learning techniques. In this regard, three experimental approaches have been introduced and compared including the successive implementation of the clustering process, improving the quality of clusters in terms of semantics and enriching the stop words of a specific field, use of specialized keyword template, finally, the use of important parts of the text in discovering and extracting key words and important topics of the text. The statistical population includes 1000 e-book titles from the subject fields of library and information science based on the congress classification system. Bibliographic information of e-books was obtained from the Congress Library database, then the original text was prepared. The extraction of topic keywords and clustering of training data was performed using the non-negative matrix factorization algorithm with three experimental approaches. The quality and performance of the subject clusters resulting from the implementation of three approaches in the automatic classification of experimental data were compared using a support vector machine. The findings showed that the Hamming loss (0.020) and in other words the error rate in the correct classification of experimental texts in the third approach is far less than the other two approaches. Also, the F1 score (0.82), which is the average of the two criteria of precision (0.87) and recall (0.78) and is a reflection of the correct performance of the classification process in topic labeling of texts, is better in the third approach than the other two approaches. The results showed that the quality and semantic coherence of the subject clusters obtained from the third approach, i.e. the use of important parts of the text in discovering and extracting the subject, was better compared to other two approaches. In this approach, by focusing on the main parts of the data, which represent the main content and theme of the text, more meaningful topic clusters were obtained. In addition, the keywords obtained from the topic cluster of the third approach can be used in unspecified and unknown collections in order to extract the unknown thematic content of the whole collection. The results of third approach also was better in terms of accuracy and readability (0.79) and the rate of classification error (0.020) of texts, in comparison of other two approaches.
Keywords