Informática y Sistemas (May 2022)
Topic identification from news blog in Spanish language
Abstract
Currently exist a large amount of news in a digital format that need to be classified or labeled automatically according to their content. LDA is an unsupervised technique that automatically creates topics based on words in documents. The present work aims to apply LDA in order to analyze and extract topic from digital news in Spanish language. A total of 198 digital news was collected from a university news blog. A data pre-processing and representation in vector spaces was carried out and k values were selected based on coherence metric. A TF_IDF matrix and a combination of unigrams and bigrams produce topics with a variety of terms and topics related to university activities like study programs, research, projects for innovation and social responsibility. Furthermore, with the manual validation process, terms in topics correspond with hashtags written by the communication professionals.
Keywords