Informática y Sistemas (May 2022)

Topic identification from news blog in Spanish language

  • Lizbeth Pacheco-Guevara,
  • Ruth Reátegui,
  • Priscila Valdiviezo-Díaz

DOI
https://doi.org/10.33936/isrtic.v6i1.4514
Journal volume & issue
Vol. 6, no. 1
pp. 22 – 34

Abstract

Read online

Currently exist a large amount of news in a digital format that need to be classified or labeled automatically according to their content. LDA is an unsupervised technique that automatically creates topics based on words in documents. The present work aims to apply LDA in order to analyze and extract topic from digital news in Spanish language. A total of 198 digital news was collected from a university news blog. A data pre-processing and representation in vector spaces was carried out and k values were selected based on coherence metric. A TF_IDF matrix and a combination of unigrams and bigrams produce topics with a variety of terms and topics related to university activities like study programs, research, projects for innovation and social responsibility. Furthermore, with the manual validation process, terms in topics correspond with hashtags written by the communication professionals.

Keywords