Journal of Intelligent Systems (Mar 2022)

Extractive summarization of Malayalam documents using latent Dirichlet allocation: An experience

  • Kondath Manju,
  • Suseelan David Peter,
  • Idicula Sumam Mary

DOI
https://doi.org/10.1515/jisys-2022-0027
Journal volume & issue
Vol. 31, no. 1
pp. 393 – 406

Abstract

Read online

Automatic text summarization (ATS) extracts information from a source text and presents it to the user in a condensed form while preserving its primary content. Many text summarization approaches have been investigated in the literature for highly resourced languages. At the same time, ATS is a complicated and challenging task for under-resourced languages like Malayalam. The lack of a standard corpus and enough processing tools are challenges when it comes to language processing. In the absence of a standard corpus, we have developed a dataset consisting of Malayalam news articles. This article proposes an extractive topic modeling-based multi-document text summarization approach for Malayalam news documents. We first cluster the contents based on latent topics identified using the latent Dirichlet allocation topic modeling technique. Then by adopting vector space model, the topic vector and sentence vector of the given document are generated. According to the relevant status value, sentences are ranked between the document’s topic and sentence vectors. The summary obtained is optimized for non-redundancy. Evaluation results on Malayalam news articles show that the summary generated by the proposed method is closer to the human-generated summaries than the existing text summarization methods.

Keywords