Multidocument Arabic Text Summarization Based on Clustering and Word2Vec to Reduce Redundancy

Samer Abdulateef; Naseer  Ahmed Khan; Bolin Chen; Xuequn Shang

doi:10.3390/info11020059

Information (Jan 2020)

Multidocument Arabic Text Summarization Based on Clustering and Word2Vec to Reduce Redundancy

Samer Abdulateef,
Naseer Ahmed Khan,
Bolin Chen,
Xuequn Shang

Affiliations

Samer Abdulateef: School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an 710072, China
Naseer Ahmed Khan: School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an 710072, China
Bolin Chen: School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an 710072, China
Xuequn Shang: School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an 710072, China

DOI: https://doi.org/10.3390/info11020059
Journal volume & issue: Vol. 11, no. 2
p. 59

Abstract

Read online

Arabic is one of the most semantically and syntactically complex languages in the world. A key challenging issue in text mining is text summarization, so we propose an unsupervised score-based method which combines the vector space model, continuous bag of words (CBOW), clustering, and a statistically-based method. The problems with multidocument text summarization are the noisy data, redundancy, diminished readability, and sentence incoherency. In this study, we adopt a preprocessing strategy to solve the noise problem and use the word2vec model for two purposes, first, to map the words to fixed-length vectors and, second, to obtain the semantic relationship between each vector based on the dimensions. Similarly, we use a k-means algorithm for two purposes: (1) Selecting the distinctive documents and tokenizing these documents to sentences, and (2) using another iteration of the k-means algorithm to select the key sentences based on the similarity metric to overcome the redundancy problem and generate the initial summary. Lastly, we use weighted principal component analysis (W-PCA) to map the sentences’ encoded weights based on a list of features. This selects the highest set of weights, which relates to important sentences for solving incoherency and readability problems. We adopted Recall-Oriented Understudy for Gisting Evaluation (ROUGE) as an evaluation measure to examine our proposed technique and compare it with state-of-the-art methods. Finally, an experiment on the Essex Arabic Summaries Corpus (EASC) using the ROUGE-1 and ROUGE-2 metrics showed promising results in comparison with existing methods.

Published in Information

ISSN: 2078-2489 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://www.mdpi.com/journal/information/

About the journal

Abstract

Keywords