Extractive Arabic Text Summarization Using Modified PageRank Algorithm

Reda Elbarougy; Gamal Behery; Akram El Khatib

Egyptian Informatics Journal (Jul 2020)

Extractive Arabic Text Summarization Using Modified PageRank Algorithm

Reda Elbarougy,
Gamal Behery,
Akram El Khatib

Affiliations

Reda Elbarougy: Department of Computer Science, Faculty of Computer and Information Sciences, Damietta University, New Damietta, Egypt
Gamal Behery: Department of Computer Science, Faculty of Computer and Information Sciences, Damietta University, New Damietta, Egypt
Akram El Khatib: Department of Mathematics-Computer Science, Faculty of Science, Damietta University, New Damietta, Egypt; Corresponding author.

Journal volume & issue: Vol. 21, no. 2
pp. 73 – 81

Abstract

Read online

This paper proposed an approach for Arabic text summarization. Text summarization is one of the natural language processing's applications which is used for reducing the original text amount and retrieving only the important information from the original text. The Arabic language has a complex morphological structure which makes it very difficult to extract nouns to be used as a feature for summarization process. Therefore, Al-Khalil morphological analyzer is used to solve the problem of nouns extraction. The proposed approach is a graph-based system, which represents the document as a graph where the vertices of the graph are the sentences. A Modified PageRank algorithm is applied with an initial score for each node that is the number of nouns in this sentence. More nouns in the sentence mean more information, so nouns count used here as initial rank for the sentence. Edges between sentences are the cosine similarity between the sentences, to get a final summary that contains sentences with more information and well connected with each other. The process of text summarization consists of three major stages: pre-processing stage, features extraction and graph construction stage, and finally applying the Modified PageRank algorithm and summary extraction. The Modified PageRank algorithm used a different number of iterations to find the number returns the best summary results, and the extracted summary depends on compression ratio, taking into account removing redundancy depending on the overlapping between the sentences. To evaluate the performance of this approach EASC Corpus is used as a standard. LexRank and TextRank algorithms were used under the same circumstances, the proposed approach provides better results when compared with other Arabic text summarization techniques. The proposed approach performs efficiently with the number of iteration 10,000.

Published in Egyptian Informatics Journal

ISSN: 1110-8665 (Print)
Publisher: Elsevier
Country of publisher: Netherlands
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.sciencedirect.com/journal/egyptian-informatics-journal

About the journal

Abstract

Keywords