Arabic paraphrased parallel synthetic datasetGitHub

Noora Al-shameri; Hend Al-Khalifa

Data in Brief (Dec 2024)

Arabic paraphrased parallel synthetic datasetGitHub

Noora Al-shameri,
Hend Al-Khalifa

Affiliations

Noora Al-shameri: Corresponding author.; Information Technology Department, King Saud University, Riyadh, Saudi Arabia
Hend Al-Khalifa: Information Technology Department, King Saud University, Riyadh, Saudi Arabia

Journal volume & issue: Vol. 57
p. 111004

Abstract

Read online

The Arabic paraphrased parallel dataset plays a crucial role in advancing NLP and other language-related applications by leveraging data from diverse sources and expanding it through data augmentation techniques. This dataset enhances machine translation, text summarization, and sentiment analysis, providing a better understanding and manipulation of the Arabic language. It also serves as a valuable tool for improving educational materials, optimizing search engines, and supporting content creation across various fields. Its role in semantic analysis aids in understanding context and meaning, making it indispensable for domain-specific applications. The main aim of building this dataset is to generate paraphrased sentences through synthetic augmentation using the back translation technique, addressing the gap in research and datasets focused on paraphrase generation in Arabic. The process involves collecting sentences from various sources, followed by preprocessing and evaluation to ensure reliability and usefulness. This systematic approach aims to produce a robust Arabic paraphrased dataset that can be utilized in various NLP tasks, fostering further innovation in Arabic language processing.

Published in Data in Brief

ISSN: 2352-3409 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Science (General)
Website: http://www.journals.elsevier.com/data-in-brief/

About the journal

Abstract

Keywords