Data in Brief (Dec 2024)
Arabic paraphrased parallel synthetic datasetGitHub
Abstract
The Arabic paraphrased parallel dataset plays a crucial role in advancing NLP and other language-related applications by leveraging data from diverse sources and expanding it through data augmentation techniques. This dataset enhances machine translation, text summarization, and sentiment analysis, providing a better understanding and manipulation of the Arabic language. It also serves as a valuable tool for improving educational materials, optimizing search engines, and supporting content creation across various fields. Its role in semantic analysis aids in understanding context and meaning, making it indispensable for domain-specific applications. The main aim of building this dataset is to generate paraphrased sentences through synthetic augmentation using the back translation technique, addressing the gap in research and datasets focused on paraphrase generation in Arabic. The process involves collecting sentences from various sources, followed by preprocessing and evaluation to ensure reliability and usefulness. This systematic approach aims to produce a robust Arabic paraphrased dataset that can be utilized in various NLP tasks, fostering further innovation in Arabic language processing.