Data in Brief (Oct 2024)
AHD: Arabic healthcare dataset
Abstract
With the soaring demand for healthcare systems, chatbots are gaining tremendous popularity and research attention. Numerous language-centric research on healthcare is conducted day by day. Despite significant advances in Arabic Natural Language Processing (NLP), challenges remain in natural language classification and generation due to the lack of suitable datasets. The primary shortcoming of these models is the lack of suitable Arabic datasets for training. To address this, authors introduce a large Arabic Healthcare Dataset (AHD) of textual data. The dataset consists of over 808k questions and answers across 90 categories, offered to the research community for Arabic computational linguistics. Authors anticipate that this rich dataset would make a great aid for a variety of NLP tasks on Arabic textual data, especially for text classification and generation purposes. Authors present the data in raw form. AHD is composed of main dataset scraped from medical website, which is Altibbi website. AHD is made public and freely available at http://data.mendeley.com/datasets/mgj29ndgrk/5.