Data in Brief (Feb 2023)
PRo-Pat: Probabilistic Root–Pattern Bi-gram data language model for Arabic based morphological analysis and distribution
Abstract
Based on 29,192,662 html files obtained from the ClueWeb a bi-gram data language model for Arabic is constructed. The created dataset is considering standard types of bi-gram analysis, however with focus on the root11 An Arabic root depict the basic morpheme of an Arabic word at a higher level of abstraction, representing the basic word meaning. A root morpheme consists predominantly of three consonants (radicals) identifying the highest semantic abstraction and is unchangeable. We use the variables C1, C2 and C3 to represents these radicals.-pattern22 An Arabic pattern is templatic shape of consonants (root radicals) and vowels order depicting the morpho-phonetic form of a word fulfilling potential phonetic, syntactic, and semantic data. For example: the pattern maC1C2uC3 [3]. paradigm in Arabic. Root-Pattern distributions in form of P(root|pattern), P(pattern|root) and P(pattern|pattern) are additionally estimated. The aspect of considering the Maximum Likelihood Estimation (MLE) on the root-pattern level as a higher-level of abstraction, has been widely neglected in Arabic research community despite its advantage in reducing ambiguities within Arabic morphological analysis and its impact on cognitive aspect on Arabic word perception [1]. In the preprocessing phase, the html files were converted to 974 unfiltered raw text files with the size of about 180 GB. These files were morphologically analyzed towards extracting and counting frequencies of patterns, roots, particle, and stems and particularly root-pattern occurrences. Based on a resulting corpus containing around 18,482,719 raw words, a language data model is constructed containing 9,311,246 bi-grams of morphologically analyzed wordform, including around 3.49 million bi-directional P(root|pattern) and around 1.153 million P(patttern|pattern) bi-grams in form of conditional probabilities covering a subset of around 8086 roots with 20413 possible pattern-forms. As this data model is considering the root–pattern phenomenon in Arabic, the created data are useful for researchers working on cognitive aspect of Arabic such as visual word cognition, morpho-phonetic perception, morphological analysis, spell-checking, and resolving ambiguities in morphological parsing.