IEEE Access (Jan 2023)
Preserving Privacy in Arabic Judgments: AI-Powered Anonymization for Enhanced Legal Data Privacy
Abstract
Jurisprudence involves studying, interpreting, and applying the law to comprehend its societal impact. Judges annually review cases to ensure accurate law application, which raises privacy concerns when accessing files from other courts. While the legal field has garnered interest from the research community, the challenge of masking personal data, particularly in the Arabic language with limited resources, remains in its early stages. To address this research gap, we develop a two-component system for generating anonymous Arabic judgments. The first component, a personal data extractor model, utilizes Named Entity Recognition (NER) to identify key individual entities like names, addresses, birthdays, case numbers, and national identity codes. We train this model on a purpose-built Arabic legal corpus. The second component involves a Python module designed to mask the personal entities extracted by the first component. Together, these components enable the generation of anonymous judgments. Our model achieves an F1-score of 96.14% when detecting entities in the created Arabic Legal corpus. Additionally, experiments on the ANERCorp corpus, with training and testing splits of 70%-30% and 90%-10%, yield F1-scores of 93.78% and 95.77%, respectively. With these results, our proposed system demonstrates the promising potential for generating anonymous Arabic judgments. Furthermore, the built Arabic legal corpus provides a valuable resource for researchers aiming to enhance domain-specific NER models in Arabic text.
Keywords