Journal of King Saud University: Computer and Information Sciences (May 2022)

Characteristics of Malay translated hadith corpus

  • Siti Syakirah Sazali,
  • Nurazzah Abdul Rahman,
  • Zainab Abu Bakar

Journal volume & issue
Vol. 34, no. 5
pp. 2151 – 2160

Abstract

Read online

Annotated corpus can greatly assist in the natural language processing field. For example, computers can understand more of the document context, and indexing and clustering in information retrieval can be done precisely with less or no ambiguity of words. However, there are only a few annotated corpora in Malay language, which are not publicly shared. In this paper, we delve into analysing and annotating Malay translated hadith documents in terms of tagging and entities. There are three phases, which are manual filtering and cleaning, analysing the corpus and creating the benchmark. As the result, an analysis and benchmark of Malay translated hadith corpus were produced in term of part-of-speech and named entities tags that follows the Zipf’s law distribution.

Keywords