Iranian Journal of Information Processing & Management (Mar 2023)

Development of a Persian Academic Word List Based on an Academic Corpus

  • Morteza Rezaei Sharifabadi,
  • Amirsaeid Moloodi,
  • Alireza Ahmadi,
  • Alireza Khormaei

DOI
https://doi.org/10.22034/jipm.2023.698611
Journal volume & issue
Vol. 38, no. 3
pp. 901 – 926

Abstract

Read online

Academic words occur with high frequency in texts from a wide range of scientific fields, and their frequency in academic texts is much higher than in general texts. Academic wordlists can facilitate the learning and teaching of scientific language. In this research, we have developed a frequency list of Persian academic words. The word list includes 307 word lemmas with a high frequency in academic texts. Creating a balanced corpus of Persian academic texts was the prerequisite for developing such a list. For this purpose, we collected scientific texts published in Persian scientific journals and built a balanced corpus containing more than 51 million words. The corpus includes texts of academic papers in four general categories, i.e., basic sciences and engineering; humanities, arts, and architecture; medicine and veterinary medicine; and agriculture and natural resources. We used four different criteria for lemmas to be included in our wordlist. 1- frequency: The lemmas should have a relative frequency of at least 30 per million words. 2- ratio: The relative frequency of the lemmas in the academic corpus should be two times greater than their frequency in a 10 million word general corpus. 3- dispersion: Juilland's D value of the lemmas in the four sections should be at least 0.5. 4- range: the observed frequency of the lemma should not be less than a third of its expected frequency in any of the four sections of the corpus. We evaluated the wordlist by measuring its coverage in our corpus's train and test sections. The wordlist covers 16.69 percent of the train subset and 16.13 percent of the test subset.

Keywords