کتابداری و اطلاعرسانی (Jul 2018)
Extracting information from language corpus: introducing the corpus of scientific articles of Ferdowsi University of Mashhad
Abstract
Purpose: Some of the most important applications of corpus are natural language processing, writing dictionaries, following lingual changes and extracting information from texts. The aim of this article was to describe and introduce a designed corpus of scientific articles. Methodology: First a corpus software was designed and developed. This software supported different text formats such as doc ،docx ،rtf ،txt and pdf. It was also possible to set the corpus parameter in advance, for example the least number of allowed token files for presence of each text in the corpus. At the next step the scientific articles of faculty members of Ferdowsi University of Mashhad were collected. The corpus contained 7,154,202 words in 1,100 articles. Then all articles were analyzed into their component sentences in separated files, word’s roots were extracted, and parts of speech were annotated. In addition to direct extraction of information, a simple and easy-using software was developed for extracting statistical information by non-expert users. Findings: The existing standard corpus such as PerDT which included a significant number of annotated sentences with syntactic and vocabulary information was used for the evaluation of the correctness of the word rooting and parts of speech labeling tools. Also, with a case study of precautionary statements (part of a research project that has not been published), the finding of the present research, i.e. the construction of the corpus of scientific research papers, was tested and approved with 96 percent of accuracy. Conclusion: Based on the results. The developed corpus has high capacity to use for data extraction in different researches. Using this corpus, a data-driven description of language usage by different language groups would be possible. In near future, this corpus will be available on the website of the Central Library of Ferdowsi University of Mashhad for the use of all researchers.
Keywords