Półrocznik Językoznawczy Tertium (Jan 2019)

Self-compiled Corpora in Linguistic Research (On the Example of an Internet Corpus)

  • Marcin Zabawa

DOI
https://doi.org/10.7592/10.7592/Tertium2019.4.1.Zabawa
Journal volume & issue
Vol. 4, no. 1
pp. 211 – 232

Abstract

Read online

The aim of the present paper, which is of a theoretical character, is to discuss the problems related to the process of the compilation of one’s own linguistic corpus. A linguist who wants to study e.g. neologisms must base his or her analysis on a certain source. Formerly, the language of the press was frequently used as such source; now, however, linguistic corpora and the Internet are utilized more frequently. The author of the paper points out that both the National Corpus of Polish (NKJP) and the Internet as a whole are not the best choices (and are definitely not sufficient) when a linguist intends to study e.g. the newest vocabulary items in Polish. The use of the spoken language as the main source is even more problematic. The best solution, albeit the most difficult and time-consuming at the same time, is the compilation of one’s own linguistic corpus. The paper discusses the inadequacy of regarding the press or the Internet as a whole as the best sources and then proceeds to discuss various theoretical aspects connected with the compilation of one’s own corpus (such as the choice of the type of texts, corpus size, the use of computer tools intended to aid in corpus compilation, etc.).

Keywords