Self-compiled Corpora in Linguistic Research (On the Example of an Internet Corpus)

Marcin Zabawa

doi:10.7592/10.7592/Tertium2019.4.1.Zabawa

Półrocznik Językoznawczy Tertium (Jan 2019)

Self-compiled Corpora in Linguistic Research (On the Example of an Internet Corpus)

Marcin Zabawa

Affiliations

Marcin Zabawa: ORCiD; Uniwersytet Śląski w Katowicach

DOI: https://doi.org/10.7592/10.7592/Tertium2019.4.1.Zabawa
Journal volume & issue: Vol. 4, no. 1
pp. 211 – 232

Abstract

Read online

The aim of the present paper, which is of a theoretical character, is to discuss the problems related to the process of the compilation of one’s own linguistic corpus. A linguist who wants to study e.g. neologisms must base his or her analysis on a certain source. Formerly, the language of the press was frequently used as such source; now, however, linguistic corpora and the Internet are utilized more frequently. The author of the paper points out that both the National Corpus of Polish (NKJP) and the Internet as a whole are not the best choices (and are definitely not sufficient) when a linguist intends to study e.g. the newest vocabulary items in Polish. The use of the spoken language as the main source is even more problematic. The best solution, albeit the most difficult and time-consuming at the same time, is the compilation of one’s own linguistic corpus. The paper discusses the inadequacy of regarding the press or the Internet as a whole as the best sources and then proceeds to discuss various theoretical aspects connected with the compilation of one’s own corpus (such as the choice of the type of texts, corpus size, the use of computer tools intended to aid in corpus compilation, etc.).

Published in Półrocznik Językoznawczy Tertium

ISSN: 2543-7844 (Online)
Publisher: Cracow Tertium Society for the Promotion of Language Studies
Country of publisher: Poland
LCC subjects: Language and Literature: Philology. Linguistics
Website: https://journal.tertium.edu.pl/index.php/JaK

About the journal

Abstract

Keywords