Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German

Sascha Wolfer; Alexander Koplenig; Marc Kupietz; Carolin Müller-Spitzer

doi:10.3390/data8110170

Data (Nov 2023)

Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German

Sascha Wolfer,
Alexander Koplenig,
Marc Kupietz,
Carolin Müller-Spitzer

Affiliations

Sascha Wolfer: Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany
Alexander Koplenig: Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany
Marc Kupietz: Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany
Carolin Müller-Spitzer: Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany

DOI: https://doi.org/10.3390/data8110170
Journal volume & issue: Vol. 8, no. 11
p. 170

Abstract

Read online

We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.

Published in Data

ISSN: 2306-5729 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Bibliography. Library science. Information resources
Website: http://www.mdpi.com/journal/data

About the journal

Abstract

Keywords