RIDE (Sep 2017)

«La Repubblica» Corpus

  • Rebecca Sierig

DOI
https://doi.org/10.18716/ride.a.6.9
Journal volume & issue
Vol. 6

Abstract

Read online

This paper reviews a huge resource of contemporary Italian newspaper language, the «La Repubblica» corpus. The corpus contains articles, which appeared in the Italian daily newspaper La Repubblica during the years 1985 to 2000 and counts more than 380 million tokens. Apart from being tokenized, it is also PoS-tagged, enriched with TEI-conformant structural mark-up as well as categorized with respect to topics and genres. The data and their preparation are addressed in the first part of this paper while its second part deals with access to the corpus. When the review was written, there were two possible ways of accessing the corpus: either by the ‘old’ interface directly hosted by the Institute of Translational Studies at the University of Bologna (SSLMIT) or by the ‘new’ one hosted by a NoSketch Engine. Both ways are compared in order to point out the changes.

Keywords