A Reproducible IT-Blog Corpus

Adrien Barbaresi; Jens Pohlmann

doi:10.5334/johd.35

Journal of Open Humanities Data (Jul 2021)

A Reproducible IT-Blog Corpus

Adrien Barbaresi,
Jens Pohlmann

Affiliations

Adrien Barbaresi: Center for Digital Lexicography of German, BBAW, Berlin
Jens Pohlmann: Centre for Media, Communication & Information Research (ZeMKI), University of Bremen, Bremen, DE; Center for Spatial and Textual Analysis (CESTA), Stanford University, Stanford

DOI: https://doi.org/10.5334/johd.35
Journal volume & issue: Vol. 7

Abstract

Read online

The dataset comprises text and metadata extracted from several hundred IT-blogs and websites, along with a method to duplicate the data by updating its contents and downloading it to the user’s local machine. The targets have been hand-picked with the intention to represent the discourse on blogs and websites dedicated to questions at the intersection of technology and society from Germany and the United States of America. The texts have been retrieved by web crawling techniques. The resulting corpus is accessible through a search platform and also reproducible with freely accessible descriptors and software.

Published in Journal of Open Humanities Data

ISSN: 2059-481X (Online)
Publisher: Ubiquity Press
Country of publisher: United Kingdom
LCC subjects: General Works: History of scholarship and learning. The humanities; Language and Literature
Website: https://openhumanitiesdata.metajnl.com/

About the journal

Abstract

Keywords