Four Datasets Derived from an Archive of Personal Homepages (1995–2009)

Data. 2017;2(2):19 DOI 10.3390/data2020019


Journal Homepage

Journal Title: Data

ISSN: 2306-5729 (Online)

Publisher: MDPI AG

LCC Subject Category: Bibliography. Library science. Information resources

Country of publisher: Switzerland

Language of fulltext: English

Full-text formats available: PDF, HTML, ePUB, XML



Sean C. Rife (Department of Psychology, Murray State University, Murray, KY 42071, USA)


Blind peer review

Editorial Board

Instructions for authors

Time From Submission to Publication: 10 weeks


Abstract | Full Text

While data from social media are easily accessible, understanding how individuals expressed themselves on the Internet in its initial years of public availability (the mid-late 1990s) has proved difficult. In this data deposit, I describe how archival data from Geocities homepages were retrieved and processed to remove non-text data, then further refined to create separate datasets, each of which provides unique insights into modes of personal expression on the early Internet. The present paper describes four datasets, all of which were derived from a larger collection of personal websites: (1) a large corpus of raw text data from Geocities personal homepages, (2) a linguistic analysis of basic psychological properties of the same Geocities pages, using an open-source implementation of the Linguistic Inquiry Word Count (LIWC), (3) a dataset of links between homepages (suitable for network analysis), and (4) a manifest dataset summarizing the size and last update date for each file in the dataset. Data from over 378,000 Geocities pages are included. In addition to providing a detailed description of how these datasets were created, I describe how they might be utilized in future research.