The Programming Historian (Mar 2017)

Basic Text Processing in R

  • Taylor Arnold,
  • Lauren Tilton

Abstract

Read online

A substantial amount of historical data is now available in the form of raw, digitized text. Common examples include letters, newspaper articles, personal notes, diary entries, legal documents and transcribed speeches. While some stand-alone software applications provide tools for analyzing text data, a programming language offers increased flexibility to analyze a corpus of text documents. In this tutorial we guide users through the basics of text analysis within the R programming language. The approach we take involves only using a tokenizer that parses text into elements such as words, phrases and sentences. By the end of the lesson users will be able to: - employ exploratory analyses to check for errors and detect high-level patterns; - apply basic stylometric methods over time and across authors; - approach document summarization to provide a high-level description of the elements in a corpus. All of these will be demonstrated on a dataset from the text of United States Presidential State of the Union Addresses.1 We assume that users have only a very basic understanding of the R programming language. The ‘R Basics with Tabular Data’ lesson by Taryn Dewar2 is an excellent guide that covers all of the R knowledge assumed here, such as installing and starting R, installing and loading packages, importing data and working with basic R data. Users can download R for their operating system from The Comprehensive R Archive Network. Though not required, we also recommend that new users download RStudio, an open source development environment for writing and executing R programs. All of the code in this lesson was tested in R version 3.3.2, though we expect it to run properly on any future version of the software.