The Programming Historian (Aug 2017)
Fetching and Parsing Data from the Web with OpenRefine
Abstract
OpenRefine is a powerful tool for exploring, cleaning, and transforming data. An earlier Programming Historian lesson, “Cleaning Data with OpenRefine”, introduced the basic functionality of Refine to efficiently discover and correct inconsistency in a data set. Building on those essential data wrangling skills, this lesson focuses on Refine’s ability to fetch URLs and parse web content. Examples introduce some of the advanced features to transform and enhance a data set including: - fetch URLs using Refine - construct URL queries to retrieve information from a simple web API - parse HTML and JSON responses to extract relevant data - use array functions to manipulate string values - use Jython to extend Refine’s functionality It will be helpful to have basic familiarity with OpenRefine, HTML, and programming concepts such as variables and loops to complete this lesson.