The Programming Historian (Aug 2013)

Cleaning Data with OpenRefine

  • Seth van Hooland,
  • Ruben Verborgh,
  • Max De Wilde

Abstract

Read online

Duplicate records, empty values and inconsistent formats are phenomena we should be prepared to deal with when using historical data sets. This lesson will teach you how to discover inconsistencies in data contained within a spreadsheet or a database. As we increasingly share, aggregate and reuse data on the web, historians will need to respond to data quality issues which inevitably pop up. Using a program called OpenRefine, you will be able to easily identify systematic errors such as blank cells, duplicates, spelling inconsistencies, etc. OpenRefine not only allows you to quickly diagnose the accuracy of your data, but also to act upon certain errors in an automated manner.

Keywords