Code4Lib Journal (Jul 2016)

Metadata Analytics, Visualization, and Optimization: Experiments in statistical analysis of the Digital Public Library of America (DPLA)

  • Corey A. Harper

Journal volume & issue
no. 33

Abstract

Read online

This paper presents the concepts of metadata assessment and “quantification” and describes preliminary research results applying these concepts to metadata from the Digital Public Library of America (DPLA). The introductory sections provide a technical outline of data pre-processing, and propose visualization techniques that can help us understand metadata characteristics in a given context. Example visualizations are shown and discussed, leading up to the use of "metadata fingerprints" -- D3 Star Plots -- to summarize metadata characteristics across multiple fields for arbitrary groupings of resources. Fingerprints are shown comparing metadata characterisics for different DPLA "Hubs" and also for used versus not used resources based on Google Analytics "pageview" counts. The closing sections introduce the concept of metadata optimization and explore the use of machine learning techniques to optimize metadata in the context of large-scale metadata aggregators like DPLA. Various statistical models are used to predict whether a particular DPLA item is used based only on its metadata. The article concludes with a discussion of the broad potential for machine learning and data science in libraries, academic institutions, and cultural heritage.