Generalized word shift graphs: a method for visualizing and explaining pairwise comparisons between texts

Ryan J. Gallagher; Morgan R. Frank; Lewis Mitchell; Aaron J. Schwartz; Andrew J. Reagan; Christopher M. Danforth; Peter Sheridan Dodds

doi:10.1140/epjds/s13688-021-00260-3

EPJ Data Science (Jan 2021)

Generalized word shift graphs: a method for visualizing and explaining pairwise comparisons between texts

Ryan J. Gallagher,
Morgan R. Frank,
Lewis Mitchell,
Aaron J. Schwartz,
Andrew J. Reagan,
Christopher M. Danforth,
Peter Sheridan Dodds

Affiliations

Ryan J. Gallagher: Network Science Institute, Northeastern University
Morgan R. Frank: Department of Informatics and Networked Systems, University of Pittsburgh
Lewis Mitchell: School of Mathematical Sciences, The University of Adelaide
Aaron J. Schwartz: Computational Story Lab, Vermont Complex Systems Center, & Vermont Advanced Computing Core, The University of Vermont
Andrew J. Reagan: MassMutual Data Science
Christopher M. Danforth: Computational Story Lab, Vermont Complex Systems Center, & Vermont Advanced Computing Core, The University of Vermont
Peter Sheridan Dodds: Computational Story Lab, Vermont Complex Systems Center, & Vermont Advanced Computing Core, The University of Vermont

DOI: https://doi.org/10.1140/epjds/s13688-021-00260-3
Journal volume & issue: Vol. 10, no. 1
pp. 1 – 29

Abstract

Read online

Abstract A common task in computational text analyses is to quantify how two corpora differ according to a measurement like word frequency, sentiment, or information content. However, collapsing the texts’ rich stories into a single number is often conceptually perilous, and it is difficult to confidently interpret interesting or unexpected textual patterns without looming concerns about data artifacts or measurement validity. To better capture fine-grained differences between texts, we introduce generalized word shift graphs, visualizations which yield a meaningful and interpretable summary of how individual words contribute to the variation between two texts for any measure that can be formulated as a weighted average. We show that this framework naturally encompasses many of the most commonly used approaches for comparing texts, including relative frequencies, dictionary scores, and entropy-based measures like the Kullback–Leibler and Jensen–Shannon divergences. Through a diverse set of case studies ranging from presidential speeches to tweets posted in urban green spaces, we demonstrate how generalized word shift graphs can be flexibly applied across domains for diagnostic investigation, hypothesis generation, and substantive interpretation. By providing a detailed lens into textual shifts between corpora, generalized word shift graphs help computational social scientists, digital humanists, and other text analysis practitioners fashion more robust scientific narratives.

Published in EPJ Data Science

ISSN: 2193-1127 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: http://www.epjdatascience.com/

About the journal

Abstract

Keywords