EPJ Data Science (Jul 2022)

Evaluating the construct validity of text embeddings with application to survey questions

  • Qixiang Fang,
  • Dong Nguyen,
  • Daniel L. Oberski

DOI
https://doi.org/10.1140/epjds/s13688-022-00353-7
Journal volume & issue
Vol. 11, no. 1
pp. 1 – 31

Abstract

Read online

Abstract Text embedding models from Natural Language Processing can map text data (e.g. words, sentences, documents) to meaningful numerical representations (a.k.a. text embeddings). While such models are increasingly applied in social science research, one important issue is often not addressed: the extent to which these embeddings are high-quality representations of the information needed to be encoded. We view this quality evaluation problem from a measurement validity perspective, and propose the use of the classic construct validity framework to evaluate the quality of text embeddings. First, we describe how this framework can be adapted to the opaque and high-dimensional nature of text embeddings. Second, we apply our adapted framework to an example where we compare the validity of survey question representation across text embedding models.

Keywords