Precise unbiased estimation in randomized experiments using auxiliary observational data

Gagnon-Bartsch Johann A.; Sales Adam C.; Wu Edward; Botelho Anthony F.; Erickson John A.; Miratrix Luke W.; Heffernan Neil T.

doi:10.1515/jci-2022-0011

Journal of Causal Inference (Aug 2023)

Precise unbiased estimation in randomized experiments using auxiliary observational data

Gagnon-Bartsch Johann A.,
Sales Adam C.,
Wu Edward,
Botelho Anthony F.,
Erickson John A.,
Miratrix Luke W.,
Heffernan Neil T.

Affiliations

Gagnon-Bartsch Johann A.: Department of Statistics, University of Michigan, Ann Arbor, Michigan, United Sates
Sales Adam C.: Department of Mathematical Sciences, Worcester Polytechnic Institute, Worcester, Massachusetts, United Sates
Wu Edward: Biocomplexity Institute, Social and Decision Analytics Division, University of Virginia, Charlottesville, Virginia, United Sates
Botelho Anthony F.: College of Education, University of Florida, Gainesville, Florida, United Sates
Erickson John A.: Analytics and Information Systems, Western Kentucky University, Bowling Green, KY 42101, United States
Miratrix Luke W.: Graduate School of Education, Harvard University, Cambridge, Massachusetts, United States
Heffernan Neil T.: Department of Computer Science, Worcester Polytechnic Institute, Worcester, Massachusetts, United Sates

DOI: https://doi.org/10.1515/jci-2022-0011
Journal volume & issue: Vol. 11, no. 1
pp. 286 – 327

Abstract

Read online

Randomized controlled trials (RCTs) admit unconfounded design-based inference – randomization largely justifies the assumptions underlying statistical effect estimates – but often have limited sample sizes. However, researchers may have access to big observational data on covariates and outcomes from RCT nonparticipants. For example, data from A/B tests conducted within an educational technology platform exist alongside historical observational data drawn from student logs. We outline a design-based approach to using such observational data for variance reduction in RCTs. First, we use the observational data to train a machine learning algorithm predicting potential outcomes using covariates and then use that algorithm to generate predictions for RCT participants. Then, we use those predictions, perhaps alongside other covariates, to adjust causal effect estimates with a flexible, design-based covariate-adjustment routine. In this way, there is no danger of biases from the observational data leaking into the experimental estimates, which are guaranteed to be exactly unbiased regardless of whether the machine learning models are “correct” in any sense or whether the observational samples closely resemble RCT samples. We demonstrate the method in analyzing 33 randomized A/B tests and show that it decreases standard errors relative to other estimators, sometimes substantially.

Published in Journal of Causal Inference

ISSN: 2193-3677 (Print); 2193-3685 (Online)
Publisher: De Gruyter
Country of publisher: Poland
LCC subjects: Science: Mathematics: Probabilities. Mathematical statistics
Website: https://www.degruyter.com/view/journals/jci/jci-overview.xml

About the journal

Abstract

Keywords