Why experiments still matter

Experiments, Relational Organizing
June 8, 2023
In 2022, Deck conducted a large-scale paid relational field experiment in Pennsylvania with Relentless. We pursued this test because, to our knowledge, there was no experiment establishing a causal relationship between paid relational organizing and voter turnout. Instead, the effect of paid relational programs had been measured using observational causal inference. 

Summary

In 2022, Deck conducted a large-scale paid relational field experiment in Pennsylvania with Relentless. We pursued this test because, to our knowledge, there was no experiment establishing a causal relationship between paid relational organizing and voter turnout. Instead, the effect of paid relational programs had been measured using observational causal inference. 

Our experiment showed that paid relational organizing likely increases voter turnout. However, we also found that previous findings derived from observational causal inference might have been severely biased upwards, overstating the turnout impacts of relational outreach. We break down how much bias there might have been by using our experiment as a benchmark.

What’s the problem with observational causal inference?

While observational causal inference1 is sometimes the best we can do when programs don’t integrate experiments, we should always aim to run an experiment and resort to other tools only when an experiment is impossible. Experiments bypass the fundamental problem of causal inference: we cannot know how a treatment affects a person and how that same person acts without the treatment. In short, we should strive to run experiments.

Non-experimental methods are vulnerable to endogeneity.2 We are typically most concerned with omitted variable bias (i.e., excluding a confounding variable that jointly explains the outcome and why someone was treated). For paid relational, for example, we might be concerned that organizers are more likely to contact friends who are already more likely to vote. If we tried to find the relationship between relational contact and voting, we might incorrectly infer that relational contact increased voter turnout even though the organizers’ friend group was already more likely to vote to begin with. 

This problem is especially hard to manage when attempting to identify the effect of contact on paid relational organizing targets because of the unique goals of the program. For example, Relentless was interested in targeting people whom campaigns and organizations normally don’t contact and people who may not have the time to engage in politics. This is a good goal but poses challenges for identification using observational methods because they were targeting people very different from the rest of the population. This means that finding suitable controls and comparable units becomes harder.

In the specific case of relational outreach, most research has suggested that relational outreach was better than traditional forms of voter contact. However, these papers were conducted using observational causal inference techniques that were vulnerable to endogeneity.3 

What did we do?

This is why Deck randomized paid relational contact in PA. In brief, Deck randomly assigned voters in paid relational organizers’ networks to control, cold outreach, and relational outreach conditions. Afterward, we matched voters back to the voter file to estimate the effect of a relational outreach. Read our write-up for a more detailed discussion of the program.  

Afterward, to mimic the approach of observational causal inference studies, we added a random sample of 100,000 voters from the PA voter file as an observational comparison group. Therefore, our dataset has four groups: control, cold outreach, relational outreach, and an observational comparison group.

What models are we comparing?

We are going to compare four different ways of analyzing the data given different constraints. The first two leverage the experimental data while the last two methods simulate how we would use observational methods if we didn’t have an experiment. We’ll use the following covariates: age, race, party, gender, and midterm turnout score.4

  1. Difference-in-means without controls: We will compare the averages of voter turnout in the treatment conditions and control conditions against each other. This is the classic method for analyzing experimental results for recovering the average treatment effect (ATE).5
  2. Ordinary least squares (OLS) with controls: We will compare the averages of voter turnout in the treatment conditions and control conditions against each other. We will also control for the covariates we use in the following models to add precision. 
  3. Inverse propensity weighting (IPW) for the ATE:6 We will first generate weights representing how different each voter is from the combined sample of voters in the treatment conditions and the observational comparison group. Then, we will reweight the voters such that underrepresented voters are up-weighted and overrepresented voters are down-weighted. This is one of the methods used in observational studies.
  4. IPW for the average treatment effect on the treated (ATT):7 We will estimate the likelihood of being in the treatment group by regressing being in the treatment group on our covariates. Then, we will down-weight overrepresented voters and up-weight underrepresented voters compared to the treatment group. This is another method used in observational causal inference studies.

In theory, if observational methods are unbiased and work perfectly in practice, these methods should give us the exact same estimate as our difference-in-means estimator. We should also expect our difference-in-means estimator with controls to match the basic difference-in-means estimate without controls.

What did we find?

To preview, we find that all of our non-experimental observational causal inference methods are highly biased compared to our experimental benchmark. 

Below, we present the treatment effects of relational outreach using our different estimators. Our difference-in-means estimates show that relational outreach increased voter turnout by 1.2 percentage points. Our inverse propensity-weighted estimators are severely biased in comparison. When targeting the ATE, IPW suggests relational outreach increased voter turnout by 2.8 percentage points. The estimate is substantially larger than our experimental benchmark. The results are worse when targeting the ATT, the standard procedure for these estimators. In that case, IPW suggests relational outreach increased voter turnout by 3.9 percentage points. 

EstimatorEstimandEstimateStandard errorsBias
Difference-in-means (RCT)ATE0.0120.0100
OLS (RCT)ATE0.0120.0090
IPWATE0.0280.0030.016
IPWATT0.0390.0030.027
Non-experimental estimates are severely biased compared to experimental estimates of relational outreach

We now do a similar exercise with cold outreach. Our difference-in-means estimate without controls shows that cold outreach increased voter turnout by 3.1 percentage points, while our difference-in-means estimate with controls shows that cold outreach increased voter turnout by 3.0 percentage points. Like before, our IPW estimates are severely biased. When estimating the ATE, IPW suggests cold outreach increases voter turnout by 3.6 percentage points. When targeting the ATT, IPW suggests cold outreach increases voter turnout by 5.7 percentage points. 

EstimatorEstimandEstimateStandard errorsBias
Difference-in-means (RCT)ATE0.0310.0140
OLS (RCT)ATE0.0300.0120.001
IPWATE0.0360.0130.005
IPWATT0.0570.0080.026
Non-experimental estimates are severely biased compared to experimental estimates of cold outreach

In sum, our difference-in-means estimators act exactly as we expected. Adding controls leads to little bias while giving us a bit more precision by reducing the standard errors. On the other hand, IPW provides estimates that are 18% to 225% larger than the true estimate. This can be very problematic for campaigns and organizations trying to allocate resources to the most effective tactics.

How should we interpret estimates using non-experimental research designs?

At Deck, we think of estimates from non-experimental research designs as suggestive of an effect and recognize that we should be wary about the point estimate. This does not mean that we should never use non-experimental techniques to evaluate programs. Sometimes that is all we can do if an experiment was not conducted or we are studying something that is hard to randomize. Rather, we think that non-experimental designs should set up organizations to find ways to randomize their future programs to unbiasedly estimate the effect of their interventions.

Footnotes

  1. Observational causal inference is any method for recovering a treatment’s causal estimate when the treatment was not randomly assigned or the researcher did not control the randomization process.
  2. Formally, endogeneity is when the treatment is correlated with the error term. Intuitively, endogeneity is when the estimate is biased because we have not properly modeled the causal relationship between treatment and the outcome of interest.
  3. See these program evaluations for an example of how these methods have been used.
  4. We do not use mobilizer controls because those data would not be available for our random sample of voters (e.g., there is no mobilizer tie to our random voters). We wanted to ensure the estimates were as comparable as possible so we used covariates available for all voters.
  5. The ATE is the average effect of an intervention in the population of interest.
  6. Our estimator is “doubly-robust” because we only need either the regression or the propensity score model to be properly specified to recover consistent estimates. The key benefit is we get “two bites at the apple,” in that we get two chances to get it right and we only need one to be right. See this for a full discussion.
  7. The ATT is the average effect among the treated population. This is different from the ATE because the treated population could be different from the overall population. In an ideal experiment, the ATE and ATT equal each other because we have randomization and perfect treatment uptake.

More Like This

Loading...