data(star)
star_sr <- subset(star, aide == 0) # small vs. regular classes
coef(summary(lm(totalscore ~ small, star_sr)))
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 918.04289 1.667157 550.663750 0.000000e+00
#> small 13.89899 2.446592 5.680962 1.441473e-0820 Treatment Effects & Difference-in-Differences
Reading. SW
13.1, 13.3 <80><93>13.4, HGL 7.5 <80><93>7.6
From the very first scatterplots to the omitted-variable bias of model specification, one warning has recurred all term: correlation is not causation. A regression slope measures association. It is a causal effect only if the regressor is exogenous
The cleanest answer comes from the potential-outcomes framework. We use it to define a treatment effect and the average treatment effect (ATE), to see that selection bias is nothing more than omitted-variable bias in disguise, to understand why randomized experiments are the gold standard (with Project STAR as the example), and to learn difference-in-differences for the common case where randomizing is impossible (with Card and Krueger’s minimum-wage study as the example). We end where the course began: a final word on correlation versus causation.
20.1 The potential-outcomes framework
Let \(d_i = 1\) if individual \(i\) is treated and \(d_i = 0\) if not. The potential-outcomes framework asks us to imagine both futures for each person: \[
y_{1i} = \text{outcome if treated}, \qquad y_{0i} = \text{outcome if not treated}.
\] The causal effect for individual \(i\) is the difference between these two worlds, \(y_{1i} - y_{0i}\)
The catch is that the world only ever lets us see one of the two outcomes for any given person. A person who takes the treatment reveals \(y_{1i}\); their \(y_{0i}\)
We only ever observe one outcome per person. What we see is \[
y_i = y_{0i} + (y_{1i} - y_{0i})\,d_i ,
\] which equals \(y_{1i}\) when \(d_i = 1\) and \(y_{0i}\) when \(d_i = 0\). The other potential outcome
The framework’s real gift is honesty. Effects genuinely differ across people: a drug helps some patients and harms others, a small class boosts some children more than others. Since we can never recover any single individual’s effect, we must settle for an average effect rather than an individual one.
The average treatment effect is the mean of the individual effects across the whole population: \[ \tau_{\text{ATE}} = \E(y_{1i} - y_{0i}). \]
Because individual effects are hidden but averages are not, the ATE is the natural target. The most natural estimator simply compares the average outcome of the treated group with that of the control group
20.2 Selection bias
To see what the difference estimator really captures, split the gap in group averages into two pieces: \[
\underbrace{\E(y\given d{=}1) - \E(y\given d{=}0)}_{\text{what } \hat\tau \text{ estimates}}
= \underbrace{\E(y_{1}-y_{0}\given d{=}1)}_{\text{effect on the treated (ATT)}}
+ \underbrace{\bigl[\E(y_{0}\given d{=}1) - \E(y_{0}\given d{=}0)\bigr]}_{\text{selection bias}}.
\] The first piece is the average treatment effect on the treated (ATT)
Selection bias is the difference in the two groups’ untreated potential outcomes, \[
\E(y_{0}\given d{=}1) - \E(y_{0}\given d{=}0).
\] It measures how non-comparable the treated and control groups were to begin with
A classic illustration shows just how badly the difference estimator can mislead.
In a health survey, people who had recently been hospitalized rated their own health worse (an average of \(3.21\)) than people who had not been hospitalized (\(3.93\)). Taken literally, the naive difference \(3.21 - 3.93 = -0.72\) says that going to the hospital damages your health. But of course hospitals do not cause illness
Why call it selection bias and not something new? Because it is exactly the omitted-variable bias of earlier chapters. In the regression \(y_i = \alpha + \tau d_i + e_i\), the error \(e\) absorbs everything we left out
A treatment-dummy coefficient is causal only if treatment is uncorrelated with the omitted factors
20.3 Randomized experiments
If we randomly assign treatment, then \(d_i\) becomes statistically independent of the potential outcomes. The treated and control groups become comparable in expectation
The randomized controlled experiment (RCT) is the gold standard of causal inference
A celebrated economic example is Project STAR, a genuine RCT run in Tennessee from 1985 to 1989. Kindergartners were randomly assigned, within their own schools, to either a small class (13
The coefficient on small is about 13.9 points, and it is highly significant. Because assignment was random, \(\hat\beta_2\) genuinely is the causal effect of being in a small class
A telltale sign of good randomization is what happens when we add controls. Throw in teacher experience, for instance, and the estimate barely budges, moving from \(13.9\) to about \(14.0\):
coef(summary(lm(totalscore ~ small + tchexper, star_sr)))["small", ]
#> Estimate Std. Error t value Pr(>|t|)
#> 1.398327e+01 2.437332e+00 5.737121e+00 1.039372e-08Under successful randomization, controls are uncorrelated with the treatment, so adding them does not change the treatment coefficient small on student traits with a linear probability model (the dummy-variable regression idea applied to a \(0/1\) outcome). If assignment was truly random, no trait should predict it.
lpm <- lm(small ~ boy + white_asian + freelunch, star_sr)
mean(fitted(lpm)) # average predicted probability of small class
#> [1] 0.4643334The fitted probabilities cluster near \(\hat p \approx 0.47\)
Show the R code
star_means <- star_sr |>
mutate(class = factor(small, levels = c(0, 1),
labels = c("Regular", "Small"))) |>
group_by(class) |>
summarise(score = mean(totalscore), .groups = "drop")
ggplot(star_means, aes(class, score)) +
geom_col(fill = ucla$blue, color = ucla$darkblue, width = 0.6) +
geom_text(aes(label = round(score, 0)), vjust = -0.5,
color = ucla$darkblue, size = 3.6) +
coord_cartesian(ylim = c(880, 940)) +
labs(x = "Class type", y = "Mean total score")20.4 Difference-in-differences
Most economic data are observational, not experimental
Difference-in-differences (DiD) exploits exactly this situation, using before/after data on a treatment group and a comparison group. The idea is to measure the treatment group’s change over time, then subtract off the comparison group’s change over the same period: \[
\hat\delta = \bigl(\bar y^{\text{treat}}_{\text{after}} - \bar y^{\text{treat}}_{\text{before}}\bigr)
- \bigl(\bar y^{\text{ctrl}}_{\text{after}} - \bar y^{\text{ctrl}}_{\text{before}}\bigr).
\] The control group’s change captures the common trend
Figure 20.2 makes the logic visual. The treatment group rises from \(B\) to \(C\), while the control group drifts from \(A\) to \(E\). If the treatment group would have followed the same trend as the control group absent the policy, it would have ended at the counterfactual point \(D\) (the dashed line \(BD\), parallel to \(AE\)). The DiD estimate \(\hat\delta\) is the vertical gap between where the treatment group actually landed (\(C\)) and where it would have landed without treatment (\(D\)).
Show the R code
lines_df <- data.frame(
time = c(1, 2, 1, 2, 1, 2),
y = c(1.5, 2.2, 2.5, 4.5, 2.5, 3.2),
group = c("Control", "Control", "Treatment", "Treatment",
"Counterfactual", "Counterfactual")
)
labs_df <- data.frame(
time = c(1, 2, 1, 2, 2),
y = c(1.5, 2.2, 2.5, 4.5, 3.2),
lab = c("A", "E", "B", "C", "D")
)
ggplot(lines_df, aes(time, y, group = group)) +
geom_line(aes(color = group, linetype = group), linewidth = 1) +
geom_point(data = subset(lines_df, group != "Counterfactual"),
aes(color = group), size = 2) +
geom_text(data = labs_df, aes(time, y, label = lab), inherit.aes = FALSE,
nudge_x = ifelse(labs_df$time == 1, -0.07, 0.07),
color = ucla$darkblue, size = 3.6) +
annotate("segment", x = 2.12, xend = 2.12, y = 3.2, yend = 4.5,
arrow = arrow(ends = "both", length = unit(0.15, "cm")),
color = ucla$darkblue) +
annotate("text", x = 2.2, y = 3.85, label = "delta",
parse = TRUE, color = ucla$darkblue, size = 4) +
scale_color_manual(values = c(Control = ucla$blue,
Treatment = ucla$red,
Counterfactual = ucla$red)) +
scale_linetype_manual(values = c(Control = "solid",
Treatment = "solid",
Counterfactual = "dashed")) +
scale_x_continuous(breaks = c(1, 2), labels = c("Before", "After"),
limits = c(0.85, 2.35)) +
labs(x = NULL, y = "y", color = NULL, linetype = NULL)DiD as one regression
The whole estimator collapses into a single regression once we introduce a treatment dummy and a time dummy. Then \(\hat\delta\) is simply the coefficient on their interaction: \[
y_{it} = \beta_1 + \beta_2\,\text{TREAT}_i + \beta_3\,\text{AFTER}_t
+ \delta\,(\text{TREAT}_i \times \text{AFTER}_t) + e_{it} .
\] Reading off the pieces: \(\beta_2\) is the fixed gap between the two groups, \(\beta_3\) is the common time trend shared by both, and \(\delta\)
DiD rests on one key assumption: parallel trends. Absent the treatment, the two groups would have moved together
A panel-data route to the same answer. When we have panel data
Card and Krueger: the minimum wage
The most famous DiD study in economics is Card and Krueger’s analysis of the minimum wage. In April 1992 New Jersey raised its minimum wage from $4.25 to $5.05 an hour, while neighboring Pennsylvania held steady at $4.25. Card and Krueger surveyed full-time-equivalent (FTE) employment at fast-food restaurants in both states, before and after the increase. New Jersey is the treatment group; Pennsylvania, just across the border, is the natural control.
Computing the four cell means and the difference of differences:
data(njmin3) # nj = NJ (treatment), d = after, fte = employment
njmin3 |>
group_by(state = ifelse(nj == 1, "NJ", "PA"),
period = ifelse(d == 1, "after", "before")) |>
summarise(fte = mean(fte, na.rm = TRUE), .groups = "drop")
#> # A tibble: 4 x 3
#> state period fte
#> <chr> <chr> <dbl>
#> 1 NJ after 21.0
#> 2 NJ before 20.4
#> 3 PA after 21.2
#> 4 PA before 23.3Plugging the means into the DiD formula, \[
\hat\delta = (21.03 - 20.44)_{\text{NJ}} - (21.17 - 23.33)_{\text{PA}} = +2.75 .
\] Equivalently, run the one-line interaction regression d_nj is the \(\text{TREAT}\times\text{AFTER}\) term
coef(summary(lm(fte ~ nj + d + d_nj, njmin3)))
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 23.331169 1.071870 21.766795 1.163534e-82
#> nj -2.891761 1.193524 -2.422877 1.562199e-02
#> d -2.165584 1.515853 -1.428625 1.535074e-01
#> d_nj 2.753606 1.688409 1.630888 1.033126e-01The interaction coefficient is about +2.75 FTE. Employment in New Jersey did not fall after the minimum-wage hike
Show the R code
ck <- njmin3 |>
group_by(state = ifelse(nj == 1, "NJ (treated)", "PA (control)"),
period = ifelse(d == 1, "After", "Before")) |>
summarise(fte = mean(fte, na.rm = TRUE), .groups = "drop") |>
mutate(period = factor(period, levels = c("Before", "After")))
pa_change <- with(subset(ck, state == "PA (control)"),
fte[period == "After"] - fte[period == "Before"])
nj_before <- with(subset(ck, state == "NJ (treated)"), fte[period == "Before"])
cf <- data.frame(period = factor(c("Before", "After"),
levels = c("Before", "After")),
fte = c(nj_before, nj_before + pa_change),
state = "NJ counterfactual")
ggplot(ck, aes(period, fte, group = state, color = state)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
geom_line(data = cf, aes(period, fte, group = state),
color = ucla$red, linetype = "dashed", linewidth = 1) +
geom_point(data = cf, aes(period, fte), color = ucla$red, size = 2) +
scale_color_manual(values = c("NJ (treated)" = ucla$red,
"PA (control)" = ucla$blue)) +
labs(x = NULL, y = "Mean FTE employment", color = NULL)20.5 Correlation vs. causation, revisited
We can now answer the question that has shadowed the entire course.
A coefficient is causal only when the regressor is (as-if) randomly assigned
There are several routes to as-good-as-random variation. An RCT engineers randomization directly
And we should always beware spurious correlation. Maine’s divorce rate and U.S. per-capita margarine consumption move together with a correlation of \(0.99\), yet neither causes the other and the relationship means nothing. A high correlation
Because most economic data are observational, a credible causal claim demands a credible source of variation: a randomized experiment, a clean natural experiment, or a convincing argument that controlling for observables makes treatment as-good-as-random. It does not come from statistical significance alone.
20.6 Recap
The potential-outcomes framework gives each individual two outcomes, \(y_{1i}\) and \(y_{0i}\), of which we ever see only one
| Strategy | How it gets causal | Example |
|---|---|---|
| RCT | engineers randomization | Project STAR: small class \(+13.9\) pts |
| DiD / natural experiment | borrows “as-if” randomness | Card |
| Controls & proxies | as-good-as-random given observables | regression with confounders |
Randomization removes selection bias, which is why the RCT is the gold standard; when we cannot randomize, difference-in-differences recovers \(\delta\) as the coefficient on \(\text{TREAT}\times\text{AFTER}\), under the parallel-trends assumption. The through-line is simple: a coefficient is causal exactly when its regressor is exogenous.
That is also the arc of the whole course. We built up probability and the CLT, then simple regression
Next time: beyond ECON 103