\( \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\given}{\,\vert\,} \newcommand{\indic}[1]{\mathbf{1}\!\left\{#1\right\}} \newcommand{\pmf}{\text{p.m.f.}} \newcommand{\pdf}{\text{p.d.f.}} \newcommand{\cdf}{\text{c.d.f.}} \)

20  Treatment Effects & Difference-in-Differences

Reading. SW 13.1, 13.3<80><93>13.4, HGL 7.5<80><93>7.6

From the very first scatterplots to the omitted-variable bias of model specification, one warning has recurred all term: correlation is not causation. A regression slope measures association. It is a causal effect only if the regressor is exogenous <80><94> uncorrelated with everything in the error term (the SR2/MR2 assumption). This final chapter closes the loop and asks the question we have circled all along: when is a coefficient a causal effect?

The cleanest answer comes from the potential-outcomes framework. We use it to define a treatment effect and the average treatment effect (ATE), to see that selection bias is nothing more than omitted-variable bias in disguise, to understand why randomized experiments are the gold standard (with Project STAR as the example), and to learn difference-in-differences for the common case where randomizing is impossible (with Card and Krueger’s minimum-wage study as the example). We end where the course began: a final word on correlation versus causation.

20.1 The potential-outcomes framework

Let \(d_i = 1\) if individual \(i\) is treated and \(d_i = 0\) if not. The potential-outcomes framework asks us to imagine both futures for each person: \[ y_{1i} = \text{outcome if treated}, \qquad y_{0i} = \text{outcome if not treated}. \] The causal effect for individual \(i\) is the difference between these two worlds, \(y_{1i} - y_{0i}\) <80><94> how much the treatment changes that one person’s outcome.

The catch is that the world only ever lets us see one of the two outcomes for any given person. A person who takes the treatment reveals \(y_{1i}\); their \(y_{0i}\) <80><94> what would have happened without treatment <80><94> is gone. A person who does not take it reveals \(y_{0i}\), and their \(y_{1i}\) is gone. The unseen outcome is the counterfactual.

The fundamental problem of causal inference

We only ever observe one outcome per person. What we see is \[ y_i = y_{0i} + (y_{1i} - y_{0i})\,d_i , \] which equals \(y_{1i}\) when \(d_i = 1\) and \(y_{0i}\) when \(d_i = 0\). The other potential outcome <80><94> the counterfactual <80><94> is forever missing. Because of this, the individual effect \(y_{1i} - y_{0i}\) is unknowable.

The framework’s real gift is honesty. Effects genuinely differ across people: a drug helps some patients and harms others, a small class boosts some children more than others. Since we can never recover any single individual’s effect, we must settle for an average effect rather than an individual one.

Average treatment effect (ATE)

The average treatment effect is the mean of the individual effects across the whole population: \[ \tau_{\text{ATE}} = \E(y_{1i} - y_{0i}). \]

Because individual effects are hidden but averages are not, the ATE is the natural target. The most natural estimator simply compares the average outcome of the treated group with that of the control group <80><94> and this comparison is exactly a regression on a treatment dummy: \[ y_i = \alpha + \tau\,d_i + e_i \qquad\Longrightarrow\qquad \hat\tau = \bar y_{\text{treated}} - \bar y_{\text{control}} . \] This is the difference estimator. It is the same dummy-variable regression we met in dummy variables, now carrying a causal name. The real question is whether \(\hat\tau\) actually estimates \(\tau_{\text{ATE}}\). Sometimes it does; often it does not <80><94> and the gap between them is selection bias.

20.2 Selection bias

To see what the difference estimator really captures, split the gap in group averages into two pieces: \[ \underbrace{\E(y\given d{=}1) - \E(y\given d{=}0)}_{\text{what } \hat\tau \text{ estimates}} = \underbrace{\E(y_{1}-y_{0}\given d{=}1)}_{\text{effect on the treated (ATT)}} + \underbrace{\bigl[\E(y_{0}\given d{=}1) - \E(y_{0}\given d{=}0)\bigr]}_{\text{selection bias}}. \] The first piece is the average treatment effect on the treated (ATT) <80><94> a genuine causal quantity. The second piece is the troublemaker.

Selection bias

Selection bias is the difference in the two groups’ untreated potential outcomes, \[ \E(y_{0}\given d{=}1) - \E(y_{0}\given d{=}0). \] It measures how non-comparable the treated and control groups were to begin with <80><94> before any treatment was applied.

A classic illustration shows just how badly the difference estimator can mislead.

Do hospitals make you sicker?

In a health survey, people who had recently been hospitalized rated their own health worse (an average of \(3.21\)) than people who had not been hospitalized (\(3.93\)). Taken literally, the naive difference \(3.21 - 3.93 = -0.72\) says that going to the hospital damages your health. But of course hospitals do not cause illness <80><94> sick people select into hospitals. The treated and control groups had wildly different untreated health to start with, so almost all of that \(-0.72\) is selection bias, not a treatment effect.

Why call it selection bias and not something new? Because it is exactly the omitted-variable bias of earlier chapters. In the regression \(y_i = \alpha + \tau d_i + e_i\), the error \(e\) absorbs everything we left out <80><94> including each person’s pre-existing health. If sicker people choose treatment, then \(d_i\) is correlated with \(e_i\): \[ \E(e_i\given d_i) \neq 0 \quad\Longrightarrow\quad \hat\tau \text{ is biased.} \] This is precisely the endogeneity / omitted-variable problem from earlier in the course (see OLS properties and model specification): a confounder <80><94> here, health <80><94> is correlated with the “regressor” <80><94> here, treatment <80><94> and also drives the outcome.

The whole game

A treatment-dummy coefficient is causal only if treatment is uncorrelated with the omitted factors <80><94> that is, only if \(d_i\) is (as-if) randomly assigned. So the central question becomes: how do we get treatment to behave as though it were assigned at random? Sometimes we can engineer it directly.

20.3 Randomized experiments

If we randomly assign treatment, then \(d_i\) becomes statistically independent of the potential outcomes. The treated and control groups become comparable in expectation <80><94> they have the same average untreated outcome, \(\E(y_0\given d{=}1) = \E(y_0\given d{=}0)\) <80><94> so selection bias vanishes and the difference estimator recovers the causal effect: \[ \tau_{\text{ATE}} = \E(y\given d{=}1) - \E(y\given d{=}0) = \hat\tau \quad(\text{unbiased}). \]

The gold standard

The randomized controlled experiment (RCT) is the gold standard of causal inference <80><94> the benchmark against which every observational study is judged. RCTs are common in medicine (a new drug versus a placebo) but rarer in economics, where they can be costly, unethical, or simply infeasible. When randomization is available, the simple difference estimator <80><94> an ordinary dummy regression <80><94> recovers the causal effect with no fancy econometrics required.

A celebrated economic example is Project STAR, a genuine RCT run in Tennessee from 1985 to 1989. Kindergartners were randomly assigned, within their own schools, to either a small class (13<80><93>17 students) or a regular class (22<80><93>25 students). We compare total test scores across the two class types with a single dummy regression, \[ \text{TOTALSCORE} = \beta_1 + \beta_2\,\text{SMALL} + e . \]

data(star)
star_sr <- subset(star, aide == 0)        # small vs. regular classes
coef(summary(lm(totalscore ~ small, star_sr)))
#>              Estimate Std. Error    t value     Pr(>|t|)
#> (Intercept) 918.04289   1.667157 550.663750 0.000000e+00
#> small        13.89899   2.446592   5.680962 1.441473e-08

The coefficient on small is about 13.9 points, and it is highly significant. Because assignment was random, \(\hat\beta_2\) genuinely is the causal effect of being in a small class <80><94> it is not contaminated by which families happened to choose small classes, since families did not get to choose.

A telltale sign of good randomization is what happens when we add controls. Throw in teacher experience, for instance, and the estimate barely budges, moving from \(13.9\) to about \(14.0\):

coef(summary(lm(totalscore ~ small + tchexper, star_sr)))["small", ]
#>     Estimate   Std. Error      t value     Pr(>|t|) 
#> 1.398327e+01 2.437332e+00 5.737121e+00 1.039372e-08

Under successful randomization, controls are uncorrelated with the treatment, so adding them does not change the treatment coefficient <80><94> it only shrinks its standard error a little. We can check the randomization directly: regress small on student traits with a linear probability model (the dummy-variable regression idea applied to a \(0/1\) outcome). If assignment was truly random, no trait should predict it.

lpm <- lm(small ~ boy + white_asian + freelunch, star_sr)
mean(fitted(lpm))   # average predicted probability of small class
#> [1] 0.4643334

The fitted probabilities cluster near \(\hat p \approx 0.47\) <80><94> essentially a coin flip <80><94> and the predictors carry little explanatory weight. Assignment really was (as good as) random. Figure 20.1 shows the bottom line: average total scores are visibly higher in the small classes, and randomization licenses us to read that gap as causal.

Show the R code
star_means <- star_sr |>
  mutate(class = factor(small, levels = c(0, 1),
                        labels = c("Regular", "Small"))) |>
  group_by(class) |>
  summarise(score = mean(totalscore), .groups = "drop")
ggplot(star_means, aes(class, score)) +
  geom_col(fill = ucla$blue, color = ucla$darkblue, width = 0.6) +
  geom_text(aes(label = round(score, 0)), vjust = -0.5,
            color = ucla$darkblue, size = 3.6) +
  coord_cartesian(ylim = c(880, 940)) +
  labs(x = "Class type", y = "Mean total score")
Figure 20.1: Project STAR: average total test score by class type. Random assignment makes the gap a causal effect of small classes.

20.4 Difference-in-differences

Most economic data are observational, not experimental <80><94> we cannot randomly assign tax rates, minimum wages, or schooling. But sometimes the world runs an experiment for us: a policy change strikes one group and spares another, creating a natural (quasi-)experiment in which treatment is “as if” random.

Difference-in-differences (DiD) exploits exactly this situation, using before/after data on a treatment group and a comparison group. The idea is to measure the treatment group’s change over time, then subtract off the comparison group’s change over the same period: \[ \hat\delta = \bigl(\bar y^{\text{treat}}_{\text{after}} - \bar y^{\text{treat}}_{\text{before}}\bigr) - \bigl(\bar y^{\text{ctrl}}_{\text{after}} - \bar y^{\text{ctrl}}_{\text{before}}\bigr). \] The control group’s change captures the common trend <80><94> everything that would have moved both groups even without the policy. Netting it out leaves only the treatment’s own effect.

Figure 20.2 makes the logic visual. The treatment group rises from \(B\) to \(C\), while the control group drifts from \(A\) to \(E\). If the treatment group would have followed the same trend as the control group absent the policy, it would have ended at the counterfactual point \(D\) (the dashed line \(BD\), parallel to \(AE\)). The DiD estimate \(\hat\delta\) is the vertical gap between where the treatment group actually landed (\(C\)) and where it would have landed without treatment (\(D\)).

Show the R code
lines_df <- data.frame(
  time  = c(1, 2, 1, 2, 1, 2),
  y     = c(1.5, 2.2, 2.5, 4.5, 2.5, 3.2),
  group = c("Control", "Control", "Treatment", "Treatment",
            "Counterfactual", "Counterfactual")
)
labs_df <- data.frame(
  time  = c(1, 2, 1, 2, 2),
  y     = c(1.5, 2.2, 2.5, 4.5, 3.2),
  lab   = c("A", "E", "B", "C", "D")
)
ggplot(lines_df, aes(time, y, group = group)) +
  geom_line(aes(color = group, linetype = group), linewidth = 1) +
  geom_point(data = subset(lines_df, group != "Counterfactual"),
             aes(color = group), size = 2) +
  geom_text(data = labs_df, aes(time, y, label = lab), inherit.aes = FALSE,
            nudge_x = ifelse(labs_df$time == 1, -0.07, 0.07),
            color = ucla$darkblue, size = 3.6) +
  annotate("segment", x = 2.12, xend = 2.12, y = 3.2, yend = 4.5,
           arrow = arrow(ends = "both", length = unit(0.15, "cm")),
           color = ucla$darkblue) +
  annotate("text", x = 2.2, y = 3.85, label = "delta",
           parse = TRUE, color = ucla$darkblue, size = 4) +
  scale_color_manual(values = c(Control = ucla$blue,
                                Treatment = ucla$red,
                                Counterfactual = ucla$red)) +
  scale_linetype_manual(values = c(Control = "solid",
                                   Treatment = "solid",
                                   Counterfactual = "dashed")) +
  scale_x_continuous(breaks = c(1, 2), labels = c("Before", "After"),
                     limits = c(0.85, 2.35)) +
  labs(x = NULL, y = "y", color = NULL, linetype = NULL)
Figure 20.2: Difference-in-differences. The dashed line BD is the counterfactual: where the treatment group would have gone under the parallel-trends assumption. \(\hat\delta\) is the gap CD.

DiD as one regression

The whole estimator collapses into a single regression once we introduce a treatment dummy and a time dummy. Then \(\hat\delta\) is simply the coefficient on their interaction: \[ y_{it} = \beta_1 + \beta_2\,\text{TREAT}_i + \beta_3\,\text{AFTER}_t + \delta\,(\text{TREAT}_i \times \text{AFTER}_t) + e_{it} . \] Reading off the pieces: \(\beta_2\) is the fixed gap between the two groups, \(\beta_3\) is the common time trend shared by both, and \(\delta\) <80><94> the interaction <80><94> is the treatment effect. This is exactly the slope-dummy / interaction idea from dummy variables and interactions, now put to causal work.

The parallel-trends assumption

DiD rests on one key assumption: parallel trends. Absent the treatment, the two groups would have moved together <80><94> the control group’s change is a valid stand-in for what would have happened to the treatment group. This is the counterfactual line \(BD\) in Figure 20.2. It is an assumption, not something the data can prove, and the whole causal claim hinges on its credibility.

A panel-data route to the same answer. When we have panel data <80><94> the same units observed both before and after <80><94> we can instead take first differences. Differencing each unit against itself sweeps out every fixed unit characteristic \(c_i\), no matter whether we measured it, and delivers the same \(\hat\delta\) by another path.

Card and Krueger: the minimum wage

The most famous DiD study in economics is Card and Krueger’s analysis of the minimum wage. In April 1992 New Jersey raised its minimum wage from $4.25 to $5.05 an hour, while neighboring Pennsylvania held steady at $4.25. Card and Krueger surveyed full-time-equivalent (FTE) employment at fast-food restaurants in both states, before and after the increase. New Jersey is the treatment group; Pennsylvania, just across the border, is the natural control.

Computing the four cell means and the difference of differences:

data(njmin3)   # nj = NJ (treatment), d = after, fte = employment
njmin3 |>
  group_by(state = ifelse(nj == 1, "NJ", "PA"),
           period = ifelse(d == 1, "after", "before")) |>
  summarise(fte = mean(fte, na.rm = TRUE), .groups = "drop")
#> # A tibble: 4 x 3
#>   state period   fte
#>   <chr> <chr>  <dbl>
#> 1 NJ    after   21.0
#> 2 NJ    before  20.4
#> 3 PA    after   21.2
#> 4 PA    before  23.3

Plugging the means into the DiD formula, \[ \hat\delta = (21.03 - 20.44)_{\text{NJ}} - (21.17 - 23.33)_{\text{PA}} = +2.75 . \] Equivalently, run the one-line interaction regression <80><94> d_nj is the \(\text{TREAT}\times\text{AFTER}\) term <80><94> and read off its coefficient:

coef(summary(lm(fte ~ nj + d + d_nj, njmin3)))
#>              Estimate Std. Error   t value     Pr(>|t|)
#> (Intercept) 23.331169   1.071870 21.766795 1.163534e-82
#> nj          -2.891761   1.193524 -2.422877 1.562199e-02
#> d           -2.165584   1.515853 -1.428625 1.535074e-01
#> d_nj         2.753606   1.688409  1.630888 1.033126e-01

The interaction coefficient is about +2.75 FTE. Employment in New Jersey did not fall after the minimum-wage hike <80><94> if anything it rose slightly <80><94> flatly contrary to the textbook competitive prediction that a higher wage floor should cut employment. The result is robust: it survives adding controls for chain, ownership, and region, and it holds up under a panel first-difference specification, with \(\hat\delta \approx 2.75\) throughout. Figure 20.3 displays the two trajectories and the counterfactual.

Show the R code
ck <- njmin3 |>
  group_by(state = ifelse(nj == 1, "NJ (treated)", "PA (control)"),
           period = ifelse(d == 1, "After", "Before")) |>
  summarise(fte = mean(fte, na.rm = TRUE), .groups = "drop") |>
  mutate(period = factor(period, levels = c("Before", "After")))

pa_change <- with(subset(ck, state == "PA (control)"),
                  fte[period == "After"] - fte[period == "Before"])
nj_before <- with(subset(ck, state == "NJ (treated)"), fte[period == "Before"])
cf <- data.frame(period = factor(c("Before", "After"),
                                 levels = c("Before", "After")),
                 fte = c(nj_before, nj_before + pa_change),
                 state = "NJ counterfactual")

ggplot(ck, aes(period, fte, group = state, color = state)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  geom_line(data = cf, aes(period, fte, group = state),
            color = ucla$red, linetype = "dashed", linewidth = 1) +
  geom_point(data = cf, aes(period, fte), color = ucla$red, size = 2) +
  scale_color_manual(values = c("NJ (treated)" = ucla$red,
                                "PA (control)" = ucla$blue)) +
  labs(x = NULL, y = "Mean FTE employment", color = NULL)
Figure 20.3: Card<80><93>Krueger DiD. NJ (treated) employment holds up while PA (control) falls; the gap between NJ’s actual path and its parallel-trends counterfactual is \(\hat\delta \approx +2.75\) FTE.

20.5 Correlation vs. causation, revisited

We can now answer the question that has shadowed the entire course.

When is a regression coefficient causal?

A coefficient is causal only when the regressor is (as-if) randomly assigned <80><94> exogenous, so that it is uncorrelated with everything in the error term. Every technique in the course has been a way to reach that condition, or to check whether it holds.

There are several routes to as-good-as-random variation. An RCT engineers randomization directly <80><94> the gold standard, exemplified by Project STAR. Difference-in-differences and natural experiments borrow “as-if” randomness from a policy change that hits one group and not another, as in Card<80><93>Krueger. Controls and proxies (from model specification) try to make treatment as-good-as-random conditional on the observed variables.

And we should always beware spurious correlation. Maine’s divorce rate and U.S. per-capita margarine consumption move together with a correlation of \(0.99\), yet neither causes the other and the relationship means nothing. A high correlation <80><94> or, equivalently, a small \(p\)-value <80><94> is never enough.

Because most economic data are observational, a credible causal claim demands a credible source of variation: a randomized experiment, a clean natural experiment, or a convincing argument that controlling for observables makes treatment as-good-as-random. It does not come from statistical significance alone.

20.6 Recap

The potential-outcomes framework gives each individual two outcomes, \(y_{1i}\) and \(y_{0i}\), of which we ever see only one <80><94> the fundamental problem of causal inference forces us to target the average treatment effect \(\tau_{\text{ATE}} = \E(y_1 - y_0)\) rather than any individual effect. The difference estimator for the ATE is just a regression on a treatment dummy, but it equals the ATE only when there is no selection bias <80><94> and selection bias is simply omitted-variable bias, the correlation between treatment and the error.

Three routes from association to causation.
Strategy How it gets causal Example
RCT engineers randomization Project STAR: small class \(+13.9\) pts
DiD / natural experiment borrows “as-if” randomness Card<80><93>Krueger: \(\delta \approx +2.75\)
Controls & proxies as-good-as-random given observables regression with confounders

Randomization removes selection bias, which is why the RCT is the gold standard; when we cannot randomize, difference-in-differences recovers \(\delta\) as the coefficient on \(\text{TREAT}\times\text{AFTER}\), under the parallel-trends assumption. The through-line is simple: a coefficient is causal exactly when its regressor is exogenous.

That is also the arc of the whole course. We built up probability and the CLT, then simple regression <80><94> how to estimate it, why OLS is BLUE, and how to do inference <80><94> then fit and functional form, then multiple regression and its inference, and finally specification, dummy variables, and causality. You can now estimate an economic relationship, quantify its uncertainty, test hypotheses about it, and reason about whether it is causal. That is what it means to do econometrics.

Next time: beyond ECON 103 <80><94> instrumental variables, panel data, time series, and probit/logit models for limited dependent variables build directly on the foundation laid here. Good luck on the final.