data(edu_inc)
full <- lm(log(faminc) ~ he + we, data = edu_inc)
short <- lm(log(faminc) ~ he, data = edu_inc)
c(full_he = coef(full)["he"], short_he = coef(short)["he"])
#> full_he.he short_he.he
#> 0.04385462 0.0613225618 Model Specification, Multicollinearity & Model Selection
Reading. SW
6.1, 7.5, 9.2, HGL 4.3, 6.3
We can now estimate, interpret, and test a multiple regression. But every result so far has quietly assumed that we already have the right model. Choosing it is the hard part of applied work, and it turns on one central tension. Omit a variable that belongs in the model and the coefficients you keep become biased
This chapter does two things. First it quantifies the trade-off: the omitted-variable bias formula and its direction, the cost of irrelevant variables, and the role of control variables and proxies
18.1 Omitted-variable bias, quantified
Suppose the true model contains two regressors, \[ y = \beta_1 + \beta_2 x + \beta_3 z + e , \] but we omit \(z\) and instead fit the short regression \(y = \beta_1 + \beta_2 x + v\). What happens to our estimate \(b_2\) of the slope on \(x\)? It is biased, and the bias has an exact algebraic form.
If the true model includes \(z\) but we omit it, the OLS slope on \(x\) in the short regression satisfies \[ \mathrm{bias}(b_2) = \E(b_2) - \beta_2 = \beta_3\,\frac{\widehat{\Cov}(x,z)}{\widehat{\Var}(x)} . \]
Each piece of this formula has a clean reading. The ratio \(\widehat{\Cov}(x,z)/\widehat{\Var}(x)\) is exactly the slope from regressing the omitted variable \(z\) on the included variable \(x\)
The formula also makes precise the two conditions under which omitting \(z\) does no harm, the same two we met in multiple regression. The bias vanishes if and only if either
- \(\beta_3 = 0\), so that \(z\) is genuinely irrelevant
<80><94> it does not belong in the model in the first place; or - \(\Cov(x,z) = 0\), so that \(z\) is uncorrelated with the regressor we kept
<80><94> there is no channel through which its omission can contaminate \(b_2\).
If neither holds, \(b_2\) is biased.
Signing the bias
Because the bias is a product, its direction is the product of two signs: \[
\mathrm{sign}\bigl(\mathrm{bias}(b_2)\bigr)
= \mathrm{sign}(\beta_3)\times\mathrm{sign}\bigl(\Cov(x,z)\bigr).
\] This is a remarkably useful fact, because it often lets you reason about the direction of bias even when you have no data on \(z\) at all. If you can argue on economic grounds that the omitted variable has a positive effect on \(y\) and is positively correlated with the regressor you kept, you know the kept coefficient is biased upward
Consider the household income equation \[
\ln(\text{FAMINC}) = \beta_1 + \beta_2\,\text{HEDU}
+ \beta_3\,\text{WEDU} + e ,
\] where HEDU and WEDU are the husband’s and wife’s years of education. Both education effects are positive, and the two spouses’ education levels are positively correlated. So if we omit WEDU, the formula predicts an upward bias in the HEDU coefficient: \(\beta_3 > 0\) and \(\Cov(\text{HEDU},\text{WEDU})>0\) make the product positive. And that is exactly what we see
We can reproduce this directly on the edu_inc data. Fitting the full model and then dropping we:
The husband’s coefficient rises from about \(0.044\) in the full model to about \(0.061\) once the wife’s education is omitted, just as the slide reports. We can even verify the bias formula numerically: \(\beta_3\) times the slope of we on he should equal the change.
b3 <- coef(full)["we"]
aux_slope <- coef(lm(we ~ he, data = edu_inc))["he"]
c(predicted_bias = unname(b3 * aux_slope),
actual_bias = unname(coef(short)["he"] - coef(full)["he"]))
#> predicted_bias actual_bias
#> 0.01746795 0.01746795The two agree to rounding. Figure 18.1 shows the sign rule as a simple \(2\times 2\) map: the bias is positive in the two cells where \(\beta_3\) and \(\Cov(x,z)\) share a sign, and negative where they differ.
Show the R code
grid <- expand.grid(cov = c("Cov(x,z) < 0", "Cov(x,z) > 0"),
b3 = c("beta3 < 0", "beta3 > 0"))
grid$sign <- c("bias > 0", "bias < 0", "bias < 0", "bias > 0")
ggplot(grid, aes(cov, b3, fill = sign)) +
geom_tile(color = "white", linewidth = 1.2) +
geom_text(aes(label = sign), color = "white", fontface = "bold", size = 4) +
scale_fill_manual(values = c("bias > 0" = ucla$red, "bias < 0" = ucla$blue),
guide = "none") +
labs(x = NULL, y = NULL)18.2 Irrelevant variables & control variables
If omitting a relevant variable is so costly, why not simply include everything? Because the opposite error has a cost too. An irrelevant variable
Take the family-income equation and add two artificial regressors that are correlated with HEDU and WEDU but have no genuine effect on income. Their estimated coefficients come out insignificant, which is exactly what we want from an irrelevant variable. But the standard errors on HEDU and WEDU rise, and the precision of the coefficients we actually care about falls.
data(edu_inc)
clean <- lm(log(faminc) ~ he + we, data = edu_inc)
bloat <- lm(log(faminc) ~ he + we + xtra_x5 + xtra_x6, data = edu_inc)
rbind(
clean = summary(clean)$coefficients[c("he", "we"), "Std. Error"],
bloat = summary(bloat)$coefficients[c("he", "we"), "Std. Error"]
)
#> he we
#> clean 0.008722604 0.01158432
#> bloat 0.013668731 0.02494325The standard errors on both he and we are noticeably larger once the two irrelevant regressors are in the model
| bias | variance | |
|---|---|---|
| omit a relevant variable | biased | lower |
| include an irrelevant variable | unbiased | inflated |
Omitting buys precision at the cost of bias; including buys unbiasedness at the cost of precision. Neither extreme is automatically right.
Control variables and proxies
The hardest case is a confounder you cannot measure. The classic example is ability in a wage equation: more able workers tend to get more education and to earn more, so leaving ability out biases the estimated return to education upward
In the wage equation \(\ln(\text{WAGE})\) on EDUC and EXPER, omitting ability biases the return to education upward. Adding SCORE
For a proxy to do its job it must satisfy a condition called conditional mean independence: once you control for the proxy, the regressor you care about behaves “as if” randomly assigned with respect to the omitted factor. There is a subtlety worth stressing, though.
A proxy is included only to clean up the coefficient you care about. Its own estimated coefficient should not be interpreted as a causal effect
18.3 Causal vs. prediction
Which variables belong in a model depends entirely on why you built it. The two great purposes of regression
Causal inference. The goal is an unbiased effect. Omitted-variable bias is the enemy, so you include every confounder and control you can. A low \(R^2\) is perfectly acceptable; what matters is that “other things” are genuinely held constant.
Prediction. The goal is an accurate \(\hat y\). You want regressors that are highly correlated with \(y\) and a high \(R^2\). Nothing is being “held constant,” so omitted-variable bias simply does not apply
This distinction matters precisely because the selection tools we are about to meet
A high-scoring predictive model can be a terrible causal model. Economic theory about which variables are confounders must take precedence over any statistic when the goal is a causal effect. Fit criteria inform causal modeling; they do not decide it.
18.4 Diagnostic & selection tools
We now turn to the tools that help us compare and stress-test specifications.
Fit criteria that penalize size
The plain \(R^2\) is useless for choosing how many regressors to include, because it never falls when you add a variable
The adjusted \(R^2\), \(\bar R^2\), divides the sums of squares by their degrees of freedom. A new variable raises \(\bar R^2\) only if its \(|t|\)-statistic exceeds \(1\)
The AIC (Akaike information criterion) and BIC (Bayesian / Schwarz information criterion) take a different form: you pick the model that minimizes the criterion. Both add a penalty proportional to \(K\), but BIC’s penalty \(K\ln N / N\) is harsher than AIC’s \(2K/N\) whenever \(N \ge 8\), so BIC favors smaller, more parsimonious models than AIC does. Figure 18.2 shows how the two penalties diverge as the sample grows.
Show the R code
N <- seq(4, 200, by = 1)
pen <- rbind(
data.frame(N = N, penalty = 2 / N, crit = "AIC: 2/N"),
data.frame(N = N, penalty = log(N) / N, crit = "BIC: ln(N)/N")
)
ggplot(pen, aes(N, penalty, color = crit)) +
geom_line(linewidth = 1) +
geom_vline(xintercept = 8, linetype = "dashed", color = ucla$gray) +
annotate("text", x = 8, y = 0.45, label = "N = 8", hjust = -0.1,
color = ucla$gray, size = 3.2) +
scale_color_manual(values = c("AIC: 2/N" = ucla$blue,
"BIC: ln(N)/N" = ucla$red), name = NULL) +
labs(x = "sample size N", y = "penalty per parameter")These criteria are only comparable across models that share the same dependent variable. You cannot use them to choose between a model in \(y\) and a model in \(\ln y\)
RESET: is the functional form wrong?
The RESET test (Regression Specification Error Test) looks for a missing nonlinearity or interaction
Why does this work? Because \(\hat y^2\) and \(\hat y^3\) are themselves polynomials in the original regressors. If the true relationship has a curve or an interaction that the linear model misses, these powers will improve the fit and their coefficients will come out nonzero. Rejecting \(H_0\) therefore signals that the model is misspecified
Failing to reject does not certify that the model is correct. It only means RESET did not catch anything. A clean RESET is reassuring but never conclusive.
Jarque<80><93>Bera: are the errors normal?
Exact small-sample \(t\) and \(F\) inference relies on assumption SR6/MR6
Fitting food_exp ~ income on the food data and testing the residuals gives skewness \(S = -0.10\), kurtosis \(K = 2.99\), and so \(\mathrm{JB} = 0.06\) with \(p = 0.97\). We comfortably fail to reject
We can compute the Jarque
data(food)
e <- resid(lm(food_exp ~ income, data = food))
N <- length(e)
S <- mean((e - mean(e))^3) / mean((e - mean(e))^2)^(3/2)
K <- mean((e - mean(e))^4) / mean((e - mean(e))^2)^2
JB <- N / 6 * (S^2 + (K - 3)^2 / 4)
c(skewness = S, kurtosis = K, JB = JB, p_value = 1 - pchisq(JB, df = 2))
#> skewness kurtosis JB p_value
#> -0.09731877 2.98903377 0.06334005 0.96882622The residual histogram in Figure 18.3 confirms the verdict visually: the residuals are roughly symmetric and bell-shaped, hugging the overlaid normal density.
Show the R code
res_df <- data.frame(e = e)
ggplot(res_df, aes(e)) +
geom_histogram(aes(y = after_stat(density)), bins = 12,
fill = ucla$blue, color = ucla$darkblue) +
stat_function(fun = dnorm, args = list(mean = mean(e), sd = sd(e)),
color = ucla$red, linewidth = 1) +
labs(x = "residual", y = "density")If the errors are not normal, all is not lost. In large samples the central limit theorem makes \(t\)- and \(F\)-inference approximately valid anyway, regardless of the error distribution
Collinearity: the variance inflation factor
Recall from multiple-regression variance and collinearity that near-collinear regressors blow up standard errors. The variance inflation factor (VIF) quantifies the damage one regressor at a time. The variance of \(b_2\) decomposes as \[ \Var(b_2 \given \mathbf{X}) = \frac{\sigma^2}{\sum_i (x_{i2}-\bar x_2)^2}\cdot \underbrace{\frac{1}{1 - R^2_{2\bullet}}}_{\text{VIF}}, \] where \(R^2_{2\bullet}\) is the \(R^2\) from regressing \(x_2\) on all the other regressors. The closer the other regressors come to explaining \(x_2\), the closer \(R^2_{2\bullet}\) is to \(1\) and the larger the VIF.
The interpretation is direct. With no collinearity, \(R^2_{2\bullet} = 0\) and \(\text{VIF} = 1\)
Show the R code
r2 <- seq(0, 0.97, length.out = 200)
vif <- data.frame(r2 = r2, vif = 1 / (1 - r2))
ggplot(vif, aes(r2, vif)) +
geom_line(color = ucla$blue, linewidth = 1) +
geom_hline(yintercept = 10, linetype = "dashed", color = ucla$gray) +
geom_vline(xintercept = 0.9, linetype = "dashed", color = ucla$gray) +
annotate("text", x = 0.25, y = 11.5, label = "VIF = 10 threshold",
color = ucla$gray, size = 3.2) +
scale_x_continuous(breaks = seq(0, 1, 0.2)) +
labs(x = expression(R[2~bullet]^2), y = "VIF")18.5 A word of caution on model selection
These tools inform; they do not decide. No statistic can substitute for judgment, and the most common abuse of econometrics is letting a fit criterion do the thinking. A few principles keep the work honest.
Start from economic theory and the model’s purpose, not from whatever happens to maximize a statistic. Use several signals in combination rather than trusting any one: the signs and magnitudes of coefficients, \(t\)- and \(F\)-tests, RESET, residual plots, robustness across alternative specifications, AIC/BIC, and a hold-out sample when the goal is prediction. And above all, do not data-mine.
Running dozens of models and reporting only the one that came out “significant” invalidates the very inference you are reporting
A good specification is one that is defensible on theory and robust across reasonable alternatives
18.6 Recap
The whole chapter pivots on a single trade-off. Omit a relevant variable and your coefficients are biased, by the exact amount \[
\mathrm{bias}(b_2) = \beta_3\,\frac{\Cov(x,z)}{\Var(x)}
\]
Which way to lean depends on purpose: for causal inference, kill omitted-variable bias at all costs; for prediction, maximize fit and forget about “holding constant.” The selection toolkit
Next time: many regressors are categorical