\( \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\given}{\,\vert\,} \newcommand{\indic}[1]{\mathbf{1}\!\left\{#1\right\}} \newcommand{\pmf}{\text{p.m.f.}} \newcommand{\pdf}{\text{p.d.f.}} \newcommand{\cdf}{\text{c.d.f.}} \)

18  Model Specification, Multicollinearity & Model Selection

Reading. SW 6.1, 7.5, 9.2, HGL 4.3, 6.3

We can now estimate, interpret, and test a multiple regression. But every result so far has quietly assumed that we already have the right model. Choosing it is the hard part of applied work, and it turns on one central tension. Omit a variable that belongs in the model and the coefficients you keep become biased <80><94> this was the omitted-variable problem of multiple regression, now sharpened into a formula. But include a variable that does not belong and you inflate the variances of the coefficients you care about, losing precision. Good specification is the art of navigating between these two errors.

This chapter does two things. First it quantifies the trade-off: the omitted-variable bias formula and its direction, the cost of irrelevant variables, and the role of control variables and proxies <80><94> all framed by the crucial distinction between building a model for causal inference versus for prediction. Second it assembles a toolkit of diagnostics for model selection: adjusted \(R^2\), the AIC and BIC information criteria, the RESET test for functional-form errors, the Jarque<80><93>Bera test for residual normality, and the variance inflation factor for collinearity.

18.1 Omitted-variable bias, quantified

Suppose the true model contains two regressors, \[ y = \beta_1 + \beta_2 x + \beta_3 z + e , \] but we omit \(z\) and instead fit the short regression \(y = \beta_1 + \beta_2 x + v\). What happens to our estimate \(b_2\) of the slope on \(x\)? It is biased, and the bias has an exact algebraic form.

The omitted-variable bias formula

If the true model includes \(z\) but we omit it, the OLS slope on \(x\) in the short regression satisfies \[ \mathrm{bias}(b_2) = \E(b_2) - \beta_2 = \beta_3\,\frac{\widehat{\Cov}(x,z)}{\widehat{\Var}(x)} . \]

Each piece of this formula has a clean reading. The ratio \(\widehat{\Cov}(x,z)/\widehat{\Var}(x)\) is exactly the slope from regressing the omitted variable \(z\) on the included variable \(x\) <80><94> it measures how the thing we left out tracks the thing we kept. Multiplying by \(\beta_3\), the true effect of \(z\) on \(y\), tells us how much of \(z\)’s influence gets misattributed to \(x\).

The formula also makes precise the two conditions under which omitting \(z\) does no harm, the same two we met in multiple regression. The bias vanishes if and only if either

  • \(\beta_3 = 0\), so that \(z\) is genuinely irrelevant <80><94> it does not belong in the model in the first place; or
  • \(\Cov(x,z) = 0\), so that \(z\) is uncorrelated with the regressor we kept <80><94> there is no channel through which its omission can contaminate \(b_2\).

If neither holds, \(b_2\) is biased.

Signing the bias

Because the bias is a product, its direction is the product of two signs: \[ \mathrm{sign}\bigl(\mathrm{bias}(b_2)\bigr) = \mathrm{sign}(\beta_3)\times\mathrm{sign}\bigl(\Cov(x,z)\bigr). \] This is a remarkably useful fact, because it often lets you reason about the direction of bias even when you have no data on \(z\) at all. If you can argue on economic grounds that the omitted variable has a positive effect on \(y\) and is positively correlated with the regressor you kept, you know the kept coefficient is biased upward <80><94> a routine and powerful move in applied work.

Family income (HGL edu\_inc)

Consider the household income equation \[ \ln(\text{FAMINC}) = \beta_1 + \beta_2\,\text{HEDU} + \beta_3\,\text{WEDU} + e , \] where HEDU and WEDU are the husband’s and wife’s years of education. Both education effects are positive, and the two spouses’ education levels are positively correlated. So if we omit WEDU, the formula predicts an upward bias in the HEDU coefficient: \(\beta_3 > 0\) and \(\Cov(\text{HEDU},\text{WEDU})>0\) make the product positive. And that is exactly what we see <80><94> the husband’s coefficient jumps from \(0.044\) to \(0.061\). The wife’s education effect gets misattributed to the husband.

We can reproduce this directly on the edu_inc data. Fitting the full model and then dropping we:

data(edu_inc)
full  <- lm(log(faminc) ~ he + we, data = edu_inc)
short <- lm(log(faminc) ~ he,       data = edu_inc)
c(full_he = coef(full)["he"], short_he = coef(short)["he"])
#>  full_he.he short_he.he 
#>  0.04385462  0.06132256

The husband’s coefficient rises from about \(0.044\) in the full model to about \(0.061\) once the wife’s education is omitted, just as the slide reports. We can even verify the bias formula numerically: \(\beta_3\) times the slope of we on he should equal the change.

b3        <- coef(full)["we"]
aux_slope <- coef(lm(we ~ he, data = edu_inc))["he"]
c(predicted_bias = unname(b3 * aux_slope),
  actual_bias    = unname(coef(short)["he"] - coef(full)["he"]))
#> predicted_bias    actual_bias 
#>     0.01746795     0.01746795

The two agree to rounding. Figure 18.1 shows the sign rule as a simple \(2\times 2\) map: the bias is positive in the two cells where \(\beta_3\) and \(\Cov(x,z)\) share a sign, and negative where they differ.

Show the R code
grid <- expand.grid(cov = c("Cov(x,z) < 0", "Cov(x,z) > 0"),
                    b3  = c("beta3 < 0", "beta3 > 0"))
grid$sign <- c("bias > 0", "bias < 0", "bias < 0", "bias > 0")
ggplot(grid, aes(cov, b3, fill = sign)) +
  geom_tile(color = "white", linewidth = 1.2) +
  geom_text(aes(label = sign), color = "white", fontface = "bold", size = 4) +
  scale_fill_manual(values = c("bias > 0" = ucla$red, "bias < 0" = ucla$blue),
                    guide = "none") +
  labs(x = NULL, y = NULL)
Figure 18.1: The direction of omitted-variable bias is the product of two signs: the omitted variable’s effect \(\beta_3\) and its covariance with the included regressor.

18.2 Irrelevant variables & control variables

If omitting a relevant variable is so costly, why not simply include everything? Because the opposite error has a cost too. An irrelevant variable <80><94> one whose true coefficient is zero <80><94> does not bias your estimates, but if it is correlated with your regressors it inflates their variances. You buy safety against bias at the price of precision.

Family income, continued

Take the family-income equation and add two artificial regressors that are correlated with HEDU and WEDU but have no genuine effect on income. Their estimated coefficients come out insignificant, which is exactly what we want from an irrelevant variable. But the standard errors on HEDU and WEDU rise, and the precision of the coefficients we actually care about falls.

data(edu_inc)
clean <- lm(log(faminc) ~ he + we,                   data = edu_inc)
bloat <- lm(log(faminc) ~ he + we + xtra_x5 + xtra_x6, data = edu_inc)
rbind(
  clean = summary(clean)$coefficients[c("he", "we"), "Std. Error"],
  bloat = summary(bloat)$coefficients[c("he", "we"), "Std. Error"]
)
#>                he         we
#> clean 0.008722604 0.01158432
#> bloat 0.013668731 0.02494325

The standard errors on both he and we are noticeably larger once the two irrelevant regressors are in the model <80><94> the price of including variables that do not belong. This is the heart of the bias<80><93>variance trade-off.

The bias<e2><80><93>variance trade-off
bias variance
omit a relevant variable biased lower
include an irrelevant variable unbiased inflated

Omitting buys precision at the cost of bias; including buys unbiasedness at the cost of precision. Neither extreme is automatically right.

Control variables and proxies

The hardest case is a confounder you cannot measure. The classic example is ability in a wage equation: more able workers tend to get more education and to earn more, so leaving ability out biases the estimated return to education upward <80><94> but ability is not in any dataset. The fix is a control variable or proxy: an observable variable that stands in for the unmeasured confounder.

A proxy for ability (HGL, Koop--Tobias)

In the wage equation \(\ln(\text{WAGE})\) on EDUC and EXPER, omitting ability biases the return to education upward. Adding SCORE <80><94> an aptitude-test score that proxies for ability <80><94> pulls the estimated return down from about \(7.3\%\) to about \(5.9\%\). The proxy soaks up the part of the education<80><93>wage association that was really ability, shrinking the overstated education effect toward its true value.

For a proxy to do its job it must satisfy a condition called conditional mean independence: once you control for the proxy, the regressor you care about behaves “as if” randomly assigned with respect to the omitted factor. There is a subtlety worth stressing, though.

The proxy's own coefficient is not causal

A proxy is included only to clean up the coefficient you care about. Its own estimated coefficient should not be interpreted as a causal effect <80><94> SCORE’s coefficient is not “the return to aptitude.” The proxy is scaffolding, present so that the education coefficient can be read causally, not a finding in itself.

18.3 Causal vs. prediction

Which variables belong in a model depends entirely on why you built it. The two great purposes of regression <80><94> estimating a causal effect and forecasting an outcome <80><94> follow different rulebooks, and confusing them is a common and costly mistake.

Two purposes, two rulebooks

Causal inference. The goal is an unbiased effect. Omitted-variable bias is the enemy, so you include every confounder and control you can. A low \(R^2\) is perfectly acceptable; what matters is that “other things” are genuinely held constant.

Prediction. The goal is an accurate \(\hat y\). You want regressors that are highly correlated with \(y\) and a high \(R^2\). Nothing is being “held constant,” so omitted-variable bias simply does not apply <80><94> a good predictor need not be causal at all.

This distinction matters precisely because the selection tools we are about to meet <80><94> adjusted \(R^2\), AIC, BIC, hold-out RMSE <80><94> all chase predictive fit. They are genuinely useful for building forecasting models. But a model that scores beautifully on these criteria can still be badly biased for a causal question, because it may have dropped a confounder that hurt predictive fit, or included a downstream variable that helped it.

Never let a fit criterion overrule theory

A high-scoring predictive model can be a terrible causal model. Economic theory about which variables are confounders must take precedence over any statistic when the goal is a causal effect. Fit criteria inform causal modeling; they do not decide it.

18.4 Diagnostic & selection tools

We now turn to the tools that help us compare and stress-test specifications.

Fit criteria that penalize size

The plain \(R^2\) is useless for choosing how many regressors to include, because it never falls when you add a variable <80><94> adding noise can only push it up. The remedy is a criterion that rewards a smaller sum of squared errors (SSE) but penalizes the number of parameters \(K\). Three such criteria are standard: \[ \bar R^2 = 1 - \frac{\mathrm{SSE}/(N-K)}{\mathrm{SST}/(N-1)}, \qquad \mathrm{AIC} = \ln\!\frac{\mathrm{SSE}}{N} + \frac{2K}{N}, \qquad \mathrm{BIC} = \ln\!\frac{\mathrm{SSE}}{N} + \frac{K\ln N}{N}. \]

The adjusted \(R^2\), \(\bar R^2\), divides the sums of squares by their degrees of freedom. A new variable raises \(\bar R^2\) only if its \(|t|\)-statistic exceeds \(1\) <80><94> a fairly weak penalty. The price of the adjustment is that \(\bar R^2\) loses the clean “fraction of variation explained” interpretation that \(R^2\) has.

The AIC (Akaike information criterion) and BIC (Bayesian / Schwarz information criterion) take a different form: you pick the model that minimizes the criterion. Both add a penalty proportional to \(K\), but BIC’s penalty \(K\ln N / N\) is harsher than AIC’s \(2K/N\) whenever \(N \ge 8\), so BIC favors smaller, more parsimonious models than AIC does. Figure 18.2 shows how the two penalties diverge as the sample grows.

Show the R code
N <- seq(4, 200, by = 1)
pen <- rbind(
  data.frame(N = N, penalty = 2 / N,        crit = "AIC: 2/N"),
  data.frame(N = N, penalty = log(N) / N,   crit = "BIC: ln(N)/N")
)
ggplot(pen, aes(N, penalty, color = crit)) +
  geom_line(linewidth = 1) +
  geom_vline(xintercept = 8, linetype = "dashed", color = ucla$gray) +
  annotate("text", x = 8, y = 0.45, label = "N = 8", hjust = -0.1,
           color = ucla$gray, size = 3.2) +
  scale_color_manual(values = c("AIC: 2/N" = ucla$blue,
                                "BIC: ln(N)/N" = ucla$red), name = NULL) +
  labs(x = "sample size N", y = "penalty per parameter")
Figure 18.2: Per-parameter penalty of AIC (\(2/N\)) versus BIC (\(\ln N / N\)). For \(N \geq 8\), BIC penalizes each extra parameter more heavily, so it favors smaller models.
Same dependent variable only

These criteria are only comparable across models that share the same dependent variable. You cannot use them to choose between a model in \(y\) and a model in \(\ln y\) <80><94> the SSEs are measured on different scales, so the comparison is meaningless.

RESET: is the functional form wrong?

The RESET test (Regression Specification Error Test) looks for a missing nonlinearity or interaction <80><94> a sign that the functional form is wrong or a variable is omitted. The trick is elegant. After fitting the model and obtaining the fitted values \(\hat y\), augment the regression with powers of those fitted values: \[ y = \beta_1 + \beta_2 x_2 + \beta_3 x_3 + \gamma_1 \hat y^2 + \gamma_2 \hat y^3 + e , \] and then test \(H_0: \gamma_1 = \gamma_2 = 0\) with an \(F\)-test.

Why does this work? Because \(\hat y^2\) and \(\hat y^3\) are themselves polynomials in the original regressors. If the true relationship has a curve or an interaction that the linear model misses, these powers will improve the fit and their coefficients will come out nonzero. Rejecting \(H_0\) therefore signals that the model is misspecified <80><94> you should go looking for a missing term or a transformation.

RESET is asymmetric

Failing to reject does not certify that the model is correct. It only means RESET did not catch anything. A clean RESET is reassuring but never conclusive.

Jarque<80><93>Bera: are the errors normal?

Exact small-sample \(t\) and \(F\) inference relies on assumption SR6/MR6 <80><94> that the errors are normally distributed. The Jarque<80><93>Bera test checks this on the residuals, combining their skewness \(S\) and kurtosis \(K\) into a single statistic: \[ \mathrm{JB} = \frac{N}{6}\!\left(S^2 + \frac{(K-3)^2}{4}\right) \sim \chi^2_{(2)} \quad\text{under normality.} \] A normal distribution has skewness \(S = 0\) and kurtosis \(K = 3\), which drives JB toward \(0\). We reject normality if JB exceeds the critical value \(\chi^2_{(0.95,2)} = 5.99\).

Food expenditure residuals

Fitting food_exp ~ income on the food data and testing the residuals gives skewness \(S = -0.10\), kurtosis \(K = 2.99\), and so \(\mathrm{JB} = 0.06\) with \(p = 0.97\). We comfortably fail to reject <80><94> normality of the errors is entirely plausible here.

We can compute the Jarque<80><93>Bera statistic directly from the food-data residuals:

data(food)
e <- resid(lm(food_exp ~ income, data = food))
N <- length(e)
S <- mean((e - mean(e))^3) / mean((e - mean(e))^2)^(3/2)
K <- mean((e - mean(e))^4) / mean((e - mean(e))^2)^2
JB <- N / 6 * (S^2 + (K - 3)^2 / 4)
c(skewness = S, kurtosis = K, JB = JB, p_value = 1 - pchisq(JB, df = 2))
#>    skewness    kurtosis          JB     p_value 
#> -0.09731877  2.98903377  0.06334005  0.96882622

The residual histogram in Figure 18.3 confirms the verdict visually: the residuals are roughly symmetric and bell-shaped, hugging the overlaid normal density.

Show the R code
res_df <- data.frame(e = e)
ggplot(res_df, aes(e)) +
  geom_histogram(aes(y = after_stat(density)), bins = 12,
                 fill = ucla$blue, color = ucla$darkblue) +
  stat_function(fun = dnorm, args = list(mean = mean(e), sd = sd(e)),
                color = ucla$red, linewidth = 1) +
  labs(x = "residual", y = "density")
Figure 18.3: Residuals from the food-expenditure regression, with a normal density overlaid. The Jarque–Bera test does not reject normality (JB = 0.06, p = 0.97).

If the errors are not normal, all is not lost. In large samples the central limit theorem makes \(t\)- and \(F\)-inference approximately valid anyway, regardless of the error distribution <80><94> the same large-sample logic we developed for confidence intervals and the CLT.

Collinearity: the variance inflation factor

Recall from multiple-regression variance and collinearity that near-collinear regressors blow up standard errors. The variance inflation factor (VIF) quantifies the damage one regressor at a time. The variance of \(b_2\) decomposes as \[ \Var(b_2 \given \mathbf{X}) = \frac{\sigma^2}{\sum_i (x_{i2}-\bar x_2)^2}\cdot \underbrace{\frac{1}{1 - R^2_{2\bullet}}}_{\text{VIF}}, \] where \(R^2_{2\bullet}\) is the \(R^2\) from regressing \(x_2\) on all the other regressors. The closer the other regressors come to explaining \(x_2\), the closer \(R^2_{2\bullet}\) is to \(1\) and the larger the VIF.

The interpretation is direct. With no collinearity, \(R^2_{2\bullet} = 0\) and \(\text{VIF} = 1\) <80><94> no inflation. If \(R^2_{2\bullet} = 0.9\), the VIF is \(10\), so the variance of \(b_2\) is ten times what it would be with orthogonal regressors. A common rule of thumb flags any \(\text{VIF} > 10\) (equivalently \(R^2_{2\bullet} > 0.9\)) as a worrying degree of collinearity. Figure 18.4 shows how sharply the VIF climbs as \(R^2_{2\bullet}\) approaches one.

Show the R code
r2  <- seq(0, 0.97, length.out = 200)
vif <- data.frame(r2 = r2, vif = 1 / (1 - r2))
ggplot(vif, aes(r2, vif)) +
  geom_line(color = ucla$blue, linewidth = 1) +
  geom_hline(yintercept = 10, linetype = "dashed", color = ucla$gray) +
  geom_vline(xintercept = 0.9, linetype = "dashed", color = ucla$gray) +
  annotate("text", x = 0.25, y = 11.5, label = "VIF = 10 threshold",
           color = ucla$gray, size = 3.2) +
  scale_x_continuous(breaks = seq(0, 1, 0.2)) +
  labs(x = expression(R[2~bullet]^2), y = "VIF")
Figure 18.4: The variance inflation factor \(1/(1 - R^2_{2\bullet})\) explodes as the other regressors come to explain \(x_2\). The rule-of-thumb threshold VIF = 10 corresponds to \(R^2_{2\bullet} = 0.9\).

18.5 A word of caution on model selection

These tools inform; they do not decide. No statistic can substitute for judgment, and the most common abuse of econometrics is letting a fit criterion do the thinking. A few principles keep the work honest.

Start from economic theory and the model’s purpose, not from whatever happens to maximize a statistic. Use several signals in combination rather than trusting any one: the signs and magnitudes of coefficients, \(t\)- and \(F\)-tests, RESET, residual plots, robustness across alternative specifications, AIC/BIC, and a hold-out sample when the goal is prediction. And above all, do not data-mine.

Do not data-mine

Running dozens of models and reporting only the one that came out “significant” invalidates the very inference you are reporting <80><94> the \(p\)-values no longer mean what they claim. If you searched over specifications, disclose the search.

The honest standard

A good specification is one that is defensible on theory and robust across reasonable alternatives <80><94> not merely the one with the prettiest \(R^2\).

18.6 Recap

The whole chapter pivots on a single trade-off. Omit a relevant variable and your coefficients are biased, by the exact amount \[ \mathrm{bias}(b_2) = \beta_3\,\frac{\Cov(x,z)}{\Var(x)} \] <80><94> the family-income example moved the husband’s education coefficient from \(0.044\) to \(0.061\). Include an irrelevant variable and you avoid bias but inflate the variances. When a confounder is unmeasurable, a proxy or control (SCORE for ability, which pulled the education return from \(7.3\%\) to \(5.9\%\)) can rescue the coefficient you care about <80><94> though the proxy’s own coefficient is not causal.

Which way to lean depends on purpose: for causal inference, kill omitted-variable bias at all costs; for prediction, maximize fit and forget about “holding constant.” The selection toolkit <80><94> adjusted \(R^2\), AIC, and BIC (all penalizing model size, BIC the most harshly), RESET (add \(\hat y^2\) and \(\hat y^3\) to detect bad functional form), Jarque<80><93>Bera (residual normality; food JB \(= 0.06\)), and the VIF \(= 1/(1 - R^2_{2\bullet})\) for collinearity <80><94> helps you compare models. But theory comes first, and data-mining invalidates inference.

Next time: many regressors are categorical <80><94> sex, region, treatment status. We encode them as \(0/1\) indicator (dummy) variables: intercept shifts, slope dummies, and the linear probability model for when \(y\) itself is binary.