\( \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\given}{\,\vert\,} \newcommand{\indic}[1]{\mathbf{1}\!\left\{#1\right\}} \newcommand{\pmf}{\text{p.m.f.}} \newcommand{\pdf}{\text{p.d.f.}} \newcommand{\cdf}{\text{c.d.f.}} \)

14  Interpreting MR: Variance & Collinearity

Reading. SW 6.4, 6.6<80><93>6.7, HGL 5.3, 6.5

In the previous chapter we estimated Big Andy’s sales “plane,” \[ \widehat{\text{SALES}} = 118.91 - 7.908\,\text{PRICE} + 1.863\,\text{ADVERT}, \] treating each slope as a point estimate. But just as in simple regression, these slopes are random variables: re-run the burger chain’s experiment on a fresh set of cities and the numbers would come out a little differently. So the natural next question is one of precision <80><94> how much would each slope wobble across samples?

This chapter answers that question for multiple regression and, in doing so, introduces a complication that has no analog in the one-regressor world. We meet the variance<80><93>covariance matrix and see where standard errors come from; we identify the four drivers of precision, one of which is genuinely new; we study collinearity, what happens when regressors move together; and we close with perfect collinearity and the dummy-variable trap, the two ways that overlapping regressors can break OLS outright.

14.1 The variance<80><93>covariance matrix

In simple regression there were only two estimators, \(b_1\) and \(b_2\), and we tracked each one’s variance separately. With \(K\) coefficients we have to track more: not just each estimator’s variance, but every covariance between pairs of estimators. The tidy way to hold all of this is a single object, the variance<80><93>covariance matrix, with variances down the diagonal and covariances off it.

Variance<e2><80><93>covariance matrix

For three coefficients the estimated variance<80><93>covariance matrix is \[ \widehat{\Cov}(b_1,b_2,b_3) = \begin{bmatrix} \widehat{\Var}(b_1) & \widehat{\Cov}(b_1,b_2) & \widehat{\Cov}(b_1,b_3)\\ \widehat{\Cov}(b_1,b_2) & \widehat{\Var}(b_2) & \widehat{\Cov}(b_2,b_3)\\ \widehat{\Cov}(b_1,b_3) & \widehat{\Cov}(b_2,b_3) & \widehat{\Var}(b_3) \end{bmatrix}. \] The diagonal holds the variances; the off-diagonal entries hold the covariances. The matrix is symmetric, because \(\widehat{\Cov}(b_j,b_k) = \widehat{\Cov}(b_k,b_j)\).

This matrix is built exactly the way the variance was built in simple regression: by replacing the unknown error variance \(\sigma^2\) with its estimate \(\hat\sigma^2 = \mathrm{SSE}/(N-K)\) inside the variance formulas. Two parts of it serve two different purposes. The diagonal is what gives us standard errors <80><94> take a square root of each variance and you have the standard error of that coefficient. The off-diagonal covariances are needed whenever we care about a linear combination of coefficients rather than one coefficient in isolation; that is the subject of the next chapter, where Big Andy weighs a joint price-and-advertising strategy.

Big Andy’s variance<80><93>covariance matrix

For Andy’s two-regressor model the error-variance estimate is \(\hat\sigma^2 = 23.87\), and software returns the matrix

\[ \widehat{\Cov}(b_1,b_2,b_3) = \begin{bmatrix} 40.343 & -6.795 & -0.748\\ -6.795 & 1.201 & -0.020\\ -0.748 & -0.020 & 0.467 \end{bmatrix}. \]

The standard errors are simply the square roots of the diagonal entries: \[ \mathrm{se}(b_1)=\sqrt{40.343}=6.35,\quad \mathrm{se}(b_2)=\sqrt{1.201}=1.10,\quad \mathrm{se}(b_3)=\sqrt{0.467}=0.68 . \]

These are exactly the standard errors printed beneath the coefficients in the previous chapter. As a rough first reading of precision: across resampled sets of cities we would expect \(b_2\), the price slope, to land within about \(\pm 2\,(1.10) = \pm 2.2\) of the true \(\beta_2\).

In R, the entire matrix comes from vcov() applied to the fitted model, and the standard errors are the square roots of its diagonal.

data(andy)
mod <- lm(sales ~ price + advert, data = andy)
vcov(mod)              # the variance--covariance matrix
#>             (Intercept)       price      advert
#> (Intercept)  40.3432990 -6.79506412 -0.74842060
#> price        -6.7950641  1.20120070 -0.01974215
#> advert       -0.7484206 -0.01974215  0.46675606
sqrt(diag(vcov(mod)))  # standard errors = sqrt of the diagonal
#> (Intercept)       price      advert 
#>   6.3516375   1.0959930   0.6831955

14.2 What drives precision

To understand what makes a slope precise, we dissect its variance. For a model with two regressors, the variance of the price coefficient \(b_2\) can be written out in full:

The variance of a slope in two-regressor MR

\[ \Var(b_2 \given \mathbf{X}) = \frac{\sigma^2}{(1 - r_{23}^2)\,\sum_{i}(x_{i2}-\bar x_2)^2}, \] where \(r_{23}\) is the sample correlation between the two regressors \(x_2\) and \(x_3\).

Compare this to the simple-regression variance, \(\sigma^2 / \sum_i (x_{i2} - \bar x_2)^2\). Three of the four levers are exactly the same as before. The only new ingredient is the factor \((1 - r_{23}^2)\) sitting in the denominator <80><94> and it is worth reading carefully, because it is the whole story of multiple regression precision.

Four drivers of precision
  1. Error variance \(\sigma^2\) (the numerator). A noisier model means a larger variance for every slope. Same as before.
  2. Sample size \(N\). More observations enlarge the sum \(\sum_i (x_{i2}-\bar x_2)^2\), which shrinks the variance. Same as before.
  3. Variation in \(x_2\), measured by \(\sum_i (x_{i2}-\bar x_2)^2\). More spread in price gives a smaller variance <80><94> you learn a slope best where the regressor moves a lot. Same as before.
  4. Correlation between the regressors \(r_{23}\) (new!). The factor \((1 - r_{23}^2)\) shrinks toward \(0\) as \(|r_{23}| \to 1\), so the variance explodes.

The first three drivers are familiar from the simple-regression variance chapter. The fourth is the price of admission to multiple regression.

The new tension

Items 1<80><93>3 carry over unchanged from simple regression. Item 4 is unique to having more than one regressor: when two regressors carry overlapping information, it is hard to pin down either one’s separate effect. The more they move together, the less the data can say about each in isolation.

Figure 14.1 makes the new term concrete. As \(|r_{23}|\) climbs toward \(1\), the multiplier \(1/(1 - r_{23}^2)\) <80><94> the factor by which collinearity inflates the variance relative to uncorrelated regressors <80><94> grows without bound.

Show the R code
rs <- seq(0, 0.97, length.out = 300)
vif <- data.frame(r = rs, factor = 1 / (1 - rs^2))
ggplot(vif, aes(r, factor)) +
  geom_line(color = ucla$blue, linewidth = 1) +
  scale_x_continuous(breaks = seq(0, 1, 0.25)) +
  scale_y_continuous(limits = c(0, 20)) +
  labs(x = expression(abs(r[23])),
       y = expression(1 / (1 - r[23]^2)))
Figure 14.1: The variance-inflation factor \(1/(1 - r_{23}^2)\) explodes as the regressors’ correlation approaches \(\pm 1\).

14.3 Collinearity

The fourth driver has a name. Collinearity <80><94> also called multicollinearity <80><94> is the situation where regressors are highly correlated with one another. The variance formula tells us its consequence immediately: when \(|r_{23}|\) is near \(1\), the factor \((1 - r_{23}^2)\) becomes tiny, and the standard errors become huge.

The intuition is worth dwelling on. If \(x_2\) and \(x_3\) almost always move together, then the data contain very little independent variation in \(x_2\) <80><94> that is, very little movement in \(x_2\) that is not also movement in \(x_3\). But independent variation in \(x_2\) is exactly what OLS uses to identify \(x_2\)’s own effect, holding \(x_3\) fixed. You cannot cleanly separate two effects that, in the data, never separate.

Test scores, English learners, and immigrants

Suppose you are modeling school test scores and the regression already contains “percent English learners.” You then add “percent immigrants.” In most districts these two variables track each other closely, so they supply little independent variation. The coefficient on either one becomes imprecise <80><94> its standard error balloons <80><94> even though the pair clearly matters for scores.

Figure 14.2 shows what collinear regressors look like: a scatter of \((x_2, x_3)\) pairs that nearly fall on a straight line, \(x_3 \approx x_2\). There is barely any spread off the line, and that missing spread is precisely the independent variation OLS needs.

Show the R code
coll <- data.frame(
  x2 = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
  x3 = c(1.3, 1.9, 3.2, 3.8, 5.3, 5.7, 7.2, 7.6, 9.1)
)
ggplot(coll, aes(x2, x3)) +
  geom_abline(slope = 0.95, intercept = 0.3,
              color = ucla$red, linewidth = 1) +
  geom_point(color = ucla$darkblue, size = 2) +
  scale_x_continuous(limits = c(0, 10)) +
  scale_y_continuous(limits = c(0, 10)) +
  labs(x = expression(x[2]), y = expression(x[3]))
Figure 14.2: Collinear regressors: \(x_3\) is nearly a linear function of \(x_2\), so there is little independent variation in either.

Recognizing and living with collinearity

Collinearity rarely announces itself directly <80><94> you usually diagnose it from a cluster of symptoms in the regression output. The standard errors are large and the confidence intervals correspondingly wide. Coefficients come out statistically insignificant even when the group of variables clearly belongs in the model. The overall fit can be deceptive: a high \(R^2\) for the regression as a whole sits alongside very few significant individual \(t\)-statistics. And the estimates are fragile <80><94> they swing wildly when you add or drop a single variable.

Symptoms of near collinearity
  • Large standard errors and wide confidence intervals.
  • Coefficients insignificant even when the group of regressors clearly matters.
  • High overall \(R^2\) but few significant individual \(t\)-statistics.
  • Estimates that swing wildly when a regressor is added or dropped.

What can you do about it? The cleanest remedy is more or better data carrying independent variation <80><94> in principle, Andy could run an experiment that varies price and advertising more independently across his cities, breaking the correlation by design. A second option is to drop a redundant regressor, but this must be done carefully: dropping a variable that truly belongs in the model reintroduces omitted-variable bias (see model specification). The third option is often the honest one: simply accept it. Imperfect collinearity is not an error in your analysis <80><94> it is a limit on what these particular data can tell you.

What to do about collinearity
  • Get more or better data with independent variation in the regressors.
  • Drop a redundant regressor <80><94> but carefully, weighing the risk of bias.
  • Or accept it: imperfect collinearity is not a mistake, just a limit of the data.

Key point. Collinearity does not bias OLS <80><94> the estimator stays BLUE. It only makes the (still-unbiased) estimates imprecise. The point estimates are right on average; you just cannot trust any single one very far.

14.4 Perfect collinearity & the dummy-variable trap

Everything above concerned near collinearity, where regressors are strongly but not exactly related. There is an extreme case where the relationship is exact, and it is qualitatively different: OLS does not merely become imprecise, it breaks down entirely.

If one regressor is an exact linear function of the others, then \(r_{23}^2 = 1\), the factor \((1 - r_{23}^2)\) in the denominator becomes zero, and the variance formula divides by zero. This is a violation of assumption MR5 (no exact linear relationship among the regressors), and OLS cannot be computed at all <80><94> there is no unique solution.

A couple of examples make clear how easily this happens:

  • Percent versus fraction of English learners. If one column records the percentage and another the fraction, then \(\text{Pct} = 100 \times \text{Frac}\). The two are perfectly redundant <80><94> each is the other rescaled.
  • Percent English speakers versus percent learners. If everyone is either a speaker or a learner, then \(\text{PctES} = 100 - \text{PctEL}\). Together with the constant term, this is again an exact linear relationship.
It's a logical error, not a data problem

With perfect collinearity, OLS is being asked an impossible question <80><94> “what is the effect of Pct holding Frac constant?” <80><94> when Pct and Frac always move together and so can never be held apart. The fix is not statistical but logical: respecify the model by dropping the redundant regressor. Software will warn you, or quietly drop one variable for you.

The dummy-variable trap

The most common way to stumble into perfect collinearity by accident is the dummy-variable trap. Suppose we partition cities into mutually exclusive and exhaustive categories <80><94> say Rural, Suburban, and Urban <80><94> and encode each as a \(0/1\) indicator variable. Because every city belongs to exactly one category, the three indicators always sum to one for every observation: \[ \texttt{Rural}_i + \texttt{Suburban}_i + \texttt{Urban}_i = 1 \quad\text{for every } i. \] But the constant \(1\) is precisely the regressor that multiplies the intercept. So including all \(G\) category dummies together with an intercept creates an exact linear relationship <80><94> perfect collinearity, and OLS fails.

The dummy-variable trap

With \(G\) mutually exclusive categories, including all \(G\) dummies and an intercept guarantees perfect collinearity, because the dummies sum to the intercept’s column of ones. OLS cannot be estimated.

The fix: drop one

Include only \(G - 1\) dummies and keep the intercept, leaving one category as the base (or reference) group. Each included coefficient then measures the difference from the base category. Equivalently, you may keep all \(G\) dummies and drop the intercept instead.

Looking ahead. Indicator variables <80><94> intercept shifts, slope dummies, interpreting coefficients relative to a reference group <80><94> get a full treatment in the dummy variables chapter. Here they appear only as the trap to avoid.

14.5 Recap

The slopes in a multiple regression are random variables, and this chapter built the machinery for measuring their precision.

The variance<80><93>covariance matrix collects every estimator’s variance (on the diagonal) and every pair’s covariance (off the diagonal), all built by plugging \(\hat\sigma^2 = \mathrm{SSE}/(N-K)\) into the variance formulas. Standard errors are the square roots of the diagonal <80><94> for Big Andy, \(\mathrm{se}(b_2) = 1.10\) and \(\mathrm{se}(b_3) = 0.68\). The variance of a slope decomposes as \[ \Var(b_2) = \frac{\sigma^2}{(1 - r_{23}^2)\,\sum_i (x_{i2}-\bar x_2)^2}, \] whose four drivers of precision are a small error variance \(\sigma^2\), a large sample size \(N\), plenty of spread in \(x_2\), and <80><94> the new one <80><94> a low correlation \(r_{23}\) between regressors.

The four drivers of slope precision in two-regressor MR.
Driver of precision Effect on \(\Var(b_2)\) New to MR?
Error variance \(\sigma^2\) larger \(\Rightarrow\) less precise no
Sample size \(N\) larger \(\Rightarrow\) more precise no
Spread in \(x_2\) larger \(\Rightarrow\) more precise no
Correlation \(r_{23}\) larger \(\Rightarrow\) less precise yes

Collinearity is the fourth driver gone wrong. Near collinearity inflates standard errors, widens confidence intervals, and renders coefficients insignificant <80><94> but OLS remains unbiased, so it is a limit of the data, not a bias. The remedies are more independent variation, careful dropping of a redundant regressor, or simple acceptance. Perfect collinearity (\(r_{23}^2 = 1\)) is a different beast: it violates MR5 and leaves OLS undefined, with the dummy-variable trap <80><94> all \(G\) dummies plus an intercept <80><94> as the classic accidental cause. The fix is always to respecify, most often by dropping one dummy (or the intercept).

Next time: with standard errors in hand, we turn to inference <80><94> hypothesis testing in multiple regression: \(t\)-tests on single coefficients (with \(\text{df} = N - K\)), confidence intervals, and tests of linear combinations of several coefficients, such as Andy’s joint price-and-advertising strategy.