data(andy)
mod <- lm(sales ~ price + advert, data = andy)
vcov(mod) # the variance--covariance matrix
#> (Intercept) price advert
#> (Intercept) 40.3432990 -6.79506412 -0.74842060
#> price -6.7950641 1.20120070 -0.01974215
#> advert -0.7484206 -0.01974215 0.46675606
sqrt(diag(vcov(mod))) # standard errors = sqrt of the diagonal
#> (Intercept) price advert
#> 6.3516375 1.0959930 0.683195514 Interpreting MR: Variance & Collinearity
Reading. SW
6.4, 6.6 <80><93>6.7, HGL 5.3, 6.5
In the previous chapter we estimated Big Andy’s sales “plane,” \[
\widehat{\text{SALES}} = 118.91 - 7.908\,\text{PRICE} + 1.863\,\text{ADVERT},
\] treating each slope as a point estimate. But just as in simple regression, these slopes are random variables: re-run the burger chain’s experiment on a fresh set of cities and the numbers would come out a little differently. So the natural next question is one of precision
This chapter answers that question for multiple regression and, in doing so, introduces a complication that has no analog in the one-regressor world. We meet the variance
14.1 The variance<80><93>covariance matrix
In simple regression there were only two estimators, \(b_1\) and \(b_2\), and we tracked each one’s variance separately. With \(K\) coefficients we have to track more: not just each estimator’s variance, but every covariance between pairs of estimators. The tidy way to hold all of this is a single object, the variance
For three coefficients the estimated variance
This matrix is built exactly the way the variance was built in simple regression: by replacing the unknown error variance \(\sigma^2\) with its estimate \(\hat\sigma^2 = \mathrm{SSE}/(N-K)\) inside the variance formulas. Two parts of it serve two different purposes. The diagonal is what gives us standard errors
Big Andy’s variance<80><93>covariance matrix
For Andy’s two-regressor model the error-variance estimate is \(\hat\sigma^2 = 23.87\), and software returns the matrix
\[ \widehat{\Cov}(b_1,b_2,b_3) = \begin{bmatrix} 40.343 & -6.795 & -0.748\\ -6.795 & 1.201 & -0.020\\ -0.748 & -0.020 & 0.467 \end{bmatrix}. \]
The standard errors are simply the square roots of the diagonal entries: \[ \mathrm{se}(b_1)=\sqrt{40.343}=6.35,\quad \mathrm{se}(b_2)=\sqrt{1.201}=1.10,\quad \mathrm{se}(b_3)=\sqrt{0.467}=0.68 . \]
These are exactly the standard errors printed beneath the coefficients in the previous chapter. As a rough first reading of precision: across resampled sets of cities we would expect \(b_2\), the price slope, to land within about \(\pm 2\,(1.10) = \pm 2.2\) of the true \(\beta_2\).
In R, the entire matrix comes from vcov() applied to the fitted model, and the standard errors are the square roots of its diagonal.
14.2 What drives precision
To understand what makes a slope precise, we dissect its variance. For a model with two regressors, the variance of the price coefficient \(b_2\) can be written out in full:
\[ \Var(b_2 \given \mathbf{X}) = \frac{\sigma^2}{(1 - r_{23}^2)\,\sum_{i}(x_{i2}-\bar x_2)^2}, \] where \(r_{23}\) is the sample correlation between the two regressors \(x_2\) and \(x_3\).
Compare this to the simple-regression variance, \(\sigma^2 / \sum_i (x_{i2} -
\bar x_2)^2\). Three of the four levers are exactly the same as before. The only new ingredient is the factor \((1 - r_{23}^2)\) sitting in the denominator
- Error variance \(\sigma^2\) (the numerator). A noisier model means a larger variance for every slope. Same as before.
- Sample size \(N\). More observations enlarge the sum \(\sum_i (x_{i2}-\bar x_2)^2\), which shrinks the variance. Same as before.
- Variation in \(x_2\), measured by \(\sum_i (x_{i2}-\bar x_2)^2\). More spread in price gives a smaller variance
<80><94> you learn a slope best where the regressor moves a lot. Same as before. - Correlation between the regressors \(r_{23}\) (new!). The factor \((1 - r_{23}^2)\) shrinks toward \(0\) as \(|r_{23}| \to 1\), so the variance explodes.
The first three drivers are familiar from the simple-regression variance chapter. The fourth is the price of admission to multiple regression.
Items 1
Figure 14.1 makes the new term concrete. As \(|r_{23}|\) climbs toward \(1\), the multiplier \(1/(1 - r_{23}^2)\)
Show the R code
rs <- seq(0, 0.97, length.out = 300)
vif <- data.frame(r = rs, factor = 1 / (1 - rs^2))
ggplot(vif, aes(r, factor)) +
geom_line(color = ucla$blue, linewidth = 1) +
scale_x_continuous(breaks = seq(0, 1, 0.25)) +
scale_y_continuous(limits = c(0, 20)) +
labs(x = expression(abs(r[23])),
y = expression(1 / (1 - r[23]^2)))14.3 Collinearity
The fourth driver has a name. Collinearity
The intuition is worth dwelling on. If \(x_2\) and \(x_3\) almost always move together, then the data contain very little independent variation in \(x_2\)
Suppose you are modeling school test scores and the regression already contains “percent English learners.” You then add “percent immigrants.” In most districts these two variables track each other closely, so they supply little independent variation. The coefficient on either one becomes imprecise
Figure 14.2 shows what collinear regressors look like: a scatter of \((x_2, x_3)\) pairs that nearly fall on a straight line, \(x_3 \approx x_2\). There is barely any spread off the line, and that missing spread is precisely the independent variation OLS needs.
Show the R code
coll <- data.frame(
x2 = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
x3 = c(1.3, 1.9, 3.2, 3.8, 5.3, 5.7, 7.2, 7.6, 9.1)
)
ggplot(coll, aes(x2, x3)) +
geom_abline(slope = 0.95, intercept = 0.3,
color = ucla$red, linewidth = 1) +
geom_point(color = ucla$darkblue, size = 2) +
scale_x_continuous(limits = c(0, 10)) +
scale_y_continuous(limits = c(0, 10)) +
labs(x = expression(x[2]), y = expression(x[3]))Recognizing and living with collinearity
Collinearity rarely announces itself directly
- Large standard errors and wide confidence intervals.
- Coefficients insignificant even when the group of regressors clearly matters.
- High overall \(R^2\) but few significant individual \(t\)-statistics.
- Estimates that swing wildly when a regressor is added or dropped.
What can you do about it? The cleanest remedy is more or better data carrying independent variation
- Get more or better data with independent variation in the regressors.
- Drop a redundant regressor
<80><94> but carefully, weighing the risk of bias. - Or accept it: imperfect collinearity is not a mistake, just a limit of the data.
Key point. Collinearity does not bias OLS
14.4 Perfect collinearity & the dummy-variable trap
Everything above concerned near collinearity, where regressors are strongly but not exactly related. There is an extreme case where the relationship is exact, and it is qualitatively different: OLS does not merely become imprecise, it breaks down entirely.
If one regressor is an exact linear function of the others, then \(r_{23}^2 =
1\), the factor \((1 - r_{23}^2)\) in the denominator becomes zero, and the variance formula divides by zero. This is a violation of assumption MR5 (no exact linear relationship among the regressors), and OLS cannot be computed at all
A couple of examples make clear how easily this happens:
- Percent versus fraction of English learners. If one column records the percentage and another the fraction, then \(\text{Pct} = 100 \times
\text{Frac}\). The two are perfectly redundant
<80><94> each is the other rescaled. - Percent English speakers versus percent learners. If everyone is either a speaker or a learner, then \(\text{PctES} = 100 - \text{PctEL}\). Together with the constant term, this is again an exact linear relationship.
With perfect collinearity, OLS is being asked an impossible question
The dummy-variable trap
The most common way to stumble into perfect collinearity by accident is the dummy-variable trap. Suppose we partition cities into mutually exclusive and exhaustive categories Rural, Suburban, and Urban
With \(G\) mutually exclusive categories, including all \(G\) dummies and an intercept guarantees perfect collinearity, because the dummies sum to the intercept’s column of ones. OLS cannot be estimated.
Include only \(G - 1\) dummies and keep the intercept, leaving one category as the base (or reference) group. Each included coefficient then measures the difference from the base category. Equivalently, you may keep all \(G\) dummies and drop the intercept instead.
Looking ahead. Indicator variables
14.5 Recap
The slopes in a multiple regression are random variables, and this chapter built the machinery for measuring their precision.
The variance
| Driver of precision | Effect on \(\Var(b_2)\) | New to MR? |
|---|---|---|
| Error variance \(\sigma^2\) | larger \(\Rightarrow\) less precise | no |
| Sample size \(N\) | larger \(\Rightarrow\) more precise | no |
| Spread in \(x_2\) | larger \(\Rightarrow\) more precise | no |
| Correlation \(r_{23}\) | larger \(\Rightarrow\) less precise | yes |
Collinearity is the fourth driver gone wrong. Near collinearity inflates standard errors, widens confidence intervals, and renders coefficients insignificant
Next time: with standard errors in hand, we turn to inference