\( \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\given}{\,\vert\,} \newcommand{\indic}[1]{\mathbf{1}\!\left\{#1\right\}} \newcommand{\pmf}{\text{p.m.f.}} \newcommand{\pdf}{\text{p.d.f.}} \newcommand{\cdf}{\text{c.d.f.}} \)

13  The Multiple Regression Model

Reading. Hill, Griffiths & Lim (5th ed.), 5.1<80><93>5.2; Stock & Watson (4th ed.), 6.1<80><93>6.3.

The simple regression model has an Achilles heel. Its causal reading rests on the strict-exogeneity assumption SR2, \[ \E(e \given x) = 0 , \] but the error \(e\) holds everything else about the outcome. If any omitted factor is correlated with \(x\), then SR2 fails and the slope estimator \(b_2\) is biased <80><94> this is exactly the ability-in-the-wage-equation problem from OLS properties.

The cure is to stop hiding confounders inside \(e\) and instead put them in the regression. With more than one regressor we can finally hold other factors constant <80><94> ceteris paribus for real. This chapter introduces the multiple regression model and its partial coefficients, lays out the assumptions MR1<80><93>MR6 (one of which is genuinely new), and applies ordinary least squares to Big Andy’s Burger Barn.

13.1 Why more than one regressor?

Omitted-variable bias

A left-out variable does not always cause trouble. It biases the OLS slope only under two conditions, both of which must hold.

When does omitting a variable bias OLS? (two conditions)

A left-out variable biases \(b_2\) only if it is both

  1. correlated with the included regressor \(x\), and
  2. a determinant of the outcome \(y\) (so it sits in \(e\)).
Class size and test scores

Regress district test scores on the student<80><93>teacher ratio (STR) alone. Districts with larger classes also tend to have more English learners (a correlation of about \(0.19\)), and English learners score lower on average. The share of English learners is therefore correlated with STR and a driver of scores, so omitting it biases the estimated class-size effect. In fact the class-size effect roughly halves once the English-learner share is controlled for.

Both conditions genuinely matter. Consider instead the time of day at which a test is taken: it may well affect scores, but if it is uncorrelated with class size, then leaving it in the error is harmless <80><94> it does not contaminate the class-size slope.

The fix: put the confounder in the regression

Omitted-variable bias is nothing more than SR2 failing. A determinant of \(y\) that happens to be correlated with \(x\) lives inside \(e\), which makes \(\E(e \given x) \neq 0\). The remedy is direct: move that variable out of the error and into the model as its own regressor. Once it is an explicit regressor, OLS can estimate the effect of \(x\) holding that variable constant, and the bias it had been causing disappears.

This is what "control for" means

Adding the share of English learners as a regressor lets us compare districts as if they had the same share of English learners. Multiple regression does, with continuous data, what we wished we could do by hand: hold the other factors fixed.

Caveat for later. You can only control for what you observe. Unobservable confounders <80><94> the “ability” term from OLS properties <80><94> still threaten SR2, and the deeper fix waits until we study treatment effects.

13.2 The model and its partial coefficients

Big Andy’s Burger Barn

Our running example throughout the multiple-regression chapters comes from HGL. A burger chain operates in 75 small cities. In each city it sets a different price and advertising budget, and it observes monthly sales. The question is how revenue responds to each lever <80><94> holding the other fixed: \[ \text{SALES} = \beta_1 + \beta_2\,\text{PRICE} + \beta_3\,\text{ADVERT} + e . \] Here SALES and ADVERT are measured in thousands of dollars and PRICE is a dollar price index. The error \(e\) collects everything else that moves sales: competitors, local demographics, the quality of each location.

More generally, the multiple regression model with \(K - 1\) regressors plus an intercept is \[ y_i = \beta_1 + \beta_2 x_{i2} + \beta_3 x_{i3} + \dots + \beta_K x_{iK} + e_i . \] Under strict exogeneity the regression function is \[ \E(y \given \mathbf{X}) = \beta_1 + \beta_2 x_2 + \dots + \beta_K x_K , \] which is now a plane (or, with more than two regressors, a hyperplane) rather than a line.

Partial coefficients: ceteris paribus at last

Each slope is a partial effect <80><94> the change in \(\E(y)\) from a one-unit change in that regressor, holding all the others fixed: \[ \beta_k = \frac{\partial\,\E(y \given \mathbf{X})}{\partial x_k} \qquad (\text{other } x\text{'s held constant}). \]

In Big Andy’s model the coefficients read off cleanly:

  • \(\beta_2\) is the effect of PRICE on sales, with ADVERT fixed;
  • \(\beta_3\) is the effect of ADVERT on sales, with PRICE fixed;
  • the intercept \(\beta_1 = \E(y)\) when all the \(x\)’s are zero <80><94> often not economically meaningful, but we keep it to pin down the plane.

Figure 13.1 sketches the regression function as a plane. The intercept \(\beta_1\) is its height above the origin, and \(\beta_2\) and \(\beta_3\) are the slopes of the plane in the PRICE and ADVERT directions respectively.

Show the R code
# A small isometric sketch of the regression plane E(SALES | PRICE, ADVERT).
# Project 3D corners onto 2D with a simple oblique projection.
proj <- function(price, advert, sales) {
  data.frame(
    x = price + 0.5 * advert,
    y = sales + 0.4 * advert
  )
}
# Four corners of a tilted plane (height falls in price, rises in advert).
corners <- rbind(
  proj(0, 0, 1.6),
  proj(3, 0, 0.7),
  proj(3, 2, 1.7),
  proj(0, 2, 2.6)
)
plane <- data.frame(x = corners$x, y = corners$y)
axes <- data.frame(
  x    = c(0, 0, 0),
  y    = c(0, 0, 0),
  xend = c(3.6, 1.0, 0),
  yend = c(0,   0.8, 3.2),
  lab  = c("PRICE", "ADVERT", "SALES")
)
b1 <- proj(0, 0, 1.6)
ggplot() +
  geom_segment(data = axes, aes(x = x, y = y, xend = xend, yend = yend),
               arrow = arrow(length = unit(0.18, "cm")), color = ucla$gray) +
  geom_polygon(data = plane, aes(x, y), fill = ucla$blue, alpha = 0.30,
               color = ucla$blue, linewidth = 1) +
  geom_point(data = b1, aes(x, y), color = ucla$darkblue, size = 2.4) +
  annotate("text", x = 3.7, y = 0.1, label = "PRICE", color = ucla$gray,
           size = 3.2, hjust = 0) +
  annotate("text", x = 1.1, y = 0.9, label = "ADVERT", color = ucla$gray,
           size = 3.2, hjust = 0) +
  annotate("text", x = 0.05, y = 3.2, label = "E(SALES | .)",
           color = ucla$darkblue, size = 3.2, hjust = 0) +
  annotate("text", x = -0.15, y = 1.6, label = "beta[1]", parse = TRUE,
           color = ucla$darkblue, size = 3.6, hjust = 1) +
  coord_equal() +
  theme_void()
Figure 13.1: The multiple-regression function is a plane. The two slopes are partial effects: \(\beta_2\) in the PRICE direction, \(\beta_3\) in the ADVERT direction.

What does “held constant” precisely mean? The Frisch<80><93>Waugh<80><93>Lovell theorem gives the formal answer: \(\beta_3\) is the effect of ADVERT after the linear influence of PRICE has been partialled out of both SALES and ADVERT. The partial coefficient is the effect of the part of ADVERT that is unrelated to PRICE.

13.3 Assumptions MR1<80><93>MR6

The multiple-regression assumptions mirror the simple-regression assumptions SR1<80><93>SR6, with one genuine newcomer. Writing \(\mathbf{X}\) for the full collection of regressors, they are:

The multiple-regression assumptions.
MR1 \(y_i = \beta_1 + \beta_2 x_{i2} + \dots + \beta_K x_{iK} + e_i\)
MR2 \(\E(e_i \given \mathbf{X}) = 0\) (strict exogeneity <80><94> now for all regressors)
MR3 \(\Var(e_i \given \mathbf{X}) = \sigma^2\) (homoskedastic)
MR4 \(\Cov(e_i, e_j \given \mathbf{X}) = 0\) for \(i \neq j\)
MR5 no exact linear relationship among the regressors (new)
MR6 \(e_i \given \mathbf{X} \sim N(0, \sigma^2)\) (optional)

Two points about MR2 deserve emphasis. First, it must now hold for every regressor: the bar for “no confounders” is higher, because each included variable must be uncorrelated with the error. Second, MR2 implies both that \(\E(e_i) = 0\) and that \(\Cov(e_i, x_{jk}) = 0\) for all regressors \(k\).

MR5: no exact linear relationship

MR5 <e2><80><94> no perfect collinearity

No regressor may be written as an exact linear combination of the others (including the constant). If one can, OLS cannot separate their effects <80><94> the estimation formulas divide by zero.

The assumption is required because violating it asks an impossible question. Suppose you tried to include both the percentage and the fraction of English learners, where \(\text{Pct} = 100 \times \text{Frac}\). There is no way for OLS to find “the effect of Pct holding Frac constant,” because the two move together perfectly <80><94> you can never change one while keeping the other fixed.

MR5 also generalizes the simple-regression assumption SR5. The requirement there that “\(x\) must take at least two values” is just the one-regressor special case: a regressor that never varies is an exact multiple of the constant term, so it is perfectly collinear with the intercept.

Perfect vs. near collinearity

MR5 rules out perfect collinearity only. Near-collinear regressors <80><94> variables that move together strongly but not exactly <80><94> are allowed. As the next chapter shows, however, near collinearity inflates standard errors and makes the slopes hard to pin down.

What the assumptions buy: Gauss<80><93>Markov, again

Gauss<e2><80><93>Markov for multiple regression

If MR1<80><93>MR5 hold, the OLS estimators \(b_1, \dots, b_K\) are the Best Linear Unbiased Estimators (BLUE) of \(\beta_1, \dots, \beta_K\).

Everything from OLS properties carries over without change. The estimators are linear in the data, unbiased (\(\E(b_k \given \mathbf{X}) = \beta_k\) for every \(k\)), and have the smallest variance in the class of linear unbiased estimators.

Adding MR6 (normal errors) makes each \(b_k\) exactly normal, which gives the exact \(t\)-based inference we develop in hypothesis testing. Even without MR6 the same inference holds approximately in large samples, thanks to the central limit theorem.

The conceptual machinery, in short, is unchanged from the simple model. Only the bookkeeping grows: more coefficients to estimate, and degrees of freedom \(N - K\) rather than \(N - 2\).

13.4 OLS estimation and Big Andy’s results

Least squares, same principle

OLS chooses \(b_1, \dots, b_K\) to minimize the sum of squared residuals <80><94> the identical idea as in OLS estimation, just with more terms: \[ \min_{b_1,\dots,b_K}\ \sum_{i=1}^{N} \bigl(y_i - b_1 - b_2 x_{i2} - \dots - b_K x_{iK}\bigr)^2 . \]

Setting the \(K\) partial derivatives to zero yields \(K\) normal equations in \(K\) unknowns, solved in one step. By hand the formulas are messy <80><94> they are most naturally written with matrix algebra in advanced courses <80><94> so we let software do the arithmetic and concentrate on reading the output. As always, \(b_1, \dots, b_K\) are random-variable estimators; the numbers from one particular sample are estimates.

Big Andy’s: the fitted equation

Running OLS on the 75 cities is a one-line call in R. We fit the model and read off the coefficients, with standard errors in the second column.

data(andy)
andy_fit <- lm(sales ~ price + advert, data = andy)
summary(andy_fit)
#> 
#> Call:
#> lm(formula = sales ~ price + advert, data = andy)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -13.4825  -3.1434  -0.3456   2.8754  11.3049 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 118.9136     6.3516  18.722  < 2e-16 ***
#> price        -7.9079     1.0960  -7.215 4.42e-10 ***
#> advert        1.8626     0.6832   2.726  0.00804 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.886 on 72 degrees of freedom
#> Multiple R-squared:  0.4483, Adjusted R-squared:  0.4329 
#> F-statistic: 29.25 on 2 and 72 DF,  p-value: 5.041e-10

Writing the result as a fitted equation with standard errors beneath each estimate, \[ \widehat{\text{SALES}} = \underset{(6.35)}{118.91} \;\underset{(1.096)}{-\,7.908}\,\text{PRICE} \;\underset{(0.683)}{+\,1.863}\,\text{ADVERT}, \qquad R^2 = 0.448 . \]

The two slopes tell the economic story.

Price: $b_2 = -7.908$

Holding advertising fixed, a $1 increase in the price index lowers mean monthly revenue by $7,908 (a more realistic 10-cent cut raises revenue by about $791). Revenue falls when price rises, which means demand is price-elastic.

Advertising: $b_3 = 1.863$

Holding price fixed, spending $1,000 more on advertising raises mean revenue by $1,863. Whether that increase is actually profitable <80><94> that is, whether \(\beta_3 > 1\) <80><94> is a hypothesis test we take up in hypothesis testing.

The intercept. \(\beta_1 = \$118{,}914\) is predicted sales at zero price and zero advertising <80><94> economically impossible. We keep it only to pin down the height of the plane, not for interpretation.

Error variance, fit, and a prediction

The estimated error variance now divides the sum of squared errors by the degrees of freedom \(N - K = 75 - 3 = 72\): \[ \hat\sigma^2 = \frac{\mathrm{SSE}}{N - K} = \frac{1718.94}{72} = 23.87, \qquad \hat\sigma = \sqrt{23.87} = 4.89 . \] The goodness-of-fit measure is the familiar \(R^2 = 1 - \mathrm{SSE}/\mathrm{SST} = 0.448\): price and advertising together explain 44.8% of the variation in sales. We can pull these quantities straight out of the fitted object.

N   <- nobs(andy_fit)
K   <- length(coef(andy_fit))
SSE <- sum(resid(andy_fit)^2)
c(N = N, K = K, df = N - K,
  sigma2_hat = SSE / (N - K),
  sigma_hat  = sqrt(SSE / (N - K)),
  R2 = summary(andy_fit)$r.squared)
#>          N          K         df sigma2_hat  sigma_hat         R2 
#> 75.0000000  3.0000000 72.0000000 23.8742075  4.8861240  0.4482578

To form a prediction, plug a chosen price and advertising level into the fitted equation. At \(\text{PRICE} = 5.50\) and \(\text{ADVERT} = 1.2\), \[ \widehat{\text{SALES}} = 118.91 - 7.908(5.5) + 1.863(1.2) = 77.66 , \] that is, predicted monthly revenue of about $77,656.

predict(andy_fit, newdata = data.frame(price = 5.5, advert = 1.2))
#>        1 
#> 77.65551

Where the only change is. Relative to simple regression, the lone arithmetic difference here is that \(\hat\sigma^2\) divides by \(N - K\) (with \(K = 3\)) instead of \(N - 2\) <80><94> one degree of freedom is spent per estimated coefficient.

A standing caution

The negative price coefficient does not say “cut price to zero.” An estimated model describes the data’s neighborhood; extrapolating to extreme values far outside the observed range invites disaster.

13.5 Recap

We add regressors to escape omitted-variable bias, which strikes only when a confounder is (i) correlated with \(x\) and (ii) a determinant of \(y\). The fix is to include the confounder so that OLS holds it constant. The multiple regression model is \[ y = \beta_1 + \beta_2 x_2 + \dots + \beta_K x_K + e , \] each slope \(\beta_k = \partial\,\E(y)/\partial x_k\) is a partial (ceteris paribus) effect, and the regression function is a plane.

The assumptions MR1<80><93>MR4 and MR6 carry over from the simple model; the new one is MR5, no perfect collinearity. Together MR1<80><93>MR5 make OLS BLUE. For Big Andy’s Burger Barn, \[ \widehat{\text{SALES}} = 118.9 - 7.91\,\text{PRICE} + 1.86\,\text{ADVERT}, \] demand is price-elastic, \(\hat\sigma^2 = \mathrm{SSE}/(N - K) = 23.87\), and \(R^2 = 0.448\).

Next time: how reliable are these slopes? In variance and collinearity we build the variance<80><93>covariance matrix, see what drives the standard errors, and meet the regression headache of collinearity <80><94> when regressors move together.