---
title: "The Multiple Regression Model"
---
{{< include _setup.qmd >}}
> **Reading.** Hill, Griffiths & Lim (5th ed.), §5.1–5.2; Stock & Watson (4th ed.), §6.1–6.3.
The simple regression model has an Achilles heel. Its causal reading rests on the
strict-exogeneity assumption SR2,
$$
\E(e \given x) = 0 ,
$$
but the error $e$ holds *everything else* about the outcome. If any omitted factor
is correlated with $x$, then SR2 fails and the slope estimator $b_2$ is **biased** —
this is exactly the ability-in-the-wage-equation problem from [OLS
properties](07-ols-properties.qmd).
The cure is to stop hiding confounders inside $e$ and instead put them *in the
regression*. With more than one regressor we can finally hold other factors
constant — *ceteris paribus* for real. This chapter introduces the **multiple
regression model** and its *partial* coefficients, lays out the assumptions
**MR1–MR6** (one of which is genuinely new), and applies ordinary least squares
to Big Andy's Burger Barn.
## Why more than one regressor? {#sec-why-mr}
### Omitted-variable bias
A left-out variable does not always cause trouble. It biases the OLS slope only
under two conditions, both of which must hold.
::: {.keyidea title="When does omitting a variable bias OLS? (two conditions)"}
A left-out variable biases $b_2$ only if it is **both**
1. correlated with the included regressor $x$, *and*
2. a determinant of the outcome $y$ (so it sits in $e$).
:::
::: {.example title="Class size and test scores"}
Regress district test scores on the student–teacher ratio (STR) alone. Districts
with larger classes also tend to have more *English learners* (a correlation of
about $0.19$), and English learners score lower on average. The share of English
learners is therefore correlated with STR *and* a driver of scores, so omitting it
**biases** the estimated class-size effect. In fact the class-size effect roughly
*halves* once the English-learner share is controlled for.
:::
Both conditions genuinely matter. Consider instead the time of day at which a test
is taken: it may well affect scores, but if it is *uncorrelated* with class size,
then leaving it in the error is harmless — it does not contaminate the class-size
slope.
### The fix: put the confounder in the regression
Omitted-variable bias is nothing more than SR2 failing. A determinant of $y$ that
happens to be correlated with $x$ lives inside $e$, which makes
$\E(e \given x) \neq 0$. The remedy is direct: move that variable *out* of the
error and *into* the model as its own regressor. Once it is an explicit regressor,
OLS can estimate the effect of $x$ *holding that variable constant*, and the bias
it had been causing disappears.
::: {.keyidea title="This is what \"control for\" means"}
Adding the share of English learners as a regressor lets us compare districts *as
if* they had the same share of English learners. Multiple regression does, with
continuous data, what we wished we could do by hand: hold the other factors fixed.
:::
::: {.callout-note appearance="simple"}
**Caveat for later.** You can only control for what you *observe*. Unobservable
confounders — the "ability" term from [OLS properties](07-ols-properties.qmd) —
still threaten SR2, and the deeper fix waits until we study [treatment
effects](20-treatment-effects.qmd).
:::
## The model and its partial coefficients {#sec-model}
### Big Andy's Burger Barn
Our running example throughout the multiple-regression chapters comes from HGL. A
burger chain operates in 75 small cities. In each city it sets a different
**price** and **advertising** budget, and it observes monthly **sales**. The
question is how revenue responds to each lever — *holding the other fixed*:
$$
\text{SALES} = \beta_1 + \beta_2\,\text{PRICE} + \beta_3\,\text{ADVERT} + e .
$$
Here SALES and ADVERT are measured in thousands of dollars and PRICE is a dollar
price index. The error $e$ collects everything else that moves sales: competitors,
local demographics, the quality of each location.
More generally, the multiple regression model with $K - 1$ regressors plus an
intercept is
$$
y_i = \beta_1 + \beta_2 x_{i2} + \beta_3 x_{i3} + \dots + \beta_K x_{iK} + e_i .
$$
Under strict exogeneity the regression function is
$$
\E(y \given \mathbf{X}) = \beta_1 + \beta_2 x_2 + \dots + \beta_K x_K ,
$$
which is now a **plane** (or, with more than two regressors, a hyperplane) rather
than a line.
### Partial coefficients: *ceteris paribus* at last
Each slope is a **partial effect** — the change in $\E(y)$ from a one-unit change
in *that* regressor, holding all the others fixed:
$$
\beta_k = \frac{\partial\,\E(y \given \mathbf{X})}{\partial x_k}
\qquad (\text{other } x\text{'s held constant}).
$$
In Big Andy's model the coefficients read off cleanly:
- $\beta_2$ is the effect of PRICE on sales, with ADVERT fixed;
- $\beta_3$ is the effect of ADVERT on sales, with PRICE fixed;
- the intercept $\beta_1 = \E(y)$ when *all* the $x$'s are zero — often not
economically meaningful, but we keep it to pin down the plane.
@fig-plane sketches the regression function as a plane. The intercept $\beta_1$ is
its height above the origin, and $\beta_2$ and $\beta_3$ are the slopes of the
plane in the PRICE and ADVERT directions respectively.
```{r}
#| label: fig-plane
#| fig-cap: "The multiple-regression function is a plane. The two slopes are partial effects: $\\beta_2$ in the PRICE direction, $\\beta_3$ in the ADVERT direction."
#| fig-width: 5
#| fig-height: 3.4
# A small isometric sketch of the regression plane E(SALES | PRICE, ADVERT).
# Project 3D corners onto 2D with a simple oblique projection.
proj <- function(price, advert, sales) {
data.frame(
x = price + 0.5 * advert,
y = sales + 0.4 * advert
)
}
# Four corners of a tilted plane (height falls in price, rises in advert).
corners <- rbind(
proj(0, 0, 1.6),
proj(3, 0, 0.7),
proj(3, 2, 1.7),
proj(0, 2, 2.6)
)
plane <- data.frame(x = corners$x, y = corners$y)
axes <- data.frame(
x = c(0, 0, 0),
y = c(0, 0, 0),
xend = c(3.6, 1.0, 0),
yend = c(0, 0.8, 3.2),
lab = c("PRICE", "ADVERT", "SALES")
)
b1 <- proj(0, 0, 1.6)
ggplot() +
geom_segment(data = axes, aes(x = x, y = y, xend = xend, yend = yend),
arrow = arrow(length = unit(0.18, "cm")), color = ucla$gray) +
geom_polygon(data = plane, aes(x, y), fill = ucla$blue, alpha = 0.30,
color = ucla$blue, linewidth = 1) +
geom_point(data = b1, aes(x, y), color = ucla$darkblue, size = 2.4) +
annotate("text", x = 3.7, y = 0.1, label = "PRICE", color = ucla$gray,
size = 3.2, hjust = 0) +
annotate("text", x = 1.1, y = 0.9, label = "ADVERT", color = ucla$gray,
size = 3.2, hjust = 0) +
annotate("text", x = 0.05, y = 3.2, label = "E(SALES | .)",
color = ucla$darkblue, size = 3.2, hjust = 0) +
annotate("text", x = -0.15, y = 1.6, label = "beta[1]", parse = TRUE,
color = ucla$darkblue, size = 3.6, hjust = 1) +
coord_equal() +
theme_void()
```
::: {.callout-note appearance="simple"}
**What does "held constant" precisely mean?** The Frisch–Waugh–Lovell theorem
gives the formal answer: $\beta_3$ is the effect of ADVERT *after the linear
influence of PRICE has been partialled out* of both SALES and ADVERT. The partial
coefficient is the effect of the part of ADVERT that is unrelated to PRICE.
:::
## Assumptions MR1–MR6 {#sec-assumptions}
The multiple-regression assumptions mirror the simple-regression assumptions
SR1–SR6, with one genuine newcomer. Writing $\mathbf{X}$ for the full collection
of regressors, they are:
| | |
|---|---|
| **MR1** | $y_i = \beta_1 + \beta_2 x_{i2} + \dots + \beta_K x_{iK} + e_i$ |
| **MR2** | $\E(e_i \given \mathbf{X}) = 0$ (strict exogeneity — now for *all* regressors) |
| **MR3** | $\Var(e_i \given \mathbf{X}) = \sigma^2$ (homoskedastic) |
| **MR4** | $\Cov(e_i, e_j \given \mathbf{X}) = 0$ for $i \neq j$ |
| **MR5** | no exact linear relationship among the regressors (new) |
| **MR6** | $e_i \given \mathbf{X} \sim N(0, \sigma^2)$ (optional) |
: The multiple-regression assumptions. {.striped}
Two points about MR2 deserve emphasis. First, it must now hold for *every*
regressor: the bar for "no confounders" is higher, because each included variable
must be uncorrelated with the error. Second, MR2 implies both that $\E(e_i) = 0$
and that $\Cov(e_i, x_{jk}) = 0$ for all regressors $k$.
### MR5: no exact linear relationship
::: {.definition title="MR5 — no perfect collinearity"}
No regressor may be written as an **exact linear combination** of the others
(including the constant). If one can, OLS *cannot* separate their effects — the
estimation formulas divide by zero.
:::
The assumption is required because violating it asks an impossible question.
Suppose you tried to include both the *percentage* and the *fraction* of English
learners, where $\text{Pct} = 100 \times \text{Frac}$. There is no way for OLS to
find "the effect of Pct holding Frac constant," because the two move together
perfectly — you can never change one while keeping the other fixed.
MR5 also generalizes the simple-regression assumption SR5. The requirement there
that "$x$ must take at least two values" is just the one-regressor special case: a
regressor that never varies is an exact multiple of the constant term, so it is
perfectly collinear with the intercept.
::: {.warningbox title="Perfect vs. near collinearity"}
MR5 rules out *perfect* collinearity only. *Near*-collinear regressors — variables
that move together strongly but not exactly — are allowed. As the [next
chapter](14-mr-variance-collinearity.qmd) shows, however, near collinearity
inflates standard errors and makes the slopes hard to pin down.
:::
### What the assumptions buy: Gauss–Markov, again
::: {.property title="Gauss–Markov for multiple regression"}
If **MR1–MR5** hold, the OLS estimators $b_1, \dots, b_K$ are the **Best Linear
Unbiased Estimators** (BLUE) of $\beta_1, \dots, \beta_K$.
:::
Everything from [OLS properties](07-ols-properties.qmd) carries over without
change. The estimators are linear in the data, unbiased
($\E(b_k \given \mathbf{X}) = \beta_k$ for every $k$), and have the smallest
variance in the class of linear unbiased estimators.
Adding **MR6** (normal errors) makes each $b_k$ exactly normal, which gives the
exact $t$-based inference we develop in [hypothesis
testing](15-mr-hypothesis-testing.qmd). Even without MR6 the same inference holds
*approximately* in large samples, thanks to the central limit theorem.
The conceptual machinery, in short, is unchanged from the simple model. Only the
*bookkeeping* grows: more coefficients to estimate, and degrees of freedom $N - K$
rather than $N - 2$.
## OLS estimation and Big Andy's results {#sec-ols}
### Least squares, same principle
OLS chooses $b_1, \dots, b_K$ to minimize the sum of squared residuals — the
identical idea as in [OLS estimation](06-ols-estimation.qmd), just with more
terms:
$$
\min_{b_1,\dots,b_K}\ \sum_{i=1}^{N}
\bigl(y_i - b_1 - b_2 x_{i2} - \dots - b_K x_{iK}\bigr)^2 .
$$
Setting the $K$ partial derivatives to zero yields $K$ *normal equations* in $K$
unknowns, solved in one step. By hand the formulas are messy — they are most
naturally written with matrix algebra in advanced courses — so we let software do
the arithmetic and concentrate on *reading* the output. As always, $b_1, \dots,
b_K$ are random-variable **estimators**; the numbers from one particular sample are
**estimates**.
### Big Andy's: the fitted equation
Running OLS on the 75 cities is a one-line call in R. We fit the model and read
off the coefficients, with standard errors in the second column.
```{r}
#| code-fold: false
data(andy)
andy_fit <- lm(sales ~ price + advert, data = andy)
summary(andy_fit)
```
Writing the result as a fitted equation with standard errors beneath each
estimate,
$$
\widehat{\text{SALES}} = \underset{(6.35)}{118.91}
\;\underset{(1.096)}{-\,7.908}\,\text{PRICE}
\;\underset{(0.683)}{+\,1.863}\,\text{ADVERT},
\qquad R^2 = 0.448 .
$$
The two slopes tell the economic story.
::: {.example title="Price: $b_2 = -7.908$"}
Holding advertising fixed, a \$1 increase in the price index lowers mean monthly
revenue by **\$7,908** (a more realistic 10-cent cut raises revenue by about
\$791). Revenue *falls* when price rises, which means demand is **price-elastic**.
:::
::: {.example title="Advertising: $b_3 = 1.863$"}
Holding price fixed, spending \$1,000 more on advertising raises mean revenue by
**\$1,863**. Whether that increase is actually *profitable* — that is, whether
$\beta_3 > 1$ — is a hypothesis test we take up in [hypothesis
testing](15-mr-hypothesis-testing.qmd).
:::
::: {.callout-note appearance="simple"}
**The intercept.** $\beta_1 = \$118{,}914$ is predicted sales at zero price *and*
zero advertising — economically impossible. We keep it only to pin down the
height of the plane, not for interpretation.
:::
### Error variance, fit, and a prediction
The estimated error variance now divides the sum of squared errors by the degrees
of freedom $N - K = 75 - 3 = 72$:
$$
\hat\sigma^2 = \frac{\mathrm{SSE}}{N - K} = \frac{1718.94}{72} = 23.87,
\qquad
\hat\sigma = \sqrt{23.87} = 4.89 .
$$
The goodness-of-fit measure is the familiar
$R^2 = 1 - \mathrm{SSE}/\mathrm{SST} = 0.448$: price and advertising together
explain **44.8%** of the variation in sales. We can pull these quantities
straight out of the fitted object.
```{r}
#| code-fold: false
N <- nobs(andy_fit)
K <- length(coef(andy_fit))
SSE <- sum(resid(andy_fit)^2)
c(N = N, K = K, df = N - K,
sigma2_hat = SSE / (N - K),
sigma_hat = sqrt(SSE / (N - K)),
R2 = summary(andy_fit)$r.squared)
```
To form a **prediction**, plug a chosen price and advertising level into the
fitted equation. At $\text{PRICE} = 5.50$ and $\text{ADVERT} = 1.2$,
$$
\widehat{\text{SALES}} = 118.91 - 7.908(5.5) + 1.863(1.2) = 77.66 ,
$$
that is, predicted monthly revenue of about \$77,656.
```{r}
#| code-fold: false
predict(andy_fit, newdata = data.frame(price = 5.5, advert = 1.2))
```
::: {.callout-note appearance="simple"}
**Where the only change is.** Relative to simple regression, the lone arithmetic
difference here is that $\hat\sigma^2$ divides by $N - K$ (with $K = 3$) instead of
$N - 2$ — one degree of freedom is spent per estimated coefficient.
:::
::: {.warningbox title="A standing caution"}
The negative price coefficient does *not* say "cut price to zero." An estimated
model describes the data's neighborhood; **extrapolating** to extreme values far
outside the observed range invites disaster.
:::
## Recap {#sec-recap}
We add regressors to escape **omitted-variable bias**, which strikes only when a
confounder is (i) correlated with $x$ *and* (ii) a determinant of $y$. The fix is
to include the confounder so that OLS holds it constant. The **multiple regression
model** is
$$
y = \beta_1 + \beta_2 x_2 + \dots + \beta_K x_K + e ,
$$
each slope $\beta_k = \partial\,\E(y)/\partial x_k$ is a **partial** (*ceteris
paribus*) effect, and the regression function is a plane.
The assumptions MR1–MR4 and MR6 carry over from the simple model; the new one is
**MR5**, no perfect collinearity. Together MR1–MR5 make OLS **BLUE**. For Big
Andy's Burger Barn,
$$
\widehat{\text{SALES}} = 118.9 - 7.91\,\text{PRICE} + 1.86\,\text{ADVERT},
$$
demand is price-elastic, $\hat\sigma^2 = \mathrm{SSE}/(N - K) = 23.87$, and
$R^2 = 0.448$.
**Next time:** how reliable are these slopes? In [variance and
collinearity](14-mr-variance-collinearity.qmd) we build the variance–covariance
matrix, see what drives the standard errors, and meet the regression headache of
*collinearity* — when regressors move together.