---
title: "The Simple Linear Regression Model"
---
{{< include _setup.qmd >}}
> **Reading.** Hill, Griffiths & Lim (5th ed.), §2.1–2.2; Stock & Watson (4th ed.), §4.1, 4.4.
The last four chapters built a probability toolkit. The very last idea — from the
bivariate Normal — was that a conditional mean can be a straight line:
$$
\E(Y \given X = x) = \alpha + \beta x,
\qquad
\beta = \frac{\Cov(X,Y)}{\Var(X)} .
$$
Starting now, that line becomes the object of the whole course: the **simple
linear regression model**. This chapter writes down the model
$y = \beta_1 + \beta_2 x + e$ and interprets each of its pieces, carefully
separates the three things people sloppily all call "beta" —
[parameters]{.term}, estimators, and estimates — and states the assumptions
(SR1–SR6) that make the whole apparatus work.
Recall the very first lecture, where we scatter-plotted weekly food expenditure
against income and eyeballed an upward-sloping cloud of points. Here we write
down the model behind that cloud; in the [next chapter](06-ols-estimation.qmd) we
fit the line.
## From an economic idea to a model {#sec-idea-to-model}
The running example throughout this part of the course comes from Hill,
Griffiths & Lim: **how does a household's weekly *food expenditure* $y$ depend on
its weekly *income* $x$?**
Even among households with the *same* income, food spending varies — tastes,
household size, restaurants, impulse buys. So at each income $x$, the outcome $y$
is not a single number but has a whole **conditional distribution**
$f(y \given x)$. Economic theory does not pin down every household; it speaks to
the **center** of that distribution — the [conditional mean]{.term}
$\E(y \given x)$ — which we expect to rise with income. @fig-cond-dist shows the
picture: at two incomes $x_1$ and $x_2$ there is a spread of possible outcomes,
each spread centered on a point that lies on the population regression line.
```{r}
#| label: fig-cond-dist
#| fig-cap: "At each income there is a conditional distribution of food expenditure, centered on the population regression line $\\E(y \\mid x) = \\beta_1 + \\beta_2 x$."
#| fig-width: 5
#| fig-height: 3.4
line_df <- data.frame(x = c(2, 28), y = 83 + 10 * c(2, 28))
# two vertical conditional densities (bells opening to the right)
bell <- function(x0, y0, scale = 6, span = 22) {
t <- seq(-2.6, 2.6, length.out = 60)
data.frame(x = x0 + scale * exp(-(t^2) / 2), y = y0 + span * t)
}
b1 <- bell(8, 163); b2 <- bell(20, 283)
means <- data.frame(x = c(8, 20), y = c(163, 283))
ggplot() +
geom_line(data = line_df, aes(x, y), color = ucla$blue, linewidth = 1) +
geom_path(data = b1, aes(x, y), color = ucla$red, linewidth = 0.8) +
geom_path(data = b2, aes(x, y), color = ucla$red, linewidth = 0.8) +
geom_point(data = means, aes(x, y), color = ucla$darkblue, size = 1.8) +
annotate("text", x = 8, y = 120, label = "mu[y*'|'*x[1]]",
parse = TRUE, color = ucla$darkblue, size = 3) +
annotate("text", x = 20, y = 240, label = "mu[y*'|'*x[2]]",
parse = TRUE, color = ucla$darkblue, size = 3) +
annotate("text", x = 22, y = 360,
label = "E(y*'|'*x) == beta[1] + beta[2]*x",
parse = TRUE, color = ucla$blue, size = 3) +
scale_x_continuous(breaks = c(8, 20), labels = c(expression(x[1]), expression(x[2]))) +
scale_y_continuous(limits = c(0, 420)) +
labs(x = "income x", y = "food exp. y")
```
### From a rule to a model
Imagine first a made-up *deterministic* rule: a household spends \$80 plus 10
cents of each dollar of income on food,
$$
y = 80 + 0.10\,x .
$$
Under this rule a \$100 rise in income raises spending by exactly \$10. The
number $0.10$ — the **marginal propensity to spend on food** — is the slope, and
it is precisely the "how much" quantity a decision-maker cares about.
But reality is not deterministic. Countless other factors move food spending. We
collect all of them into a single [random error]{.term} $e$, and we replace the
fixed numbers $80$ and $0.10$ by *unknown* parameters $\beta_1$ and $\beta_2$,
because in practice we do not know their values:
$$
y = \beta_1 + \beta_2 x + e .
$$
::: {.keyidea title="Systematic part + random error"}
This is the same "systematic part $+$ random error" template introduced in the
[first chapter](01-introduction.qmd) — now specialized to *one* explanatory
variable, with the two pieces of the systematic part given names, $\beta_1$ and
$\beta_2$.
:::
## The simple linear regression model {#sec-the-model}
We can now state the model that organizes the rest of the course.
::: {.definition title="The simple linear regression model"}
For each observation $i = 1,\dots,N$,
$$
y_i = \beta_1 + \beta_2 x_i + e_i .
$$
:::
Each symbol has a name. On the left, $y_i$ is the [dependent]{.term} variable —
also called the regressand or the "left-hand side" variable. On the right, $x_i$
is the [independent]{.term} or explanatory variable, also called the regressor,
and $e_i$ is the [random error]{.term}, standing in for everything else that
affects $y$. The two unknowns $\beta_1$ and $\beta_2$ are the [intercept]{.term}
and [slope]{.term} parameters; both are fixed, **unknown population parameters**
— there is one true value of each, out in the population, that we are trying to
learn.
::: {.callout-note appearance="simple"}
"Simple" means *one* regressor — not that the model is easy. Everything we do
here generalizes to many regressors when we reach [multiple
regression](13-multiple-regression.qmd).
:::
### The regression function and the systematic/random split
Suppose — as we will formally assume in a moment — that the errors average to
zero at each value of $x$. Then taking the conditional mean of
$y_i = \beta_1 + \beta_2 x_i + e_i$ leaves only the systematic part, giving the
[population regression function]{.term}
$$
\E(y \given x) = \beta_1 + \beta_2 x .
$$
Every observation therefore splits cleanly into two pieces,
$$
y_i = \underbrace{\E(y_i \given x_i)}_{\text{systematic}}
\;+\; \underbrace{e_i}_{\text{random}} .
$$
The line is the *average* behavior of food expenditure at each income; the error
$e_i$ is the $i$th household's departure from that average — the vertical gap
between its point and the line, as in @fig-error-split.
```{r}
#| label: fig-error-split
#| fig-cap: "Each observation is the regression line (systematic part) plus an error $e_i$, the vertical gap from the point to the line."
#| fig-width: 5
#| fig-height: 3.4
line_df <- data.frame(x = c(2, 28), y = 83 + 10 * c(2, 28))
pts <- data.frame(
x = c(5, 8, 11, 14, 17, 20, 23, 26, 9, 22),
y = c(140, 150, 165, 255, 235, 300, 300, 360, 210, 360)
)
hi <- data.frame(x = 14, y = 255, yline = 83 + 10 * 14)
ggplot() +
geom_line(data = line_df, aes(x, y), color = ucla$blue, linewidth = 1) +
geom_point(data = pts, aes(x, y), color = ucla$darkblue, size = 1.6) +
geom_segment(data = hi, aes(x = x, xend = x, y = yline, yend = y),
linetype = "dashed", color = ucla$red) +
geom_point(data = hi, aes(x, y), color = ucla$red, size = 2.2) +
annotate("text", x = 14.8, y = 240, label = "e[i]", parse = TRUE,
color = ucla$red, size = 3.4) +
annotate("text", x = 23, y = 250, label = "E(y*'|'*x)", parse = TRUE,
color = ucla$blue, size = 3) +
scale_y_continuous(limits = c(0, 420)) +
labs(x = "income x", y = "food exp. y")
```
### Interpreting the slope
The slope is the [marginal effect]{.term} of $x$ on the *average* of $y$:
$$
\beta_2 = \frac{\Delta\,\E(y \given x)}{\Delta x}
= \frac{d\,\E(y \given x)}{dx} .
$$
Holding "everything else" fixed — that is, $\Delta e = 0$ — a change $\Delta x$
moves average spending by $\beta_2 \, \Delta x$. This is the *ceteris paribus*
interpretation. In the food example, if income rises by \$100 then average food
expenditure rises by $\beta_2 \times \$100$; that single number is exactly what a
decision-maker wants to know.
::: {.keyidea title="The intercept $\\beta_1 = \\E(y \\mid x = 0)$"}
The intercept is the average of $y$ when $x = 0$. Sometimes this is meaningful,
often it is not. In a regression of test scores on class size, $\beta_1$ would be
the predicted score for a class of *zero* students — nonsense. In such cases
$\beta_1$ is best read as just the height that pins the line in place, not as a
quantity to interpret on its own.
:::
## Parameters, estimators, estimates {#sec-three-betas}
Keeping three closely related objects straight is the central conceptual hurdle
of the course. People sloppily call all three "beta," but they are different
kinds of thing.
::: {.definition title="Parameter, estimator, estimate"}
- A **parameter** ($\beta_1, \beta_2$) is a fixed, *unknown* feature of the
population. There is one true value; it is *not* random.
- An **estimator** ($b_1, b_2$) is a *formula* applied to a sample. Because the
sample is random, the estimator is itself a **random variable** — it has a
sampling distribution.
- An **estimate** (e.g. $b_1 = 83.4$) is the *number* the estimator produces in
*one* particular sample. It is just a number — not random.
:::
::: {.keyidea title="The connection to the sample mean"}
The estimator $b_2$ is to the parameter $\beta_2$ exactly as the sample mean
$\bar Y$ is to the population mean $\mu$: a random variable that varies from
sample to sample, with a center and a spread we can study. That is precisely how
we will judge it when we turn to the [properties of
OLS](07-ols-properties.qmd) and the [variance of the
estimators](08-variance-prediction.qmd).
:::
### Error versus residual
A closely related distinction trips up nearly everyone, because it hinges on the
same parameter-versus-estimate divide. The **random error** $e_i$ is a population
object,
$$
e_i = y_i - (\beta_1 + \beta_2 x_i) = y_i - \E(y_i \given x_i) ,
$$
defined using the *true* parameters $\beta_1, \beta_2$. Because we never know
those parameters, the error is **unobservable**. The **residual** $\hat e_i$ is
the sample analog,
$$
\hat e_i = y_i - (b_1 + b_2 x_i) = y_i - \hat y_i ,
$$
defined using the *estimated* line. The residual is therefore **observable** — we
can compute it as soon as we have fit the line in the next chapter.
::: {.keyidea title="The parallel"}
The error $e_i$ is to $\beta$ as the residual $\hat e_i$ is to $b$. The residual
is our visible *stand-in* for the invisible error — and minimizing the residuals
is exactly how [OLS](06-ols-estimation.qmd) chooses the line.
:::
## The assumptions: SR1–SR6 {#sec-assumptions}
A model is only as trustworthy as the conditions behind it. The simple regression
assumptions [SR1–SR6]{.term} ("SR" for *simple regression*) are the conditions
under which two things hold: the slope $\beta_2$ measures a genuinely **causal**
marginal effect, and the estimators $b_1, b_2$ are well behaved — unbiased, with
a known sampling distribution we can use for inference. Much of the rest of
econometrics is about what to do *when* one of these assumptions fails, so it
pays to know exactly what we are assuming, and which assumption each later
technique is designed to rescue. We meet them one at a time and then collect
them.
### SR1 and SR2: the model and strict exogeneity
::: {.property title="SR1 — the model holds in the population"}
$$
y_i = \beta_1 + \beta_2 x_i + e_i \qquad \text{for all } i = 1,\dots,N .
$$
:::
::: {.property title="SR2 — strict exogeneity (the crucial one)"}
The error has conditional mean zero given the regressor(s):
$$
\E(e_i \given x) = 0 .
$$
:::
SR2 says that knowing $x$ tells you **nothing** about the average error: the
omitted factors balance out to zero at every value of $x$. It is the assumption
that does the heavy lifting, because it delivers two consequences at once,
$$
\E(e_i \given x) = 0
\;\Longrightarrow\;
\E(e_i) = 0
\quad\text{and}\quad
\Cov(e_i, x_i) = 0 ,
$$
and from it follows the regression function
$\E(y_i \given x) = \beta_1 + \beta_2 x_i$ that we used above.
The covariance consequence is what separates good cases from bad ones. If
$\Cov(e, x) = 0$, the regressor $x$ is [exogenous]{.term}: regression can recover
$\beta_1, \beta_2$, and $\beta_2$ is the causal marginal effect. If instead
$\Cov(e, x) \neq 0$, then $x$ is [endogenous]{.term}, and $\beta_2$ is **not**
causal. This is the formal version of the slogan "correlation $\neq$ causation"
from the [first chapter](01-introduction.qmd).
::: {.example title="Wages and education (HGL)"}
Consider $\text{WAGE}_i = \beta_1 + \beta_2\,\text{EDUC}_i + e_i$. The error $e$
holds factors like *ability, drive, intelligence* — all plausibly **correlated**
with education. Then $\E(e \given \text{EDUC}) \neq 0$, education is endogenous,
and $b_2$ confounds the true return to schooling with the effect of ability. (We
tackle problems of this kind much later in the course.)
:::
### SR3 and SR4: spread and dependence of the errors
::: {.property title="SR3 — homoskedasticity"}
The error has *constant* conditional variance,
$$
\Var(e_i \given x) = \sigma^2 .
$$
The spread of $y$ about the line is the same at *every* $x$. If the variance
changes with $x$, the errors are [heteroskedastic]{.term}.
:::
::: {.property title="SR4 — uncorrelated errors"}
$$
\Cov(e_i, e_j \given x) = 0, \qquad i \neq j .
$$
One observation's error carries no information about another's. This typically
fails with clustered or time-series data.
:::
Homoskedasticity is easiest to see in a picture. @fig-homosked redraws the
conditional-distribution diagram with the two bells given the *same* width — that
equal width *is* SR3.
```{r}
#| label: fig-homosked
#| fig-cap: "SR3 (homoskedasticity): the conditional distribution of $y$ has the same spread at every $x$ — the two bells are equally wide."
#| fig-width: 5
#| fig-height: 3.4
line_df <- data.frame(x = c(2, 28), y = 83 + 10 * c(2, 28))
bell <- function(x0, y0, scale = 6, span = 20) {
t <- seq(-2.6, 2.6, length.out = 60)
data.frame(x = x0 + scale * exp(-(t^2) / 2), y = y0 + span * t)
}
b1 <- bell(8, 163); b2 <- bell(20, 283)
ggplot() +
geom_line(data = line_df, aes(x, y), color = ucla$blue, linewidth = 1) +
geom_path(data = b1, aes(x, y), color = ucla$red, linewidth = 0.8) +
geom_path(data = b2, aes(x, y), color = ucla$red, linewidth = 0.8) +
scale_x_continuous(breaks = c(8, 20), labels = c(expression(x[1]), expression(x[2]))) +
scale_y_continuous(limits = c(0, 420)) +
labs(x = "x", y = "y")
```
### SR5 and SR6: variation in $x$, and (optional) normality
::: {.property title="SR5 — the regressor must vary"}
In the sample, $x_i$ takes **at least two different values**. As the old saw goes,
"it takes two points to determine a line": with no variation in $x$ there is no
slope to estimate.
:::
::: {.property title="SR6 — normality of errors (optional)"}
$$
e_i \given x \sim N(0, \sigma^2)
\quad\Longleftrightarrow\quad
y_i \given x \sim N(\beta_1 + \beta_2 x_i,\ \sigma^2) .
$$
:::
SR6 is *not* needed for the estimators to work. Its role is to make
**small-sample** inference exact, as we will see when we build [confidence
intervals](09-confidence-intervals.qmd). It is also plausible: by the Central
Limit Theorem from the [Normal chapter](04-normal-clt.qmd), an error that sums up
many small independent factors tends toward a Normal distribution.
### The six at a glance
It helps to see all six in one place.
| Assumption | Statement |
|:--|:--|
| **SR1** | $y_i = \beta_1 + \beta_2 x_i + e_i$ |
| **SR2** | $\E(e_i \given x) = 0$ (strict exogeneity) |
| **SR3** | $\Var(e_i \given x) = \sigma^2$ (homoskedastic) |
| **SR4** | $\Cov(e_i, e_j \given x) = 0,\ i \neq j$ |
| **SR5** | $x_i$ takes $\ge 2$ values |
| **SR6** | $e_i \given x \sim N(0, \sigma^2)$ (optional) |
: The simple regression assumptions SR1–SR6. {.striped}
::: {.callout-note appearance="simple"}
**The same idea in Stock & Watson.** S&W write the model as
$Y_i = \beta_0 + \beta_1 X_i + u_i$ and list three assumptions:
(1) $\E(u_i \given X_i) = 0$, which is exactly SR2; (2) the pairs $(X_i, Y_i)$
are i.i.d.; and (3) large outliers are unlikely (finite fourth moments). S&W drop
homoskedasticity — they use robust standard errors throughout — and add the
outlier condition. We follow HGL's SR1–SR6.
:::
## Recap {#sec-recap}
The **simple linear regression model** is $y_i = \beta_1 + \beta_2 x_i + e_i$,
with population regression function $\E(y \given x) = \beta_1 + \beta_2 x$. Every
observation is the systematic part plus a random error, and the slope
$\beta_2 = \Delta\,\E(y \given x) / \Delta x$ is the marginal effect of $x$ on
the average of $y$.
Keep the three "betas" distinct: a **parameter** $\beta$ (fixed) is estimated by
an **estimator** $b$ (random), which yields an **estimate** (a number); likewise
the unobserved **error** $e$ has the computable **residual** $\hat e$ as its
sample stand-in.
The assumptions **SR1–SR6** are the conditions under which this all works: SR1
the model; SR2 exogeneity $\E(e \given x) = 0$ (exogenous $\Rightarrow$ causal,
otherwise endogenous); SR3 homoskedasticity; SR4 uncorrelated errors; SR5
variation in $x$; and SR6 (optional) normality.
**Next time:** we have the model and the assumptions, but not the line. In the
next chapter we choose $b_1, b_2$ to [minimize the
residuals](06-ols-estimation.qmd) — ordinary least squares — and find that the
slope is $b_2 = \Cov(x, y) / \Var(x)$.