\( \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\given}{\,\vert\,} \newcommand{\indic}[1]{\mathbf{1}\!\left\{#1\right\}} \newcommand{\pmf}{\text{p.m.f.}} \newcommand{\pdf}{\text{p.d.f.}} \newcommand{\cdf}{\text{c.d.f.}} \)

5  The Simple Linear Regression Model

Reading. Hill, Griffiths & Lim (5th ed.), 2.1<80><93>2.2; Stock & Watson (4th ed.), 4.1, 4.4.

The last four chapters built a probability toolkit. The very last idea <80><94> from the bivariate Normal <80><94> was that a conditional mean can be a straight line: \[ \E(Y \given X = x) = \alpha + \beta x, \qquad \beta = \frac{\Cov(X,Y)}{\Var(X)} . \] Starting now, that line becomes the object of the whole course: the simple linear regression model. This chapter writes down the model \(y = \beta_1 + \beta_2 x + e\) and interprets each of its pieces, carefully separates the three things people sloppily all call “beta” <80><94> parameters, estimators, and estimates <80><94> and states the assumptions (SR1<80><93>SR6) that make the whole apparatus work.

Recall the very first lecture, where we scatter-plotted weekly food expenditure against income and eyeballed an upward-sloping cloud of points. Here we write down the model behind that cloud; in the next chapter we fit the line.

5.1 From an economic idea to a model

The running example throughout this part of the course comes from Hill, Griffiths & Lim: how does a household’s weekly food expenditure \(y\) depend on its weekly income \(x\)?

Even among households with the same income, food spending varies <80><94> tastes, household size, restaurants, impulse buys. So at each income \(x\), the outcome \(y\) is not a single number but has a whole conditional distribution \(f(y \given x)\). Economic theory does not pin down every household; it speaks to the center of that distribution <80><94> the conditional mean \(\E(y \given x)\) <80><94> which we expect to rise with income. Figure 5.1 shows the picture: at two incomes \(x_1\) and \(x_2\) there is a spread of possible outcomes, each spread centered on a point that lies on the population regression line.

Show the R code
line_df <- data.frame(x = c(2, 28), y = 83 + 10 * c(2, 28))

# two vertical conditional densities (bells opening to the right)
bell <- function(x0, y0, scale = 6, span = 22) {
  t <- seq(-2.6, 2.6, length.out = 60)
  data.frame(x = x0 + scale * exp(-(t^2) / 2), y = y0 + span * t)
}
b1 <- bell(8, 163); b2 <- bell(20, 283)
means <- data.frame(x = c(8, 20), y = c(163, 283))

ggplot() +
  geom_line(data = line_df, aes(x, y), color = ucla$blue, linewidth = 1) +
  geom_path(data = b1, aes(x, y), color = ucla$red, linewidth = 0.8) +
  geom_path(data = b2, aes(x, y), color = ucla$red, linewidth = 0.8) +
  geom_point(data = means, aes(x, y), color = ucla$darkblue, size = 1.8) +
  annotate("text", x = 8,  y = 120, label = "mu[y*'|'*x[1]]",
           parse = TRUE, color = ucla$darkblue, size = 3) +
  annotate("text", x = 20, y = 240, label = "mu[y*'|'*x[2]]",
           parse = TRUE, color = ucla$darkblue, size = 3) +
  annotate("text", x = 22, y = 360,
           label = "E(y*'|'*x) == beta[1] + beta[2]*x",
           parse = TRUE, color = ucla$blue, size = 3) +
  scale_x_continuous(breaks = c(8, 20), labels = c(expression(x[1]), expression(x[2]))) +
  scale_y_continuous(limits = c(0, 420)) +
  labs(x = "income x", y = "food exp. y")
Figure 5.1: At each income there is a conditional distribution of food expenditure, centered on the population regression line \(\E(y \mid x) = \beta_1 + \beta_2 x\).

From a rule to a model

Imagine first a made-up deterministic rule: a household spends $80 plus 10 cents of each dollar of income on food, \[ y = 80 + 0.10\,x . \] Under this rule a $100 rise in income raises spending by exactly $10. The number \(0.10\) <80><94> the marginal propensity to spend on food <80><94> is the slope, and it is precisely the “how much” quantity a decision-maker cares about.

But reality is not deterministic. Countless other factors move food spending. We collect all of them into a single random error \(e\), and we replace the fixed numbers \(80\) and \(0.10\) by unknown parameters \(\beta_1\) and \(\beta_2\), because in practice we do not know their values: \[ y = \beta_1 + \beta_2 x + e . \]

Systematic part + random error

This is the same “systematic part \(+\) random error” template introduced in the first chapter <80><94> now specialized to one explanatory variable, with the two pieces of the systematic part given names, \(\beta_1\) and \(\beta_2\).

5.2 The simple linear regression model

We can now state the model that organizes the rest of the course.

The simple linear regression model

For each observation \(i = 1,\dots,N\), \[ y_i = \beta_1 + \beta_2 x_i + e_i . \]

Each symbol has a name. On the left, \(y_i\) is the dependent variable <80><94> also called the regressand or the “left-hand side” variable. On the right, \(x_i\) is the independent or explanatory variable, also called the regressor, and \(e_i\) is the random error, standing in for everything else that affects \(y\). The two unknowns \(\beta_1\) and \(\beta_2\) are the intercept and slope parameters; both are fixed, unknown population parameters <80><94> there is one true value of each, out in the population, that we are trying to learn.

“Simple” means one regressor <80><94> not that the model is easy. Everything we do here generalizes to many regressors when we reach multiple regression.

The regression function and the systematic/random split

Suppose <80><94> as we will formally assume in a moment <80><94> that the errors average to zero at each value of \(x\). Then taking the conditional mean of \(y_i = \beta_1 + \beta_2 x_i + e_i\) leaves only the systematic part, giving the population regression function \[ \E(y \given x) = \beta_1 + \beta_2 x . \] Every observation therefore splits cleanly into two pieces, \[ y_i = \underbrace{\E(y_i \given x_i)}_{\text{systematic}} \;+\; \underbrace{e_i}_{\text{random}} . \] The line is the average behavior of food expenditure at each income; the error \(e_i\) is the \(i\)th household’s departure from that average <80><94> the vertical gap between its point and the line, as in Figure 5.2.

Show the R code
line_df <- data.frame(x = c(2, 28), y = 83 + 10 * c(2, 28))
pts <- data.frame(
  x = c(5, 8, 11, 14, 17, 20, 23, 26, 9, 22),
  y = c(140, 150, 165, 255, 235, 300, 300, 360, 210, 360)
)
hi <- data.frame(x = 14, y = 255, yline = 83 + 10 * 14)

ggplot() +
  geom_line(data = line_df, aes(x, y), color = ucla$blue, linewidth = 1) +
  geom_point(data = pts, aes(x, y), color = ucla$darkblue, size = 1.6) +
  geom_segment(data = hi, aes(x = x, xend = x, y = yline, yend = y),
               linetype = "dashed", color = ucla$red) +
  geom_point(data = hi, aes(x, y), color = ucla$red, size = 2.2) +
  annotate("text", x = 14.8, y = 240, label = "e[i]", parse = TRUE,
           color = ucla$red, size = 3.4) +
  annotate("text", x = 23, y = 250, label = "E(y*'|'*x)", parse = TRUE,
           color = ucla$blue, size = 3) +
  scale_y_continuous(limits = c(0, 420)) +
  labs(x = "income x", y = "food exp. y")
Figure 5.2: Each observation is the regression line (systematic part) plus an error \(e_i\), the vertical gap from the point to the line.

Interpreting the slope

The slope is the marginal effect of \(x\) on the average of \(y\): \[ \beta_2 = \frac{\Delta\,\E(y \given x)}{\Delta x} = \frac{d\,\E(y \given x)}{dx} . \] Holding “everything else” fixed <80><94> that is, \(\Delta e = 0\) <80><94> a change \(\Delta x\) moves average spending by \(\beta_2 \, \Delta x\). This is the ceteris paribus interpretation. In the food example, if income rises by $100 then average food expenditure rises by \(\beta_2 \times \$100\); that single number is exactly what a decision-maker wants to know.

The intercept $\beta_1 = \E(y \mid x = 0)$

The intercept is the average of \(y\) when \(x = 0\). Sometimes this is meaningful, often it is not. In a regression of test scores on class size, \(\beta_1\) would be the predicted score for a class of zero students <80><94> nonsense. In such cases \(\beta_1\) is best read as just the height that pins the line in place, not as a quantity to interpret on its own.

5.3 Parameters, estimators, estimates

Keeping three closely related objects straight is the central conceptual hurdle of the course. People sloppily call all three “beta,” but they are different kinds of thing.

Parameter, estimator, estimate
  • A parameter (\(\beta_1, \beta_2\)) is a fixed, unknown feature of the population. There is one true value; it is not random.
  • An estimator (\(b_1, b_2\)) is a formula applied to a sample. Because the sample is random, the estimator is itself a random variable <80><94> it has a sampling distribution.
  • An estimate (e.g. \(b_1 = 83.4\)) is the number the estimator produces in one particular sample. It is just a number <80><94> not random.
The connection to the sample mean

The estimator \(b_2\) is to the parameter \(\beta_2\) exactly as the sample mean \(\bar Y\) is to the population mean \(\mu\): a random variable that varies from sample to sample, with a center and a spread we can study. That is precisely how we will judge it when we turn to the properties of OLS and the variance of the estimators.

Error versus residual

A closely related distinction trips up nearly everyone, because it hinges on the same parameter-versus-estimate divide. The random error \(e_i\) is a population object, \[ e_i = y_i - (\beta_1 + \beta_2 x_i) = y_i - \E(y_i \given x_i) , \] defined using the true parameters \(\beta_1, \beta_2\). Because we never know those parameters, the error is unobservable. The residual \(\hat e_i\) is the sample analog, \[ \hat e_i = y_i - (b_1 + b_2 x_i) = y_i - \hat y_i , \] defined using the estimated line. The residual is therefore observable <80><94> we can compute it as soon as we have fit the line in the next chapter.

The parallel

The error \(e_i\) is to \(\beta\) as the residual \(\hat e_i\) is to \(b\). The residual is our visible stand-in for the invisible error <80><94> and minimizing the residuals is exactly how OLS chooses the line.

5.4 The assumptions: SR1<80><93>SR6

A model is only as trustworthy as the conditions behind it. The simple regression assumptions SR1<80><93>SR6 (“SR” for simple regression) are the conditions under which two things hold: the slope \(\beta_2\) measures a genuinely causal marginal effect, and the estimators \(b_1, b_2\) are well behaved <80><94> unbiased, with a known sampling distribution we can use for inference. Much of the rest of econometrics is about what to do when one of these assumptions fails, so it pays to know exactly what we are assuming, and which assumption each later technique is designed to rescue. We meet them one at a time and then collect them.

SR1 and SR2: the model and strict exogeneity

SR1 <e2><80><94> the model holds in the population

\[ y_i = \beta_1 + \beta_2 x_i + e_i \qquad \text{for all } i = 1,\dots,N . \]

SR2 <e2><80><94> strict exogeneity (the crucial one)

The error has conditional mean zero given the regressor(s): \[ \E(e_i \given x) = 0 . \]

SR2 says that knowing \(x\) tells you nothing about the average error: the omitted factors balance out to zero at every value of \(x\). It is the assumption that does the heavy lifting, because it delivers two consequences at once, \[ \E(e_i \given x) = 0 \;\Longrightarrow\; \E(e_i) = 0 \quad\text{and}\quad \Cov(e_i, x_i) = 0 , \] and from it follows the regression function \(\E(y_i \given x) = \beta_1 + \beta_2 x_i\) that we used above.

The covariance consequence is what separates good cases from bad ones. If \(\Cov(e, x) = 0\), the regressor \(x\) is exogenous: regression can recover \(\beta_1, \beta_2\), and \(\beta_2\) is the causal marginal effect. If instead \(\Cov(e, x) \neq 0\), then \(x\) is endogenous, and \(\beta_2\) is not causal. This is the formal version of the slogan “correlation \(\neq\) causation” from the first chapter.

Wages and education (HGL)

Consider \(\text{WAGE}_i = \beta_1 + \beta_2\,\text{EDUC}_i + e_i\). The error \(e\) holds factors like ability, drive, intelligence <80><94> all plausibly correlated with education. Then \(\E(e \given \text{EDUC}) \neq 0\), education is endogenous, and \(b_2\) confounds the true return to schooling with the effect of ability. (We tackle problems of this kind much later in the course.)

SR3 and SR4: spread and dependence of the errors

SR3 <e2><80><94> homoskedasticity

The error has constant conditional variance, \[ \Var(e_i \given x) = \sigma^2 . \] The spread of \(y\) about the line is the same at every \(x\). If the variance changes with \(x\), the errors are heteroskedastic.

SR4 <e2><80><94> uncorrelated errors

\[ \Cov(e_i, e_j \given x) = 0, \qquad i \neq j . \] One observation’s error carries no information about another’s. This typically fails with clustered or time-series data.

Homoskedasticity is easiest to see in a picture. Figure 5.3 redraws the conditional-distribution diagram with the two bells given the same width <80><94> that equal width is SR3.

Show the R code
line_df <- data.frame(x = c(2, 28), y = 83 + 10 * c(2, 28))
bell <- function(x0, y0, scale = 6, span = 20) {
  t <- seq(-2.6, 2.6, length.out = 60)
  data.frame(x = x0 + scale * exp(-(t^2) / 2), y = y0 + span * t)
}
b1 <- bell(8, 163); b2 <- bell(20, 283)

ggplot() +
  geom_line(data = line_df, aes(x, y), color = ucla$blue, linewidth = 1) +
  geom_path(data = b1, aes(x, y), color = ucla$red, linewidth = 0.8) +
  geom_path(data = b2, aes(x, y), color = ucla$red, linewidth = 0.8) +
  scale_x_continuous(breaks = c(8, 20), labels = c(expression(x[1]), expression(x[2]))) +
  scale_y_continuous(limits = c(0, 420)) +
  labs(x = "x", y = "y")
Figure 5.3: SR3 (homoskedasticity): the conditional distribution of \(y\) has the same spread at every \(x\) <80><94> the two bells are equally wide.

SR5 and SR6: variation in \(x\), and (optional) normality

SR5 <e2><80><94> the regressor must vary

In the sample, \(x_i\) takes at least two different values. As the old saw goes, “it takes two points to determine a line”: with no variation in \(x\) there is no slope to estimate.

SR6 <e2><80><94> normality of errors (optional)

\[ e_i \given x \sim N(0, \sigma^2) \quad\Longleftrightarrow\quad y_i \given x \sim N(\beta_1 + \beta_2 x_i,\ \sigma^2) . \]

SR6 is not needed for the estimators to work. Its role is to make small-sample inference exact, as we will see when we build confidence intervals. It is also plausible: by the Central Limit Theorem from the Normal chapter, an error that sums up many small independent factors tends toward a Normal distribution.

The six at a glance

It helps to see all six in one place.

The simple regression assumptions SR1<80><93>SR6.
Assumption Statement
SR1 \(y_i = \beta_1 + \beta_2 x_i + e_i\)
SR2 \(\E(e_i \given x) = 0\) (strict exogeneity)
SR3 \(\Var(e_i \given x) = \sigma^2\) (homoskedastic)
SR4 \(\Cov(e_i, e_j \given x) = 0,\ i \neq j\)
SR5 \(x_i\) takes \(\ge 2\) values
SR6 \(e_i \given x \sim N(0, \sigma^2)\) (optional)

The same idea in Stock & Watson. S&W write the model as \(Y_i = \beta_0 + \beta_1 X_i + u_i\) and list three assumptions: (1) \(\E(u_i \given X_i) = 0\), which is exactly SR2; (2) the pairs \((X_i, Y_i)\) are i.i.d.; and (3) large outliers are unlikely (finite fourth moments). S&W drop homoskedasticity <80><94> they use robust standard errors throughout <80><94> and add the outlier condition. We follow HGL’s SR1<80><93>SR6.

5.5 Recap

The simple linear regression model is \(y_i = \beta_1 + \beta_2 x_i + e_i\), with population regression function \(\E(y \given x) = \beta_1 + \beta_2 x\). Every observation is the systematic part plus a random error, and the slope \(\beta_2 = \Delta\,\E(y \given x) / \Delta x\) is the marginal effect of \(x\) on the average of \(y\).

Keep the three “betas” distinct: a parameter \(\beta\) (fixed) is estimated by an estimator \(b\) (random), which yields an estimate (a number); likewise the unobserved error \(e\) has the computable residual \(\hat e\) as its sample stand-in.

The assumptions SR1<80><93>SR6 are the conditions under which this all works: SR1 the model; SR2 exogeneity \(\E(e \given x) = 0\) (exogenous \(\Rightarrow\) causal, otherwise endogenous); SR3 homoskedasticity; SR4 uncorrelated errors; SR5 variation in \(x\); and SR6 (optional) normality.

Next time: we have the model and the assumptions, but not the line. In the next chapter we choose \(b_1, b_2\) to minimize the residuals <80><94> ordinary least squares <80><94> and find that the slope is \(b_2 = \Cov(x, y) / \Var(x)\).