5 The Simple Linear Regression Model

Reading. Hill, Griffiths & Lim (5th ed.), 2.1<80><93>2.2; Stock & Watson (4th ed.), 4.1, 4.4.

The last four chapters built a probability toolkit. The very last idea <80><94> from the bivariate Normal <80><94> was that a conditional mean can be a straight line: \[ \E(Y \given X = x) = \alpha + \beta x, \qquad \beta = \frac{\Cov(X,Y)}{\Var(X)} . \] Starting now, that line becomes the object of the whole course: the simple linear regression model. This chapter writes down the model $y = \beta_1 + \beta_2 x + e$ and interprets each of its pieces, carefully separates the three things people sloppily all call “beta” <80><94> parameters, estimators, and estimates <80><94> and states the assumptions (SR1<80><93>SR6) that make the whole apparatus work.

Recall the very first lecture, where we scatter-plotted weekly food expenditure against income and eyeballed an upward-sloping cloud of points. Here we write down the model behind that cloud; in the next chapter we fit the line.

5.1 From an economic idea to a model

The running example throughout this part of the course comes from Hill, Griffiths & Lim: how does a household’s weekly food expenditure $y$ depend on its weekly income $x$?

Even among households with the same income, food spending varies <80><94> tastes, household size, restaurants, impulse buys. So at each income $x$, the outcome $y$ is not a single number but has a whole conditional distribution $f(y \given x)$. Economic theory does not pin down every household; it speaks to the center of that distribution <80><94> the conditional mean $\E(y \given x)$ <80><94> which we expect to rise with income. Figure 5.1 shows the picture: at two incomes $x_1$ and $x_2$ there is a spread of possible outcomes, each spread centered on a point that lies on the population regression line.

Show the R code

line_df <- data.frame(x = c(2, 28), y = 83 + 10 * c(2, 28))

# two vertical conditional densities (bells opening to the right)
bell <- function(x0, y0, scale = 6, span = 22) {
  t <- seq(-2.6, 2.6, length.out = 60)
  data.frame(x = x0 + scale * exp(-(t^2) / 2), y = y0 + span * t)
}
b1 <- bell(8, 163); b2 <- bell(20, 283)
means <- data.frame(x = c(8, 20), y = c(163, 283))

ggplot() +
  geom_line(data = line_df, aes(x, y), color = ucla$blue, linewidth = 1) +
  geom_path(data = b1, aes(x, y), color = ucla$red, linewidth = 0.8) +
  geom_path(data = b2, aes(x, y), color = ucla$red, linewidth = 0.8) +
  geom_point(data = means, aes(x, y), color = ucla$darkblue, size = 1.8) +
  annotate("text", x = 8,  y = 120, label = "mu[y*'|'*x[1]]",
           parse = TRUE, color = ucla$darkblue, size = 3) +
  annotate("text", x = 20, y = 240, label = "mu[y*'|'*x[2]]",
           parse = TRUE, color = ucla$darkblue, size = 3) +
  annotate("text", x = 22, y = 360,
           label = "E(y*'|'*x) == beta[1] + beta[2]*x",
           parse = TRUE, color = ucla$blue, size = 3) +
  scale_x_continuous(breaks = c(8, 20), labels = c(expression(x[1]), expression(x[2]))) +
  scale_y_continuous(limits = c(0, 420)) +
  labs(x = "income x", y = "food exp. y")

Figure 5.1: At each income there is a conditional distribution of food expenditure, centered on the population regression line $\E(y \mid x) = \beta_1 + \beta_2 x$.

From a rule to a model

Imagine first a made-up deterministic rule: a household spends $80 plus 10 cents of each dollar of income on food, \[ y = 80 + 0.10\,x . \] Under this rule a $100 rise in income raises spending by exactly $10. The number $0.10$ <80><94> the marginal propensity to spend on food <80><94> is the slope, and it is precisely the “how much” quantity a decision-maker cares about.

But reality is not deterministic. Countless other factors move food spending. We collect all of them into a single random error $e$, and we replace the fixed numbers $80$ and $0.10$ by unknown parameters $\beta_1$ and $\beta_2$, because in practice we do not know their values: \[ y = \beta_1 + \beta_2 x + e . \]

Systematic part + random error

This is the same “systematic part $+$ random error” template introduced in the first chapter <80><94> now specialized to one explanatory variable, with the two pieces of the systematic part given names, $\beta_1$ and $\beta_2$.

5.2 The simple linear regression model

We can now state the model that organizes the rest of the course.

The simple linear regression model

For each observation $i = 1,\dots,N$, \[ y_i = \beta_1 + \beta_2 x_i + e_i . \]

Each symbol has a name. On the left, $y_i$ is the dependent variable <80><94> also called the regressand or the “left-hand side” variable. On the right, $x_i$ is the independent or explanatory variable, also called the regressor, and $e_i$ is the random error, standing in for everything else that affects $y$. The two unknowns $\beta_1$ and $\beta_2$ are the intercept and slope parameters; both are fixed, unknown population parameters <80><94> there is one true value of each, out in the population, that we are trying to learn.

“Simple” means one regressor <80><94> not that the model is easy. Everything we do here generalizes to many regressors when we reach multiple regression.

The regression function and the systematic/random split

Suppose <80><94> as we will formally assume in a moment <80><94> that the errors average to zero at each value of $x$. Then taking the conditional mean of $y_i = \beta_1 + \beta_2 x_i + e_i$ leaves only the systematic part, giving the population regression function \[ \E(y \given x) = \beta_1 + \beta_2 x . \] Every observation therefore splits cleanly into two pieces, \[ y_i = \underbrace{\E(y_i \given x_i)}_{\text{systematic}} \;+\; \underbrace{e_i}_{\text{random}} . \] The line is the average behavior of food expenditure at each income; the error $e_i$ is the $i$th household’s departure from that average <80><94> the vertical gap between its point and the line, as in Figure 5.2.

Show the R code

line_df <- data.frame(x = c(2, 28), y = 83 + 10 * c(2, 28))
pts <- data.frame(
  x = c(5, 8, 11, 14, 17, 20, 23, 26, 9, 22),
  y = c(140, 150, 165, 255, 235, 300, 300, 360, 210, 360)
)
hi <- data.frame(x = 14, y = 255, yline = 83 + 10 * 14)

ggplot() +
  geom_line(data = line_df, aes(x, y), color = ucla$blue, linewidth = 1) +
  geom_point(data = pts, aes(x, y), color = ucla$darkblue, size = 1.6) +
  geom_segment(data = hi, aes(x = x, xend = x, y = yline, yend = y),
               linetype = "dashed", color = ucla$red) +
  geom_point(data = hi, aes(x, y), color = ucla$red, size = 2.2) +
  annotate("text", x = 14.8, y = 240, label = "e[i]", parse = TRUE,
           color = ucla$red, size = 3.4) +
  annotate("text", x = 23, y = 250, label = "E(y*'|'*x)", parse = TRUE,
           color = ucla$blue, size = 3) +
  scale_y_continuous(limits = c(0, 420)) +
  labs(x = "income x", y = "food exp. y")

Figure 5.2: Each observation is the regression line (systematic part) plus an error $e_i$, the vertical gap from the point to the line.

Interpreting the slope

The slope is the marginal effect of $x$ on the average of $y$: \[ \beta_2 = \frac{\Delta\,\E(y \given x)}{\Delta x} = \frac{d\,\E(y \given x)}{dx} . \] Holding “everything else” fixed <80><94> that is, $\Delta e = 0$ <80><94> a change $\Delta x$ moves average spending by $\beta_2 \, \Delta x$. This is the ceteris paribus interpretation. In the food example, if income rises by $100 then average food expenditure rises by $\beta_2 \times \$100$; that single number is exactly what a decision-maker wants to know.

The intercept $\beta_1 = \E(y \mid x = 0)$

The intercept is the average of $y$ when $x = 0$. Sometimes this is meaningful, often it is not. In a regression of test scores on class size, $\beta_1$ would be the predicted score for a class of zero students <80><94> nonsense. In such cases $\beta_1$ is best read as just the height that pins the line in place, not as a quantity to interpret on its own.

5.3 Parameters, estimators, estimates

Keeping three closely related objects straight is the central conceptual hurdle of the course. People sloppily call all three “beta,” but they are different kinds of thing.

Parameter, estimator, estimate

A parameter ($\beta_1, \beta_2$) is a fixed, unknown feature of the population. There is one true value; it is not random.
An estimator ($b_1, b_2$) is a formula applied to a sample. Because the sample is random, the estimator is itself a random variable <80><94> it has a sampling distribution.
An estimate (e.g. $b_1 = 83.4$) is the number the estimator produces in one particular sample. It is just a number <80><94> not random.

The connection to the sample mean

The estimator $b_2$ is to the parameter $\beta_2$ exactly as the sample mean $\bar Y$ is to the population mean $\mu$: a random variable that varies from sample to sample, with a center and a spread we can study. That is precisely how we will judge it when we turn to the properties of OLS and the variance of the estimators.

Error versus residual

A closely related distinction trips up nearly everyone, because it hinges on the same parameter-versus-estimate divide. The random error $e_i$ is a population object, \[ e_i = y_i - (\beta_1 + \beta_2 x_i) = y_i - \E(y_i \given x_i) , \] defined using the true parameters $\beta_1, \beta_2$. Because we never know those parameters, the error is unobservable. The residual $\hat e_i$ is the sample analog, \[ \hat e_i = y_i - (b_1 + b_2 x_i) = y_i - \hat y_i , \] defined using the estimated line. The residual is therefore observable <80><94> we can compute it as soon as we have fit the line in the next chapter.

The parallel

The error $e_i$ is to $\beta$ as the residual $\hat e_i$ is to $b$. The residual is our visible stand-in for the invisible error <80><94> and minimizing the residuals is exactly how OLS chooses the line.

5.4 The assumptions: SR1<80><93>SR6

A model is only as trustworthy as the conditions behind it. The simple regression assumptions SR1<80><93>SR6 (“SR” for simple regression) are the conditions under which two things hold: the slope $\beta_2$ measures a genuinely causal marginal effect, and the estimators $b_1, b_2$ are well behaved <80><94> unbiased, with a known sampling distribution we can use for inference. Much of the rest of econometrics is about what to do when one of these assumptions fails, so it pays to know exactly what we are assuming, and which assumption each later technique is designed to rescue. We meet them one at a time and then collect them.

SR1 and SR2: the model and strict exogeneity

SR1 <e2><80><94> the model holds in the population

\[ y_i = \beta_1 + \beta_2 x_i + e_i \qquad \text{for all } i = 1,\dots,N . \]

SR2 <e2><80><94> strict exogeneity (the crucial one)

The error has conditional mean zero given the regressor(s): \[ \E(e_i \given x) = 0 . \]

SR2 says that knowing $x$ tells you nothing about the average error: the omitted factors balance out to zero at every value of $x$. It is the assumption that does the heavy lifting, because it delivers two consequences at once, \[ \E(e_i \given x) = 0 \;\Longrightarrow\; \E(e_i) = 0 \quad\text{and}\quad \Cov(e_i, x_i) = 0 , \] and from it follows the regression function $\E(y_i \given x) = \beta_1 + \beta_2 x_i$ that we used above.

The covariance consequence is what separates good cases from bad ones. If $\Cov(e, x) = 0$, the regressor $x$ is exogenous: regression can recover $\beta_1, \beta_2$, and $\beta_2$ is the causal marginal effect. If instead $\Cov(e, x) \neq 0$, then $x$ is endogenous, and $\beta_2$ is not causal. This is the formal version of the slogan “correlation $\neq$ causation” from the first chapter.

Wages and education (HGL)

Consider $\text{WAGE}_i = \beta_1 + \beta_2\,\text{EDUC}_i + e_i$. The error $e$ holds factors like ability, drive, intelligence <80><94> all plausibly correlated with education. Then $\E(e \given \text{EDUC}) \neq 0$, education is endogenous, and $b_2$ confounds the true return to schooling with the effect of ability. (We tackle problems of this kind much later in the course.)

SR3 and SR4: spread and dependence of the errors

SR3 <e2><80><94> homoskedasticity

The error has constant conditional variance, \[ \Var(e_i \given x) = \sigma^2 . \] The spread of $y$ about the line is the same at every $x$. If the variance changes with $x$, the errors are heteroskedastic.

SR4 <e2><80><94> uncorrelated errors

\[ \Cov(e_i, e_j \given x) = 0, \qquad i \neq j . \] One observation’s error carries no information about another’s. This typically fails with clustered or time-series data.

Homoskedasticity is easiest to see in a picture. Figure 5.3 redraws the conditional-distribution diagram with the two bells given the same width <80><94> that equal width is SR3.

Show the R code

line_df <- data.frame(x = c(2, 28), y = 83 + 10 * c(2, 28))
bell <- function(x0, y0, scale = 6, span = 20) {
  t <- seq(-2.6, 2.6, length.out = 60)
  data.frame(x = x0 + scale * exp(-(t^2) / 2), y = y0 + span * t)
}
b1 <- bell(8, 163); b2 <- bell(20, 283)

ggplot() +
  geom_line(data = line_df, aes(x, y), color = ucla$blue, linewidth = 1) +
  geom_path(data = b1, aes(x, y), color = ucla$red, linewidth = 0.8) +
  geom_path(data = b2, aes(x, y), color = ucla$red, linewidth = 0.8) +
  scale_x_continuous(breaks = c(8, 20), labels = c(expression(x[1]), expression(x[2]))) +
  scale_y_continuous(limits = c(0, 420)) +
  labs(x = "x", y = "y")

Figure 5.3: SR3 (homoskedasticity): the conditional distribution of $y$ has the same spread at every $x$ <80><94> the two bells are equally wide.

SR5 and SR6: variation in $x$, and (optional) normality

SR5 <e2><80><94> the regressor must vary

In the sample, $x_i$ takes at least two different values. As the old saw goes, “it takes two points to determine a line”: with no variation in $x$ there is no slope to estimate.

SR6 <e2><80><94> normality of errors (optional)

\[ e_i \given x \sim N(0, \sigma^2) \quad\Longleftrightarrow\quad y_i \given x \sim N(\beta_1 + \beta_2 x_i,\ \sigma^2) . \]

SR6 is not needed for the estimators to work. Its role is to make small-sample inference exact, as we will see when we build confidence intervals. It is also plausible: by the Central Limit Theorem from the Normal chapter, an error that sums up many small independent factors tends toward a Normal distribution.

The six at a glance

It helps to see all six in one place.

The simple regression assumptions SR1<80><93>SR6.
Assumption	Statement
SR1	$y_i = \beta_1 + \beta_2 x_i + e_i$
SR2	$\E(e_i \given x) = 0$ (strict exogeneity)
SR3	$\Var(e_i \given x) = \sigma^2$ (homoskedastic)
SR4	$\Cov(e_i, e_j \given x) = 0,\ i \neq j$
SR5	$x_i$ takes $\ge 2$ values
SR6	$e_i \given x \sim N(0, \sigma^2)$ (optional)

The same idea in Stock & Watson. S&W write the model as $Y_i = \beta_0 + \beta_1 X_i + u_i$ and list three assumptions: (1) $\E(u_i \given X_i) = 0$, which is exactly SR2; (2) the pairs $(X_i, Y_i)$ are i.i.d.; and (3) large outliers are unlikely (finite fourth moments). S&W drop homoskedasticity <80><94> they use robust standard errors throughout <80><94> and add the outlier condition. We follow HGL’s SR1<80><93>SR6.

5.5 Recap

The simple linear regression model is $y_i = \beta_1 + \beta_2 x_i + e_i$, with population regression function $\E(y \given x) = \beta_1 + \beta_2 x$. Every observation is the systematic part plus a random error, and the slope $\beta_2 = \Delta\,\E(y \given x) / \Delta x$ is the marginal effect of $x$ on the average of $y$.

Keep the three “betas” distinct: a parameter $\beta$ (fixed) is estimated by an estimator $b$ (random), which yields an estimate (a number); likewise the unobserved error $e$ has the computable residual $\hat e$ as its sample stand-in.

The assumptions SR1<80><93>SR6 are the conditions under which this all works: SR1 the model; SR2 exogeneity $\E(e \given x) = 0$ (exogenous $\Rightarrow$ causal, otherwise endogenous); SR3 homoskedasticity; SR4 uncorrelated errors; SR5 variation in $x$; and SR6 (optional) normality.

Next time: we have the model and the assumptions, but not the line. In the next chapter we choose $b_1, b_2$ to minimize the residuals <80><94> ordinary least squares <80><94> and find that the slope is $b_2 = \Cov(x, y) / \Var(x)$.

--- title: "The Simple Linear Regression Model" --- {{< include _setup.qmd >}} > **Reading.** Hill, Griffiths & Lim (5th ed.), sec. 2.1--2.2; Stock & Watson (4th ed.), sec. 4.1, 4.4. The last four chapters built a probability toolkit. The very last idea --- from the bivariate Normal --- was that a conditional mean can be a straight line: $$ \E(Y \given X = x) = \alpha + \beta x, \qquad \beta = \frac{\Cov(X,Y)}{\Var(X)} . $$ Starting now, that line becomes the object of the whole course: the **simple linear regression model**. This chapter writes down the model $y = \beta_1 + \beta_2 x + e$ and interprets each of its pieces, carefully separates the three things people sloppily all call "beta" --- [parameters]{.term}, estimators, and estimates --- and states the assumptions (SR1--SR6) that make the whole apparatus work. Recall the very first lecture, where we scatter-plotted weekly food expenditure against income and eyeballed an upward-sloping cloud of points. Here we write down the model behind that cloud; in the [next chapter](06-ols-estimation.qmd) we fit the line. ## From an economic idea to a model {#sec-idea-to-model} The running example throughout this part of the course comes from Hill, Griffiths & Lim: **how does a household's weekly *food expenditure* $y$ depend on its weekly *income* $x$?** Even among households with the *same* income, food spending varies --- tastes, household size, restaurants, impulse buys. So at each income $x$, the outcome $y$ is not a single number but has a whole **conditional distribution** $f(y \given x)$. Economic theory does not pin down every household; it speaks to the **center** of that distribution --- the [conditional mean]{.term} $\E(y \given x)$ --- which we expect to rise with income. @fig-cond-dist shows the picture: at two incomes $x_1$ and $x_2$ there is a spread of possible outcomes, each spread centered on a point that lies on the population regression line. ```{r} #| label: fig-cond-dist #| fig-cap: "At each income there is a conditional distribution of food expenditure, centered on the population regression line $\\E(y \\mid x) = \\beta_1 + \\beta_2 x$." #| fig-width: 5 #| fig-height: 3.4 line_df <- data.frame(x = c(2, 28), y = 83 + 10 * c(2, 28)) # two vertical conditional densities (bells opening to the right) bell <- function(x0, y0, scale = 6, span = 22) { t <- seq(-2.6, 2.6, length.out = 60) data.frame(x = x0 + scale * exp(-(t^2) / 2), y = y0 + span * t) } b1 <- bell(8, 163); b2 <- bell(20, 283) means <- data.frame(x = c(8, 20), y = c(163, 283)) ggplot() + geom_line(data = line_df, aes(x, y), color = ucla$blue, linewidth = 1) + geom_path(data = b1, aes(x, y), color = ucla$red, linewidth = 0.8) + geom_path(data = b2, aes(x, y), color = ucla$red, linewidth = 0.8) + geom_point(data = means, aes(x, y), color = ucla$darkblue, size = 1.8) + annotate("text", x = 8, y = 120, label = "mu[y*'|'*x[1]]", parse = TRUE, color = ucla$darkblue, size = 3) + annotate("text", x = 20, y = 240, label = "mu[y*'|'*x[2]]", parse = TRUE, color = ucla$darkblue, size = 3) + annotate("text", x = 22, y = 360, label = "E(y*'|'*x) == beta[1] + beta[2]*x", parse = TRUE, color = ucla$blue, size = 3) + scale_x_continuous(breaks = c(8, 20), labels = c(expression(x[1]), expression(x[2]))) + scale_y_continuous(limits = c(0, 420)) + labs(x = "income x", y = "food exp. y") ``` ### From a rule to a model Imagine first a made-up *deterministic* rule: a household spends \$80 plus 10 cents of each dollar of income on food, $$ y = 80 + 0.10\,x . $$ Under this rule a \$100 rise in income raises spending by exactly \$10. The number $0.10$ --- the **marginal propensity to spend on food** --- is the slope, and it is precisely the "how much" quantity a decision-maker cares about. But reality is not deterministic. Countless other factors move food spending. We collect all of them into a single [random error]{.term} $e$, and we replace the fixed numbers $80$ and $0.10$ by *unknown* parameters $\beta_1$ and $\beta_2$, because in practice we do not know their values: $$ y = \beta_1 + \beta_2 x + e . $$ ::: {.keyidea title="Systematic part + random error"} This is the same "systematic part $+$ random error" template introduced in the [first chapter](01-introduction.qmd) --- now specialized to *one* explanatory variable, with the two pieces of the systematic part given names, $\beta_1$ and $\beta_2$. ::: ## The simple linear regression model {#sec-the-model} We can now state the model that organizes the rest of the course. ::: {.definition title="The simple linear regression model"} For each observation $i = 1,\dots,N$, $$ y_i = \beta_1 + \beta_2 x_i + e_i . $$ ::: Each symbol has a name. On the left, $y_i$ is the [dependent]{.term} variable --- also called the regressand or the "left-hand side" variable. On the right, $x_i$ is the [independent]{.term} or explanatory variable, also called the regressor, and $e_i$ is the [random error]{.term}, standing in for everything else that affects $y$. The two unknowns $\beta_1$ and $\beta_2$ are the [intercept]{.term} and [slope]{.term} parameters; both are fixed, **unknown population parameters** --- there is one true value of each, out in the population, that we are trying to learn. ::: {.callout-note appearance="simple"} "Simple" means *one* regressor --- not that the model is easy. Everything we do here generalizes to many regressors when we reach [multiple regression](13-multiple-regression.qmd). ::: ### The regression function and the systematic/random split Suppose --- as we will formally assume in a moment --- that the errors average to zero at each value of $x$. Then taking the conditional mean of $y_i = \beta_1 + \beta_2 x_i + e_i$ leaves only the systematic part, giving the [population regression function]{.term} $$ \E(y \given x) = \beta_1 + \beta_2 x . $$ Every observation therefore splits cleanly into two pieces, $$ y_i = \underbrace{\E(y_i \given x_i)}_{\text{systematic}} \;+\; \underbrace{e_i}_{\text{random}} . $$ The line is the *average* behavior of food expenditure at each income; the error $e_i$ is the $i$th household's departure from that average --- the vertical gap between its point and the line, as in @fig-error-split. ```{r} #| label: fig-error-split #| fig-cap: "Each observation is the regression line (systematic part) plus an error $e_i$, the vertical gap from the point to the line." #| fig-width: 5 #| fig-height: 3.4 line_df <- data.frame(x = c(2, 28), y = 83 + 10 * c(2, 28)) pts <- data.frame( x = c(5, 8, 11, 14, 17, 20, 23, 26, 9, 22), y = c(140, 150, 165, 255, 235, 300, 300, 360, 210, 360) ) hi <- data.frame(x = 14, y = 255, yline = 83 + 10 * 14) ggplot() + geom_line(data = line_df, aes(x, y), color = ucla$blue, linewidth = 1) + geom_point(data = pts, aes(x, y), color = ucla$darkblue, size = 1.6) + geom_segment(data = hi, aes(x = x, xend = x, y = yline, yend = y), linetype = "dashed", color = ucla$red) + geom_point(data = hi, aes(x, y), color = ucla$red, size = 2.2) + annotate("text", x = 14.8, y = 240, label = "e[i]", parse = TRUE, color = ucla$red, size = 3.4) + annotate("text", x = 23, y = 250, label = "E(y*'|'*x)", parse = TRUE, color = ucla$blue, size = 3) + scale_y_continuous(limits = c(0, 420)) + labs(x = "income x", y = "food exp. y") ``` ### Interpreting the slope The slope is the [marginal effect]{.term} of $x$ on the *average* of $y$: $$ \beta_2 = \frac{\Delta\,\E(y \given x)}{\Delta x} = \frac{d\,\E(y \given x)}{dx} . $$ Holding "everything else" fixed --- that is, $\Delta e = 0$ --- a change $\Delta x$ moves average spending by $\beta_2 \, \Delta x$. This is the *ceteris paribus* interpretation. In the food example, if income rises by \$100 then average food expenditure rises by $\beta_2 \times \$100$; that single number is exactly what a decision-maker wants to know. ::: {.keyidea title="The intercept $\\beta_1 = \\E(y \\mid x = 0)$"} The intercept is the average of $y$ when $x = 0$. Sometimes this is meaningful, often it is not. In a regression of test scores on class size, $\beta_1$ would be the predicted score for a class of *zero* students --- nonsense. In such cases $\beta_1$ is best read as just the height that pins the line in place, not as a quantity to interpret on its own. ::: ## Parameters, estimators, estimates {#sec-three-betas} Keeping three closely related objects straight is the central conceptual hurdle of the course. People sloppily call all three "beta," but they are different kinds of thing. ::: {.definition title="Parameter, estimator, estimate"} - A **parameter** ($\beta_1, \beta_2$) is a fixed, *unknown* feature of the population. There is one true value; it is *not* random. - An **estimator** ($b_1, b_2$) is a *formula* applied to a sample. Because the sample is random, the estimator is itself a **random variable** --- it has a sampling distribution. - An **estimate** (e.g. $b_1 = 83.4$) is the *number* the estimator produces in *one* particular sample. It is just a number --- not random. ::: ::: {.keyidea title="The connection to the sample mean"} The estimator $b_2$ is to the parameter $\beta_2$ exactly as the sample mean $\bar Y$ is to the population mean $\mu$: a random variable that varies from sample to sample, with a center and a spread we can study. That is precisely how we will judge it when we turn to the [properties of OLS](07-ols-properties.qmd) and the [variance of the estimators](08-variance-prediction.qmd). ::: ### Error versus residual A closely related distinction trips up nearly everyone, because it hinges on the same parameter-versus-estimate divide. The **random error** $e_i$ is a population object, $$ e_i = y_i - (\beta_1 + \beta_2 x_i) = y_i - \E(y_i \given x_i) , $$ defined using the *true* parameters $\beta_1, \beta_2$. Because we never know those parameters, the error is **unobservable**. The **residual** $\hat e_i$ is the sample analog, $$ \hat e_i = y_i - (b_1 + b_2 x_i) = y_i - \hat y_i , $$ defined using the *estimated* line. The residual is therefore **observable** --- we can compute it as soon as we have fit the line in the next chapter. ::: {.keyidea title="The parallel"} The error $e_i$ is to $\beta$ as the residual $\hat e_i$ is to $b$. The residual is our visible *stand-in* for the invisible error --- and minimizing the residuals is exactly how [OLS](06-ols-estimation.qmd) chooses the line. ::: ## The assumptions: SR1--SR6 {#sec-assumptions} A model is only as trustworthy as the conditions behind it. The simple regression assumptions [SR1--SR6]{.term} ("SR" for *simple regression*) are the conditions under which two things hold: the slope $\beta_2$ measures a genuinely **causal** marginal effect, and the estimators $b_1, b_2$ are well behaved --- unbiased, with a known sampling distribution we can use for inference. Much of the rest of econometrics is about what to do *when* one of these assumptions fails, so it pays to know exactly what we are assuming, and which assumption each later technique is designed to rescue. We meet them one at a time and then collect them. ### SR1 and SR2: the model and strict exogeneity ::: {.property title="SR1 --- the model holds in the population"} $$ y_i = \beta_1 + \beta_2 x_i + e_i \qquad \text{for all } i = 1,\dots,N . $$ ::: ::: {.property title="SR2 --- strict exogeneity (the crucial one)"} The error has conditional mean zero given the regressor(s): $$ \E(e_i \given x) = 0 . $$ ::: SR2 says that knowing $x$ tells you **nothing** about the average error: the omitted factors balance out to zero at every value of $x$. It is the assumption that does the heavy lifting, because it delivers two consequences at once, $$ \E(e_i \given x) = 0 \;\Longrightarrow\; \E(e_i) = 0 \quad\text{and}\quad \Cov(e_i, x_i) = 0 , $$ and from it follows the regression function $\E(y_i \given x) = \beta_1 + \beta_2 x_i$ that we used above. The covariance consequence is what separates good cases from bad ones. If $\Cov(e, x) = 0$, the regressor $x$ is [exogenous]{.term}: regression can recover $\beta_1, \beta_2$, and $\beta_2$ is the causal marginal effect. If instead $\Cov(e, x) \neq 0$, then $x$ is [endogenous]{.term}, and $\beta_2$ is **not** causal. This is the formal version of the slogan "correlation $\neq$ causation" from the [first chapter](01-introduction.qmd). ::: {.example title="Wages and education (HGL)"} Consider $\text{WAGE}_i = \beta_1 + \beta_2\,\text{EDUC}_i + e_i$. The error $e$ holds factors like *ability, drive, intelligence* --- all plausibly **correlated** with education. Then $\E(e \given \text{EDUC}) \neq 0$, education is endogenous, and $b_2$ confounds the true return to schooling with the effect of ability. (We tackle problems of this kind much later in the course.) ::: ### SR3 and SR4: spread and dependence of the errors ::: {.property title="SR3 --- homoskedasticity"} The error has *constant* conditional variance, $$ \Var(e_i \given x) = \sigma^2 . $$ The spread of $y$ about the line is the same at *every* $x$. If the variance changes with $x$, the errors are [heteroskedastic]{.term}. ::: ::: {.property title="SR4 --- uncorrelated errors"} $$ \Cov(e_i, e_j \given x) = 0, \qquad i \neq j . $$ One observation's error carries no information about another's. This typically fails with clustered or time-series data. ::: Homoskedasticity is easiest to see in a picture. @fig-homosked redraws the conditional-distribution diagram with the two bells given the *same* width --- that equal width *is* SR3. ```{r} #| label: fig-homosked #| fig-cap: "SR3 (homoskedasticity): the conditional distribution of $y$ has the same spread at every $x$ --- the two bells are equally wide." #| fig-width: 5 #| fig-height: 3.4 line_df <- data.frame(x = c(2, 28), y = 83 + 10 * c(2, 28)) bell <- function(x0, y0, scale = 6, span = 20) { t <- seq(-2.6, 2.6, length.out = 60) data.frame(x = x0 + scale * exp(-(t^2) / 2), y = y0 + span * t) } b1 <- bell(8, 163); b2 <- bell(20, 283) ggplot() + geom_line(data = line_df, aes(x, y), color = ucla$blue, linewidth = 1) + geom_path(data = b1, aes(x, y), color = ucla$red, linewidth = 0.8) + geom_path(data = b2, aes(x, y), color = ucla$red, linewidth = 0.8) + scale_x_continuous(breaks = c(8, 20), labels = c(expression(x[1]), expression(x[2]))) + scale_y_continuous(limits = c(0, 420)) + labs(x = "x", y = "y") ``` ### SR5 and SR6: variation in $x$, and (optional) normality ::: {.property title="SR5 --- the regressor must vary"} In the sample, $x_i$ takes **at least two different values**. As the old saw goes, "it takes two points to determine a line": with no variation in $x$ there is no slope to estimate. ::: ::: {.property title="SR6 --- normality of errors (optional)"} $$ e_i \given x \sim N(0, \sigma^2) \quad\Longleftrightarrow\quad y_i \given x \sim N(\beta_1 + \beta_2 x_i,\ \sigma^2) . $$ ::: SR6 is *not* needed for the estimators to work. Its role is to make **small-sample** inference exact, as we will see when we build [confidence intervals](09-confidence-intervals.qmd). It is also plausible: by the Central Limit Theorem from the [Normal chapter](04-normal-clt.qmd), an error that sums up many small independent factors tends toward a Normal distribution. ### The six at a glance It helps to see all six in one place. | Assumption | Statement | |:--|:--| | **SR1** | $y_i = \beta_1 + \beta_2 x_i + e_i$ | | **SR2** | $\E(e_i \given x) = 0$ (strict exogeneity) | | **SR3** | $\Var(e_i \given x) = \sigma^2$ (homoskedastic) | | **SR4** | $\Cov(e_i, e_j \given x) = 0,\ i \neq j$ | | **SR5** | $x_i$ takes $\ge 2$ values | | **SR6** | $e_i \given x \sim N(0, \sigma^2)$ (optional) | : The simple regression assumptions SR1--SR6. {.striped} ::: {.callout-note appearance="simple"} **The same idea in Stock & Watson.** S&W write the model as $Y_i = \beta_0 + \beta_1 X_i + u_i$ and list three assumptions: (1) $\E(u_i \given X_i) = 0$, which is exactly SR2; (2) the pairs $(X_i, Y_i)$ are i.i.d.; and (3) large outliers are unlikely (finite fourth moments). S&W drop homoskedasticity --- they use robust standard errors throughout --- and add the outlier condition. We follow HGL's SR1--SR6. ::: ## Recap {#sec-recap} The **simple linear regression model** is $y_i = \beta_1 + \beta_2 x_i + e_i$, with population regression function $\E(y \given x) = \beta_1 + \beta_2 x$. Every observation is the systematic part plus a random error, and the slope $\beta_2 = \Delta\,\E(y \given x) / \Delta x$ is the marginal effect of $x$ on the average of $y$. Keep the three "betas" distinct: a **parameter** $\beta$ (fixed) is estimated by an **estimator** $b$ (random), which yields an **estimate** (a number); likewise the unobserved **error** $e$ has the computable **residual** $\hat e$ as its sample stand-in. The assumptions **SR1--SR6** are the conditions under which this all works: SR1 the model; SR2 exogeneity $\E(e \given x) = 0$ (exogenous $\Rightarrow$ causal, otherwise endogenous); SR3 homoskedasticity; SR4 uncorrelated errors; SR5 variation in $x$; and SR6 (optional) normality. **Next time:** we have the model and the assumptions, but not the line. In the next chapter we choose $b_1, b_2$ to [minimize the residuals](06-ols-estimation.qmd) --- ordinary least squares --- and find that the slope is $b_2 = \Cov(x, y) / \Var(x)$.

5.1 From an economic idea to a model

From a rule to a model

5.2 The simple linear regression model

The regression function and the systematic/random split

Interpreting the slope

5.3 Parameters, estimators, estimates

Error versus residual

5.4 The assumptions: SR1<80><93>SR6

SR1 and SR2: the model and strict exogeneity

SR3 and SR4: spread and dependence of the errors

SR5 and SR6: variation in \(x\), and (optional) normality

The six at a glance

5.5 Recap