\( \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\given}{\,\vert\,} \newcommand{\indic}[1]{\mathbf{1}\!\left\{#1\right\}} \newcommand{\pmf}{\text{p.m.f.}} \newcommand{\pdf}{\text{p.d.f.}} \newcommand{\cdf}{\text{c.d.f.}} \)

7  Properties of OLS & the Gauss<80><93>Markov Theorem

Reading. SW 4.5, 5.5, HGL 2.4<80><93>2.6

In the last chapter we fit the line. The least-squares formulas \(b_2 = \widehat{\Cov}(x,y)/\widehat{\Var}(x)\) and \(b_1 = \bar y - b_2\bar x\) turned the food-expenditure data into \[ \widehat{\text{FOOD\_EXP}} = 83.42 + 10.21\,\text{INCOME}. \] The natural next question is: how good is this estimate? HGL’s answer is blunt <80><94> that question is unanswerable. We will never observe \(\beta_2\), so we can never say how close \(10.21\) lands to it.

So we change the question. Instead of judging a single estimate, we judge the estimation procedure <80><94> the random variables \(b_1\) and \(b_2\) that the procedure produces. This chapter shows that OLS is unbiased (on average across samples it nails \(\beta\)), works out its variance (its precision) and what drives it, and closes with the Gauss<80><93>Markov theorem: under the standard assumptions OLS is BLUE, the Best Linear Unbiased Estimator. The remaining unknown, \(\sigma^2\), and the standard errors that depend on it are the subject of the next chapter.

7.1 The estimator as a random variable

Draw another 40 households <80><94> the same incomes, but new families <80><94> and you get different estimates, because each \(y_i\) is random. HGL report ten hypothetical samples drawn from the same population, and the slopes wander all over the place.

Show the R code
samples <- data.frame(
  sample = c("1", "2", "3", "4", "$\\vdots$", "10", "**avg**"),
  b1     = c("93.64", "91.62", "126.76", "55.98", "$\\vdots$", "128.55", "**96.11**"),
  b2     = c("8.24", "8.90", "6.59", "11.23", "$\\vdots$", "6.99", "**8.70**")
)
knitr::kable(samples, col.names = c("sample", "$b_1$", "$b_2$"), align = "ccc")
Table 7.1: Ten hypothetical samples from one population. The same procedure gives different numbers each time.
sample \(b_1\) \(b_2\)
1 93.64 8.24
2 91.62 8.90
3 126.76 6.59
4 55.98 11.23
\(\vdots\) \(\vdots\) \(\vdots\)
10 128.55 6.99
avg 96.11 8.70

Across these samples \(b_2\) ranges from \(6.59\) to \(11.23\): the same procedure applied to different data produces different numbers. This sampling variation is unavoidable <80><94> \(b_1\) and \(b_2\) are random variables with a distribution. A hopeful sign is that the average slope across the ten samples, \(8.70\), sits near the truth, hinting at unbiasedness. With a single sample we can never see this spread directly, so we study it theoretically instead.

OLS is a linear estimator

To study the distribution of \(b_2\), we first rewrite it in a more revealing form (HGL Appendix 2C). Starting from the deviation form of the slope and using \(\sum (x_i - \bar x)\bar y = 0\), the estimator collapses to a weighted average of the \(y_i\): \[ b_2 = \sum_{i=1}^N w_i\, y_i, \qquad w_i = \frac{x_i - \bar x}{\sum_{j}(x_j-\bar x)^2}. \]

The weights \(w_i\) depend only on \(x\). Once we condition on the regressor, they are simply constants. This puts OLS into an important category.

Linear estimator

An estimator is a linear estimator if it is a weighted average of the \(y_i\), \(\sum_i w_i y_i\), with weights that do not depend on the \(y_i\). OLS is a linear estimator <80><94> a fact we will lean on heavily when we get to Gauss<80><93>Markov.

Two facts that do all the bookkeeping. The OLS weights satisfy \(\sum_i w_i = 0\) and \(\sum_i w_i x_i = 1\). These two identities appear in every proof below.

The key decomposition

Now substitute the model \(y_i = \beta_1 + \beta_2 x_i + e_i\) into \(b_2 = \sum w_i y_i\) and apply \(\sum w_i = 0\) and \(\sum w_i x_i = 1\). The intercept and slope terms collapse, leaving the single most important equation of the chapter: \[ b_2 = \beta_2 + \sum_{i=1}^N w_i\, e_i . \]

The workhorse decomposition

\[ b_2 = \underbrace{\beta_2}_{\text{what we want (fixed)}} \;+\; \underbrace{\sum_i w_i e_i}_{\text{estimation error (random)}}. \] Everything random about \(b_2\) lives in the error term \(\sum_i w_i e_i\). Its mean controls bias; its variance controls precision. We take them in turn.

7.2 Unbiasedness

Take the conditional expectation of the decomposition \(b_2 = \beta_2 + \sum w_i e_i\) given \(x\): \[ \begin{aligned} \E(b_2 \given x) &= \beta_2 + \sum_i w_i\,\E(e_i \given x)\\ &= \beta_2 + \sum_i w_i \cdot 0 = \beta_2 . \end{aligned} \] Two ingredients make this work. First, each \(w_i\) is constant given \(x\), so it pulls straight out of the expectation. Second, assumption SR2 says \(\E(e_i \given x) = 0\), which zeroes out the entire error term.

Unbiasedness of OLS

An estimator is unbiased if its expected value equals the parameter it estimates. Under assumptions SR1<80><93>SR5, \[ \E(b_2 \given x) = \beta_2 \qquad\text{and}\qquad \E(b_1 \given x) = \beta_1 . \]

What unbiasedness does and does not say

It is worth being precise about the claim, because it is easy to over-read.

Unbiasedness does say that, over all possible samples, the estimates average out to the true \(\beta_2\). The procedure is centered on the target <80><94> no systematic over- or under-shooting.

Unbiasedness does not say that your particular estimate, \(10.21\), is close to \(\beta_2\). A single draw can land far from the center of the distribution. Unbiasedness is a property of the estimator (the procedure), never of a single estimate.

Figure 7.1 makes the distinction visual. The bell curve is the sampling distribution of \(b_2\), centered exactly on \(\beta_2\). Any one sample gives a single draw from that curve <80><94> and that draw can sit well off-center even though the curve as a whole is correctly centered.

Show the R code
xs  <- seq(4, 14, length.out = 400)
dat <- data.frame(x = xs, y = dnorm(xs, mean = 9, sd = 1.2))
ggplot(dat, aes(x, y)) +
  geom_area(fill = ucla$blue, alpha = 0.30) +
  geom_line(color = ucla$blue, linewidth = 1) +
  geom_segment(aes(x = 9, xend = 9, y = 0, yend = dnorm(9, 9, 1.2)),
               linetype = "dashed", color = ucla$gray) +
  annotate("text", x = 9, y = dnorm(9, 9, 1.2) + 0.015,
           label = "beta[2]~(center)", parse = TRUE, color = ucla$darkblue,
           size = 3.4) +
  annotate("point", x = 11.23, y = 0.01, color = ucla$red, size = 2.2) +
  annotate("text", x = 11.7, y = 0.045, label = "one estimate",
           color = ucla$red, size = 3.2) +
  scale_y_continuous(limits = c(0, 0.40)) +
  labs(x = expression(value~of~b[2]~across~samples), y = NULL)
Figure 7.1: The sampling distribution of \(b_2\) is centered on \(\beta_2\), but any one estimate can land far from the center.

When unbiasedness fails: omitted variables

The entire proof leaned on SR2, \(\E(e_i \given x) = 0\). If that assumption breaks, so does unbiasedness: \[ \E(e_i \given x) \neq 0 \;\Longrightarrow\; \E(b_2 \given x) = \beta_2 + \sum_i w_i\,\E(e_i \given x) \neq \beta_2 . \]

The classic way SR2 fails is when a variable that belongs in the model has been left out and lurks in the error term.

Omitting ability from a wage equation

Consider \(\text{WAGE} = \beta_1 + \beta_2\,\text{EDUC} + e\), with a worker’s ability buried inside \(e\). Ability is correlated with education, so \(\E(e \given \text{EDUC}) \neq 0\). As a result \(b_2\) is biased: it confounds the genuine return to schooling with the payoff to ability.

This is omitted-variable bias <80><94> the formal face of the slogan “correlation \(\neq\) causation.” We quantify it precisely when we get to model specification.

7.3 Variance and precision

Being unbiased is not enough on its own. We want estimates that are tightly clustered around \(\beta\), not merely centered on it. That tightness is the variance of the estimator. Under assumptions SR1<80><93>SR5 (HGL Appendix 2E), the variances and covariance of the OLS estimators are \[ \Var(b_2 \given x) = \frac{\sigma^2}{\sum (x_i-\bar x)^2}, \qquad \Var(b_1 \given x) = \sigma^2\!\left[\frac{\sum x_i^2}{N\sum(x_i-\bar x)^2}\right], \] \[ \Cov(b_1, b_2 \given x) = \sigma^2\!\left[\frac{-\bar x}{\sum(x_i-\bar x)^2}\right]. \]

Smaller variance means more precise

Take two unbiased estimators with the same center. Prefer the one with the smaller variance <80><94> it has a higher chance of landing near \(\beta_2\) on any given sample. Most of what follows is about \(\Var(b_2 \given x)\).

What drives the precision of \(b_2\)?

Look hard at \(\displaystyle \Var(b_2 \given x) = \frac{\sigma^2}{\sum(x_i-\bar x)^2}\). Three levers control it.

  1. Error variance \(\sigma^2\) (the numerator). Noisier data about the line means a less precise slope. We cannot control this <80><94> it is a feature of the population.
  2. Spread of \(x\), measured by \(\sum(x_i-\bar x)^2\) (the denominator). More variation in income means a more precise slope. A wide lever arm pins the line down firmly.
  3. Sample size \(N\). Each additional observation adds a term to the denominator sum, so a larger \(N\) shrinks the variance: more data, tighter estimates.

The second lever is the easiest to picture. Figure 7.2 contrasts a sample with bunched-up \(x\) values against one with spread-out \(x\) values. When the \(x\)’s are bunched together, very different lines fit the cloud about equally well, so the slope is poorly determined. Spread the same number of points across a wide range of \(x\) and the line is nailed down.

Show the R code
bunched <- data.frame(
  x = c(4, 4.5, 5, 5.5, 5, 4.8),
  y = c(3, 5, 4, 6, 5.5, 3.8),
  panel = "bunched x: imprecise"
)
spread <- data.frame(
  x = c(1, 2.5, 4, 6, 8, 9),
  y = c(2, 3.5, 4, 6, 7.5, 8),
  panel = "spread x: precise"
)
pts  <- rbind(bunched, spread)
fits <- data.frame(
  x = c(0.5, 9.5, 0.5, 9.5, 0.5, 9.5),
  y = c(1 + 0.8 * 0.5, 1 + 0.8 * 9.5,    # line 1, bunched
        3 + 0.3 * 0.5, 3 + 0.3 * 9.5,    # line 2, bunched
        1 + 0.8 * 0.5, 1 + 0.8 * 9.5),   # line 1, spread
  grp   = c("a", "a", "b", "b", "c", "c"),
  panel = c("bunched x: imprecise", "bunched x: imprecise",
            "bunched x: imprecise", "bunched x: imprecise",
            "spread x: precise", "spread x: precise")
)
ggplot() +
  geom_line(data = fits, aes(x, y, group = grp, color = grp),
            linewidth = 1) +
  geom_point(data = pts, aes(x, y), color = ucla$darkblue, size = 1.6) +
  scale_color_manual(values = c(a = ucla$blue, b = ucla$red, c = ucla$blue),
                     guide = "none") +
  facet_wrap(~ panel) +
  scale_x_continuous(limits = c(0, 10), breaks = NULL) +
  scale_y_continuous(limits = c(0, 10), breaks = NULL) +
  labs(x = "x", y = "y")
Figure 7.2: Bunched regressors (left) leave the slope poorly determined; spread-out regressors (right) pin it down.

From variance to standard error

There is a catch: every variance formula above contains the unknown error variance \(\sigma^2\). To make them operational we replace \(\sigma^2\) with an estimate \(\hat\sigma^2\) (the subject of the next chapter). This gives an estimated variance, and its square root is the standard error: \[ \mathrm{se}(b_2) = \sqrt{\widehat{\Var}(b_2 \given x)} = \sqrt{\frac{\hat\sigma^2}{\sum(x_i-\bar x)^2}} . \]

The standard error is our estimate of the sampling standard deviation of \(b_2\) <80><94> it answers “how far, typically, would \(b_2\) land from \(\beta_2\) across samples?” For the food-expenditure data, software reports \(\mathrm{se}(b_1) = 43.41\) and \(\mathrm{se}(b_2) = 2.09\). We can read both straight off the standard lm() output:

data(food)
fit <- lm(food_exp ~ income, data = food)
summary(fit)$coefficients
#>             Estimate Std. Error  t value     Pr(>|t|)
#> (Intercept) 83.41600  43.410163 1.921578 6.218242e-02
#> income      10.20964   2.093264 4.877381 1.945862e-05

The Std. Error column reproduces the slide’s numbers. Standard errors are the raw material of every confidence interval and \(t\)-test to come (see confidence intervals and hypothesis testing). But to actually compute them we first need \(\hat\sigma^2\) <80><94> that is the first order of business next chapter.

The shape of the distribution

The center (\(\beta_2\)) and the spread (\(\Var(b_2 \given x)\)) describe where the sampling distribution sits and how wide it is. What about its shape? The answer depends on whether the errors are normal.

If SR6 holds: exactly normal

If the errors are normally distributed, then \(b_2 = \sum w_i y_i\) is a weighted sum of normal random variables, hence exactly normal: \[ b_2 \given x \sim N\!\left(\beta_2,\ \frac{\sigma^2}{\sum(x_i-\bar x)^2}\right). \]

If SR6 fails: a central limit theorem

Even with non-normal errors, \(b_2\) is approximately normal in large samples. Because \(b_2\) is essentially an average (a weighted sum of the \(y_i\)), a central limit theorem applies.

Either way <80><94> exactly under normality, or approximately in large samples <80><94> \(b_2\) ends up normal. That normal shape is precisely what makes the \(t\)-based inference of the next several chapters possible.

7.4 The Gauss<80><93>Markov theorem

We now know two things about OLS: it is linear (\(b_2 = \sum w_i y_i\)) and it is unbiased (\(\E(b_2 \given x) = \beta_2\)). The remaining question is whether it is the best such estimator. The answer is the central theoretical result of the simple regression model.

Gauss<e2><80><93>Markov theorem

Given \(x\) and under assumptions SR1<80><93>SR5, the OLS estimators \(b_1\) and \(b_2\) have the smallest variance among all linear and unbiased estimators of \(\beta_1\) and \(\beta_2\). OLS is the Best Linear Unbiased Estimator (BLUE).

In words: within the class of estimators that are (i) weighted averages of the \(y_i\) and (ii) correct on average, nothing beats OLS on precision. There is no point hunting for a cleverer linear unbiased rule <80><94> you already hold the winner.

Reading the fine print

Gauss<80><93>Markov is precise about exactly what it promises, and each qualifier matters.

  1. “Best” is only within a class <80><94> linear and unbiased. A nonlinear or biased estimator could, in principle, have a smaller variance.
  2. “Best” means minimum variance among that class.
  3. It requires SR1<80><93>SR5. Break any one of them <80><94> for instance, heteroskedasticity breaks SR3 <80><94> and OLS need no longer be best.
  4. It does not need normality (SR6). Gauss<80><93>Markov is a pure first- and second-moment result; it says nothing about the shape of the distribution.
  5. It applies to the estimators, the procedure <80><94> not to the estimates from any single sample.

Stock & Watson call the same result the Gauss<80><93>Markov theorem too (5.5). Their efficiency statement is the textbook’s reason OLS is the default estimator almost everywhere in applied work.

What each assumption buys you

It helps to keep a scorecard of which property leans on which assumption. Each result in this chapter needs only some of the assumptions, and much of the rest of econometrics is a tour of which assumption is failing and what to do about it.

Show the R code
scorecard <- data.frame(
  prop = c("$b_1, b_2$ exist / computable",
           "Unbiased: $\\E(b\\mid x)=\\beta$",
           "Variance formulas as stated",
           "BLUE (Gauss<e2><80><93>Markov)",
           "Exact normal $b$, exact inference"),
  needs = c("SR1, SR5", "SR1, SR2", "SR1<e2><80><93>SR4", "SR1<e2><80><93>SR5", "+ SR6"),
  why = c("line defined; $\\sum(x_i-\\bar x)^2 \\neq 0$",
          "$\\E(e\\mid x)=0$ kills the bias term",
          "SR3 (homoskedasticity), SR4 (uncorrelated)",
          "minimum variance in class",
          "normal errors $\\Rightarrow$ normal $b$")
)
knitr::kable(scorecard, col.names = c("Property", "Needs", "Why"),
             align = "lll")
Table 7.2: Which OLS property each assumption buys.
Property Needs Why
\(b_1, b_2\) exist / computable SR1, SR5 line defined; \(\sum(x_i-\bar x)^2 \neq 0\)
Unbiased: \(\E(b\mid x)=\beta\) SR1, SR2 \(\E(e\mid x)=0\) kills the bias term
Variance formulas as stated SR1<80><93>SR4 SR3 (homoskedasticity), SR4 (uncorrelated)
BLUE (Gauss<80><93>Markov) SR1<80><93>SR5 minimum variance in class
Exact normal \(b\), exact inference + SR6 normal errors \(\Rightarrow\) normal \(b\)

When an assumption fails there is usually a standard remedy: robust standard errors when SR3 (homoskedasticity) breaks, clustering when SR4 (uncorrelated errors) breaks, and instrumental variables when SR2 (exogeneity) breaks. The chapters ahead work through these one at a time.

7.5 Recap

The estimate \(10.21\) is one draw of a random estimator, so we judge the procedure, not the number.

  • The estimator is random. OLS is a linear estimator, \(b_2 = \sum_i w_i y_i\) with \(w_i = (x_i - \bar x)/\sum_j(x_j-\bar x)^2\), and it decomposes as \(b_2 = \beta_2 + \sum_i w_i e_i\) <80><94> the second term holds all the randomness.
  • Unbiased. \(\E(b_2 \given x) = \beta_2\) (it needs SR2), a statement about the procedure rather than any single estimate. It fails under omitted variables.
  • Precision. \(\Var(b_2 \given x) = \sigma^2 / \sum(x_i-\bar x)^2\) is smaller when \(\sigma^2\) is low, when \(x\) is spread out, and when \(N\) is large. The standard error is \(\mathrm{se}(b_2) = \sqrt{\hat\sigma^2/\sum(x_i-\bar x)^2}\).
  • Gauss<80><93>Markov. Under SR1<80><93>SR5, OLS is BLUE. Normality (SR6) is optional and is needed only for an exactly normal \(b\).

Next time: the last remaining unknown, \(\sigma^2\). We estimate it with \(\hat\sigma^2 = \mathrm{SSE}/(N-2)\), build the standard error of the regression on top of it, and use the fitted line to make point predictions <80><94> variance estimation and prediction.