\( \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\given}{\,\vert\,} \newcommand{\indic}[1]{\mathbf{1}\!\left\{#1\right\}} \newcommand{\pmf}{\text{p.m.f.}} \newcommand{\pdf}{\text{p.d.f.}} \newcommand{\cdf}{\text{c.d.f.}} \)

9  Confidence Intervals

Reading. SW 5.2, HGL 3.1, 3.6

By now we have everything we need to start doing statistical inference. From the food-expenditure regression we have a slope estimate \(b_2 = 10.21\) with standard error \(\mathrm{se}(b_2) = 2.09\), and from the properties of OLS we know that, conditional on the regressor, \[ b_2 \given x \sim N\!\left(\beta_2,\ \frac{\sigma^2}{\sum(x_i-\bar x)^2}\right). \]

A point estimate by itself says nothing about its reliability. We can report “\(\beta_2 = 10.21\)<80><94> but how sure are we? Could the truth plausibly be \(6\)? Could it be \(14\)? This chapter answers that question by reporting a range of plausible values <80><94> an interval estimate, better known as a confidence interval. We do this in three steps: turn the normal \(b_2\) into a usable \(t\)-statistic, build the interval \(b_2 \pm t_c\,\mathrm{se}(b_2)\) and interpret it carefully, and finally extend it to linear combinations of the parameters such as the conditional mean \(\E(y\given x_0)=\beta_1 + x_0\beta_2\).

9.1 From the normal to the \(t\)-distribution

Under the simple-regression assumptions SR1<80><93>SR6, \(b_2\) is conditionally normal. The natural first move is to standardize it <80><94> subtract its mean and divide by its standard deviation: \[ Z = \frac{b_2 - \beta_2}{\sqrt{\sigma^2/\sum(x_i-\bar x)^2}} \sim N(0,1). \]

The quantity \(Z\) is pivotal: its \(N(0,1)\) distribution involves no unknown parameters, so we can read probabilities straight off the normal table. For instance, \[ \Prob(-1.96 \le Z \le 1.96) = 0.95 . \]

One snag

\(Z\) still contains the unknown \(\sigma^2\) in the denominator, so we cannot actually compute it. We must replace \(\sigma^2\) with its estimate \(\hat\sigma^2\) <80><94> and that swap changes the distribution.

Swapping \(\sigma^2\) for \(\hat\sigma^2\)

Replace \(\sigma^2\) by \(\hat\sigma^2 = \mathrm{SSE}/(N-2)\). The denominator then becomes exactly the standard error of \(b_2\), and the standardized statistic is no longer normal <80><94> it follows Student’s \(t\)-distribution: \[ t = \frac{b_2 - \beta_2}{\sqrt{\hat\sigma^2/\sum(x_i-\bar x)^2}} = \frac{b_2 - \beta_2}{\mathrm{se}(b_2)} \sim t_{(N-2)} . \]

The same construction works for the intercept \(b_1\). In general, for \(k = 1, 2\), \[ t = \frac{b_k - \beta_k}{\mathrm{se}(b_k)} \sim t_{(N-2)} . \]

The engine of inference

This single equation is the engine of both confidence intervals (this chapter) and hypothesis tests (the next). Like \(Z\), it is pivotal <80><94> no unknown parameters and no dependence on \(x\) <80><94> which is exactly what lets us turn it into statements about \(\beta_k\).

What the \(t\)-distribution looks like

The \(t\)-distribution is bell-shaped, symmetric, and centered at \(0\), just like the standard normal. The difference is in the tails: the \(t\) has fatter tails and more spread, because estimating \(\sigma^2\) injects extra uncertainty into the statistic. Its exact shape is governed by a single number, the degrees of freedom \(\mathrm{df} = N - 2\). As \(\mathrm{df} \to \infty\) the \(t\) converges to the normal, so for large \(N\) the critical value \(t_c \approx 1.96\).

This is why Stock & Watson, who lean on large samples, simply use the normal and \(1.96\). With only \(N = 40\) observations in the food data we use the exact \(t\). Figure 9.1 contrasts the two: the \(t\) (here with just \(3\) degrees of freedom, to exaggerate the effect) sits lower in the middle and is fatter in the tails than the \(N(0,1)\).

Show the R code
xs   <- seq(-4, 4, length.out = 400)
dens <- data.frame(
  x    = rep(xs, 2),
  y    = c(dnorm(xs), dt(xs, df = 3)),
  dist = rep(c("N(0, 1)", "t (df = 3)"), each = length(xs))
)
ggplot(dens, aes(x, y, color = dist, linetype = dist)) +
  geom_line(linewidth = 1) +
  scale_color_manual(values = c("N(0, 1)" = ucla$darkblue,
                                "t (df = 3)" = ucla$red)) +
  scale_linetype_manual(values = c("N(0, 1)" = "solid",
                                   "t (df = 3)" = "dashed")) +
  scale_y_continuous(limits = c(0, 0.45)) +
  labs(x = NULL, y = NULL, color = NULL, linetype = NULL)
Figure 9.1: The \(t\)-distribution (here \(t_{(3)}\)) is bell-shaped and symmetric like the standard normal, but sits lower in the middle and has fatter tails.

9.2 Confidence intervals for a coefficient

To build an interval, start from Statistical Table 2 and pick the critical value \(t_c = t_{(1-\alpha/2,\,N-2)}\) that puts \(\alpha/2\) of the probability in each tail of the \(t_{(N-2)}\) distribution, so that \[ \Prob(-t_c \le t \le t_c) = 1-\alpha . \]

Now substitute \(t = (b_k - \beta_k)/\mathrm{se}(b_k)\) and rearrange the inequality to isolate the unknown \(\beta_k\): \[ \Prob\!\bigl[\,b_k - t_c\,\mathrm{se}(b_k) \le \beta_k \le b_k + t_c\,\mathrm{se}(b_k)\,\bigr] = 1-\alpha . \]

The 100(1 <e2><88><92> <ce><b1>)% interval estimator

\[ b_k \pm t_c\,\mathrm{se}(b_k) \] There are three ingredients: the estimate \(b_k\), its standard error \(\mathrm{se}(b_k)\), and a critical value \(t_c\) that bakes in both the confidence level (through \(\alpha\)) and the sample size (through the degrees of freedom).

Food data: a 95% interval for \(\beta_2\)

In the food data \(N = 40\), so \(\mathrm{df} = 38\), and for \(\alpha = 0.05\) the critical value is \(t_c = t_{(0.975,\,38)} = 2.024\). With \(b_2 = 10.21\) and \(\mathrm{se}(b_2) = 2.09\), \[ b_2 \pm t_c\,\mathrm{se}(b_2) = 10.21 \pm 2.024(2.09) = [\,5.97,\ 14.45\,]. \]

In R we never compute this by hand <80><94> we fit the model and call confint(), which returns exactly the interval above.

data(food)
fit <- lm(food_exp ~ income, data = food)
confint(fit, "income", level = 0.95)
#>           2.5 %   97.5 %
#> income 5.972052 14.44723
Reading it in context

We estimate, with 95% confidence, that from an extra $100 of weekly income households spend between $5.97 and $14.45 more on food. The range is wide: a single regressor leaves a lot of uncertainty about \(\beta_2\). A supermarket CEO planning store capacity would stress-test decisions across this whole range, not just the point estimate \(10.21\).

What “95% confidence” really means

The confidence is in the procedure, not in any one interval. Across all possible samples, \(95\%\) of the intervals built this way will contain the true \(\beta_2\). Our particular interval \([5.97, 14.45]\) either contains \(\beta_2\) or it does not <80><94> and we will never know which.

A 95% interval is not a 95% probability statement

It is wrong to say “\(\beta_2\) has a 95% probability of being in \([5.97, 14.45]\).” The parameter \(\beta_2\) is a fixed (if unknown) number; it is the interval that is random, because it is built from the random sample. Once the sample is drawn, the interval is fixed too, and the only honest statement is that the procedure that produced it works \(95\%\) of the time.

Figure 9.2 makes this concrete. Imagine drawing many samples and building a \(95\%\) interval from each. The vertical line is the true \(\beta_2\); most intervals straddle it, but a handful (shown in red) miss entirely. Over the long run, about \(1\) in \(20\) misses.

Show the R code
set.seed(103)
beta2 <- 10.21
n_int <- 20
centers <- rnorm(n_int, mean = beta2, sd = 2.09)
half    <- 2.024 * 2.09
ints <- data.frame(
  id    = seq_len(n_int),
  lo    = centers - half,
  hi    = centers + half,
  mid   = centers
)
ints$miss <- ints$lo > beta2 | ints$hi < beta2
ggplot(ints, aes(y = id)) +
  geom_vline(xintercept = beta2, linetype = "dashed", color = ucla$gray) +
  geom_segment(aes(x = lo, xend = hi, yend = id, color = miss),
               linewidth = 1) +
  geom_point(aes(x = mid, color = miss), size = 1.6) +
  scale_color_manual(values = c("FALSE" = ucla$blue, "TRUE" = ucla$red),
                     guide = "none") +
  labs(x = expression(beta[2]), y = "sample")
Figure 9.2: Many 95% intervals from repeated samples. Most cover the true \(\beta_2\) (vertical line); the red ones miss. In the long run about 5% miss.

The width of the interval is information

The half-width of the interval is \[ \text{half-width} = t_c\,\mathrm{se}(b_k) = t_c\sqrt{\frac{\hat\sigma^2}{\sum(x_i-\bar x)^2}} , \] and its size tells us how much the data have taught us. A narrow interval corresponds to a small standard error: the data pin down \(\beta_k\) sharply, so we have learned a lot. A wide interval corresponds to a large standard error and little information about \(\beta_k\). Everything that shrank \(\mathrm{se}(b_2)\) in the chapter on the variance of OLS <80><94> a smaller error variance \(\sigma^2\), more spread-out \(x\) values, a larger sample \(N\) <80><94> also narrows the interval. Demanding higher confidence, on the other hand, raises \(t_c\) (a \(99\%\) interval uses a bigger critical value than a \(95\%\) one) and so widens the interval: more coverage costs precision.

The handy large-sample shortcut

When \(\mathrm{df} = N - 2 > 30\), the critical value \(t_c \approx 2\), so a quick \(95\%\) interval is \[ b_k \pm 2\,\mathrm{se}(b_k). \] This is the rule of thumb behind the phrase “two standard errors.”

9.3 Linear combinations of parameters

Often the quantity we actually care about mixes both parameters <80><94> a linear combination \[ \lambda = c_1\beta_1 + c_2\beta_2, \] where \(c_1\) and \(c_2\) are constants we choose. The headline case is the conditional mean of \(y\) at a specific value \(x_0\) of the regressor, \[ \E(y\given x = x_0) = \beta_1 + x_0\,\beta_2 \qquad (c_1 = 1,\ c_2 = x_0). \]

We estimate \(\lambda\) in the natural way, by plugging in the OLS estimates, and <80><94> because \(b_1\) and \(b_2\) are BLUE <80><94> the estimator \(\hat\lambda = c_1 b_1 + c_2 b_2\) is the best linear unbiased estimator of \(\lambda\). Unbiasedness follows directly from the linearity of expectation: \[ \E(\hat\lambda\given x) = c_1\E(b_1\given x) + c_2\E(b_2\given x) = c_1\beta_1 + c_2\beta_2 = \lambda . \]

The standard error of a linear combination

The point estimate is easy; the standard error needs the variance-of-a-sum rule from the chapter on expectation, variance and covariance, and the covariance term is essential: \[ \Var(\hat\lambda\given x) = c_1^2\,\Var(b_1\given x) + c_2^2\,\Var(b_2\given x) + 2c_1 c_2\,\Cov(b_1,b_2\given x). \] Plugging in the estimated variances and covariance (from the variance chapter) and taking the square root gives the standard error, \[ \mathrm{se}(\hat\lambda) = \sqrt{\widehat{\Var}(\hat\lambda\given x)} . \]

Don't forget the covariance

A common mistake is to add only \(c_1^2\widehat{\Var}(b_1) + c_2^2\widehat{\Var}(b_2)\) and stop. Because \(b_1\) and \(b_2\) are correlated <80><94> recall that \(\Cov(b_1, b_2) < 0\) whenever \(\bar x > 0\) <80><94> the cross term \(2 c_1 c_2 \widehat{\Cov}(b_1, b_2)\) is genuinely part of the variance and cannot be dropped.

Food data: a CI for expected food expenditure

Suppose we want to “estimate average weekly food spending for households with $2,000 of income,” i.e. \(x_0 = 20\) (income is measured in $100 units). This is the conditional mean \(\E(y\given x_0 = 20) = \beta_1 + 20\beta_2\). The point estimate is \[ \hat\lambda = b_1 + 20\,b_2 = 83.42 + 20(10.21) = 287.61 . \] Using the estimated \(\widehat{\Var}(b_1) = 1884.44\), \(\widehat{\Var}(b_2) = 4.3818\), and \(\widehat{\Cov}(b_1, b_2) = -85.90\), \[ \widehat{\Var}(\hat\lambda) = 1884.44 + 20^2(4.3818) + 2(20)(-85.90) = 201.02, \] \[ \mathrm{se}(\hat\lambda) = \sqrt{201.02} = 14.18 . \] A \(95\%\) interval, with \(t_c = 2.024\), is \[ 287.61 \pm 2.024(14.18) = [\,258.91,\ 316.31\,]. \] With \(95\%\) confidence, the average such household spends between $258.91 and $316.31 on food.

We can reproduce every one of these numbers directly. The variances and covariance come from the estimated coefficient covariance matrix vcov(fit), and the whole calculation is a couple of lines.

b  <- coef(fit)                 # b1, b2
V  <- vcov(fit)                 # estimated variance-covariance matrix
cc <- c(1, 20)                  # c1 = 1, c2 = x0 = 20

lambda_hat <- sum(cc * b)                       # point estimate
var_hat    <- as.numeric(t(cc) %*% V %*% cc)    # c' V c, includes covariance
se_hat     <- sqrt(var_hat)
tc         <- qt(0.975, df = nrow(food) - 2)    # t_(0.975, 38)

c(estimate = lambda_hat, se = se_hat,
  lower = lambda_hat - tc * se_hat,
  upper = lambda_hat + tc * se_hat)
#>  estimate        se     lower     upper 
#> 287.60886  14.17804 258.90692 316.31081

Figure 9.3 shows the fitted regression line together with this \(95\%\) confidence band for the mean food expenditure across the range of income. The band is narrowest near the average income and flares out toward the extremes, mirroring how the standard error of \(\hat\lambda\) grows as \(x_0\) moves away from \(\bar x\).

Show the R code
grid <- data.frame(income = seq(min(food$income), max(food$income),
                                length.out = 100))
pred <- predict(fit, newdata = grid, interval = "confidence", level = 0.95)
band <- cbind(grid, as.data.frame(pred))

ggplot(band, aes(income)) +
  geom_ribbon(aes(ymin = lwr, ymax = upr), fill = ucla$blue, alpha = 0.30) +
  geom_line(aes(y = fit), color = ucla$blue, linewidth = 1) +
  geom_point(data = food, aes(income, food_exp),
             color = ucla$gray, alpha = 0.7, size = 1.4) +
  geom_vline(xintercept = 20, linetype = "dashed", color = ucla$gray) +
  labs(x = "income (\\$100s)", y = "food expenditure (\\$)")
Figure 9.3: Fitted line for food expenditure on income, with a 95% confidence band for the mean. The band is tightest near the average income and widens at the extremes.

Mean versus outcome: two different bands

Notice how tight the interval for the mean is <80><94> only \(\pm\$28.70\) wide. That tightness is a clue that the confidence interval for a mean is a fundamentally different object from a forecast of a single household’s spending.

CI for the mean vs. prediction interval for an outcome
  • CI for the mean \(\E(y\given x_0)\): \([258.91,\ 316.31]\). The only sources of error are estimating \(b_1\) and \(b_2\). This is the linear-combination interval we just built.
  • Prediction interval for \(y_0\): much wider, because it must also absorb the new household’s own random shock \(e_0\). It is built from the forecast error of the variance chapter, and the full mechanics come in prediction and goodness of fit.

Both bands share the same center, \(287.61\), but they have very different widths. Whenever you report an interval, ask which one you need: am I estimating an average, or am I forecasting an individual outcome?

9.4 Recap

We turned a point estimate into a range of plausible values by way of the \(t\)-statistic.

  • The \(t\)-statistic. Standardize \(b_k\) and swap the unknown \(\sigma^2\) for \(\hat\sigma^2\): \[ \frac{b_k - \beta_k}{\mathrm{se}(b_k)} \sim t_{(N-2)} . \] The \(t\) is bell-shaped with fatter tails than the normal, governed by \(\mathrm{df} = N - 2\), and approaches the normal as \(N\) grows.
  • Confidence interval. \(b_k \pm t_c\,\mathrm{se}(b_k)\) with \(t_c = t_{(1-\alpha/2,\,N-2)}\). For the food data, \(\beta_2 \in [5.97,\ 14.45]\). The confidence is in the procedure: \(95\%\) of such intervals cover the true parameter, but we never know whether ours is one of them.
  • Linear combinations. \(\hat\lambda = c_1 b_1 + c_2 b_2\) is the BLUE of \(\lambda = c_1\beta_1 + c_2\beta_2\), with variance \[ \Var(\hat\lambda) = c_1^2\Var(b_1) + c_2^2\Var(b_2) + 2c_1 c_2\Cov(b_1, b_2), \] covariance term included. For mean food spending at $2,000 income, \(\hat\lambda = 287.61\) with \(95\%\) CI \([258.91,\ 316.31]\) <80><94> a band for the mean that is much tighter than a band for an individual outcome.

Next time: the same \(t = (b - c)/\mathrm{se}(b)\) engine, now aimed at a conjecture about a parameter <80><94> hypothesis testing, where we ask whether \(\beta_2 = 0\) (or \(> 5.5\)), reject or fail to reject using \(p\)-values, and distinguish statistical from economic significance.