---
title: "Confidence Intervals"
---
{{< include _setup.qmd >}}
> **Reading.** SW §5.2, HGL §3.1, 3.6
By now we have everything we need to start doing **statistical inference**. From
the food-expenditure regression we have a slope estimate $b_2 = 10.21$ with
standard error $\mathrm{se}(b_2) = 2.09$, and from [the properties of
OLS](07-ols-properties.qmd) we know that, conditional on the regressor,
$$
b_2 \given x \sim N\!\left(\beta_2,\ \frac{\sigma^2}{\sum(x_i-\bar x)^2}\right).
$$
A point estimate by itself says nothing about its **reliability**. We can report
"$\beta_2 = 10.21$" — but how sure are we? Could the truth plausibly be $6$? Could
it be $14$? This chapter answers that question by reporting a *range* of plausible
values — an **interval estimate**, better known as a **confidence interval**. We
do this in three steps: turn the normal $b_2$ into a usable **$t$-statistic**,
build the interval $b_2 \pm t_c\,\mathrm{se}(b_2)$ and interpret it carefully, and
finally extend it to **linear combinations** of the parameters such as the
conditional mean $\E(y\given x_0)=\beta_1 + x_0\beta_2$.
## From the normal to the $t$-distribution {#sec-t-dist}
Under the simple-regression assumptions SR1–SR6, $b_2$ is conditionally normal. The
natural first move is to **standardize** it — subtract its mean and divide by its
standard deviation:
$$
Z = \frac{b_2 - \beta_2}{\sqrt{\sigma^2/\sum(x_i-\bar x)^2}} \sim N(0,1).
$$
The quantity $Z$ is **pivotal**: its $N(0,1)$ distribution involves *no unknown
parameters*, so we can read probabilities straight off the normal table. For
instance,
$$
\Prob(-1.96 \le Z \le 1.96) = 0.95 .
$$
::: {.keyidea title="One snag"}
$Z$ still contains the *unknown* $\sigma^2$ in the denominator, so we cannot
actually compute it. We must replace $\sigma^2$ with its estimate $\hat\sigma^2$
— and that swap changes the distribution.
:::
### Swapping $\sigma^2$ for $\hat\sigma^2$
Replace $\sigma^2$ by $\hat\sigma^2 = \mathrm{SSE}/(N-2)$. The denominator then
becomes exactly the *standard error* of $b_2$, and the standardized statistic is
no longer normal — it follows **Student's $t$-distribution**:
$$
t = \frac{b_2 - \beta_2}{\sqrt{\hat\sigma^2/\sum(x_i-\bar x)^2}}
= \frac{b_2 - \beta_2}{\mathrm{se}(b_2)} \sim t_{(N-2)} .
$$
The same construction works for the intercept $b_1$. In general, for $k = 1, 2$,
$$
t = \frac{b_k - \beta_k}{\mathrm{se}(b_k)} \sim t_{(N-2)} .
$$
::: {.keyidea title="The engine of inference"}
This single equation is the **engine** of *both* confidence intervals (this
chapter) and [hypothesis tests](10-hypothesis-testing.qmd) (the next). Like $Z$,
it is pivotal — no unknown parameters and no dependence on $x$ — which is exactly
what lets us turn it into statements about $\beta_k$.
:::
### What the $t$-distribution looks like
The $t$-distribution is bell-shaped, symmetric, and centered at $0$, just like the
standard normal. The difference is in the tails: the $t$ has **fatter tails** and
more spread, because estimating $\sigma^2$ injects extra uncertainty into the
statistic. Its exact shape is governed by a single number, the **degrees of
freedom** $\mathrm{df} = N - 2$. As $\mathrm{df} \to \infty$ the $t$ **converges
to the normal**, so for large $N$ the critical value $t_c \approx 1.96$.
This is why Stock & Watson, who lean on large samples, simply use the normal and
$1.96$. With only $N = 40$ observations in the food data we use the exact $t$.
@fig-t-vs-normal contrasts the two: the $t$ (here with just $3$ degrees of
freedom, to exaggerate the effect) sits lower in the middle and is fatter in the
tails than the $N(0,1)$.
```{r}
#| label: fig-t-vs-normal
#| fig-cap: "The $t$-distribution (here $t_{(3)}$) is bell-shaped and symmetric like the standard normal, but sits lower in the middle and has fatter tails."
#| fig-width: 5
#| fig-height: 3.4
xs <- seq(-4, 4, length.out = 400)
dens <- data.frame(
x = rep(xs, 2),
y = c(dnorm(xs), dt(xs, df = 3)),
dist = rep(c("N(0, 1)", "t (df = 3)"), each = length(xs))
)
ggplot(dens, aes(x, y, color = dist, linetype = dist)) +
geom_line(linewidth = 1) +
scale_color_manual(values = c("N(0, 1)" = ucla$darkblue,
"t (df = 3)" = ucla$red)) +
scale_linetype_manual(values = c("N(0, 1)" = "solid",
"t (df = 3)" = "dashed")) +
scale_y_continuous(limits = c(0, 0.45)) +
labs(x = NULL, y = NULL, color = NULL, linetype = NULL)
```
## Confidence intervals for a coefficient {#sec-ci-coef}
To build an interval, start from Statistical Table 2 and pick the **critical
value** $t_c = t_{(1-\alpha/2,\,N-2)}$ that puts $\alpha/2$ of the probability in
each tail of the $t_{(N-2)}$ distribution, so that
$$
\Prob(-t_c \le t \le t_c) = 1-\alpha .
$$
Now substitute $t = (b_k - \beta_k)/\mathrm{se}(b_k)$ and rearrange the inequality
to isolate the unknown $\beta_k$:
$$
\Prob\!\bigl[\,b_k - t_c\,\mathrm{se}(b_k) \le \beta_k \le b_k + t_c\,\mathrm{se}(b_k)\,\bigr] = 1-\alpha .
$$
::: {.definition title="The 100(1 − α)% interval estimator"}
$$
b_k \pm t_c\,\mathrm{se}(b_k)
$$
There are three ingredients: the estimate $b_k$, its standard error
$\mathrm{se}(b_k)$, and a critical value $t_c$ that bakes in both the confidence
level (through $\alpha$) and the sample size (through the degrees of freedom).
:::
### Food data: a 95% interval for $\beta_2$
In the food data $N = 40$, so $\mathrm{df} = 38$, and for $\alpha = 0.05$ the
critical value is $t_c = t_{(0.975,\,38)} = 2.024$. With $b_2 = 10.21$ and
$\mathrm{se}(b_2) = 2.09$,
$$
b_2 \pm t_c\,\mathrm{se}(b_2)
= 10.21 \pm 2.024(2.09)
= [\,5.97,\ 14.45\,].
$$
In R we never compute this by hand — we fit the model and call `confint()`, which
returns exactly the interval above.
```{r}
#| code-fold: false
data(food)
fit <- lm(food_exp ~ income, data = food)
confint(fit, "income", level = 0.95)
```
::: {.example title="Reading it in context"}
We estimate, *with 95% confidence*, that from an extra \$100 of weekly income
households spend between **\$5.97 and \$14.45** more on food. The range is wide:
a single regressor leaves a lot of uncertainty about $\beta_2$. A supermarket CEO
planning store capacity would stress-test decisions across this *whole* range, not
just the point estimate $10.21$.
:::
### What "95% confidence" really means
The confidence is in the **procedure**, not in any one interval. Across *all
possible samples*, $95\%$ of the intervals built this way will contain the true
$\beta_2$. *Our* particular interval $[5.97, 14.45]$ either contains $\beta_2$ or
it does not — and we will **never know which**.
::: {.warningbox title="A 95% interval is not a 95% probability statement"}
It is *wrong* to say "$\beta_2$ has a 95% probability of being in
$[5.97, 14.45]$." The parameter $\beta_2$ is a fixed (if unknown) number; it is
the *interval* that is random, because it is built from the random sample. Once
the sample is drawn, the interval is fixed too, and the only honest statement is
that the procedure that produced it works $95\%$ of the time.
:::
@fig-coverage makes this concrete. Imagine drawing many samples and building a
$95\%$ interval from each. The vertical line is the true $\beta_2$; most intervals
straddle it, but a handful (shown in red) miss entirely. Over the long run, about
$1$ in $20$ misses.
```{r}
#| label: fig-coverage
#| fig-cap: "Many 95% intervals from repeated samples. Most cover the true $\\beta_2$ (vertical line); the red ones miss. In the long run about 5% miss."
#| fig-width: 5
#| fig-height: 3.4
set.seed(103)
beta2 <- 10.21
n_int <- 20
centers <- rnorm(n_int, mean = beta2, sd = 2.09)
half <- 2.024 * 2.09
ints <- data.frame(
id = seq_len(n_int),
lo = centers - half,
hi = centers + half,
mid = centers
)
ints$miss <- ints$lo > beta2 | ints$hi < beta2
ggplot(ints, aes(y = id)) +
geom_vline(xintercept = beta2, linetype = "dashed", color = ucla$gray) +
geom_segment(aes(x = lo, xend = hi, yend = id, color = miss),
linewidth = 1) +
geom_point(aes(x = mid, color = miss), size = 1.6) +
scale_color_manual(values = c("FALSE" = ucla$blue, "TRUE" = ucla$red),
guide = "none") +
labs(x = expression(beta[2]), y = "sample")
```
### The width of the interval is information
The half-width of the interval is
$$
\text{half-width} = t_c\,\mathrm{se}(b_k)
= t_c\sqrt{\frac{\hat\sigma^2}{\sum(x_i-\bar x)^2}} ,
$$
and its size tells us how much the data have taught us. A **narrow** interval
corresponds to a small standard error: the data pin down $\beta_k$ sharply, so we
have learned a lot. A **wide** interval corresponds to a large standard error and
little information about $\beta_k$. Everything that shrank $\mathrm{se}(b_2)$ in
[the chapter on the variance of OLS](08-variance-prediction.qmd) — a smaller error
variance $\sigma^2$, more spread-out $x$ values, a larger sample $N$ — also
*narrows* the interval. Demanding higher confidence, on the other hand, raises
$t_c$ (a $99\%$ interval uses a bigger critical value than a $95\%$ one) and so
*widens* the interval: more coverage costs precision.
::: {.keyidea title="The handy large-sample shortcut"}
When $\mathrm{df} = N - 2 > 30$, the critical value $t_c \approx 2$, so a quick
$95\%$ interval is
$$
b_k \pm 2\,\mathrm{se}(b_k).
$$
This is the rule of thumb behind the phrase "two standard errors."
:::
## Linear combinations of parameters {#sec-lin-comb}
Often the quantity we actually care about mixes *both* parameters — a **linear
combination**
$$
\lambda = c_1\beta_1 + c_2\beta_2,
$$
where $c_1$ and $c_2$ are constants we choose. The headline case is the
conditional mean of $y$ at a specific value $x_0$ of the regressor,
$$
\E(y\given x = x_0) = \beta_1 + x_0\,\beta_2
\qquad (c_1 = 1,\ c_2 = x_0).
$$
We estimate $\lambda$ in the natural way, by plugging in the OLS estimates, and
— because $b_1$ and $b_2$ are BLUE — the estimator
$\hat\lambda = c_1 b_1 + c_2 b_2$ is the **best linear unbiased estimator** of
$\lambda$. Unbiasedness follows directly from the linearity of expectation:
$$
\E(\hat\lambda\given x) = c_1\E(b_1\given x) + c_2\E(b_2\given x)
= c_1\beta_1 + c_2\beta_2 = \lambda .
$$
### The standard error of a linear combination
The point estimate is easy; the standard error needs the **variance-of-a-sum**
rule from [the chapter on expectation, variance and covariance](03-expectation.qmd),
and the covariance term is essential:
$$
\Var(\hat\lambda\given x)
= c_1^2\,\Var(b_1\given x) + c_2^2\,\Var(b_2\given x)
+ 2c_1 c_2\,\Cov(b_1,b_2\given x).
$$
Plugging in the *estimated* variances and covariance (from [the variance
chapter](08-variance-prediction.qmd)) and taking the square root gives the
standard error,
$$
\mathrm{se}(\hat\lambda) = \sqrt{\widehat{\Var}(\hat\lambda\given x)} .
$$
::: {.warningbox title="Don't forget the covariance"}
A common mistake is to add only
$c_1^2\widehat{\Var}(b_1) + c_2^2\widehat{\Var}(b_2)$ and stop. Because $b_1$ and
$b_2$ are *correlated* — recall that $\Cov(b_1, b_2) < 0$ whenever $\bar x > 0$ —
the cross term $2 c_1 c_2 \widehat{\Cov}(b_1, b_2)$ is genuinely part of the
variance and cannot be dropped.
:::
### Food data: a CI for expected food expenditure
Suppose we want to "estimate average weekly food spending for households with
\$2,000 of income," i.e. $x_0 = 20$ (income is measured in \$100 units). This is
the conditional mean $\E(y\given x_0 = 20) = \beta_1 + 20\beta_2$. The point
estimate is
$$
\hat\lambda = b_1 + 20\,b_2 = 83.42 + 20(10.21) = 287.61 .
$$
Using the estimated $\widehat{\Var}(b_1) = 1884.44$,
$\widehat{\Var}(b_2) = 4.3818$, and $\widehat{\Cov}(b_1, b_2) = -85.90$,
$$
\widehat{\Var}(\hat\lambda) = 1884.44 + 20^2(4.3818) + 2(20)(-85.90) = 201.02,
$$
$$
\mathrm{se}(\hat\lambda) = \sqrt{201.02} = 14.18 .
$$
A $95\%$ interval, with $t_c = 2.024$, is
$$
287.61 \pm 2.024(14.18) = [\,258.91,\ 316.31\,].
$$
With $95\%$ confidence, the *average* such household spends between \$258.91 and
\$316.31 on food.
We can reproduce every one of these numbers directly. The variances and covariance
come from the estimated coefficient covariance matrix `vcov(fit)`, and the whole
calculation is a couple of lines.
```{r}
#| code-fold: false
b <- coef(fit) # b1, b2
V <- vcov(fit) # estimated variance-covariance matrix
cc <- c(1, 20) # c1 = 1, c2 = x0 = 20
lambda_hat <- sum(cc * b) # point estimate
var_hat <- as.numeric(t(cc) %*% V %*% cc) # c' V c, includes covariance
se_hat <- sqrt(var_hat)
tc <- qt(0.975, df = nrow(food) - 2) # t_(0.975, 38)
c(estimate = lambda_hat, se = se_hat,
lower = lambda_hat - tc * se_hat,
upper = lambda_hat + tc * se_hat)
```
@fig-mean-ci shows the fitted regression line together with this $95\%$
confidence band for the *mean* food expenditure across the range of income. The
band is narrowest near the average income and flares out toward the extremes,
mirroring how the standard error of $\hat\lambda$ grows as $x_0$ moves away from
$\bar x$.
```{r}
#| label: fig-mean-ci
#| fig-cap: "Fitted line for food expenditure on income, with a 95% confidence band for the mean. The band is tightest near the average income and widens at the extremes."
#| fig-width: 5
#| fig-height: 3.4
grid <- data.frame(income = seq(min(food$income), max(food$income),
length.out = 100))
pred <- predict(fit, newdata = grid, interval = "confidence", level = 0.95)
band <- cbind(grid, as.data.frame(pred))
ggplot(band, aes(income)) +
geom_ribbon(aes(ymin = lwr, ymax = upr), fill = ucla$blue, alpha = 0.30) +
geom_line(aes(y = fit), color = ucla$blue, linewidth = 1) +
geom_point(data = food, aes(income, food_exp),
color = ucla$gray, alpha = 0.7, size = 1.4) +
geom_vline(xintercept = 20, linetype = "dashed", color = ucla$gray) +
labs(x = "income (\\$100s)", y = "food expenditure (\\$)")
```
### Mean versus outcome: two different bands
Notice how *tight* the interval for the mean is — only $\pm\$28.70$ wide. That
tightness is a clue that the confidence interval for a mean is a fundamentally
different object from a forecast of a single household's spending.
::: {.property title="CI for the mean vs. prediction interval for an outcome"}
- **CI for the mean** $\E(y\given x_0)$: $[258.91,\ 316.31]$. The only sources of
error are estimating $b_1$ and $b_2$. This is the linear-combination interval we
just built.
- **Prediction interval for $y_0$**: much *wider*, because it must also absorb the
new household's own random shock $e_0$. It is built from the forecast error of
[the variance chapter](08-variance-prediction.qmd), and the full mechanics come
in [prediction and goodness of fit](11-prediction-fit.qmd).
:::
Both bands share the same center, $287.61$, but they have very different widths.
Whenever you report an interval, ask which one you need: am I estimating an
**average**, or am I forecasting an **individual outcome**?
## Recap {#sec-recap}
We turned a point estimate into a range of plausible values by way of the
$t$-statistic.
- **The $t$-statistic.** Standardize $b_k$ and swap the unknown $\sigma^2$ for
$\hat\sigma^2$:
$$
\frac{b_k - \beta_k}{\mathrm{se}(b_k)} \sim t_{(N-2)} .
$$
The $t$ is bell-shaped with fatter tails than the normal, governed by
$\mathrm{df} = N - 2$, and approaches the normal as $N$ grows.
- **Confidence interval.** $b_k \pm t_c\,\mathrm{se}(b_k)$ with
$t_c = t_{(1-\alpha/2,\,N-2)}$. For the food data,
$\beta_2 \in [5.97,\ 14.45]$. The confidence is in the *procedure*: $95\%$ of
such intervals cover the true parameter, but we never know whether ours is one
of them.
- **Linear combinations.** $\hat\lambda = c_1 b_1 + c_2 b_2$ is the BLUE of
$\lambda = c_1\beta_1 + c_2\beta_2$, with variance
$$
\Var(\hat\lambda) = c_1^2\Var(b_1) + c_2^2\Var(b_2) + 2c_1 c_2\Cov(b_1, b_2),
$$
*covariance term included*. For mean food spending at \$2,000 income,
$\hat\lambda = 287.61$ with $95\%$ CI $[258.91,\ 316.31]$ — a band for the
*mean* that is much tighter than a band for an individual *outcome*.
**Next time:** the same $t = (b - c)/\mathrm{se}(b)$ engine, now aimed at a
*conjecture* about a parameter — [hypothesis testing](10-hypothesis-testing.qmd),
where we ask whether $\beta_2 = 0$ (or $> 5.5$), reject or fail to reject using
$p$-values, and distinguish *statistical* from *economic* significance.