---
title: "Prediction & Goodness of Fit"
---
{{< include _setup.qmd >}}
> **Reading.** SW §4.3, HGL §4.1–4.2
Before the midterm we built the full simple-regression toolkit: we learned to
*estimate* the slope and intercept ($b_1, b_2$), to *assess* the estimator
(it is BLUE), to *quantify* its precision (the standard error), and to *test*
hypotheses about it (the $t$-statistic). Two loose ends remain. In
[the variance-and-prediction chapter](08-variance-prediction.qmd) we made a
**point** prediction $\hat y_0 = b_1 + b_2 x_0$ but never said *how uncertain*
it is; this chapter builds the **prediction interval** around it. And we never
gave a single number for *how well the line fits*; this chapter builds the
sum-of-squares decomposition and **$R^2$**.
::: {.keyidea title="The link between the two questions"}
Both come from the same split of $y$ into an *explained* part (the fitted line)
and an *unexplained* part (the residual). Prediction asks how big the
unexplained part can be for one new observation; $R^2$ asks how small it is on
average across the sample.
:::
## Prediction intervals {#sec-prediction}
To forecast $y$ at a new value $x_0$, we assume the new observation obeys the
same model as the rest of the population,
$$
y_0 = \beta_1 + \beta_2 x_0 + e_0,
$$
and our point predictor is the fitted value
$$
\hat y_0 = b_1 + b_2 x_0 .
$$
We judge this forecast by its **forecast error**, the gap between what actually
happens and what we predicted:
$$
f = y_0 - \hat y_0
= (\beta_1 + \beta_2 x_0 + e_0) - (b_1 + b_2 x_0),
\qquad \E(f \given x) = 0 .
$$
Because the forecast error has mean zero, $\hat y_0$ is unbiased — it is the
*best linear unbiased predictor* (BLUP). But "unbiased" only tells us the
forecast is correctly **centered**. To put a band around it we need the
**variance** of $f$.
### Two sources of forecast error
A forecast misses for two independent reasons, and the variance simply adds them
up (HGL Appendix 4A):
$$
\Var(f \given x) = \sigma^2 \left[\,
\underbrace{1}_{\text{new shock } e_0}
+ \underbrace{\frac{1}{N} + \frac{(x_0 - \bar x)^2}{\sum (x_i - \bar x)^2}}_{\text{estimating the line}}
\,\right].
$$
The "$1$" is the **irreducible** part of the error: household $0$ has its own
fresh disturbance $e_0$ that we can never anticipate. Even if we knew the true
parameters $\beta_1$ and $\beta_2$ *exactly*, this term would remain. The other
two terms are the **line-estimation** error — they are exactly the variance of
the estimated mean $\hat\lambda = b_1 + b_2 x_0$ that we met when
[estimating a conditional mean](09-confidence-intervals.qmd).
Replacing the unknown $\sigma^2$ by its estimate $\hat\sigma^2$ and taking the
square root gives the **standard error of the forecast**,
$$
\mathrm{se}(f) = \sqrt{\widehat{\Var}(f \given x)} .
$$
### Building the interval
With $\mathrm{se}(f)$ in hand, the $100(1-\alpha)\%$ **prediction interval** for
the outcome $y_0$ has the now-familiar "estimate $\pm$ critical value times
standard error" shape:
$$
\boxed{\; \hat y_0 \pm t_c \, \mathrm{se}(f), \qquad t_c = t_{(1 - \alpha/2,\, N-2)} \;}
$$
::: {.example title="Food expenditure at $x_0 = 20$"}
For the `food` data, predicting at an income of $x_0 = 20$ (\$2{,}000 per week)
gives $\hat y_0 = 287.61$. With $\hat\sigma^2 = 8013.29$, $N = 40$, and
$\bar x = 19.60$, the forecast variance is
$$
\widehat{\Var}(f) = 8013.29\left[1 + \tfrac{1}{40} + \tfrac{(20 - 19.60)^2}{1828.79}\right] = 8214.31,
$$
so $\mathrm{se}(f) = \sqrt{8214.31} = 90.63$. Using $t_c = 2.024$,
$$
287.61 \pm 2.024(90.63) = [\,104.13,\ 471.09\,].
$$
:::
A household with \$2{,}000 of weekly income is predicted to spend somewhere
between \$104 and \$471 on food — an **enormously** wide band. The point forecast
of \$287.61, taken by itself, is not very useful.
### Why is the interval so wide?
The width is driven almost entirely by the irreducible "$1$." Compare the two
intervals at $x_0 = 20$ side by side: a confidence interval for the **mean**
$\E(y \given x_0)$, which uses only the line-estimation terms, against the
prediction interval for an individual **outcome** $y_0$, which adds the new
shock.
```{r}
#| label: tbl-pi-vs-ci
#| tbl-cap: "Mean CI vs. prediction interval at $x_0 = 20$. Adding the irreducible \"$1$\" inflates the variance from 201 to 8214."
pi_tab <- data.frame(
target = c("CI for the mean $\\E(y \\mid x_0)$",
"PI for the outcome $y_0$"),
varterm = c("$\\hat\\sigma^2[\\tfrac1N + \\tfrac{(x_0-\\bar x)^2}{\\sum(\\cdot)^2}] = 201$",
"$\\hat\\sigma^2[1 + \\tfrac1N + \\tfrac{(x_0-\\bar x)^2}{\\sum(\\cdot)^2}] = 8214$"),
interval = c("$[258.91,\\ 316.31]$", "$[104.13,\\ 471.09]$")
)
knitr::kable(pi_tab,
col.names = c("Quantity", "Variance term", "95% interval"),
align = "lll")
```
The prediction variance ($8214$) is almost all the "$1 \cdot \hat\sigma^2 =
8013$" term. The lesson is that **individual household behavior is intrinsically
noisy**: income explains the *average* spending well, but any one family's
spending swings for a hundred reasons the model never sees.
::: {.callout-note appearance="simple"}
Collecting more data shrinks the line-estimation terms ($1/N \to 0$) but does
*not* touch the "$1$." To predict individuals better you need better
*variables*, not just more rows — which is one motivation for
[multiple regression](13-multiple-regression.qmd).
:::
### The prediction band fans out
Because the variance grows with $(x_0 - \bar x)^2$, the bands are **narrowest at
$\bar x$** and widen as $x_0$ moves away from the center of the data: we predict
best where we have seen the most observations. @fig-prediction-band plots both
bands around the fitted line. The inner (dashed) band is the confidence band for
the mean; the outer (solid red) band is the prediction band for an individual
outcome.
Three features stand out. Both bands pinch in at the point of the means
$(\bar x, \bar y)$. The mean band always sits *inside* the prediction band — the
gap between them is exactly the irreducible $\sigma^2$. And extrapolating far
beyond the data is **doubly** risky: the bands are widest there, *and* the
linear model may not even hold out in territory we have never observed.
```{r}
#| label: fig-prediction-band
#| fig-cap: "Prediction band (solid red) and mean confidence band (dashed) around the fitted food-expenditure line. Both pinch at $\\bar x$ and fan out."
#| fig-width: 5
#| fig-height: 3.4
b1 <- 83.42; b2 <- 10.21
s2 <- 8013.29; N <- 40; xbar <- 19.6; Sxx <- 1828.79; tc <- 2.024
xs <- seq(2, 38, length.out = 200)
fit <- b1 + b2 * xs
sef <- sqrt(s2 * (1 + 1/N + (xs - xbar)^2 / Sxx))
sem <- sqrt(s2 * ( 1/N + (xs - xbar)^2 / Sxx))
band <- data.frame(x = xs, fit = fit,
pi_lo = fit - tc * sef, pi_hi = fit + tc * sef,
ci_lo = fit - tc * sem, ci_hi = fit + tc * sem)
ggplot(band, aes(x)) +
geom_line(aes(y = pi_lo), color = ucla$red, linewidth = 1) +
geom_line(aes(y = pi_hi), color = ucla$red, linewidth = 1) +
geom_line(aes(y = ci_lo), color = ucla$darkblue, linetype = "dashed") +
geom_line(aes(y = ci_hi), color = ucla$darkblue, linetype = "dashed") +
geom_line(aes(y = fit), color = ucla$blue, linewidth = 1) +
geom_vline(xintercept = xbar, linetype = "dashed", color = ucla$gray) +
scale_x_continuous(breaks = xbar, labels = expression(bar(x))) +
labs(x = expression(x[0]), y = "y")
```
## Decomposing the variation {#sec-decomposition}
Now we switch from *prediction* to *fit*: how much of the variation in $y$ does
the line account for? Start from the estimated decomposition of each observation
into its fitted value and residual, $y_i = \hat y_i + \hat e_i$, and subtract the
sample mean $\bar y$ from both sides:
$$
\underbrace{(y_i - \bar y)}_{\text{total deviation}}
= \underbrace{(\hat y_i - \bar y)}_{\text{explained by the line}}
+ \underbrace{\hat e_i}_{\text{unexplained}} .
$$
The term $\hat y_i - \bar y$ measures how far the *fitted* value has moved away
from the mean — the part of the deviation the regression explains. The residual
$\hat e_i = y_i - \hat y_i$ is the leftover the line could not capture. This
clean split holds because the OLS line passes through the point of the means
$(\bar x, \bar y)$. @fig-deviation-split shows the decomposition for a single
observation.
```{r}
#| label: fig-deviation-split
#| fig-cap: "Splitting one observation's deviation $y_i - \\bar y$ into an explained part $\\hat y_i - \\bar y$ and an unexplained residual $\\hat e_i$."
#| fig-width: 5
#| fig-height: 3.4
b1 <- 1.5; b2 <- 0.7; ybar <- 4.8
xi <- 8; yi <- 8.4; yhat <- b1 + b2 * xi
line_df <- data.frame(x = c(0.5, 9.5), y = b1 + b2 * c(0.5, 9.5))
ggplot() +
geom_line(data = line_df, aes(x, y), color = ucla$blue, linewidth = 1) +
geom_hline(yintercept = ybar, linetype = "dotted", color = ucla$gray) +
geom_segment(aes(x = xi + 0.6, xend = xi + 0.6, y = ybar, yend = yhat),
color = ucla$blue, linewidth = 0.8,
arrow = arrow(ends = "both", length = unit(0.05, "in"))) +
geom_segment(aes(x = xi, xend = xi, y = yhat, yend = yi),
color = ucla$red, linewidth = 0.8,
arrow = arrow(ends = "both", length = unit(0.05, "in"))) +
geom_point(aes(x = xi, y = yi), color = ucla$darkblue, size = 2.4) +
annotate("text", x = xi + 0.1, y = yi + 0.45, label = "(list(x[i], y[i]))",
parse = TRUE, color = ucla$darkblue, size = 3.2) +
annotate("text", x = xi + 1.4, y = (ybar + yhat) / 2,
label = "hat(y)[i] - bar(y)", parse = TRUE,
color = ucla$blue, size = 3.2) +
annotate("text", x = xi - 0.9, y = (yhat + yi) / 2, label = "hat(e)[i]",
parse = TRUE, color = ucla$red, size = 3.2) +
scale_x_continuous(limits = c(0, 10)) +
scale_y_continuous(limits = c(0, 10), breaks = ybar,
labels = expression(bar(y))) +
labs(x = "x", y = "y")
```
### From deviations to sums of squares
Squaring both sides of the deviation identity and summing over all observations
produces a remarkably clean result. The cross-product term vanishes — it is a
standard algebraic fact for OLS that $\sum (\hat y_i - \bar y)\hat e_i = 0$ — so
we are left with
$$
\sum (y_i - \bar y)^2 = \sum (\hat y_i - \bar y)^2 + \sum \hat e_i^2 ,
$$
which we abbreviate as
$$
\boxed{\; \mathrm{SST} = \mathrm{SSR} + \mathrm{SSE} \;}
$$
::: {.definition title="The three sums of squares"}
- **SST** $= \sum (y_i - \bar y)^2$ — the **total** sample variation in $y$.
- **SSR** $= \sum (\hat y_i - \bar y)^2$ — the variation **explained by the
regression** (also called the explained sum of squares, ESS).
- **SSE** $= \sum \hat e_i^2$ — the **unexplained** (residual) variation; the
very quantity that OLS minimized.
:::
::: {.warningbox title="Notation collides across textbooks"}
Be careful: the labels are not standardized. Stock & Watson call the explained
part ESS and the residual part SSR. We follow Hill, Griffiths & Lim throughout:
**SSR** = sum of squares due to the *regression* (explained), and **SSE** = sum
of squared *errors* (residual).
:::
## The coefficient of determination {#sec-r2}
Dividing the sum-of-squares decomposition through by SST turns it into a
unit-free measure of fit, the **coefficient of determination**:
$$
\boxed{\; R^2 = \frac{\mathrm{SSR}}{\mathrm{SST}} = 1 - \frac{\mathrm{SSE}}{\mathrm{SST}} \;}
\qquad 0 \le R^2 \le 1 .
$$
At the extremes, $R^2 = 1$ means every point lies on the line, so $\mathrm{SSE} =
0$ — a perfect fit. $R^2 = 0$ means the fitted line is flat at $\bar y$, so
$\mathrm{SSR} = 0$ and $x$ explains nothing. In between, $R^2$ is the
**proportion of the variation in $y$ about its mean that the model explains**.
::: {.example title="Food expenditure"}
For the `food` regression,
$$
\mathrm{SST} = 495132, \qquad \mathrm{SSE} = 304505,
$$
so
$$
R^2 = 1 - \frac{304505}{495132} = 0.385 .
$$
Income explains **38.5%** of the variation in food expenditure across
households; the other 61.5% is everything else.
:::
We can confirm all of these quantities directly from the data. The
`broom::glance()` helper reads $R^2$ off the fitted model, and the sums of
squares fall straight out of the residuals and fitted values.
```{r}
#| code-fold: false
data(food)
fit <- lm(food_exp ~ income, data = food)
SST <- sum((food$food_exp - mean(food$food_exp))^2)
SSE <- sum(residuals(fit)^2)
SSR <- sum((fitted(fit) - mean(food$food_exp))^2)
c(SST = SST, SSR = SSR, SSE = SSE, R2 = 1 - SSE / SST)
glance(fit)$r.squared
```
### $R^2$ is a squared correlation
In *simple* regression there are two neat identities:
$$
R^2 = r_{xy}^2 = r_{y\hat y}^2 .
$$
Here $r_{xy}$ is the ordinary sample correlation between $x$ and $y$. For the
`food` data $r_{xy} = 0.62$, and indeed $0.62^2 = 0.385 = R^2$. The second
quantity, $r_{y\hat y}$, is the correlation between the actual outcomes $y$ and
the fitted values $\hat y$ — a measure of "how well do the predictions track the
outcomes?" This second version is the one that **generalizes** to
[multiple regression](13-multiple-regression.qmd), where there is no single $x$
to correlate with.
```{r}
#| code-fold: false
c(r_xy = cor(food$income, food$food_exp),
r_xy_sq = cor(food$income, food$food_exp)^2,
r_yyhat_sq = cor(food$food_exp, fitted(fit))^2)
```
So $R^2$ and the **standard error of the regression** $\hat\sigma$ are two
complementary summaries of fit. $R^2$ is a *relative* share — unit-free, easy to
compare across models. The standard error $\hat\sigma$ is an *absolute* typical
miss, measured in the units of $y$.
### Using $R^2$ wisely
A natural question is "Is $R^2 = 0.385$ good?" — but that is usually not a useful
thing to ask. What counts as a high or low $R^2$ depends entirely on the kind of
data:
- With microeconomic **cross-sectional** data, an $R^2$ of $0.10$–$0.40$ is
completely normal. Human behavior is genuinely hard to explain.
- With macroeconomic **time-series** data that trend together over time, an $R^2$
of $0.90$ or higher is routine. A high $R^2$ is not, by itself, a badge of a
better model.
- $R^2$ **never falls** when you add another regressor, so it cannot, on its own,
tell you whether a new variable belongs in the model. We fix this defect with
the *adjusted* $R^2$ when we get to
[model specification](18-model-specification.qmd).
::: {.keyidea title="Judge a model by more than fit"}
The signs and magnitudes of the coefficients, statistical *and* economic
significance, whether the regression assumptions actually hold, and how the model
predicts on **out-of-sample** data — all of these matter more than a high
in-sample $R^2$.
:::
## Recap {#sec-recap}
On **prediction intervals**: the forecast error has variance
$$
\Var(f) = \sigma^2 \left[1 + \tfrac{1}{N} + \tfrac{(x_0 - \bar x)^2}{\sum (x_i - \bar x)^2}\right],
$$
and the prediction interval is $\hat y_0 \pm t_c\,\mathrm{se}(f)$. For the `food`
data at $x_0 = 20$ it runs from \$104 to \$471, because the irreducible "$1$"
(the new shock) dominates the variance. The bands pinch at $\bar x$ and fan out,
and the mean confidence band always sits inside the prediction band.
On **goodness of fit**: the total variation splits as $\mathrm{SST} =
\mathrm{SSR} + \mathrm{SSE}$, and the coefficient of determination
$$
R^2 = \frac{\mathrm{SSR}}{\mathrm{SST}} = 1 - \frac{\mathrm{SSE}}{\mathrm{SST}}
$$
measures the share of variation in $y$ that the model explains — $0.385$ for the
`food` data. In simple regression $R^2 = r_{xy}^2 = r_{y\hat y}^2$, but a high or
low value should never be over-read.
**Next time:** the regression line need not be straight in the *variables*. By
logging or squaring $x$ or $y$ we can model curves, elasticities, and growth
rates — all still "linear regression," because the model stays linear in the
*parameters*. See [functional forms](12-functional-forms.qmd).