\( \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\given}{\,\vert\,} \newcommand{\indic}[1]{\mathbf{1}\!\left\{#1\right\}} \newcommand{\pmf}{\text{p.m.f.}} \newcommand{\pdf}{\text{p.d.f.}} \newcommand{\cdf}{\text{c.d.f.}} \)

11  Prediction & Goodness of Fit

Reading. SW 4.3, HGL 4.1<80><93>4.2

Before the midterm we built the full simple-regression toolkit: we learned to estimate the slope and intercept (\(b_1, b_2\)), to assess the estimator (it is BLUE), to quantify its precision (the standard error), and to test hypotheses about it (the \(t\)-statistic). Two loose ends remain. In the variance-and-prediction chapter we made a point prediction \(\hat y_0 = b_1 + b_2 x_0\) but never said how uncertain it is; this chapter builds the prediction interval around it. And we never gave a single number for how well the line fits; this chapter builds the sum-of-squares decomposition and \(R^2\).

The link between the two questions

Both come from the same split of \(y\) into an explained part (the fitted line) and an unexplained part (the residual). Prediction asks how big the unexplained part can be for one new observation; \(R^2\) asks how small it is on average across the sample.

11.1 Prediction intervals

To forecast \(y\) at a new value \(x_0\), we assume the new observation obeys the same model as the rest of the population, \[ y_0 = \beta_1 + \beta_2 x_0 + e_0, \] and our point predictor is the fitted value \[ \hat y_0 = b_1 + b_2 x_0 . \] We judge this forecast by its forecast error, the gap between what actually happens and what we predicted: \[ f = y_0 - \hat y_0 = (\beta_1 + \beta_2 x_0 + e_0) - (b_1 + b_2 x_0), \qquad \E(f \given x) = 0 . \] Because the forecast error has mean zero, \(\hat y_0\) is unbiased <80><94> it is the best linear unbiased predictor (BLUP). But “unbiased” only tells us the forecast is correctly centered. To put a band around it we need the variance of \(f\).

Two sources of forecast error

A forecast misses for two independent reasons, and the variance simply adds them up (HGL Appendix 4A): \[ \Var(f \given x) = \sigma^2 \left[\, \underbrace{1}_{\text{new shock } e_0} + \underbrace{\frac{1}{N} + \frac{(x_0 - \bar x)^2}{\sum (x_i - \bar x)^2}}_{\text{estimating the line}} \,\right]. \] The “\(1\)” is the irreducible part of the error: household \(0\) has its own fresh disturbance \(e_0\) that we can never anticipate. Even if we knew the true parameters \(\beta_1\) and \(\beta_2\) exactly, this term would remain. The other two terms are the line-estimation error <80><94> they are exactly the variance of the estimated mean \(\hat\lambda = b_1 + b_2 x_0\) that we met when estimating a conditional mean.

Replacing the unknown \(\sigma^2\) by its estimate \(\hat\sigma^2\) and taking the square root gives the standard error of the forecast, \[ \mathrm{se}(f) = \sqrt{\widehat{\Var}(f \given x)} . \]

Building the interval

With \(\mathrm{se}(f)\) in hand, the \(100(1-\alpha)\%\) prediction interval for the outcome \(y_0\) has the now-familiar “estimate \(\pm\) critical value times standard error” shape: \[ \boxed{\; \hat y_0 \pm t_c \, \mathrm{se}(f), \qquad t_c = t_{(1 - \alpha/2,\, N-2)} \;} \]

Food expenditure at $x_0 = 20$

For the food data, predicting at an income of \(x_0 = 20\) ($2{,}000 per week) gives \(\hat y_0 = 287.61\). With \(\hat\sigma^2 = 8013.29\), \(N = 40\), and \(\bar x = 19.60\), the forecast variance is \[ \widehat{\Var}(f) = 8013.29\left[1 + \tfrac{1}{40} + \tfrac{(20 - 19.60)^2}{1828.79}\right] = 8214.31, \] so \(\mathrm{se}(f) = \sqrt{8214.31} = 90.63\). Using \(t_c = 2.024\), \[ 287.61 \pm 2.024(90.63) = [\,104.13,\ 471.09\,]. \]

A household with $2{,}000 of weekly income is predicted to spend somewhere between $104 and $471 on food <80><94> an enormously wide band. The point forecast of $287.61, taken by itself, is not very useful.

Why is the interval so wide?

The width is driven almost entirely by the irreducible “\(1\).” Compare the two intervals at \(x_0 = 20\) side by side: a confidence interval for the mean \(\E(y \given x_0)\), which uses only the line-estimation terms, against the prediction interval for an individual outcome \(y_0\), which adds the new shock.

Show the R code
pi_tab <- data.frame(
  target  = c("CI for the mean $\\E(y \\mid x_0)$",
              "PI for the outcome $y_0$"),
  varterm = c("$\\hat\\sigma^2[\\tfrac1N + \\tfrac{(x_0-\\bar x)^2}{\\sum(\\cdot)^2}] = 201$",
              "$\\hat\\sigma^2[1 + \\tfrac1N + \\tfrac{(x_0-\\bar x)^2}{\\sum(\\cdot)^2}] = 8214$"),
  interval = c("$[258.91,\\ 316.31]$", "$[104.13,\\ 471.09]$")
)
knitr::kable(pi_tab,
  col.names = c("Quantity", "Variance term", "95% interval"),
  align = "lll")
Table 11.1: Mean CI vs. prediction interval at \(x_0 = 20\). Adding the irreducible “\(1\)” inflates the variance from 201 to 8214.
Quantity Variance term 95% interval
CI for the mean \(\E(y \mid x_0)\) \(\hat\sigma^2[\tfrac1N + \tfrac{(x_0-\bar x)^2}{\sum(\cdot)^2}] = 201\) \([258.91,\ 316.31]\)
PI for the outcome \(y_0\) \(\hat\sigma^2[1 + \tfrac1N + \tfrac{(x_0-\bar x)^2}{\sum(\cdot)^2}] = 8214\) \([104.13,\ 471.09]\)

The prediction variance (\(8214\)) is almost all the “\(1 \cdot \hat\sigma^2 = 8013\)” term. The lesson is that individual household behavior is intrinsically noisy: income explains the average spending well, but any one family’s spending swings for a hundred reasons the model never sees.

Collecting more data shrinks the line-estimation terms (\(1/N \to 0\)) but does not touch the “\(1\).” To predict individuals better you need better variables, not just more rows <80><94> which is one motivation for multiple regression.

The prediction band fans out

Because the variance grows with \((x_0 - \bar x)^2\), the bands are narrowest at \(\bar x\) and widen as \(x_0\) moves away from the center of the data: we predict best where we have seen the most observations. Figure 11.1 plots both bands around the fitted line. The inner (dashed) band is the confidence band for the mean; the outer (solid red) band is the prediction band for an individual outcome.

Three features stand out. Both bands pinch in at the point of the means \((\bar x, \bar y)\). The mean band always sits inside the prediction band <80><94> the gap between them is exactly the irreducible \(\sigma^2\). And extrapolating far beyond the data is doubly risky: the bands are widest there, and the linear model may not even hold out in territory we have never observed.

Show the R code
b1 <- 83.42; b2 <- 10.21
s2 <- 8013.29; N <- 40; xbar <- 19.6; Sxx <- 1828.79; tc <- 2.024
xs <- seq(2, 38, length.out = 200)
fit  <- b1 + b2 * xs
sef  <- sqrt(s2 * (1 + 1/N + (xs - xbar)^2 / Sxx))
sem  <- sqrt(s2 * (    1/N + (xs - xbar)^2 / Sxx))
band <- data.frame(x = xs, fit = fit,
                   pi_lo = fit - tc * sef, pi_hi = fit + tc * sef,
                   ci_lo = fit - tc * sem, ci_hi = fit + tc * sem)
ggplot(band, aes(x)) +
  geom_line(aes(y = pi_lo), color = ucla$red, linewidth = 1) +
  geom_line(aes(y = pi_hi), color = ucla$red, linewidth = 1) +
  geom_line(aes(y = ci_lo), color = ucla$darkblue, linetype = "dashed") +
  geom_line(aes(y = ci_hi), color = ucla$darkblue, linetype = "dashed") +
  geom_line(aes(y = fit), color = ucla$blue, linewidth = 1) +
  geom_vline(xintercept = xbar, linetype = "dashed", color = ucla$gray) +
  scale_x_continuous(breaks = xbar, labels = expression(bar(x))) +
  labs(x = expression(x[0]), y = "y")
Figure 11.1: Prediction band (solid red) and mean confidence band (dashed) around the fitted food-expenditure line. Both pinch at \(\bar x\) and fan out.

11.2 Decomposing the variation

Now we switch from prediction to fit: how much of the variation in \(y\) does the line account for? Start from the estimated decomposition of each observation into its fitted value and residual, \(y_i = \hat y_i + \hat e_i\), and subtract the sample mean \(\bar y\) from both sides: \[ \underbrace{(y_i - \bar y)}_{\text{total deviation}} = \underbrace{(\hat y_i - \bar y)}_{\text{explained by the line}} + \underbrace{\hat e_i}_{\text{unexplained}} . \] The term \(\hat y_i - \bar y\) measures how far the fitted value has moved away from the mean <80><94> the part of the deviation the regression explains. The residual \(\hat e_i = y_i - \hat y_i\) is the leftover the line could not capture. This clean split holds because the OLS line passes through the point of the means \((\bar x, \bar y)\). Figure 11.2 shows the decomposition for a single observation.

Show the R code
b1 <- 1.5; b2 <- 0.7; ybar <- 4.8
xi <- 8; yi <- 8.4; yhat <- b1 + b2 * xi
line_df <- data.frame(x = c(0.5, 9.5), y = b1 + b2 * c(0.5, 9.5))
ggplot() +
  geom_line(data = line_df, aes(x, y), color = ucla$blue, linewidth = 1) +
  geom_hline(yintercept = ybar, linetype = "dotted", color = ucla$gray) +
  geom_segment(aes(x = xi + 0.6, xend = xi + 0.6, y = ybar, yend = yhat),
               color = ucla$blue, linewidth = 0.8,
               arrow = arrow(ends = "both", length = unit(0.05, "in"))) +
  geom_segment(aes(x = xi, xend = xi, y = yhat, yend = yi),
               color = ucla$red, linewidth = 0.8,
               arrow = arrow(ends = "both", length = unit(0.05, "in"))) +
  geom_point(aes(x = xi, y = yi), color = ucla$darkblue, size = 2.4) +
  annotate("text", x = xi + 0.1, y = yi + 0.45, label = "(list(x[i], y[i]))",
           parse = TRUE, color = ucla$darkblue, size = 3.2) +
  annotate("text", x = xi + 1.4, y = (ybar + yhat) / 2,
           label = "hat(y)[i] - bar(y)", parse = TRUE,
           color = ucla$blue, size = 3.2) +
  annotate("text", x = xi - 0.9, y = (yhat + yi) / 2, label = "hat(e)[i]",
           parse = TRUE, color = ucla$red, size = 3.2) +
  scale_x_continuous(limits = c(0, 10)) +
  scale_y_continuous(limits = c(0, 10), breaks = ybar,
                     labels = expression(bar(y))) +
  labs(x = "x", y = "y")
Figure 11.2: Splitting one observation’s deviation \(y_i - \bar y\) into an explained part \(\hat y_i - \bar y\) and an unexplained residual \(\hat e_i\).

From deviations to sums of squares

Squaring both sides of the deviation identity and summing over all observations produces a remarkably clean result. The cross-product term vanishes <80><94> it is a standard algebraic fact for OLS that \(\sum (\hat y_i - \bar y)\hat e_i = 0\) <80><94> so we are left with \[ \sum (y_i - \bar y)^2 = \sum (\hat y_i - \bar y)^2 + \sum \hat e_i^2 , \] which we abbreviate as \[ \boxed{\; \mathrm{SST} = \mathrm{SSR} + \mathrm{SSE} \;} \]

The three sums of squares
  • SST \(= \sum (y_i - \bar y)^2\) <80><94> the total sample variation in \(y\).
  • SSR \(= \sum (\hat y_i - \bar y)^2\) <80><94> the variation explained by the regression (also called the explained sum of squares, ESS).
  • SSE \(= \sum \hat e_i^2\) <80><94> the unexplained (residual) variation; the very quantity that OLS minimized.
Notation collides across textbooks

Be careful: the labels are not standardized. Stock & Watson call the explained part ESS and the residual part SSR. We follow Hill, Griffiths & Lim throughout: SSR = sum of squares due to the regression (explained), and SSE = sum of squared errors (residual).

11.3 The coefficient of determination

Dividing the sum-of-squares decomposition through by SST turns it into a unit-free measure of fit, the coefficient of determination: \[ \boxed{\; R^2 = \frac{\mathrm{SSR}}{\mathrm{SST}} = 1 - \frac{\mathrm{SSE}}{\mathrm{SST}} \;} \qquad 0 \le R^2 \le 1 . \] At the extremes, \(R^2 = 1\) means every point lies on the line, so \(\mathrm{SSE} = 0\) <80><94> a perfect fit. \(R^2 = 0\) means the fitted line is flat at \(\bar y\), so \(\mathrm{SSR} = 0\) and \(x\) explains nothing. In between, \(R^2\) is the proportion of the variation in \(y\) about its mean that the model explains.

Food expenditure

For the food regression, \[ \mathrm{SST} = 495132, \qquad \mathrm{SSE} = 304505, \] so \[ R^2 = 1 - \frac{304505}{495132} = 0.385 . \] Income explains 38.5% of the variation in food expenditure across households; the other 61.5% is everything else.

We can confirm all of these quantities directly from the data. The broom::glance() helper reads \(R^2\) off the fitted model, and the sums of squares fall straight out of the residuals and fitted values.

data(food)
fit <- lm(food_exp ~ income, data = food)

SST <- sum((food$food_exp - mean(food$food_exp))^2)
SSE <- sum(residuals(fit)^2)
SSR <- sum((fitted(fit) - mean(food$food_exp))^2)
c(SST = SST, SSR = SSR, SSE = SSE, R2 = 1 - SSE / SST)
#>          SST          SSR          SSE           R2 
#> 4.951322e+05 1.906270e+05 3.045052e+05 3.850022e-01

glance(fit)$r.squared
#> [1] 0.3850022

\(R^2\) is a squared correlation

In simple regression there are two neat identities: \[ R^2 = r_{xy}^2 = r_{y\hat y}^2 . \] Here \(r_{xy}\) is the ordinary sample correlation between \(x\) and \(y\). For the food data \(r_{xy} = 0.62\), and indeed \(0.62^2 = 0.385 = R^2\). The second quantity, \(r_{y\hat y}\), is the correlation between the actual outcomes \(y\) and the fitted values \(\hat y\) <80><94> a measure of “how well do the predictions track the outcomes?” This second version is the one that generalizes to multiple regression, where there is no single \(x\) to correlate with.

c(r_xy = cor(food$income, food$food_exp),
  r_xy_sq = cor(food$income, food$food_exp)^2,
  r_yyhat_sq = cor(food$food_exp, fitted(fit))^2)
#>       r_xy    r_xy_sq r_yyhat_sq 
#>  0.6204855  0.3850022  0.3850022

So \(R^2\) and the standard error of the regression \(\hat\sigma\) are two complementary summaries of fit. \(R^2\) is a relative share <80><94> unit-free, easy to compare across models. The standard error \(\hat\sigma\) is an absolute typical miss, measured in the units of \(y\).

Using \(R^2\) wisely

A natural question is “Is \(R^2 = 0.385\) good?” <80><94> but that is usually not a useful thing to ask. What counts as a high or low \(R^2\) depends entirely on the kind of data:

  • With microeconomic cross-sectional data, an \(R^2\) of \(0.10\)<80><93>\(0.40\) is completely normal. Human behavior is genuinely hard to explain.
  • With macroeconomic time-series data that trend together over time, an \(R^2\) of \(0.90\) or higher is routine. A high \(R^2\) is not, by itself, a badge of a better model.
  • \(R^2\) never falls when you add another regressor, so it cannot, on its own, tell you whether a new variable belongs in the model. We fix this defect with the adjusted \(R^2\) when we get to model specification.
Judge a model by more than fit

The signs and magnitudes of the coefficients, statistical and economic significance, whether the regression assumptions actually hold, and how the model predicts on out-of-sample data <80><94> all of these matter more than a high in-sample \(R^2\).

11.4 Recap

On prediction intervals: the forecast error has variance \[ \Var(f) = \sigma^2 \left[1 + \tfrac{1}{N} + \tfrac{(x_0 - \bar x)^2}{\sum (x_i - \bar x)^2}\right], \] and the prediction interval is \(\hat y_0 \pm t_c\,\mathrm{se}(f)\). For the food data at \(x_0 = 20\) it runs from $104 to $471, because the irreducible “\(1\)” (the new shock) dominates the variance. The bands pinch at \(\bar x\) and fan out, and the mean confidence band always sits inside the prediction band.

On goodness of fit: the total variation splits as \(\mathrm{SST} = \mathrm{SSR} + \mathrm{SSE}\), and the coefficient of determination \[ R^2 = \frac{\mathrm{SSR}}{\mathrm{SST}} = 1 - \frac{\mathrm{SSE}}{\mathrm{SST}} \] measures the share of variation in \(y\) that the model explains <80><94> \(0.385\) for the food data. In simple regression \(R^2 = r_{xy}^2 = r_{y\hat y}^2\), but a high or low value should never be over-read.

Next time: the regression line need not be straight in the variables. By logging or squaring \(x\) or \(y\) we can model curves, elasticities, and growth rates <80><94> all still “linear regression,” because the model stays linear in the parameters. See functional forms.