\( \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\given}{\,\vert\,} \newcommand{\indic}[1]{\mathbf{1}\!\left\{#1\right\}} \newcommand{\pmf}{\text{p.m.f.}} \newcommand{\pdf}{\text{p.d.f.}} \newcommand{\cdf}{\text{c.d.f.}} \)

10  Hypothesis Testing

Reading. SW 5.1, HGL 3.2<80><93>3.5

Last chapter, the \(t\)-statistic gave us a range of plausible values for a slope: \[ \frac{b_k - \beta_k}{\mathrm{se}(b_k)} \sim t_{(N-2)} \;\;\Longrightarrow\;\; \beta_2 \in [5.97,\ 14.45]. \] A confidence interval answers “how big is the effect, give or take?” But decision-makers usually ask sharper, yes/no questions. Is there any relationship between income and food spending <80><94> is \(\beta_2 = 0\)? Will households spend more than \(\$5.50\) of each extra \(\$100\) <80><94> is \(\beta_2 > 5.5\)? This chapter points the same inferential engine at specific conjectures like these. We set up null and alternative hypotheses, build rejection regions, compute \(p\)-values, and <80><94> crucially <80><94> learn to separate statistical significance from economic significance.

10.1 The logic of a hypothesis test

Every hypothesis test, no matter how complicated the setting, is built from the same five pieces.

Components of a hypothesis test
  1. a null hypothesis \(H_0\),
  2. an alternative hypothesis \(H_1\),
  3. a test statistic,
  4. a rejection region,
  5. a conclusion, stated in economic context.

The null hypothesis \(H_0: \beta_k = c\) is the maintained belief <80><94> the claim we hold to be true until the data convince us otherwise. The value \(c\) is one that matters in context, and it is very often \(0\). The alternative hypothesis \(H_1\) is what we are prepared to accept if we reject \(H_0\). It comes in three flavors: \(\beta_k > c\), \(\beta_k < c\), or \(\beta_k \neq c\).

The test statistic and its logic

Recall that \(t = (b_k - \beta_k)/\mathrm{se}(b_k) \sim t_{(N-2)}\). If the null \(H_0: \beta_k = c\) is true, we can substitute the hypothesized value \(c\) for the unknown \(\beta_k\), and the same quantity becomes computable: \[ t = \frac{b_k - c}{\mathrm{se}(b_k)} \sim t_{(N-2)} \qquad\text{under } H_0 . \] This is what makes it a test statistic: it has a known distribution when \(H_0\) is true, and some other distribution when \(H_0\) is false. That single fact powers the whole procedure.

The chain of reasoning is short. If \(H_0\) holds, the computed \(t\) should land in the middle of the \(t\)-curve, where most of the probability sits. A value of \(t\) way out in a tail is unlikely under \(H_0\). So observing such a value is evidence that \(H_0\) is false <80><94> and we reject it.

How unlikely is “unlikely”?

We draw the line between “plausible” and “too extreme” with the level of significance \(\alpha\): the probability of landing in the rejection region when \(H_0\) is true. There are two ways a test can go wrong, and they are not symmetric.

Type I and Type II errors
  • A Type I error is rejecting \(H_0\) when it is actually true. Its probability is exactly the level of significance: \(\Prob(\text{Type I error}) = \alpha\). We choose \(\alpha\), usually \(0.01\), \(0.05\), or \(0.10\).
  • A Type II error is failing to reject a false \(H_0\). Its probability depends on the unknown true \(\beta_k\), so we cannot set it directly.

Choosing \(\alpha\) is choosing how much risk of a false rejection you are willing to bear. A costly false rejection calls for a small \(\alpha\) <80><94> \(0.01\), say.

The ubiquitous “\(\alpha = 0.05\)” is convention, not law: it descends from Fisher’s old rule of thumb that “\(t > 2\) is significant,” and there is nothing sacred about it. Pick \(\alpha\) to fit the decision at hand.

10.2 Rejection regions for the three alternatives

Where the rejection region sits depends entirely on the alternative \(H_1\). There is a handy memory trick: the rejection region is in the direction the arrow points in the alternative.

If \(H_1: \beta_k > c\), the arrow points right, so we reject for large positive \(t\) <80><94> a right-tail test, rejecting when \(t \ge t_{(1-\alpha,\,N-2)}\). If \(H_1: \beta_k < c\), the arrow points left, so we reject for large negative \(t\) <80><94> a left-tail test, rejecting when \(t \le t_{(\alpha,\,N-2)}\). If \(H_1: \beta_k \neq c\), deviations in either direction count against \(H_0\), so the rejection region splits across both tails, and we reject when \(|t| \ge t_{(1-\alpha/2,\,N-2)}\). Figure 10.1 shows all three.

Show the R code
xs <- seq(-4, 4, length.out = 400)
curve_df <- data.frame(x = xs, y = dnorm(xs))

panel <- function(label, shade_fun) {
  df <- transform(curve_df, panel = label)
  sh <- shade_fun(df)
  list(curve = df, shade = sh)
}

right <- subset(curve_df, x >= 1.7);  right$panel  <- "H1: beta > c  (right tail)"
left  <- subset(curve_df, x <= -1.7); left$panel   <- "H1: beta < c  (left tail)"
twoU  <- subset(curve_df, x >= 2);    twoU$panel   <- "H1: beta != c  (two tails)"
twoL  <- subset(curve_df, x <= -2);   twoL$panel   <- "H1: beta != c  (two tails)"

lvl   <- c("H1: beta > c  (right tail)",
           "H1: beta < c  (left tail)",
           "H1: beta != c  (two tails)")
base  <- do.call(rbind, lapply(lvl, function(l) transform(curve_df, panel = l)))
base$panel  <- factor(base$panel, levels = lvl)
shade <- rbind(right, left, twoU, twoL)
shade$panel <- factor(shade$panel, levels = lvl)

ggplot(base, aes(x, y)) +
  geom_area(data = shade, aes(x, y), fill = ucla$red, alpha = 0.30) +
  geom_line(color = ucla$darkblue, linewidth = 1) +
  facet_wrap(~ panel) +
  scale_y_continuous(breaks = NULL) +
  labs(x = "t", y = NULL)
Figure 10.1: Rejection regions (shaded) for the three alternatives. The region lies in the direction the alternative’s arrow points; the two-tail test splits \(\alpha\) across both tails.

For a two-tail test the area \(\alpha\) is split, \(\alpha/2\) into each tail, so the critical value \(t_c\) has to sit farther out than for a one-tail test at the same \(\alpha\). That is why a two-tail test is harder to “pass” with a one-sided conjecture: you pay for hedging your bets about the direction.

The five-step procedure

Every test we run follows the same checklist.

The five-step recipe
  1. Hypotheses. State \(H_0\) and \(H_1\).
  2. Test statistic. \(t = \dfrac{b_k - c}{\mathrm{se}(b_k)}\), which is \(t_{(N-2)}\) if \(H_0\) is true.
  3. Rejection region. Pick \(\alpha\); find the critical value(s) for the relevant tail(s).
  4. Compute. Plug in \(b_k\), \(c\), and \(\mathrm{se}(b_k)\).
  5. Conclude. Reject or do not reject <80><94> and say what it means for the economics.

One subtle but important habit of language goes with step 5.

Never "accept" the null

We say “reject \(H_0\)” or “fail to reject \(H_0\)<80><94> never “accept \(H_0\).” Failing to reject only means the data are compatible with \(H_0\), not that \(H_0\) is true. Absence of evidence is not evidence of absence.

10.3 Worked tests on the food data

To make the recipe concrete, we run three tests on the food-expenditure regression from earlier chapters. The fitted model on the 40 households is \(\widehat{\text{food\_exp}} = 83.42 + 10.21\,\text{income}\), with \(\mathrm{se}(b_2) = 2.09\) on \(N - 2 = 38\) degrees of freedom.

data(food)
fit <- lm(food_exp ~ income, data = food)
tidy(fit)
#> # A tibble: 2 x 5
#>   term        estimate std.error statistic   p.value
#>   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)     83.4     43.4       1.92 0.0622   
#> 2 income          10.2      2.09      4.88 0.0000195

A one-tail test of significance

Economic theory says food is a normal good, so we expect \(\beta_2 > 0\). We put that conjecture in the alternative, so that rejecting \(H_0\) actively establishes it.

  1. Hypotheses. \(H_0: \beta_2 = 0\) versus \(H_1: \beta_2 > 0\).
  2. Test statistic. \(t = b_2/\mathrm{se}(b_2) \sim t_{(38)}\) under \(H_0\) (here \(c = 0\)).
  3. Rejection region. With \(\alpha = 0.05\), the right-tail critical value is \(t_{(0.95,\,38)} = 1.686\). Reject if \(t \ge 1.686\).
  4. Compute. \(t = 10.21/2.09 = 4.88\).
  5. Conclude. Since \(4.88 > 1.686\), we reject \(H_0\). There is a statistically significant positive relationship between income and food expenditure.

Note the asymmetry built into step 5. Had we failed to reject, we would not have concluded “theory is wrong” <80><94> only that this particular sample lacks the evidence to confirm it.

The two-tail test of significance (what software reports)

If we have no prior sign in mind, we test against \(\neq\) instead.

  1. Hypotheses. \(H_0: \beta_2 = 0\) versus \(H_1: \beta_2 \neq 0\).
  2. Test statistic. \(t = b_2/\mathrm{se}(b_2) \sim t_{(38)}\) under \(H_0\).
  3. Rejection region. With \(\alpha = 0.05\), the critical values are \(\pm t_{(0.975,\,38)} = \pm 2.024\). Reject if \(|t| \ge 2.024\).
  4. Compute. \(t = 10.21/2.09 = 4.88\).
  5. Conclude. Since \(|4.88| > 2.024\), we reject \(H_0\): \(\beta_2\) is significantly different from zero.
This test is automatic

Every regression printout reports, for each coefficient, the statistic \(t = b_k/\mathrm{se}(b_k)\) for the default null \(H_0: \beta_k = 0\). In the food output, income has \(t = 4.88\) <80><94> the slope is significant <80><94> while the intercept has \(t = 1.92\), which is not significant at the \(5\%\) level. The two-tail critical value \(2.024\) separates them.

Testing an economic value, and why \(\alpha\) matters

Now a genuinely economic question. A developer will build a new supermarket only if households spend more than \(\$5.50\) of each extra \(\$100\) on food. We put the make-or-break claim in the alternative.

  1. Hypotheses. \(H_0: \beta_2 \le 5.5\) versus \(H_1: \beta_2 > 5.5\).
  2. Test statistic. \(t = \dfrac{b_2 - 5.5}{\mathrm{se}(b_2)} \sim t_{(38)}\) under \(H_0\).
  3. Rejection region. A wrong “build” decision is costly, so we choose a conservative \(\alpha = 0.01\); the critical value is \(t_{(0.99,\,38)} = 2.429\). Reject if \(t \ge 2.429\).
  4. Compute. \(t = \dfrac{10.21 - 5.5}{2.09} = 2.25\).
  5. Conclude. Since \(2.25 < 2.429\), we do not reject \(H_0\). There is not enough evidence of profitability <80><94> do not build (yet).

The punchline is the role of \(\alpha\). At \(\alpha = 0.05\) the critical value would be only \(1.686\), and \(2.25 > 1.686\), so we would have rejected. The decision flips with \(\alpha\). This is exactly why you must choose \(\alpha\) before seeing the data, by weighing the cost of a Type I error <80><94> here, the cost of building an unprofitable store.

10.4 The \(p\)-value approach

Looking up critical values for every \(\alpha\) is tedious. The modern alternative is to report the \(p\)-value and let the reader compare it to whatever \(\alpha\) they care about.

$p$-value

The \(p\)-value is the probability, if \(H_0\) is true, of getting a test statistic at least as extreme as the one we actually observed.

The decision rule is then a single comparison.

The $p$-value rule

\[ p \le \alpha \;\Rightarrow\; \text{reject } H_0, \qquad p > \alpha \;\Rightarrow\; \text{do not reject } H_0 . \]

What counts as “at least as extreme” depends on \(H_1\) <80><94> and we use the same memory trick, looking where the arrow points:

  • \(H_1: \beta_k > c\): \(\;p = \Prob(t_{(N-2)} \ge t)\) (right tail);
  • \(H_1: \beta_k < c\): \(\;p = \Prob(t_{(N-2)} \le t)\) (left tail);
  • \(H_1: \beta_k \neq c\): \(\;p = 2\,\Prob(t_{(N-2)} \ge |t|)\) (both tails).

\(p\)-values on the food data

Return to the supermarket test, where \(H_1: \beta_2 > 5.5\) and \(t = 2.25\). The right-tail \(p\)-value is the area under the \(t_{(38)}\) curve beyond the observed statistic: \[ p = \Prob(t_{(38)} \ge 2.25) = 0.0152 . \] Since \(0.0152 > 0.01\), we do not reject at \(\alpha = 0.01\) <80><94> but we would reject at \(\alpha = 0.05\), since \(0.0152 < 0.05\). This is exactly the same answer the critical-value approach gave; the two methods are always consistent. Figure 10.2 shows the \(p\)-value as the shaded right-tail area.

Show the R code
xs   <- seq(-4, 4, length.out = 400)
dat  <- data.frame(x = xs, y = dt(xs, df = 38))
tail <- subset(dat, x >= 2.25)
ggplot(dat, aes(x, y)) +
  geom_area(data = tail, aes(x, y), fill = ucla$red, alpha = 0.30) +
  geom_line(color = ucla$darkblue, linewidth = 1) +
  geom_segment(aes(x = 2.25, xend = 2.25, y = 0, yend = dt(2.25, 38)),
               linetype = "dashed", color = ucla$gray) +
  annotate("text", x = 3.0, y = 0.06, label = "p = 0.0152",
           color = ucla$red, size = 3.4) +
  scale_x_continuous(breaks = 2.25, labels = "t = 2.25") +
  scale_y_continuous(breaks = NULL) +
  labs(x = "t", y = NULL)
Figure 10.2: The right-tail \(p\)-value for the supermarket test: the area under the \(t_{(38)}\) curve beyond the observed \(t = 2.25\) is \(0.0152\).

For the two-tail significance test, where \(H_1: \beta_2 \neq 0\) and \(t = 4.88\), the \(p\)-value is the combined area in both tails beyond \(\pm 4.88\): \[ p = 2\,\Prob(t_{(38)} \ge 4.88) \approx 0.0000 . \] We reject at any usual \(\alpha\). We can confirm both values directly with the \(t\)-distribution functions in R.

# Right-tail p-value for the supermarket test (t = 2.25, df = 38)
pt(2.25, df = 38, lower.tail = FALSE)
#> [1] 0.01515999

# Two-tail p-value for the significance test (t = 4.88, df = 38)
2 * pt(4.88, df = 38, lower.tail = FALSE)
#> [1] 1.930065e-05

The “\(\Pr(>|t|)\)” column in any regression printout is exactly this two-tail \(p\)-value for the default null \(H_0: \beta_k = 0\). The \(1.95\times 10^{-5}\) next to income in the tidy() output above is the same number we just computed.

Why report \(p\) instead of just “reject” or “do not reject”? Because it lets every reader apply their own \(\alpha\). A reader who cares about a \(0.01\) standard and one who is happy with \(0.10\) can both read off the verdict from the same number <80><94> a far more informative summary than a bare yes/no.

10.5 Statistical vs. economic significance

The final, and most important, lesson of the chapter is that statistically significant does not mean economically important. A coefficient can clear every significance hurdle and still be too small to matter for any real decision.

A large sample, a tiny effect

Suppose an estimate comes back as \(b_2 = 0.0001\) with \(\mathrm{se}(b_2) = 0.00001\), so that \[ t = \frac{0.0001}{0.00001} = 10.0 . \] We resoundingly reject \(H_0: \beta_2 = 0\) <80><94> \(b_2\) is statistically different from zero by a mile. But \(0.0001\) may be far too small an effect to matter for any real decision. Statistically significant, economically negligible.

The reverse can happen too: an economically large effect can come out statistically insignificant if the sample is small or noisy, producing a wide confidence interval that fails to exclude the null. Significance and importance are simply different questions.

Reading significance responsibly

A few habits keep the two straight.

  • Sample size inflates significance. As \(N\) grows, \(\mathrm{se}(b_k) \to 0\), so almost any \(\beta_k \neq 0\) <80><94> however tiny <80><94> eventually becomes “significant.” A significant result in a huge sample says little, on its own, about importance.
  • Look at the magnitude. Is the estimated effect big enough to change a decision? Report the coefficient and its units, not just stars.
  • Use the confidence interval. It shows both significance <80><94> does it exclude the null? <80><94> and the range of economically relevant values, in one object.
  • State the economic conclusion. “We reject \(H_0\)” is not an answer. “Income significantly raises food spending, by about \(\$10\) per extra \(\$100\), give or take \(\$4\)” is.
Procedures are means, not ends

Statistical tests are tools, not the goal. Always translate the result back into the economic question that motivated the test in the first place.

10.6 Recap

A hypothesis test pits a maintained null \(H_0: \beta_k = c\) against an alternative \(H_1\) (one of \(>\), \(<\), or \(\neq\)). Under \(H_0\) the test statistic \[ t = \frac{b_k - c}{\mathrm{se}(b_k)} \sim t_{(N-2)}, \] and we reject when \(t\) falls in the tail(s) <80><94> the arrow in \(H_1\) points to the rejection region. The level \(\alpha = \Prob(\text{Type I error})\) is chosen by the cost of a false rejection, before seeing the data. Equivalently, report the \(p\)-value, the probability of a \(t\) this extreme under \(H_0\), and reject if and only if \(p \le \alpha\).

On the food data, \(H_0: \beta_2 = 0\) against \(> 0\) gives \(t = 4.88\) <80><94> a significant positive effect; \(H_0: \beta_2 \le 5.5\) gives \(t = 2.25\) with \(p = 0.015\), which rejects at \(5\%\) but not at \(1\%\). And throughout, statistical significance is not economic significance: with a big enough \(N\) everything becomes “significant,” so judge an effect by its magnitude and confidence interval, in economic context.

Rejection rule and \(p\)-value for each alternative.
Alternative \(H_1\) Reject when \(p\)-value
\(\beta_k > c\) \(t \ge t_{(1-\alpha,\,N-2)}\) \(\Prob(t_{(N-2)} \ge t)\)
\(\beta_k < c\) \(t \le t_{(\alpha,\,N-2)}\) \(\Prob(t_{(N-2)} \le t)\)
\(\beta_k \neq c\) \(\lvert t\rvert \ge t_{(1-\alpha/2,\,N-2)}\) \(2\,\Prob(t_{(N-2)} \ge \lvert t\rvert)\)

This rounds out our inference toolkit for the simple regression model: we can now estimate, quantify precision, predict, and test.

Next time: measuring how well the line fits the data <80><94> goodness of fit and prediction, then functional forms and, after the midterm, multiple regression.