---
title: "Hypothesis Testing"
---
{{< include _setup.qmd >}}
> **Reading.** SW §5.1, HGL §3.2–3.5
Last chapter, the $t$-statistic gave us a *range* of plausible values for a slope:
$$
\frac{b_k - \beta_k}{\mathrm{se}(b_k)} \sim t_{(N-2)}
\;\;\Longrightarrow\;\;
\beta_2 \in [5.97,\ 14.45].
$$
A [confidence interval](09-confidence-intervals.qmd) answers "how big is the
effect, give or take?" But decision-makers usually ask sharper, yes/no
questions. Is there *any* relationship between income and food spending — is
$\beta_2 = 0$? Will households spend more than $\$5.50$ of each extra $\$100$ —
is $\beta_2 > 5.5$? This chapter points the *same* inferential engine at specific
conjectures like these. We set up null and alternative hypotheses, build
rejection regions, compute $p$-values, and — crucially — learn to separate
**statistical** significance from **economic** significance.
## The logic of a hypothesis test {#sec-logic}
Every hypothesis test, no matter how complicated the setting, is built from the
same five pieces.
::: {.keyidea title="Components of a hypothesis test"}
1. a **null hypothesis** $H_0$,
2. an **alternative hypothesis** $H_1$,
3. a **test statistic**,
4. a **rejection region**,
5. a **conclusion**, stated in economic context.
:::
The **null hypothesis** $H_0: \beta_k = c$ is the maintained belief — the claim
we hold to be true until the data convince us otherwise. The value $c$ is one
that matters in context, and it is very often $0$. The **alternative
hypothesis** $H_1$ is what we are prepared to accept if we reject $H_0$. It comes
in three flavors: $\beta_k > c$, $\beta_k < c$, or $\beta_k \neq c$.
### The test statistic and its logic
Recall that $t = (b_k - \beta_k)/\mathrm{se}(b_k) \sim t_{(N-2)}$. *If the null
$H_0: \beta_k = c$ is true*, we can substitute the hypothesized value $c$ for the
unknown $\beta_k$, and the same quantity becomes computable:
$$
t = \frac{b_k - c}{\mathrm{se}(b_k)} \sim t_{(N-2)} \qquad\text{under } H_0 .
$$
This is what makes it a *test statistic*: it has a **known distribution when
$H_0$ is true**, and some other distribution when $H_0$ is false. That single
fact powers the whole procedure.
The chain of reasoning is short. If $H_0$ holds, the computed $t$ should land in
the **middle** of the $t$-curve, where most of the probability sits. A value of
$t$ way out in a **tail** is *unlikely* under $H_0$. So observing such a value is
evidence that $H_0$ is **false** — and we reject it.
### How unlikely is "unlikely"?
We draw the line between "plausible" and "too extreme" with the **level of
significance** $\alpha$: the probability of landing in the rejection region
*when $H_0$ is true*. There are two ways a test can go wrong, and they are not
symmetric.
::: {.definition title="Type I and Type II errors"}
- A **Type I error** is rejecting $H_0$ when it is actually true. Its probability
is exactly the level of significance: $\Prob(\text{Type I error}) = \alpha$. We
*choose* $\alpha$, usually $0.01$, $0.05$, or $0.10$.
- A **Type II error** is failing to reject a *false* $H_0$. Its probability
depends on the unknown true $\beta_k$, so we cannot set it directly.
:::
Choosing $\alpha$ is choosing how much risk of a false rejection you are willing
to bear. A costly false rejection calls for a small $\alpha$ — $0.01$, say.
::: {.callout-note appearance="simple"}
The ubiquitous "$\alpha = 0.05$" is convention, not law: it descends from
Fisher's old rule of thumb that "$t > 2$ is significant," and there is nothing
sacred about it. Pick $\alpha$ to fit the decision at hand.
:::
## Rejection regions for the three alternatives {#sec-rejection}
Where the rejection region sits depends entirely on the alternative $H_1$. There
is a handy memory trick: **the rejection region is in the direction the arrow
points** in the alternative.
If $H_1: \beta_k > c$, the arrow points right, so we reject for large positive
$t$ — a **right-tail** test, rejecting when $t \ge t_{(1-\alpha,\,N-2)}$. If
$H_1: \beta_k < c$, the arrow points left, so we reject for large negative $t$ —
a **left-tail** test, rejecting when $t \le t_{(\alpha,\,N-2)}$. If
$H_1: \beta_k \neq c$, deviations in *either* direction count against $H_0$, so
the rejection region splits across **both tails**, and we reject when
$|t| \ge t_{(1-\alpha/2,\,N-2)}$. @fig-rejection shows all three.
```{r}
#| label: fig-rejection
#| fig-cap: "Rejection regions (shaded) for the three alternatives. The region lies in the direction the alternative's arrow points; the two-tail test splits $\\alpha$ across both tails."
#| fig-width: 7.2
#| fig-height: 2.8
xs <- seq(-4, 4, length.out = 400)
curve_df <- data.frame(x = xs, y = dnorm(xs))
panel <- function(label, shade_fun) {
df <- transform(curve_df, panel = label)
sh <- shade_fun(df)
list(curve = df, shade = sh)
}
right <- subset(curve_df, x >= 1.7); right$panel <- "H1: beta > c (right tail)"
left <- subset(curve_df, x <= -1.7); left$panel <- "H1: beta < c (left tail)"
twoU <- subset(curve_df, x >= 2); twoU$panel <- "H1: beta != c (two tails)"
twoL <- subset(curve_df, x <= -2); twoL$panel <- "H1: beta != c (two tails)"
lvl <- c("H1: beta > c (right tail)",
"H1: beta < c (left tail)",
"H1: beta != c (two tails)")
base <- do.call(rbind, lapply(lvl, function(l) transform(curve_df, panel = l)))
base$panel <- factor(base$panel, levels = lvl)
shade <- rbind(right, left, twoU, twoL)
shade$panel <- factor(shade$panel, levels = lvl)
ggplot(base, aes(x, y)) +
geom_area(data = shade, aes(x, y), fill = ucla$red, alpha = 0.30) +
geom_line(color = ucla$darkblue, linewidth = 1) +
facet_wrap(~ panel) +
scale_y_continuous(breaks = NULL) +
labs(x = "t", y = NULL)
```
For a two-tail test the area $\alpha$ is split, $\alpha/2$ into each tail, so the
critical value $t_c$ has to sit *farther* out than for a one-tail test at the
same $\alpha$. That is why a two-tail test is harder to "pass" with a one-sided
conjecture: you pay for hedging your bets about the direction.
### The five-step procedure
Every test we run follows the same checklist.
::: {.keyidea title="The five-step recipe"}
1. **Hypotheses.** State $H_0$ and $H_1$.
2. **Test statistic.** $t = \dfrac{b_k - c}{\mathrm{se}(b_k)}$, which is
$t_{(N-2)}$ if $H_0$ is true.
3. **Rejection region.** Pick $\alpha$; find the critical value(s) for the
relevant tail(s).
4. **Compute.** Plug in $b_k$, $c$, and $\mathrm{se}(b_k)$.
5. **Conclude.** Reject or do not reject — and say what it *means* for the
economics.
:::
One subtle but important habit of language goes with step 5.
::: {.warningbox title="Never \"accept\" the null"}
We say "reject $H_0$" or "**fail to reject** $H_0$" — never "accept $H_0$."
Failing to reject only means the data are *compatible* with $H_0$, not that $H_0$
is true. Absence of evidence is not evidence of absence.
:::
## Worked tests on the food data {#sec-worked}
To make the recipe concrete, we run three tests on the food-expenditure
regression from earlier chapters. The fitted model on the 40 households is
$\widehat{\text{food\_exp}} = 83.42 + 10.21\,\text{income}$, with
$\mathrm{se}(b_2) = 2.09$ on $N - 2 = 38$ degrees of freedom.
```{r}
#| label: food-fit
#| code-fold: false
data(food)
fit <- lm(food_exp ~ income, data = food)
tidy(fit)
```
### A one-tail test of significance
Economic theory says food is a normal good, so we expect $\beta_2 > 0$. We put
that conjecture in the *alternative*, so that rejecting $H_0$ actively
*establishes* it.
1. **Hypotheses.** $H_0: \beta_2 = 0$ versus $H_1: \beta_2 > 0$.
2. **Test statistic.** $t = b_2/\mathrm{se}(b_2) \sim t_{(38)}$ under $H_0$
(here $c = 0$).
3. **Rejection region.** With $\alpha = 0.05$, the right-tail critical value is
$t_{(0.95,\,38)} = 1.686$. Reject if $t \ge 1.686$.
4. **Compute.** $t = 10.21/2.09 = 4.88$.
5. **Conclude.** Since $4.88 > 1.686$, we **reject $H_0$**. There is a
statistically significant *positive* relationship between income and food
expenditure.
Note the asymmetry built into step 5. Had we *failed* to reject, we would
**not** have concluded "theory is wrong" — only that this particular sample
lacks the evidence to confirm it.
### The two-tail test of significance (what software reports)
If we have no prior sign in mind, we test against $\neq$ instead.
1. **Hypotheses.** $H_0: \beta_2 = 0$ versus $H_1: \beta_2 \neq 0$.
2. **Test statistic.** $t = b_2/\mathrm{se}(b_2) \sim t_{(38)}$ under $H_0$.
3. **Rejection region.** With $\alpha = 0.05$, the critical values are
$\pm t_{(0.975,\,38)} = \pm 2.024$. Reject if $|t| \ge 2.024$.
4. **Compute.** $t = 10.21/2.09 = 4.88$.
5. **Conclude.** Since $|4.88| > 2.024$, we **reject $H_0$**: $\beta_2$ is
significantly different from zero.
::: {.keyidea title="This test is automatic"}
Every regression printout reports, for each coefficient, the statistic
$t = b_k/\mathrm{se}(b_k)$ for the default null $H_0: \beta_k = 0$. In the food
output, `income` has $t = 4.88$ — the slope is significant — while the intercept
has $t = 1.92$, which is *not* significant at the $5\%$ level. The two-tail
critical value $2.024$ separates them.
:::
### Testing an economic value, and why $\alpha$ matters
Now a genuinely economic question. A developer will build a new supermarket only
if households spend more than $\$5.50$ of each extra $\$100$ on food. We put the
make-or-break claim in the alternative.
1. **Hypotheses.** $H_0: \beta_2 \le 5.5$ versus $H_1: \beta_2 > 5.5$.
2. **Test statistic.** $t = \dfrac{b_2 - 5.5}{\mathrm{se}(b_2)} \sim t_{(38)}$
under $H_0$.
3. **Rejection region.** A wrong "build" decision is costly, so we choose a
**conservative** $\alpha = 0.01$; the critical value is
$t_{(0.99,\,38)} = 2.429$. Reject if $t \ge 2.429$.
4. **Compute.** $t = \dfrac{10.21 - 5.5}{2.09} = 2.25$.
5. **Conclude.** Since $2.25 < 2.429$, we **do not reject $H_0$**. There is not
enough evidence of profitability — do not build (yet).
The punchline is the role of $\alpha$. At $\alpha = 0.05$ the critical value
would be only $1.686$, and $2.25 > 1.686$, so we *would* have rejected. The
decision flips with $\alpha$. This is exactly why you must choose $\alpha$
*before* seeing the data, by weighing the cost of a Type I error — here, the cost
of building an unprofitable store.
## The $p$-value approach {#sec-pvalue}
Looking up critical values for every $\alpha$ is tedious. The modern alternative
is to report the **$p$-value** and let the reader compare it to whatever $\alpha$
they care about.
::: {.definition title="$p$-value"}
The **$p$-value** is the probability, *if $H_0$ is true*, of getting a test
statistic **at least as extreme** as the one we actually observed.
:::
The decision rule is then a single comparison.
::: {.keyidea title="The $p$-value rule"}
$$
p \le \alpha \;\Rightarrow\; \text{reject } H_0,
\qquad
p > \alpha \;\Rightarrow\; \text{do not reject } H_0 .
$$
:::
What counts as "at least as extreme" depends on $H_1$ — and we use the same
memory trick, looking where the arrow points:
- $H_1: \beta_k > c$: $\;p = \Prob(t_{(N-2)} \ge t)$ (right tail);
- $H_1: \beta_k < c$: $\;p = \Prob(t_{(N-2)} \le t)$ (left tail);
- $H_1: \beta_k \neq c$: $\;p = 2\,\Prob(t_{(N-2)} \ge |t|)$ (both tails).
### $p$-values on the food data
Return to the supermarket test, where $H_1: \beta_2 > 5.5$ and $t = 2.25$. The
right-tail $p$-value is the area under the $t_{(38)}$ curve beyond the observed
statistic:
$$
p = \Prob(t_{(38)} \ge 2.25) = 0.0152 .
$$
Since $0.0152 > 0.01$, we do not reject at $\alpha = 0.01$ — but we *would*
reject at $\alpha = 0.05$, since $0.0152 < 0.05$. This is exactly the same answer
the critical-value approach gave; the two methods are always consistent.
@fig-pvalue shows the $p$-value as the shaded right-tail area.
```{r}
#| label: fig-pvalue
#| fig-cap: "The right-tail $p$-value for the supermarket test: the area under the $t_{(38)}$ curve beyond the observed $t = 2.25$ is $0.0152$."
#| fig-width: 5
#| fig-height: 3.4
xs <- seq(-4, 4, length.out = 400)
dat <- data.frame(x = xs, y = dt(xs, df = 38))
tail <- subset(dat, x >= 2.25)
ggplot(dat, aes(x, y)) +
geom_area(data = tail, aes(x, y), fill = ucla$red, alpha = 0.30) +
geom_line(color = ucla$darkblue, linewidth = 1) +
geom_segment(aes(x = 2.25, xend = 2.25, y = 0, yend = dt(2.25, 38)),
linetype = "dashed", color = ucla$gray) +
annotate("text", x = 3.0, y = 0.06, label = "p = 0.0152",
color = ucla$red, size = 3.4) +
scale_x_continuous(breaks = 2.25, labels = "t = 2.25") +
scale_y_continuous(breaks = NULL) +
labs(x = "t", y = NULL)
```
For the two-tail significance test, where $H_1: \beta_2 \neq 0$ and $t = 4.88$,
the $p$-value is the combined area in both tails beyond $\pm 4.88$:
$$
p = 2\,\Prob(t_{(38)} \ge 4.88) \approx 0.0000 .
$$
We reject at any usual $\alpha$. We can confirm both values directly with the
$t$-distribution functions in R.
```{r}
#| label: pvalues
#| code-fold: false
# Right-tail p-value for the supermarket test (t = 2.25, df = 38)
pt(2.25, df = 38, lower.tail = FALSE)
# Two-tail p-value for the significance test (t = 4.88, df = 38)
2 * pt(4.88, df = 38, lower.tail = FALSE)
```
::: {.callout-note appearance="simple"}
The "$\Pr(>|t|)$" column in any regression printout is exactly this two-tail
$p$-value for the default null $H_0: \beta_k = 0$. The $1.95\times 10^{-5}$ next
to `income` in the `tidy()` output above is the same number we just computed.
:::
Why report $p$ instead of just "reject" or "do not reject"? Because it lets every
reader apply **their own** $\alpha$. A reader who cares about a $0.01$ standard
and one who is happy with $0.10$ can both read off the verdict from the same
number — a far more informative summary than a bare yes/no.
## Statistical vs. economic significance {#sec-significance}
The final, and most important, lesson of the chapter is that **statistically
significant does not mean economically important.** A coefficient can clear every
significance hurdle and still be too small to matter for any real decision.
::: {.example title="A large sample, a tiny effect"}
Suppose an estimate comes back as $b_2 = 0.0001$ with $\mathrm{se}(b_2) =
0.00001$, so that
$$
t = \frac{0.0001}{0.00001} = 10.0 .
$$
We resoundingly reject $H_0: \beta_2 = 0$ — $b_2$ is *statistically* different
from zero by a mile. But $0.0001$ may be far too small an effect to **matter** for
any real decision. Statistically significant, economically negligible.
:::
The reverse can happen too: an economically large effect can come out
statistically *insignificant* if the sample is small or noisy, producing a wide
confidence interval that fails to exclude the null. Significance and importance
are simply different questions.
### Reading significance responsibly
A few habits keep the two straight.
- **Sample size inflates significance.** As $N$ grows, $\mathrm{se}(b_k) \to 0$,
so almost any $\beta_k \neq 0$ — however tiny — eventually becomes
"significant." A significant result in a huge sample says little, on its own,
about importance.
- **Look at the magnitude.** Is the estimated effect big enough to change a
decision? Report the coefficient and its *units*, not just stars.
- **Use the confidence interval.** It shows both significance — does it exclude
the null? — *and* the range of economically relevant values, in one object.
- **State the economic conclusion.** "We reject $H_0$" is not an answer. "Income
significantly raises food spending, by about $\$10$ per extra $\$100$, give or
take $\$4$" is.
::: {.warningbox title="Procedures are means, not ends"}
Statistical tests are tools, not the goal. Always translate the result back into
the economic question that motivated the test in the first place.
:::
## Recap {#sec-recap}
A hypothesis test pits a maintained null $H_0: \beta_k = c$ against an
alternative $H_1$ (one of $>$, $<$, or $\neq$). Under $H_0$ the test statistic
$$
t = \frac{b_k - c}{\mathrm{se}(b_k)} \sim t_{(N-2)},
$$
and we reject when $t$ falls in the tail(s) — the arrow in $H_1$ points to the
rejection region. The level $\alpha = \Prob(\text{Type I error})$ is chosen by
the cost of a false rejection, *before* seeing the data. Equivalently, report the
$p$-value, the probability of a $t$ this extreme under $H_0$, and reject if and
only if $p \le \alpha$.
On the food data, $H_0: \beta_2 = 0$ against $> 0$ gives $t = 4.88$ — a
significant positive effect; $H_0: \beta_2 \le 5.5$ gives $t = 2.25$ with
$p = 0.015$, which rejects at $5\%$ but not at $1\%$. And throughout, statistical
significance is not economic significance: with a big enough $N$ everything
becomes "significant," so judge an effect by its magnitude and confidence
interval, in economic context.
| Alternative $H_1$ | Reject when | $p$-value |
|---|---|---|
| $\beta_k > c$ | $t \ge t_{(1-\alpha,\,N-2)}$ | $\Prob(t_{(N-2)} \ge t)$ |
| $\beta_k < c$ | $t \le t_{(\alpha,\,N-2)}$ | $\Prob(t_{(N-2)} \le t)$ |
| $\beta_k \neq c$ | $\lvert t\rvert \ge t_{(1-\alpha/2,\,N-2)}$ | $2\,\Prob(t_{(N-2)} \ge \lvert t\rvert)$ |
: Rejection rule and $p$-value for each alternative. {.striped}
This rounds out our inference toolkit for the simple regression model: we can now
**estimate**, **quantify precision**, **predict**, and **test**.
**Next time:** measuring how well the line fits the data — goodness of fit and
[prediction](11-prediction-fit.qmd), then [functional
forms](12-functional-forms.qmd) and, after the midterm, *multiple* regression.