---
title: "The Normal Distribution, Sampling & the CLT"
---
{{< include _setup.qmd >}}
> **Reading.** SW §2.4–2.6, HGL Probability Primer §P.7 & Appendix C
In the [last chapter](03-expectation.qmd) we learned how to summarize a
distribution with its mean and variance, how those quantities combine, and how
to standardize a variable to mean $0$ and variance $1$ via $Z = (X-\mu)/\sigma$.
This chapter closes out the probability toolkit and brings us to the doorstep of
inference. We meet the **Normal distribution** — the bell curve — and learn to
read probabilities off it; we study the **sample mean** $\bar Y$ as a random
variable with its own distribution; and we arrive at the **Central Limit
Theorem**, the reason the Normal turns up everywhere.
The payoff is concrete. By the end of the chapter we can say how close $\bar Y$
is likely to be to the truth $\mu$. That single fact powers *every* confidence
interval and hypothesis test in the rest of the course.
## The Normal distribution {#sec-normal}
Some distributions are special enough to earn a name. The most important of all
is the **Normal**.
::: {.definition title="Normal distribution"}
$X$ is **normally distributed** with mean $\mu$ and variance $\sigma^2$, written
$X \sim N(\mu,\sigma^2)$, if its density is the bell curve
$$
f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\,
\exp\!\left[-\frac{(x-\mu)^2}{2\sigma^2}\right],
\qquad -\infty < x < \infty .
$$
:::
Two features are worth fixing in mind. First, the Normal is **symmetric** and
centered at $\mu$, so its mean equals its median and its skewness is $0$.
Second, the two parameters play distinct roles: $\mu$ sets the **location** of
the curve, while $\sigma^2$ sets its **spread**.
Changing the parameters simply slides and stretches the same bell shape. Moving
$\mu$ shifts the center left or right; raising $\sigma^2$ flattens and widens the
curve, while shrinking it makes the curve tall and tight. Throughout, the total
area under each curve stays equal to $1$ — so a wider curve is necessarily a
shorter one. @fig-normal-family shows three members of the family.
```{r}
#| label: fig-normal-family
#| fig-cap: "Same family, different parameters: $\\mu$ moves the center, $\\sigma^2$ controls the spread."
#| fig-width: 5.4
#| fig-height: 3.4
xs <- seq(-6, 8, length.out = 400)
fam <- rbind(
data.frame(x = xs, f = dnorm(xs, 0, 1), dist = "N(0, 1)"),
data.frame(x = xs, f = dnorm(xs, 2, 1), dist = "N(2, 1)"),
data.frame(x = xs, f = dnorm(xs, 0, 2), dist = "N(0, 4)")
)
ggplot(fam, aes(x, f, color = dist, linetype = dist)) +
geom_line(linewidth = 1) +
scale_color_manual(values = c(ucla$blue, ucla$darkblue, ucla$red)) +
scale_linetype_manual(values = c("solid", "dashed", "dotted")) +
labs(x = "x", y = expression(f[X](x)), color = NULL, linetype = NULL)
```
### A fact worth memorizing: the 95% rule
For *any* Normal, about **95%** of the probability lies within $1.96$ standard
deviations of the mean:
$$
\Prob\!\left(\mu - 1.96\,\sigma \le X \le \mu + 1.96\,\sigma\right)
\approx 0.95 .
$$
It is convenient to keep the round-number version in your head as well: about
$68\%$ of the probability falls within $\pm 1\sigma$ of the mean, about $95\%$
within $\pm 2\sigma$, and about $99.7\%$ within $\pm 3\sigma$. @fig-95-rule
shades the central $95\%$.
```{r}
#| label: fig-95-rule
#| fig-cap: "The 95% rule: about 95% of a Normal's probability lies within $\\pm 1.96\\sigma$ of the mean."
#| fig-width: 5.4
#| fig-height: 3.4
xs <- seq(-3.6, 3.6, length.out = 400)
dat <- data.frame(x = xs, y = dnorm(xs))
sh <- subset(dat, x >= -1.96 & x <= 1.96)
ggplot(dat, aes(x, y)) +
geom_area(data = sh, aes(x, y), fill = ucla$blue, alpha = 0.30) +
geom_line(color = ucla$blue, linewidth = 1) +
annotate("text", x = 0, y = 0.16, label = "95%",
color = ucla$darkblue, size = 4) +
scale_x_continuous(
breaks = c(-1.96, 0, 1.96),
labels = c(expression(mu - 1.96 * sigma), expression(mu),
expression(mu + 1.96 * sigma))
) +
scale_y_continuous(limits = c(0, 0.45)) +
labs(x = expression(x ~ "(in units of " * sigma ~ "from " * mu * ")"),
y = expression(f[X](x)))
```
## Standardizing & the standard Normal {#sec-standard}
Rather than keep a separate table for every pair $(\mu,\sigma^2)$, we convert
every Normal problem to one reference distribution.
::: {.definition title="Standard Normal"}
The **standard Normal** is $Z \sim N(0,1)$. If $X \sim N(\mu,\sigma^2)$, then
$$
Z = \frac{X-\mu}{\sigma} \sim N(0,1).
$$
:::
This is exactly the standardizing move from the [previous
chapter](03-expectation.qmd) — subtract the mean, divide by the standard
deviation to get mean $0$ and variance $1$ — now applied to a Normal, which
keeps the variable Normal. The cdf of $Z$ is important enough to get its own
symbol,
$$
\Phi(z) = \Prob(Z \le z),
$$
tabulated in the textbook's Statistical Table 1 and built into R as `pnorm`. By
symmetry of the bell curve around $0$, the upper tail past $a$ equals the lower
tail before $-a$:
$$
\Prob(Z > a) = \Prob(Z < -a).
$$
### Reading probabilities off the Normal
To get any Normal probability, **standardize, then look up $\Phi$.** For
$X \sim N(\mu,\sigma^2)$ and constants $a < b$, three rules cover everything.
::: {.property title="Three rules for Normal probabilities"}
$$
\begin{aligned}
\Prob(X \le a) &= \Phi\!\left(\tfrac{a-\mu}{\sigma}\right),\\[4pt]
\Prob(X \ge a) &= 1 - \Phi\!\left(\tfrac{a-\mu}{\sigma}\right),\\[4pt]
\Prob(a \le X \le b) &= \Phi\!\left(\tfrac{b-\mu}{\sigma}\right)
- \Phi\!\left(\tfrac{a-\mu}{\sigma}\right).
\end{aligned}
$$
:::
Everything reduces to standard-Normal cdf values $\Phi(\cdot)$ — which is why a
single table, or one R command (`pnorm`), does all the work.
::: {.example title="A worked probability"}
Let $X \sim N(3,\,9)$, so $\mu = 3$ and $\sigma = 3$. Find
$\Prob(4 \le X \le 6)$.
First standardize the endpoints:
$$
\tfrac{4-3}{3} = 0.33, \qquad \tfrac{6-3}{3} = 1.
$$
Then take the difference of cdf values:
$$
\begin{aligned}
\Prob(4 \le X \le 6) &= \Phi(1) - \Phi(0.33)\\
&= 0.8413 - 0.6293 = 0.2120.
\end{aligned}
$$
The answer is the shaded area between $4$ and $6$ under the $N(3,9)$ density in
@fig-worked-normal.
:::
```{r}
#| label: fig-worked-normal
#| fig-cap: "Shaded area between 4 and 6 under the $N(3, 9)$ density, equal to about 0.21."
#| fig-width: 5
#| fig-height: 3.4
xs <- seq(-6, 12, length.out = 400)
dat <- data.frame(x = xs, y = dnorm(xs, 3, 3))
sh <- subset(dat, x >= 4 & x <= 6)
ggplot(dat, aes(x, y)) +
geom_area(data = sh, aes(x, y), fill = ucla$blue, alpha = 0.30) +
geom_line(color = ucla$blue, linewidth = 1) +
annotate("text", x = 5, y = 0.045, label = "0.21",
color = ucla$darkblue, size = 3.6) +
scale_x_continuous(breaks = c(3, 4, 6)) +
scale_y_continuous(limits = c(0, 0.16)) +
labs(x = "x", y = expression(f[X](x)))
```
### Key percentiles you will reuse all term
We will constantly need the value $z_\alpha$ for which
$\Prob(Z \le z_\alpha) = \alpha$ — the $\alpha$-quantile of the standard Normal.
The handful of values in @tbl-z-quantiles recur throughout the course.
```{r}
#| label: tbl-z-quantiles
#| tbl-cap: "Standard-Normal quantiles $z_\\alpha$ with $\\Prob(Z \\le z_\\alpha) = \\alpha$."
z_tab <- data.frame(
alpha = c(0.90, 0.95, 0.975, 0.99, 0.995),
z = c(1.28, 1.645, 1.96, 2.33, 2.58)
)
knitr::kable(z_tab, col.names = c("$\\alpha$", "$z_\\alpha$"), align = "cc")
```
::: {.keyidea title="The three to memorize"}
$$
1.645,\qquad 1.96,\qquad 2.58.
$$
:::
Because the Normal is symmetric, a **two-sided** $95\%$ range uses $\pm 1.96$,
leaving $2.5\%$ of the probability in each tail. That is the source of the
"$1.96$" in the $95\%$ rule above — and of the confidence intervals we build in
[the confidence-intervals chapter](09-confidence-intervals.qmd).
## Linear combinations of Normals {#sec-combinations}
In the last chapter we found the *mean* and *variance* of a linear combination of
random variables. For Normals we now get the **shape** for free.
::: {.property title="Closure under linear combinations"}
If $X_1 \sim N(\mu_1,\sigma_1^2)$ and $X_2 \sim N(\mu_2,\sigma_2^2)$ are jointly
normal, then for constants $a_1, a_2$,
$$
a_1 X_1 + a_2 X_2 \sim
N\!\left(a_1\mu_1 + a_2\mu_2,\;
a_1^2\sigma_1^2 + a_2^2\sigma_2^2 + 2a_1 a_2\,\sigma_{12}\right),
$$
where $\sigma_{12} = \Cov(X_1, X_2)$.
:::
This property is special. Most distributions *change shape* when you add them
together, but the Normal does not: any linear combination of jointly normal
variables is again Normal. The mean and variance follow exactly the rules from
the previous chapter; closure just hands us the bell shape on top. We will lean
on this fact for the sample mean in a moment.
### Three properties of the bivariate Normal
When $X$ and $Y$ are *jointly* normal, three useful facts hold.
::: {.property title="The jointly normal pair"}
1. Each **marginal** is normal: $X \sim N(\mu_X,\sigma_X^2)$ and
$Y \sim N(\mu_Y,\sigma_Y^2)$.
2. **Zero covariance implies independence.** (Recall this is *false* in general —
it is a special gift of the Normal.)
3. The conditional mean is **linear** in the conditioning variable:
$$
\E(Y \given X = x) = \alpha + \beta x,
\qquad \beta = \frac{\sigma_{XY}}{\sigma_X^2}.
$$
:::
Property 3 deserves a second look. The conditional mean $\E(Y \given X)$ is a
straight line whose slope is $\Cov(X,Y)/\Var(X)$ — exactly the regression slope
previewed in the last chapter. This **linear regression function** is where the
whole course is heading; we build it from scratch in the [next
chapter](05-simple-regression.qmd).
::: {.callout-note appearance="simple"}
**A signpost: relatives of the Normal.** Three distributions *built from* the
Normal run our later inference, and we will meet each properly when we need it.
The **chi-squared** $\chi^2_m$ is the sum of $m$ independent squared standard
Normals, and shows up in variance and joint tests. **Student's $t$** with $m$
degrees of freedom is bell-shaped but *fatter-tailed* than the Normal and
approaches $N(0,1)$ as $m \to \infty$; we switch to it when we estimate $\sigma$
rather than know it (the [confidence-intervals
chapter](09-confidence-intervals.qmd)). The **$F$** distribution with $(m,n)$
degrees of freedom is a ratio of scaled chi-squareds, used to test *several*
restrictions at once (the [$F$-tests chapter](17-ftests.qmd)). For now, just
remember that the $t$ is a slightly wider Normal that we adopt once $\sigma$ is
unknown.
:::
## Random sampling & the sample mean {#sec-sampling}
All of our methods rest on *how the data were drawn*. The simplest and most
important sampling scheme is the one that makes the observations independent and
identically distributed.
::: {.definition title="Simple random sampling and i.i.d. data"}
Draw $n$ observations $Y_1,\dots,Y_n$ at random from a population with mean
$\mu$ and variance $\sigma^2$. Then they are **i.i.d.**:
- **identically distributed** — each $Y_i$ has the population's distribution,
with mean $\mu$ and variance $\sigma^2$;
- **independent** — knowing the value of one observation tells you nothing about
the others.
:::
Notice the shift in perspective. *Before* we look at the data, each $Y_i$ is a
random variable; *after* sampling, it is a recorded number. Different draws would
have produced different numbers — and that is the source of **sampling
variation**, the central object of everything that follows.
### The sample mean is a random variable
Our estimator of the population mean $\mu$ is the **sample mean**
$$
\bar Y = \frac{1}{n}\sum_{i=1}^{n} Y_i .
$$
Because the $Y_i$ are random, $\bar Y$ is **itself random**: a different sample
yields a different $\bar Y$. The distribution of $\bar Y$ over all possible
samples is called its **sampling distribution**.
A concrete illustration: in the HGL hip-width data, ten samples of size $50$ drawn
from the same population gave sample means ranging from $16.75$ to $17.41$ —
same population, a different $\bar y$ each time.
::: {.keyidea title="The key shift in thinking"}
We stop asking "is *this* estimate right?" — which is unanswerable — and instead
ask "how does the *procedure* $\bar Y$ behave across samples?" That second
question we *can* answer, through the mean and variance of $\bar Y$.
:::
### Mean and variance of $\bar Y$
Apply the previous chapter's rules to $\bar Y = \tfrac1n\sum_i Y_i$ with i.i.d.
draws. The mean follows from linearity alone — no independence needed:
$$
\E(\bar Y) = \frac1n\sum_{i=1}^n \E(Y_i) = \frac1n\,(n\mu) = \mu .
$$
So $\bar Y$ is **unbiased**: on average across samples it equals $\mu$. The
variance uses independence, which makes every $\Cov(Y_i, Y_j) = 0$ for $i \ne j$:
$$
\Var(\bar Y) = \frac{1}{n^2}\sum_{i=1}^n \Var(Y_i)
= \frac{\sigma^2}{n}.
$$
The standard deviation of the sample mean is therefore
$\sigma_{\bar Y} = \sigma/\sqrt{n}$, called the **standard error**.
::: {.keyidea title="Read these off"}
The sampling distribution of $\bar Y$ is centered at the truth $\mu$, and its
spread $\sigma/\sqrt{n}$ **shrinks as $n$ grows**. More data means $\bar Y$
clusters more tightly around $\mu$. These two facts hold for *any* population
distribution.
:::
### If the population is Normal, so is $\bar Y$
Since $\bar Y$ is a linear combination of the $Y_i$, closure under linear
combinations gives an *exact* result whenever the population itself is Normal:
$$
Y_i \sim N(\mu,\sigma^2) \quad\Longrightarrow\quad
\bar Y \sim N\!\left(\mu,\ \frac{\sigma^2}{n}\right).
$$
::: {.example title="Precision and sample size (HGL)"}
Suppose the population is normal with $\sigma^2 = 10$. With $n = 40$ we have
$\bar Y \sim N(\mu,\,0.25)$, since $\sigma^2/n = 10/40 = 0.25$, so
$$
\Prob\!\left(|\bar Y - \mu| \le 1\right)
= \Prob(-2 \le Z \le 2) = 0.954 .
$$
Raising $n$ to $80$ halves the variance to $0.125$ and tightens this probability
to $0.995$. More data, more precision.
:::
## Law of large numbers & the CLT {#sec-lln-clt}
We now have the center and spread of $\bar Y$. Two limit theorems describe what
happens to $\bar Y$ as the sample grows without bound.
::: {.property title="Law of large numbers (LLN)"}
As the sample size grows, the sample mean converges in probability to the
population mean:
$$
\bar Y \;\xrightarrow{\;p\;}\; \mu \qquad \text{as } n \to \infty .
$$
:::
This is the formal "law of averages": with many draws, high and low values
balance out and $\bar Y$ settles on $\mu$. It is the reason large samples are
trustworthy, and it makes $\bar Y$ a **consistent** estimator of $\mu$. But the
LLN only says that $\bar Y$ *gets close* to $\mu$ — it says nothing about the
*shape* of $\bar Y$'s distribution around $\mu$. For inference we need that
shape, and that is what the Central Limit Theorem supplies.
::: {.property title="Central Limit Theorem (CLT)"}
If $Y_1,\dots,Y_n$ are i.i.d. with mean $\mu$ and variance $\sigma^2$, then the
**standardized** sample mean converges in distribution to a standard Normal:
$$
\frac{\bar Y - \mu}{\sigma/\sqrt{n}} \;\xrightarrow{\;d\;}\; N(0,1)
\qquad\text{as } n \to \infty,
$$
so for large $n$, $\ \bar Y \overset{a}{\sim} N\!\left(\mu,\ \sigma^2/n\right)$.
:::
The remarkable part is the generality. The CLT holds **whatever the population
distribution** — skewed, discrete, fat-tailed, anything — as long as $\sigma^2$
is finite. The bell curve emerges from the *averaging*, not from any bell shape
in the data themselves.
::: {.callout-note appearance="simple"}
**Rule of thumb.** A sample of $n \ge 30$ is usually enough for the Normal
approximation to be good; real samples in the hundreds or thousands make it
excellent.
:::
### The CLT in action
To see how striking this is, take a decidedly *non*-Normal population: the
triangular density $f(y) = 2y$ on $[0,1]$ (HGL's example), which is plainly
skewed toward $1$. Now repeatedly draw a sample, average it, and standardize the
average. Even for a sample as small as $n = 10$, the histogram of the
standardized means is already bell-shaped and centered at $0$, and it only
sharpens toward $N(0,1)$ as $n$ grows. @fig-clt-action contrasts the skewed
population with the near-Normal distribution of its standardized sample mean.
```{r}
#| label: fig-clt-action
#| fig-cap: "Left: a skewed population $f(y) = 2y$ on $[0,1]$. Right: the standardized sample mean for $n = 10$ is already close to $N(0, 1)$."
#| fig-width: 6.4
#| fig-height: 3.0
# Left panel: the skewed triangular population f(y) = 2y on [0, 1].
pop <- data.frame(y = seq(0, 1, length.out = 200))
pop$f <- 2 * pop$y
p_pop <- ggplot(pop, aes(y, f)) +
geom_area(fill = ucla$red, alpha = 0.25) +
geom_line(color = ucla$red, linewidth = 1) +
scale_x_continuous(breaks = c(0, 1)) +
scale_y_continuous(limits = c(0, 2.3)) +
labs(title = "population: f(y) = 2y (skewed)", x = "y", y = NULL)
# Right panel: simulate the standardized sample mean for n = 10.
set.seed(103)
n <- 10
mu <- 2 / 3 # mean of the triangular population
v <- 1 / 18 # variance of the triangular population
draws <- matrix(sqrt(runif(n * 5000)), nrow = 5000) # inverse-cdf of f(y) = 2y
zbar <- (rowMeans(draws) - mu) / sqrt(v / n)
sim <- data.frame(z = zbar)
zs <- seq(-3.4, 3.4, length.out = 200)
ref <- data.frame(z = zs, f = dnorm(zs))
p_sim <- ggplot(sim, aes(z)) +
geom_histogram(aes(y = after_stat(density)), bins = 30,
fill = ucla$blue, color = ucla$darkblue, alpha = 0.7) +
geom_line(data = ref, aes(z, f), color = ucla$darkblue, linewidth = 1) +
scale_x_continuous(limits = c(-3.4, 3.4), breaks = c(-2, 0, 2)) +
labs(title = "std. mean for n = 10 -> N(0, 1)", x = "z", y = NULL)
p_pop
p_sim
```
## The bridge to inference {#sec-bridge}
Putting the pieces together shows why this chapter is the gateway to everything
that follows. For large $n$, the CLT gives
$\dfrac{\bar Y - \mu}{\sigma/\sqrt n} \approx N(0,1)$, so applying the $95\%$
rule,
$$
\Prob\!\left(-1.96 \;\le\; \frac{\bar Y - \mu}{\sigma/\sqrt n} \;\le\; 1.96\right)
\;\approx\; 0.95 .
$$
This single statement can be read in two complementary ways.
::: {.keyidea title="Rearrange one way: confidence intervals"}
Solving the inequality for $\mu$ brackets it inside a **confidence interval**
$\bar Y \pm 1.96\,\sigma/\sqrt n$. We develop this in the
[confidence-intervals chapter](09-confidence-intervals.qmd).
:::
::: {.keyidea title="Read it another way: hypothesis tests"}
Comparing $\bar Y$ to a hypothesized value of $\mu$ gives a **hypothesis test** —
is the standardized gap beyond $\pm 1.96$? We develop this in the
[hypothesis-testing chapter](10-hypothesis-testing.qmd).
:::
And because regression estimators are themselves (weighted) averages, the *same*
CLT logic will make *their* sampling distributions Normal too — which is how the
entire apparatus of regression inference gets off the ground.
### A caution: when "Normal" fails
The Normal is powerful, but assuming it blindly is dangerous.
::: {.warningbox title="The Swiss franc, 15 January 2015 (Stock & Watson)"}
On a single day in January 2015 the euro fell $17.5\%$ against the Swiss franc —
a move of about $156$ standard deviations. Under a Normal model, an event that
extreme has probability on the order of $10^{-5000}$: effectively impossible. Yet
it happened.
:::
The lesson is that real financial returns have **fat tails**, so extreme moves
are far more common than a Normal would predict. Keep the scope of the CLT
straight: it is a statement about the *sample mean's* distribution, not a license
to assume the *data themselves* are Normal.
## Recap {#sec-recap}
The **Normal distribution** $N(\mu,\sigma^2)$ is the symmetric bell curve with
location $\mu$ and spread $\sigma^2$. We standardize with
$Z = (X-\mu)/\sigma \sim N(0,1)$, read probabilities off its cdf $\Phi(\cdot)$,
and remember that $95\%$ of the probability lies within $\pm 1.96\sigma$ of the
mean. The key quantiles to keep are $1.645$, $1.96$, and $2.58$, and linear
combinations of jointly normal variables are again Normal.
For the **sample mean** of i.i.d. draws,
$$
\E(\bar Y) = \mu, \qquad \Var(\bar Y) = \frac{\sigma^2}{n},
$$
so $\bar Y$ is unbiased and grows more precise as $n$ rises. If the population is
Normal, then $\bar Y \sim N(\mu,\sigma^2/n)$ *exactly*. In general, the **LLN**
gives $\bar Y \xrightarrow{p} \mu$, and the **CLT** gives
$\tfrac{\bar Y - \mu}{\sigma/\sqrt n} \xrightarrow{d} N(0,1)$ for *any*
population with finite variance.
The bridge to inference is the statement
$$
\Prob\!\left(-1.96 \le \frac{\bar Y - \mu}{\sigma/\sqrt n} \le 1.96\right)
\approx 0.95,
$$
which becomes confidence intervals and hypothesis tests in the chapters ahead.
**Next time:** we leave one variable behind and study how one variable depends
on another — the [Simple Linear Regression Model](05-simple-regression.qmd).