\( \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\given}{\,\vert\,} \newcommand{\indic}[1]{\mathbf{1}\!\left\{#1\right\}} \newcommand{\pmf}{\text{p.m.f.}} \newcommand{\pdf}{\text{p.d.f.}} \newcommand{\cdf}{\text{c.d.f.}} \)

4  The Normal Distribution, Sampling & the CLT

Reading. SW 2.4<80><93>2.6, HGL Probability Primer P.7 & Appendix C

In the last chapter we learned how to summarize a distribution with its mean and variance, how those quantities combine, and how to standardize a variable to mean \(0\) and variance \(1\) via \(Z = (X-\mu)/\sigma\). This chapter closes out the probability toolkit and brings us to the doorstep of inference. We meet the Normal distribution <80><94> the bell curve <80><94> and learn to read probabilities off it; we study the sample mean \(\bar Y\) as a random variable with its own distribution; and we arrive at the Central Limit Theorem, the reason the Normal turns up everywhere.

The payoff is concrete. By the end of the chapter we can say how close \(\bar Y\) is likely to be to the truth \(\mu\). That single fact powers every confidence interval and hypothesis test in the rest of the course.

4.1 The Normal distribution

Some distributions are special enough to earn a name. The most important of all is the Normal.

Normal distribution

\(X\) is normally distributed with mean \(\mu\) and variance \(\sigma^2\), written \(X \sim N(\mu,\sigma^2)\), if its density is the bell curve \[ f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\, \exp\!\left[-\frac{(x-\mu)^2}{2\sigma^2}\right], \qquad -\infty < x < \infty . \]

Two features are worth fixing in mind. First, the Normal is symmetric and centered at \(\mu\), so its mean equals its median and its skewness is \(0\). Second, the two parameters play distinct roles: \(\mu\) sets the location of the curve, while \(\sigma^2\) sets its spread.

Changing the parameters simply slides and stretches the same bell shape. Moving \(\mu\) shifts the center left or right; raising \(\sigma^2\) flattens and widens the curve, while shrinking it makes the curve tall and tight. Throughout, the total area under each curve stays equal to \(1\) <80><94> so a wider curve is necessarily a shorter one. Figure 4.1 shows three members of the family.

Show the R code
xs <- seq(-6, 8, length.out = 400)
fam <- rbind(
  data.frame(x = xs, f = dnorm(xs, 0, 1), dist = "N(0, 1)"),
  data.frame(x = xs, f = dnorm(xs, 2, 1), dist = "N(2, 1)"),
  data.frame(x = xs, f = dnorm(xs, 0, 2), dist = "N(0, 4)")
)
ggplot(fam, aes(x, f, color = dist, linetype = dist)) +
  geom_line(linewidth = 1) +
  scale_color_manual(values = c(ucla$blue, ucla$darkblue, ucla$red)) +
  scale_linetype_manual(values = c("solid", "dashed", "dotted")) +
  labs(x = "x", y = expression(f[X](x)), color = NULL, linetype = NULL)
Figure 4.1: Same family, different parameters: \(\mu\) moves the center, \(\sigma^2\) controls the spread.

A fact worth memorizing: the 95% rule

For any Normal, about 95% of the probability lies within \(1.96\) standard deviations of the mean: \[ \Prob\!\left(\mu - 1.96\,\sigma \le X \le \mu + 1.96\,\sigma\right) \approx 0.95 . \]

It is convenient to keep the round-number version in your head as well: about \(68\%\) of the probability falls within \(\pm 1\sigma\) of the mean, about \(95\%\) within \(\pm 2\sigma\), and about \(99.7\%\) within \(\pm 3\sigma\). Figure 4.2 shades the central \(95\%\).

Show the R code
xs  <- seq(-3.6, 3.6, length.out = 400)
dat <- data.frame(x = xs, y = dnorm(xs))
sh  <- subset(dat, x >= -1.96 & x <= 1.96)
ggplot(dat, aes(x, y)) +
  geom_area(data = sh, aes(x, y), fill = ucla$blue, alpha = 0.30) +
  geom_line(color = ucla$blue, linewidth = 1) +
  annotate("text", x = 0, y = 0.16, label = "95%",
           color = ucla$darkblue, size = 4) +
  scale_x_continuous(
    breaks = c(-1.96, 0, 1.96),
    labels = c(expression(mu - 1.96 * sigma), expression(mu),
               expression(mu + 1.96 * sigma))
  ) +
  scale_y_continuous(limits = c(0, 0.45)) +
  labs(x = expression(x ~ "(in units of " * sigma ~ "from " * mu * ")"),
       y = expression(f[X](x)))
Figure 4.2: The 95% rule: about 95% of a Normal’s probability lies within \(\pm 1.96\sigma\) of the mean.

4.2 Standardizing & the standard Normal

Rather than keep a separate table for every pair \((\mu,\sigma^2)\), we convert every Normal problem to one reference distribution.

Standard Normal

The standard Normal is \(Z \sim N(0,1)\). If \(X \sim N(\mu,\sigma^2)\), then \[ Z = \frac{X-\mu}{\sigma} \sim N(0,1). \]

This is exactly the standardizing move from the previous chapter <80><94> subtract the mean, divide by the standard deviation to get mean \(0\) and variance \(1\) <80><94> now applied to a Normal, which keeps the variable Normal. The cdf of \(Z\) is important enough to get its own symbol, \[ \Phi(z) = \Prob(Z \le z), \] tabulated in the textbook’s Statistical Table 1 and built into R as pnorm. By symmetry of the bell curve around \(0\), the upper tail past \(a\) equals the lower tail before \(-a\): \[ \Prob(Z > a) = \Prob(Z < -a). \]

Reading probabilities off the Normal

To get any Normal probability, standardize, then look up \(\Phi\). For \(X \sim N(\mu,\sigma^2)\) and constants \(a < b\), three rules cover everything.

Three rules for Normal probabilities

\[ \begin{aligned} \Prob(X \le a) &= \Phi\!\left(\tfrac{a-\mu}{\sigma}\right),\\[4pt] \Prob(X \ge a) &= 1 - \Phi\!\left(\tfrac{a-\mu}{\sigma}\right),\\[4pt] \Prob(a \le X \le b) &= \Phi\!\left(\tfrac{b-\mu}{\sigma}\right) - \Phi\!\left(\tfrac{a-\mu}{\sigma}\right). \end{aligned} \]

Everything reduces to standard-Normal cdf values \(\Phi(\cdot)\) <80><94> which is why a single table, or one R command (pnorm), does all the work.

A worked probability

Let \(X \sim N(3,\,9)\), so \(\mu = 3\) and \(\sigma = 3\). Find \(\Prob(4 \le X \le 6)\).

First standardize the endpoints: \[ \tfrac{4-3}{3} = 0.33, \qquad \tfrac{6-3}{3} = 1. \] Then take the difference of cdf values: \[ \begin{aligned} \Prob(4 \le X \le 6) &= \Phi(1) - \Phi(0.33)\\ &= 0.8413 - 0.6293 = 0.2120. \end{aligned} \] The answer is the shaded area between \(4\) and \(6\) under the \(N(3,9)\) density in Figure 4.3.

Show the R code
xs  <- seq(-6, 12, length.out = 400)
dat <- data.frame(x = xs, y = dnorm(xs, 3, 3))
sh  <- subset(dat, x >= 4 & x <= 6)
ggplot(dat, aes(x, y)) +
  geom_area(data = sh, aes(x, y), fill = ucla$blue, alpha = 0.30) +
  geom_line(color = ucla$blue, linewidth = 1) +
  annotate("text", x = 5, y = 0.045, label = "0.21",
           color = ucla$darkblue, size = 3.6) +
  scale_x_continuous(breaks = c(3, 4, 6)) +
  scale_y_continuous(limits = c(0, 0.16)) +
  labs(x = "x", y = expression(f[X](x)))
Figure 4.3: Shaded area between 4 and 6 under the \(N(3, 9)\) density, equal to about 0.21.

Key percentiles you will reuse all term

We will constantly need the value \(z_\alpha\) for which \(\Prob(Z \le z_\alpha) = \alpha\) <80><94> the \(\alpha\)-quantile of the standard Normal. The handful of values in Table 4.1 recur throughout the course.

Show the R code
z_tab <- data.frame(
  alpha = c(0.90, 0.95, 0.975, 0.99, 0.995),
  z     = c(1.28, 1.645, 1.96, 2.33, 2.58)
)
knitr::kable(z_tab, col.names = c("$\\alpha$", "$z_\\alpha$"), align = "cc")
Table 4.1: Standard-Normal quantiles \(z_\alpha\) with \(\Prob(Z \le z_\alpha) = \alpha\).
\(\alpha\) \(z_\alpha\)
0.900 1.280
0.950 1.645
0.975 1.960
0.990 2.330
0.995 2.580
The three to memorize

\[ 1.645,\qquad 1.96,\qquad 2.58. \]

Because the Normal is symmetric, a two-sided \(95\%\) range uses \(\pm 1.96\), leaving \(2.5\%\) of the probability in each tail. That is the source of the “\(1.96\)” in the \(95\%\) rule above <80><94> and of the confidence intervals we build in the confidence-intervals chapter.

4.3 Linear combinations of Normals

In the last chapter we found the mean and variance of a linear combination of random variables. For Normals we now get the shape for free.

Closure under linear combinations

If \(X_1 \sim N(\mu_1,\sigma_1^2)\) and \(X_2 \sim N(\mu_2,\sigma_2^2)\) are jointly normal, then for constants \(a_1, a_2\), \[ a_1 X_1 + a_2 X_2 \sim N\!\left(a_1\mu_1 + a_2\mu_2,\; a_1^2\sigma_1^2 + a_2^2\sigma_2^2 + 2a_1 a_2\,\sigma_{12}\right), \] where \(\sigma_{12} = \Cov(X_1, X_2)\).

This property is special. Most distributions change shape when you add them together, but the Normal does not: any linear combination of jointly normal variables is again Normal. The mean and variance follow exactly the rules from the previous chapter; closure just hands us the bell shape on top. We will lean on this fact for the sample mean in a moment.

Three properties of the bivariate Normal

When \(X\) and \(Y\) are jointly normal, three useful facts hold.

The jointly normal pair
  1. Each marginal is normal: \(X \sim N(\mu_X,\sigma_X^2)\) and \(Y \sim N(\mu_Y,\sigma_Y^2)\).
  2. Zero covariance implies independence. (Recall this is false in general <80><94> it is a special gift of the Normal.)
  3. The conditional mean is linear in the conditioning variable: \[ \E(Y \given X = x) = \alpha + \beta x, \qquad \beta = \frac{\sigma_{XY}}{\sigma_X^2}. \]

Property 3 deserves a second look. The conditional mean \(\E(Y \given X)\) is a straight line whose slope is \(\Cov(X,Y)/\Var(X)\) <80><94> exactly the regression slope previewed in the last chapter. This linear regression function is where the whole course is heading; we build it from scratch in the next chapter.

A signpost: relatives of the Normal. Three distributions built from the Normal run our later inference, and we will meet each properly when we need it. The chi-squared \(\chi^2_m\) is the sum of \(m\) independent squared standard Normals, and shows up in variance and joint tests. Student’s \(t\) with \(m\) degrees of freedom is bell-shaped but fatter-tailed than the Normal and approaches \(N(0,1)\) as \(m \to \infty\); we switch to it when we estimate \(\sigma\) rather than know it (the confidence-intervals chapter). The \(F\) distribution with \((m,n)\) degrees of freedom is a ratio of scaled chi-squareds, used to test several restrictions at once (the \(F\)-tests chapter). For now, just remember that the \(t\) is a slightly wider Normal that we adopt once \(\sigma\) is unknown.

4.4 Random sampling & the sample mean

All of our methods rest on how the data were drawn. The simplest and most important sampling scheme is the one that makes the observations independent and identically distributed.

Simple random sampling and i.i.d. data

Draw \(n\) observations \(Y_1,\dots,Y_n\) at random from a population with mean \(\mu\) and variance \(\sigma^2\). Then they are i.i.d.:

  • identically distributed <80><94> each \(Y_i\) has the population’s distribution, with mean \(\mu\) and variance \(\sigma^2\);
  • independent <80><94> knowing the value of one observation tells you nothing about the others.

Notice the shift in perspective. Before we look at the data, each \(Y_i\) is a random variable; after sampling, it is a recorded number. Different draws would have produced different numbers <80><94> and that is the source of sampling variation, the central object of everything that follows.

The sample mean is a random variable

Our estimator of the population mean \(\mu\) is the sample mean \[ \bar Y = \frac{1}{n}\sum_{i=1}^{n} Y_i . \] Because the \(Y_i\) are random, \(\bar Y\) is itself random: a different sample yields a different \(\bar Y\). The distribution of \(\bar Y\) over all possible samples is called its sampling distribution.

A concrete illustration: in the HGL hip-width data, ten samples of size \(50\) drawn from the same population gave sample means ranging from \(16.75\) to \(17.41\) <80><94> same population, a different \(\bar y\) each time.

The key shift in thinking

We stop asking “is this estimate right?” <80><94> which is unanswerable <80><94> and instead ask “how does the procedure \(\bar Y\) behave across samples?” That second question we can answer, through the mean and variance of \(\bar Y\).

Mean and variance of \(\bar Y\)

Apply the previous chapter’s rules to \(\bar Y = \tfrac1n\sum_i Y_i\) with i.i.d. draws. The mean follows from linearity alone <80><94> no independence needed: \[ \E(\bar Y) = \frac1n\sum_{i=1}^n \E(Y_i) = \frac1n\,(n\mu) = \mu . \] So \(\bar Y\) is unbiased: on average across samples it equals \(\mu\). The variance uses independence, which makes every \(\Cov(Y_i, Y_j) = 0\) for \(i \ne j\): \[ \Var(\bar Y) = \frac{1}{n^2}\sum_{i=1}^n \Var(Y_i) = \frac{\sigma^2}{n}. \] The standard deviation of the sample mean is therefore \(\sigma_{\bar Y} = \sigma/\sqrt{n}\), called the standard error.

Read these off

The sampling distribution of \(\bar Y\) is centered at the truth \(\mu\), and its spread \(\sigma/\sqrt{n}\) shrinks as \(n\) grows. More data means \(\bar Y\) clusters more tightly around \(\mu\). These two facts hold for any population distribution.

If the population is Normal, so is \(\bar Y\)

Since \(\bar Y\) is a linear combination of the \(Y_i\), closure under linear combinations gives an exact result whenever the population itself is Normal: \[ Y_i \sim N(\mu,\sigma^2) \quad\Longrightarrow\quad \bar Y \sim N\!\left(\mu,\ \frac{\sigma^2}{n}\right). \]

Precision and sample size (HGL)

Suppose the population is normal with \(\sigma^2 = 10\). With \(n = 40\) we have \(\bar Y \sim N(\mu,\,0.25)\), since \(\sigma^2/n = 10/40 = 0.25\), so \[ \Prob\!\left(|\bar Y - \mu| \le 1\right) = \Prob(-2 \le Z \le 2) = 0.954 . \] Raising \(n\) to \(80\) halves the variance to \(0.125\) and tightens this probability to \(0.995\). More data, more precision.

4.5 Law of large numbers & the CLT

We now have the center and spread of \(\bar Y\). Two limit theorems describe what happens to \(\bar Y\) as the sample grows without bound.

Law of large numbers (LLN)

As the sample size grows, the sample mean converges in probability to the population mean: \[ \bar Y \;\xrightarrow{\;p\;}\; \mu \qquad \text{as } n \to \infty . \]

This is the formal “law of averages”: with many draws, high and low values balance out and \(\bar Y\) settles on \(\mu\). It is the reason large samples are trustworthy, and it makes \(\bar Y\) a consistent estimator of \(\mu\). But the LLN only says that \(\bar Y\) gets close to \(\mu\) <80><94> it says nothing about the shape of \(\bar Y\)’s distribution around \(\mu\). For inference we need that shape, and that is what the Central Limit Theorem supplies.

Central Limit Theorem (CLT)

If \(Y_1,\dots,Y_n\) are i.i.d. with mean \(\mu\) and variance \(\sigma^2\), then the standardized sample mean converges in distribution to a standard Normal: \[ \frac{\bar Y - \mu}{\sigma/\sqrt{n}} \;\xrightarrow{\;d\;}\; N(0,1) \qquad\text{as } n \to \infty, \] so for large \(n\), \(\ \bar Y \overset{a}{\sim} N\!\left(\mu,\ \sigma^2/n\right)\).

The remarkable part is the generality. The CLT holds whatever the population distribution <80><94> skewed, discrete, fat-tailed, anything <80><94> as long as \(\sigma^2\) is finite. The bell curve emerges from the averaging, not from any bell shape in the data themselves.

Rule of thumb. A sample of \(n \ge 30\) is usually enough for the Normal approximation to be good; real samples in the hundreds or thousands make it excellent.

The CLT in action

To see how striking this is, take a decidedly non-Normal population: the triangular density \(f(y) = 2y\) on \([0,1]\) (HGL’s example), which is plainly skewed toward \(1\). Now repeatedly draw a sample, average it, and standardize the average. Even for a sample as small as \(n = 10\), the histogram of the standardized means is already bell-shaped and centered at \(0\), and it only sharpens toward \(N(0,1)\) as \(n\) grows. ?fig-clt-action contrasts the skewed population with the near-Normal distribution of its standardized sample mean.

Show the R code
# Left panel: the skewed triangular population f(y) = 2y on [0, 1].
pop <- data.frame(y = seq(0, 1, length.out = 200))
pop$f <- 2 * pop$y
p_pop <- ggplot(pop, aes(y, f)) +
  geom_area(fill = ucla$red, alpha = 0.25) +
  geom_line(color = ucla$red, linewidth = 1) +
  scale_x_continuous(breaks = c(0, 1)) +
  scale_y_continuous(limits = c(0, 2.3)) +
  labs(title = "population: f(y) = 2y (skewed)", x = "y", y = NULL)

# Right panel: simulate the standardized sample mean for n = 10.
set.seed(103)
n  <- 10
mu <- 2 / 3                       # mean of the triangular population
v  <- 1 / 18                      # variance of the triangular population
draws <- matrix(sqrt(runif(n * 5000)), nrow = 5000)  # inverse-cdf of f(y) = 2y
zbar  <- (rowMeans(draws) - mu) / sqrt(v / n)
sim   <- data.frame(z = zbar)
zs    <- seq(-3.4, 3.4, length.out = 200)
ref   <- data.frame(z = zs, f = dnorm(zs))
p_sim <- ggplot(sim, aes(z)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 fill = ucla$blue, color = ucla$darkblue, alpha = 0.7) +
  geom_line(data = ref, aes(z, f), color = ucla$darkblue, linewidth = 1) +
  scale_x_continuous(limits = c(-3.4, 3.4), breaks = c(-2, 0, 2)) +
  labs(title = "std. mean for n = 10 -> N(0, 1)", x = "z", y = NULL)

p_pop
p_sim
Figure 4.4: Left: a skewed population \(f(y) = 2y\) on \([0,1]\). Right: the standardized sample mean for \(n = 10\) is already close to \(N(0, 1)\).
Figure 4.5: Left: a skewed population \(f(y) = 2y\) on \([0,1]\). Right: the standardized sample mean for \(n = 10\) is already close to \(N(0, 1)\).

4.6 The bridge to inference

Putting the pieces together shows why this chapter is the gateway to everything that follows. For large \(n\), the CLT gives \(\dfrac{\bar Y - \mu}{\sigma/\sqrt n} \approx N(0,1)\), so applying the \(95\%\) rule, \[ \Prob\!\left(-1.96 \;\le\; \frac{\bar Y - \mu}{\sigma/\sqrt n} \;\le\; 1.96\right) \;\approx\; 0.95 . \] This single statement can be read in two complementary ways.

Rearrange one way: confidence intervals

Solving the inequality for \(\mu\) brackets it inside a confidence interval \(\bar Y \pm 1.96\,\sigma/\sqrt n\). We develop this in the confidence-intervals chapter.

Read it another way: hypothesis tests

Comparing \(\bar Y\) to a hypothesized value of \(\mu\) gives a hypothesis test <80><94> is the standardized gap beyond \(\pm 1.96\)? We develop this in the hypothesis-testing chapter.

And because regression estimators are themselves (weighted) averages, the same CLT logic will make their sampling distributions Normal too <80><94> which is how the entire apparatus of regression inference gets off the ground.

A caution: when “Normal” fails

The Normal is powerful, but assuming it blindly is dangerous.

The Swiss franc, 15 January 2015 (Stock & Watson)

On a single day in January 2015 the euro fell \(17.5\%\) against the Swiss franc <80><94> a move of about \(156\) standard deviations. Under a Normal model, an event that extreme has probability on the order of \(10^{-5000}\): effectively impossible. Yet it happened.

The lesson is that real financial returns have fat tails, so extreme moves are far more common than a Normal would predict. Keep the scope of the CLT straight: it is a statement about the sample mean’s distribution, not a license to assume the data themselves are Normal.

4.7 Recap

The Normal distribution \(N(\mu,\sigma^2)\) is the symmetric bell curve with location \(\mu\) and spread \(\sigma^2\). We standardize with \(Z = (X-\mu)/\sigma \sim N(0,1)\), read probabilities off its cdf \(\Phi(\cdot)\), and remember that \(95\%\) of the probability lies within \(\pm 1.96\sigma\) of the mean. The key quantiles to keep are \(1.645\), \(1.96\), and \(2.58\), and linear combinations of jointly normal variables are again Normal.

For the sample mean of i.i.d. draws, \[ \E(\bar Y) = \mu, \qquad \Var(\bar Y) = \frac{\sigma^2}{n}, \] so \(\bar Y\) is unbiased and grows more precise as \(n\) rises. If the population is Normal, then \(\bar Y \sim N(\mu,\sigma^2/n)\) exactly. In general, the LLN gives \(\bar Y \xrightarrow{p} \mu\), and the CLT gives \(\tfrac{\bar Y - \mu}{\sigma/\sqrt n} \xrightarrow{d} N(0,1)\) for any population with finite variance.

The bridge to inference is the statement \[ \Prob\!\left(-1.96 \le \frac{\bar Y - \mu}{\sigma/\sqrt n} \le 1.96\right) \approx 0.95, \] which becomes confidence intervals and hypothesis tests in the chapters ahead.

Next time: we leave one variable behind and study how one variable depends on another <80><94> the Simple Linear Regression Model.