---
title: "Expectation, Variance & Covariance"
---
{{< include _setup.qmd >}}
> **Reading.** SW §2.2–2.3, HGL Probability Primer §P.3, P.5–P.6
A [random variable](02-random-vars.qmd) is described by its whole *distribution*
— a pmf, a pdf, a cdf. That is a lot of information. This chapter does the
opposite of the last one: it boils a distribution down to a few **numbers**. We
summarize where a distribution sits (its *center*, the mean), how spread out it
is (its *variance* and standard deviation), and — for *two* variables at once —
how they *move together* (covariance and correlation).
::: {.keyidea title="Why these three ideas matter"}
Every regression coefficient we estimate later is built from exactly these
pieces. The slope of a regression line, for instance, will turn out to be
$\Cov(x,y)/\Var(x)$ — so this chapter is the toolkit for the rest of the course.
:::
### A running example: the "slips" population
We reuse the population behind the pmf from the [last
chapter](02-random-vars.qmd). Ten slips sit in a hat; we draw one at random.
Define two random variables on that draw:
- $X$ = the **number** printed on the slip $(1,2,3,4)$;
- $Y$ = an **indicator**: $Y = 1$ if the slip is shaded, $0$ if not.
The full description of how $X$ and $Y$ behave *together* is their **joint pmf**,
$f_{X,Y}(x,y) = \Prob(X = x,\,Y = y)$. We can read it as a table, with the
**marginal** distributions of $X$ and $Y$ sitting in the margins.
```{r}
#| label: tbl-joint
#| tbl-cap: "The joint pmf $f_{X,Y}(x,y)$, with marginals in the margins."
joint <- data.frame(
Y = c("$0$", "$1$", "$f_X(x)$"),
x1 = c(0.0, 0.1, 0.1),
x2 = c(0.1, 0.1, 0.2),
x3 = c(0.2, 0.1, 0.3),
x4 = c(0.3, 0.1, 0.4),
margin = c(0.6, 0.4, 1.0)
)
knitr::kable(
joint,
col.names = c("$Y \\backslash X$", "$1$", "$2$", "$3$", "$4$", "$f_Y(y)$"),
align = "cccccc"
)
```
There are two ways to read it. The **body** gives the joint probabilities,
$f_{X,Y}(x,y) = \Prob(X = x,\,Y = y)$. The **right and bottom margins** give the
distributions of $Y$ and of $X$ on their own. We will compute *every* number in
this chapter from this one table.
## Expected value (the mean) {#sec-expectation}
::: {.definition title="Expected value"}
The **expected value** (or **mean**) of a discrete random variable $X$ is the
probability-weighted average of its values:
$$
\E(X) \;=\; \sum_{x} x\,f_X(x) \;=\; \mu_X .
$$
:::
The expected value is the **long-run average** of $X$ over many repetitions of
the experiment. Notice that $\mu_X$ is a **population parameter** — a fixed
feature of the population, written with a Greek letter. Later we will *estimate*
these parameters from a sample.
::: {.callout-note appearance="simple"}
**Heads-up on names.** The "mean" can refer to this *population* mean $\mu_X$ *or*
to a *sample* average $\bar x$. They are different objects — keep track of which
one is meant.
:::
### Example: the mean of $X$, and the mean of an indicator
For the **number on the slip**, $X$, we weight each value by its marginal
probability:
$$
\E(X) = \sum_x x\,f_X(x)
= 1(0.1) + 2(0.2) + 3(0.3) + 4(0.4)
= 3 .
$$
Draw thousands of slips and average the numbers — the running average settles
down to $3$.
::: {.example title="Paying off a promise about indicators"}
For the **indicator** $Y$ (a Bernoulli variable), with $p = \Prob(Y = 1)$,
$$
\E(Y) = 0(1-p) + 1(p) = p .
$$
The mean of a $0/1$ variable *is* the proportion of ones. Here
$\E(Y) = 0.4 = \Prob(\text{shaded})$.
:::
This is the reason that, later, a regression on an indicator reads off a group's
*share* or a *treatment effect* — see [dummy variables](19-dummy-variables.qmd)
and [treatment effects](20-treatment-effects.qmd).
### The expected value of a function of $X$
Any function $g(X)$ of a random variable is itself random. Its mean weights the
*transformed* values by the *same* probabilities:
$$
\E\!\left[g(X)\right] \;=\; \sum_{x} g(x)\,f_X(x).
$$
::: {.example title="Second moment of $X$"}
With $g(X) = X^2$,
$$
\E(X^2) = \sum_x x^2 f_X(x)
= 1(0.1) + 4(0.2) + 9(0.3) + 16(0.4)
= 10 .
$$
:::
::: {.warningbox title="A trap to avoid"}
In general
$$
\E\!\left[g(X)\right] \;\neq\; g\!\left(\E(X)\right).
$$
Here $\E(X^2) = 10$ but $\bigl(\E X\bigr)^2 = 3^2 = 9$. We will use $\E(X^2)$ in a
moment to get the variance.
:::
### Rules for expected values
Let $a, b, c$ be constants and $X, Y$ random variables. Expectation is a
**linear** operator.
::: {.property title="Linearity of expectation"}
$$
\begin{aligned}
\E(aX + b) &= a\,\E(X) + b,\\
\E\!\left[g_1(X) + g_2(X)\right] &= \E\!\left[g_1(X)\right] + \E\!\left[g_2(X)\right],\\
\E(aX + bY + c) &= a\,\E(X) + b\,\E(Y) + c.
\end{aligned}
$$
:::
In words: *the expected value of a sum is the sum of the expected values*, and
constants pass straight through.
::: {.warningbox title="One caution about products"}
Linearity is about *sums*. For *products*, $\E(XY) = \E(X)\,\E(Y)$ holds **only
when $X$ and $Y$ are independent** — otherwise the covariance (later in this
chapter) gets in the way.
:::
## Variance & standard deviation {#sec-variance}
::: {.definition title="Variance and standard deviation"}
The **variance** of $X$ is the expected squared distance from the mean:
$$
\Var(X) \;=\; \E\!\left[(X - \mu_X)^2\right] \;=\; \sigma_X^2 .
$$
The **standard deviation** $\sigma_X = \sqrt{\Var(X)}$ is in the *same units* as
$X$.
:::
A larger variance means the distribution is more spread out about its mean.
@fig-spread shows two distributions with the same mean but different spreads: the
flatter one has the larger variance.
```{r}
#| label: fig-spread
#| fig-cap: "Two distributions with the same mean but different variances. The wider, flatter curve has the larger spread."
#| fig-width: 5
#| fig-height: 3.4
xs <- seq(-6, 6, length.out = 400)
dat <- rbind(
data.frame(x = xs, y = dnorm(xs, 0, 1), spread = "small variance"),
data.frame(x = xs, y = dnorm(xs, 0, 2.2), spread = "large variance")
)
ggplot(dat, aes(x, y, color = spread)) +
geom_line(linewidth = 1) +
geom_vline(xintercept = 0, linetype = "dashed", color = ucla$gray) +
scale_color_manual(values = c("small variance" = ucla$blue,
"large variance" = ucla$red)) +
labs(x = "x", y = expression(f[X](x)), color = NULL)
```
In practice we almost never compute the variance straight from the definition.
The following algebraically equivalent formula is far easier to use.
::: {.property title="The computational formula (use this one)"}
$$
\Var(X) \;=\; \E(X^2) - \mu_X^2 .
$$
:::
The derivation is a one-line expansion: $\E[(X - \mu)^2] = \E(X^2) - 2\mu\,\E(X)
+ \mu^2 = \E(X^2) - \mu^2$, since $\E(X) = \mu$.
### Example: variance of $X$ and of an indicator
For the **number on the slip**, $X$, we already found $\E(X) = 3$ and
$\E(X^2) = 10$, so
$$
\Var(X) = \E(X^2) - \mu_X^2 = 10 - 3^2 = 1,
$$
and $\sigma_X = \sqrt{1} = 1$.
::: {.example title="Variance of a Bernoulli"}
For the indicator $Y$ with $\E(Y) = p$ — and noting $Y^2 = Y$, so $\E(Y^2) = p$ —
$$
\Var(Y) = p - p^2 = p(1-p).
$$
Here $\Var(Y) = 0.4(0.6) = 0.24$, so $\sigma_Y = \sqrt{0.24} \approx 0.49$.
:::
A coin is most uncertain at $p = \tfrac{1}{2}$, where $p(1-p)$ is largest.
### Variance under a linear transformation
What happens to spread when we rescale and shift? Let $a, b$ be constants.
::: {.property title="Mean and variance of $a + bX$"}
$$
\E(a + bX) = a + b\,\mu_X,
\qquad
\Var(a + bX) = b^2\,\Var(X),
\qquad
\sigma_{a + bX} = |b|\,\sigma_X .
$$
:::
The two constants play very different roles. An additive constant $a$ **shifts**
the whole distribution — it moves the mean but leaves the spread unchanged. A
multiplicative constant $b$ **rescales** — it multiplies the standard deviation
by $|b|$ and the variance by $b^2$.
::: {.example title="After-tax earnings"}
Tax pre-tax earnings $X$ at $20\%$ and add a \$2000 grant: $Y = 2000 + 0.8X$.
Then $\mu_Y = 2000 + 0.8\,\mu_X$ and $\sigma_Y = 0.8\,\sigma_X$ — the spread of
take-home pay is $80\%$ that of pre-tax pay.
:::
### A useful special case: standardization
Combining the two rules, we can turn *any* $X$ into a variable with mean $0$ and
variance $1$. Subtract the mean and divide by the standard deviation:
$$
Z \;=\; \frac{X - \mu_X}{\sigma_X}.
$$
Reading this as a linear transformation with $a = -\mu_X/\sigma_X$ and
$b = 1/\sigma_X$, the rules give
$$
\E(Z) = 0, \qquad \Var(Z) = \frac{\Var(X)}{\sigma_X^2} = 1 .
$$
::: {.keyidea title="Why we care"}
$Z$ is **unit-free** and measures "how many standard deviations from the mean."
This is exactly the move behind the *$Z$-score* and the standard Normal table —
the heart of the [next chapter](04-normal-clt.qmd).
:::
## Two variables: joint, marginal, conditional {#sec-joint}
Most economic questions involve *two* variables at once: income *and* education,
price *and* quantity. We have already met the **joint pmf** in the running
example; here we develop the two distributions we can extract from it.
::: {.definition title="Joint and marginal pmf"}
The **joint pmf** is $f_{X,Y}(x,y) = \Prob(X = x,\,Y = y)$ — the probability the
two outcomes occur *together*. Its entries sum to $1$.
The **marginal pmf** is the distribution of one variable alone, obtained by
*summing the joint over the other*:
$$
f_X(x) = \sum_y f_{X,Y}(x,y).
$$
:::
From the slips table, summing **down each column** gives
$f_X = (0.1, 0.2, 0.3, 0.4)$, and summing **across each row** gives
$f_Y = (0.6,\,0.4)$. For instance,
$$
\Prob(\text{shaded}) = f_Y(1) = 0.1 + 0.1 + 0.1 + 0.1 = 0.4 .
$$
### Conditional distributions
Often we want the distribution of $X$ *within a subpopulation* fixed by $Y$.
Conditioning **shrinks the population** to just those cases, then renormalizes so
the probabilities sum to one again.
::: {.definition title="Conditional pmf"}
$$
f_{X \given Y}(x \given y) = \Prob(X = x \given Y = y)
= \frac{f_{X,Y}(x,y)}{f_Y(y)} .
$$
:::
::: {.example title="Shaded slips only"}
Among shaded slips ($Y = 1$, probability $0.4$),
$$
f_{X \given Y}(x \given 1) = \frac{0.1}{0.4} = 0.25
$$
for each $x$ — once we know the slip is shaded, all four numbers are equally
likely.
:::
::: {.example title="Rain and the commute"}
Let $X = 0$ mean rain and $Y = 0$ a long commute. With
$\Prob(\text{rain}) = 0.30$ and a rainy-*and*-long probability of $0.15$,
$$
\Prob(\text{long} \given \text{rain}) = \frac{0.15}{0.30} = 0.50 .
$$
:::
### Independence
::: {.definition title="Independence"}
$X$ and $Y$ are **independent** if knowing one tells you *nothing* about the
other — equivalently, for *all* $x, y$,
$$
f_{X \given Y}(x \given y) = f_X(x)
\quad\Longleftrightarrow\quad
f_{X,Y}(x,y) = f_X(x)\,f_Y(y).
$$
That is, the joint factors into the product of the marginals.
:::
::: {.example title="The slips are *not* independent"}
Check the corner $x = 1,\ y = 1$:
$$
f_{X,Y}(1,1) = 0.1
\;\neq\;
f_X(1)\,f_Y(1) = (0.1)(0.4) = 0.04 .
$$
A single violated cell is enough — $X$ and $Y$ are **dependent**. This makes
sense: shaded slips are never a "1."
:::
## Conditional expectation {#sec-cond-expectation}
::: {.definition title="Conditional expectation"}
The **conditional expectation** $\E(X \given Y = y)$ is the mean computed with
the *conditional* pmf:
$$
\E(X \given Y = y) \;=\; \sum_x x\,f_{X \given Y}(x \given y).
$$
:::
This answers questions like "what is the mean wage *among* people with $16$ years
of education?", that is, $\E(\text{WAGE} \given \text{EDUC} = 16)$.
::: {.example title="Slips, given shaded"}
$$
\E(X \given Y = 1) = \sum_x x\,f_{X \given Y}(x \given 1)
= (1 + 2 + 3 + 4)(0.25) = 2.5 .
$$
:::
Note that $2.5$ is **not a value $X$ can take** — an expected value need not be
attainable. Conditioning on white slips instead gives
$$
\E(X \given Y = 0) = \tfrac{10}{3} \approx 3.33,
$$
while the *unconditional* mean is $\E(X) = 3$. So $\E(X \given Y)$ **varies with
$Y$**: it is itself a function of the conditioning value.
### The law of iterated expectations
The conditional means must "average back" to the overall mean, weighted by how
often each condition occurs.
::: {.property title="Law of iterated expectations"}
$$
\E(X) \;=\; \sum_y \E(X \given Y = y)\,f_Y(y) \;=\; \E\!\left[\E(X \given Y)\right].
$$
:::
::: {.example title="Check it on the slips"}
$$
\E(X) = \underbrace{\tfrac{10}{3}}_{\E(X \given Y = 0)}(0.6)
+ \underbrace{2.5}_{\E(X \given Y = 1)}(0.4)
= 2.0 + 1.0 = 3 \;\checkmark
$$
:::
::: {.callout-note appearance="simple"}
**Intuition.** Mean adult height is the mean height of men and of women, weighted
by their population shares.
:::
### Conditional variance — and a preview of regression
We can also measure *spread* within a subpopulation:
$$
\Var(X \given Y = y) = \E\!\left[(X - \E(X \given Y = y))^2 \,\middle|\, Y = y\right].
$$
For the slips, $\Var(X \given Y = 1) = \tfrac{5}{4}$ while
$\Var(X \given Y = 0) = \tfrac{5}{9}$: the spread of $X$ differs across
subpopulations, and either can exceed or fall short of the unconditional
$\Var(X) = 1$.
::: {.keyidea title="Why conditional expectation is the punchline of the course"}
Among *all* functions $g(X)$, the conditional mean $\E(Y \given X)$ is the **best
predictor** of $Y$ from $X$ — it minimizes the mean squared prediction error
$\E\!\left[(Y - g(X))^2\right]$. The [regression line](05-simple-regression.qmd)
we build later is precisely a model for $\E(Y \given X)$.
:::
## Covariance & correlation {#sec-covariance}
::: {.definition title="Covariance"}
The **covariance** of $X$ and $Y$ measures their *linear* association:
$$
\Cov(X,Y) = \E\!\left[(X - \mu_X)(Y - \mu_Y)\right]
= \E(XY) - \mu_X\mu_Y = \sigma_{XY}.
$$
:::
The sign tells the story. When $\sigma_{XY} > 0$, an above-average $X$ *tends* to
come with an above-average $Y$ (points fall mostly in quadrants I and III of the
mean-centered scatter). When $\sigma_{XY} < 0$, they move in *opposite*
directions (quadrants II and IV). When $\sigma_{XY} \approx 0$, there is no
*linear* tendency. @fig-cov-quadrants shows a cloud with positive covariance.
```{r}
#| label: fig-cov-quadrants
#| fig-cap: "Positive covariance: mean-centered points fall mostly in quadrants I and III."
#| fig-width: 5
#| fig-height: 3.6
pts <- data.frame(
x = c(-3, -2.4, -2, -1.5, -1, -0.6, -0.3, 0.4, 0.7, 1, 1.4, 1.8, 2.2, 2.6, 3, 3.2),
y = c(-2.4, -1.2, -2.6, -0.7, -1.6, 0.4, -1.1, 0.6, -0.5, 1.7, 0.6, 2.4, 1.1, 2.9, 1.8, 2.6)
)
quad <- data.frame(
lab = c("I", "II", "III", "IV"),
x = c(2.6, -2.6, -2.6, 2.6),
y = c(3.4, 3.4, -3.4, -3.4)
)
ggplot(pts, aes(x, y)) +
geom_hline(yintercept = 0, color = ucla$gray, linewidth = 0.4) +
geom_vline(xintercept = 0, color = ucla$gray, linewidth = 0.4) +
geom_point(color = ucla$blue, size = 1.6) +
geom_text(data = quad, aes(x, y, label = lab), color = ucla$darkblue, size = 3.4) +
scale_x_continuous(limits = c(-4, 4)) +
scale_y_continuous(limits = c(-4, 4)) +
labs(x = expression(X - mu[X]), y = expression(Y - mu[Y]))
```
### Example: covariance of the slips
First the cross-moment. Only the shaded row $Y = 1$ contributes, since $Y = 0$
kills the product:
$$
\E(XY) = \sum_{x,y} xy\,f_{X,Y}(x,y)
= (1 + 2 + 3 + 4)(1)(0.1) = 1 .
$$
Then, using $\E(X) = 3$ and $\E(Y) = 0.4$,
$$
\Cov(X,Y) = \E(XY) - \mu_X\mu_Y = 1 - (3)(0.4) = -0.2 .
$$
The covariance is **negative**: larger numbers are relatively more common on the
*white* slips, so a high $X$ goes with $Y = 0$. This is consistent with the
dependence we found earlier.
### Correlation: a unit-free covariance
Covariance has awkward units — here "slip-number $\times$ shaded" — and its size
is hard to read. Dividing by the standard deviations fixes both.
::: {.definition title="Correlation"}
$$
\rho_{XY} \;=\; \frac{\Cov(X,Y)}{\sqrt{\Var(X)}\,\sqrt{\Var(Y)}}
\;=\; \frac{\sigma_{XY}}{\sigma_X\,\sigma_Y},
\qquad -1 \le \rho_{XY} \le 1 .
$$
:::
For the slips,
$$
\rho_{XY} = \frac{-0.2}{\sqrt{1}\,\sqrt{0.24}} \approx -0.41 .
$$
The correlation hits $\rho = \pm 1$ exactly when $X$ is a perfect linear function
of $Y$, and $\rho = 0$ means no linear association.
::: {.example title="A real-data anchor"}
The food-expenditure vs. income data from the [first
chapter](01-introduction.qmd) has correlation $\rho \approx 0.62$ — a moderate,
*positive* linear association, matching its upward-sloping cloud (@fig-food-cor).
:::
```{r}
#| label: fig-food-cor
#| fig-cap: "Weekly food expenditure against income (POE5 `food`); the correlation is about 0.62."
#| fig-width: 5
#| fig-height: 3.4
data(food)
rho <- cor(food$income, food$food_exp)
ggplot(food, aes(income, food_exp)) +
geom_point(color = ucla$blue, size = 1.8, alpha = 0.8) +
geom_smooth(method = "lm", se = FALSE, color = ucla$red, linewidth = 1) +
annotate("text", x = min(food$income), y = max(food$food_exp),
hjust = 0, vjust = 1, color = ucla$darkblue,
label = paste0("rho = ", round(rho, 2))) +
labs(x = "income ($100/week)", y = "food expenditure ($/week)")
```
### Independence, covariance, and a crucial caveat
::: {.property title="Independence implies zero covariance"}
If $X$ and $Y$ are **independent**, then $\Cov(X,Y) = 0$ and $\rho_{XY} = 0$.
:::
::: {.warningbox title="The converse does *not* hold"}
$\Cov(X,Y) = 0$ does **not** imply independence. Covariance only sees *linear*
association; variables can be tightly related in a *nonlinear* way yet have zero
covariance.
:::
::: {.example title="Zero covariance, total dependence"}
Let points lie on the circle $X^2 + Y^2 = 1$, symmetric about the axes. Then
$\Cov(X,Y) = 0$, yet $X$ and $Y$ are completely dependent — knowing $X$ pins $Y$
down to $\pm\sqrt{1 - X^2}$ (@fig-circle).
:::
```{r}
#| label: fig-circle
#| fig-cap: "Points on a circle have zero covariance yet are completely dependent."
#| fig-width: 4
#| fig-height: 3.6
theta <- seq(0, 2 * pi, length.out = 200)
circ <- data.frame(x = cos(theta), y = sin(theta))
ggplot(circ, aes(x, y)) +
geom_hline(yintercept = 0, color = ucla$gray, linewidth = 0.4) +
geom_vline(xintercept = 0, color = ucla$gray, linewidth = 0.4) +
geom_path(color = ucla$blue, linewidth = 1) +
coord_equal() +
labs(x = "X", y = "Y")
```
## Mean & variance of linear combinations {#sec-linear-comb}
We constantly build new variables as weighted sums of others — a portfolio, a
sample average, a regression fit. Start with the mean: it is *always* linear.
::: {.property title="Mean of a linear combination"}
$$
\E(aX + bY + c) \;=\; a\,\E(X) + b\,\E(Y) + c,
$$
*whether or not* $X$ and $Y$ are independent. This extends to any number of terms,
$$
\E\!\left(\sum_i a_i X_i\right) = \sum_i a_i\,\E(X_i).
$$
:::
No assumptions are needed — expectation does not care about dependence.
Variance is a different story.
::: {.property title="Variance of a linear combination"}
$$
\Var(aX + bY) = a^2\Var(X) + b^2\Var(Y) + 2ab\,\Cov(X,Y).
$$
:::
A **covariance term** appears, so variance is *not* linear. Two special cases are
worth memorizing:
$$
\Var(X + Y) = \Var(X) + \Var(Y) + 2\Cov(X,Y),
$$
$$
\Var(X - Y) = \Var(X) + \Var(Y) - 2\Cov(X,Y).
$$
::: {.warningbox title="The headline"}
**The variance of a sum is *not* the sum of the variances** — unless the
variables are uncorrelated.
:::
### The independent (or uncorrelated) case
When $\Cov(X,Y) = 0$ — in particular when $X$ and $Y$ are **independent** — the
cross term vanishes and variance *does* add:
$$
\Var(aX + bY) = a^2\Var(X) + b^2\Var(Y),
\qquad
\Var(X \pm Y) = \Var(X) + \Var(Y).
$$
::: {.keyidea title="Looking ahead"}
The **sample mean** $\bar X = \tfrac{1}{n}\sum_{i=1}^n X_i$ is a linear
combination of independent draws. These rules give
$$
\E(\bar X) = \mu, \qquad \Var(\bar X) = \frac{\sigma^2}{n}.
$$
The variance shrinks as $n$ grows — the reason larger samples are more
informative, and the seed of the [Central Limit Theorem](04-normal-clt.qmd).
:::
## Recap {#sec-recap}
For a **single variable**, the mean $\E(X) = \sum_x x\,f_X(x)$ locates the center
and the variance $\Var(X) = \E(X^2) - \mu^2$ measures the spread. Expectation is
linear, but in general $\E[g(X)] \neq g(\E X)$; a linear rescaling obeys
$\Var(a + bX) = b^2\Var(X)$; and for an indicator, $\E = p$ and $\Var = p(1-p)$.
For **two variables**, we move from the joint pmf to a marginal (by summing out)
to a conditional (by dividing), with independence characterized by
$f_{X,Y} = f_X f_Y$. Their linear association is captured by
$\Cov = \E(XY) - \mu_X\mu_Y$ and the unit-free $\rho = \sigma_{XY}/(\sigma_X
\sigma_Y)$. Independence implies $\Cov = 0$ — but **not** conversely. And the
variance of a sum carries a covariance term:
$\Var(X + Y) = \Var X + \Var Y + 2\Cov(X,Y)$.
::: {.keyidea title="The thread to regression"}
$\E(Y \given X)$ is the best predictor of $Y$, and the regression slope will turn
out to be $\Cov(X,Y)/\Var(X)$. These two facts are the bridge from probability to
the estimation that follows.
:::
**Next time:** the [Normal distribution, sampling, and the Central Limit
Theorem](04-normal-clt.qmd).