3 Expectation, Variance & Covariance

Reading. SW 2.2<80><93>2.3, HGL Probability Primer P.3, P.5<80><93>P.6

A random variable is described by its whole distribution <80><94> a pmf, a pdf, a cdf. That is a lot of information. This chapter does the opposite of the last one: it boils a distribution down to a few numbers. We summarize where a distribution sits (its center, the mean), how spread out it is (its variance and standard deviation), and <80><94> for two variables at once <80><94> how they move together (covariance and correlation).

Why these three ideas matter

Every regression coefficient we estimate later is built from exactly these pieces. The slope of a regression line, for instance, will turn out to be $\Cov(x,y)/\Var(x)$ <80><94> so this chapter is the toolkit for the rest of the course.

A running example: the “slips” population

We reuse the population behind the pmf from the last chapter. Ten slips sit in a hat; we draw one at random. Define two random variables on that draw:

$X$ = the number printed on the slip $(1,2,3,4)$;
$Y$ = an indicator: $Y = 1$ if the slip is shaded, $0$ if not.

The full description of how $X$ and $Y$ behave together is their joint pmf, $f_{X,Y}(x,y) = \Prob(X = x,\,Y = y)$. We can read it as a table, with the marginal distributions of $X$ and $Y$ sitting in the margins.

Show the R code

joint <- data.frame(
  Y      = c("$0$", "$1$", "$f_X(x)$"),
  x1     = c(0.0, 0.1, 0.1),
  x2     = c(0.1, 0.1, 0.2),
  x3     = c(0.2, 0.1, 0.3),
  x4     = c(0.3, 0.1, 0.4),
  margin = c(0.6, 0.4, 1.0)
)
knitr::kable(
  joint,
  col.names = c("$Y \\backslash X$", "$1$", "$2$", "$3$", "$4$", "$f_Y(y)$"),
  align = "cccccc"
)

Table 3.1: The joint pmf $f_{X,Y}(x,y)$, with marginals in the margins.

$Y \backslash X$	$1$	$2$	$3$	$4$	$f_Y(y)$
$0$	0.0	0.1	0.2	0.3	0.6
$1$	0.1	0.1	0.1	0.1	0.4
$f_X(x)$	0.1	0.2	0.3	0.4	1.0

There are two ways to read it. The body gives the joint probabilities, $f_{X,Y}(x,y) = \Prob(X = x,\,Y = y)$. The right and bottom margins give the distributions of $Y$ and of $X$ on their own. We will compute every number in this chapter from this one table.

3.1 Expected value (the mean)

Expected value

The expected value (or mean) of a discrete random variable $X$ is the probability-weighted average of its values: \[ \E(X) \;=\; \sum_{x} x\,f_X(x) \;=\; \mu_X . \]

The expected value is the long-run average of $X$ over many repetitions of the experiment. Notice that $\mu_X$ is a population parameter <80><94> a fixed feature of the population, written with a Greek letter. Later we will estimate these parameters from a sample.

Heads-up on names. The “mean” can refer to this population mean $\mu_X$ or to a sample average $\bar x$. They are different objects <80><94> keep track of which one is meant.

Example: the mean of $X$, and the mean of an indicator

For the number on the slip, $X$, we weight each value by its marginal probability: \[ \E(X) = \sum_x x\,f_X(x) = 1(0.1) + 2(0.2) + 3(0.3) + 4(0.4) = 3 . \] Draw thousands of slips and average the numbers <80><94> the running average settles down to $3$.

Paying off a promise about indicators

For the indicator $Y$ (a Bernoulli variable), with $p = \Prob(Y = 1)$, \[ \E(Y) = 0(1-p) + 1(p) = p . \] The mean of a $0/1$ variable is the proportion of ones. Here $\E(Y) = 0.4 = \Prob(\text{shaded})$.

This is the reason that, later, a regression on an indicator reads off a group’s share or a treatment effect <80><94> see dummy variables and treatment effects.

The expected value of a function of $X$

Any function $g(X)$ of a random variable is itself random. Its mean weights the transformed values by the same probabilities: \[ \E\!\left[g(X)\right] \;=\; \sum_{x} g(x)\,f_X(x). \]

Second moment of $X$

With $g(X) = X^2$, \[ \E(X^2) = \sum_x x^2 f_X(x) = 1(0.1) + 4(0.2) + 9(0.3) + 16(0.4) = 10 . \]

A trap to avoid

In general \[ \E\!\left[g(X)\right] \;\neq\; g\!\left(\E(X)\right). \] Here $\E(X^2) = 10$ but $\bigl(\E X\bigr)^2 = 3^2 = 9$. We will use $\E(X^2)$ in a moment to get the variance.

Rules for expected values

Let $a, b, c$ be constants and $X, Y$ random variables. Expectation is a linear operator.

Linearity of expectation

\[ \begin{aligned} \E(aX + b) &= a\,\E(X) + b,\\ \E\!\left[g_1(X) + g_2(X)\right] &= \E\!\left[g_1(X)\right] + \E\!\left[g_2(X)\right],\\ \E(aX + bY + c) &= a\,\E(X) + b\,\E(Y) + c. \end{aligned} \]

In words: the expected value of a sum is the sum of the expected values, and constants pass straight through.

One caution about products

Linearity is about sums. For products, $\E(XY) = \E(X)\,\E(Y)$ holds only when $X$ and $Y$ are independent <80><94> otherwise the covariance (later in this chapter) gets in the way.

3.2 Variance & standard deviation

Variance and standard deviation

The variance of $X$ is the expected squared distance from the mean: \[ \Var(X) \;=\; \E\!\left[(X - \mu_X)^2\right] \;=\; \sigma_X^2 . \] The standard deviation $\sigma_X = \sqrt{\Var(X)}$ is in the same units as $X$.

A larger variance means the distribution is more spread out about its mean. Figure 7.2 shows two distributions with the same mean but different spreads: the flatter one has the larger variance.

Show the R code

xs <- seq(-6, 6, length.out = 400)
dat <- rbind(
  data.frame(x = xs, y = dnorm(xs, 0, 1),   spread = "small variance"),
  data.frame(x = xs, y = dnorm(xs, 0, 2.2), spread = "large variance")
)
ggplot(dat, aes(x, y, color = spread)) +
  geom_line(linewidth = 1) +
  geom_vline(xintercept = 0, linetype = "dashed", color = ucla$gray) +
  scale_color_manual(values = c("small variance" = ucla$blue,
                                "large variance" = ucla$red)) +
  labs(x = "x", y = expression(f[X](x)), color = NULL)

Figure 3.1: Two distributions with the same mean but different variances. The wider, flatter curve has the larger spread.

In practice we almost never compute the variance straight from the definition. The following algebraically equivalent formula is far easier to use.

The computational formula (use this one)

\[ \Var(X) \;=\; \E(X^2) - \mu_X^2 . \]

The derivation is a one-line expansion: $\E[(X - \mu)^2] = \E(X^2) - 2\mu\,\E(X) + \mu^2 = \E(X^2) - \mu^2$, since $\E(X) = \mu$.

Example: variance of $X$ and of an indicator

For the number on the slip, $X$, we already found $\E(X) = 3$ and $\E(X^2) = 10$, so \[ \Var(X) = \E(X^2) - \mu_X^2 = 10 - 3^2 = 1, \] and $\sigma_X = \sqrt{1} = 1$.

Variance of a Bernoulli

For the indicator $Y$ with $\E(Y) = p$ <80><94> and noting $Y^2 = Y$, so $\E(Y^2) = p$ <80><94> \[ \Var(Y) = p - p^2 = p(1-p). \] Here $\Var(Y) = 0.4(0.6) = 0.24$, so $\sigma_Y = \sqrt{0.24} \approx 0.49$.

A coin is most uncertain at $p = \tfrac{1}{2}$, where $p(1-p)$ is largest.

Variance under a linear transformation

What happens to spread when we rescale and shift? Let $a, b$ be constants.

Mean and variance of $a + bX$

\[ \E(a + bX) = a + b\,\mu_X, \qquad \Var(a + bX) = b^2\,\Var(X), \qquad \sigma_{a + bX} = |b|\,\sigma_X . \]

The two constants play very different roles. An additive constant $a$ shifts the whole distribution <80><94> it moves the mean but leaves the spread unchanged. A multiplicative constant $b$ rescales <80><94> it multiplies the standard deviation by $|b|$ and the variance by $b^2$.

After-tax earnings

Tax pre-tax earnings $X$ at $20\%$ and add a $2000 grant: $Y = 2000 + 0.8X$. Then $\mu_Y = 2000 + 0.8\,\mu_X$ and $\sigma_Y = 0.8\,\sigma_X$ <80><94> the spread of take-home pay is $80\%$ that of pre-tax pay.

A useful special case: standardization

Combining the two rules, we can turn any $X$ into a variable with mean $0$ and variance $1$. Subtract the mean and divide by the standard deviation: \[ Z \;=\; \frac{X - \mu_X}{\sigma_X}. \] Reading this as a linear transformation with $a = -\mu_X/\sigma_X$ and $b = 1/\sigma_X$, the rules give \[ \E(Z) = 0, \qquad \Var(Z) = \frac{\Var(X)}{\sigma_X^2} = 1 . \]

Why we care

$Z$ is unit-free and measures “how many standard deviations from the mean.” This is exactly the move behind the $Z$-score and the standard Normal table <80><94> the heart of the next chapter.

3.3 Two variables: joint, marginal, conditional

Most economic questions involve two variables at once: income and education, price and quantity. We have already met the joint pmf in the running example; here we develop the two distributions we can extract from it.

Joint and marginal pmf

The joint pmf is $f_{X,Y}(x,y) = \Prob(X = x,\,Y = y)$ <80><94> the probability the two outcomes occur together. Its entries sum to $1$.

The marginal pmf is the distribution of one variable alone, obtained by summing the joint over the other: \[ f_X(x) = \sum_y f_{X,Y}(x,y). \]

From the slips table, summing down each column gives $f_X = (0.1, 0.2, 0.3, 0.4)$, and summing across each row gives $f_Y = (0.6,\,0.4)$. For instance, \[ \Prob(\text{shaded}) = f_Y(1) = 0.1 + 0.1 + 0.1 + 0.1 = 0.4 . \]

Conditional distributions

Often we want the distribution of $X$ within a subpopulation fixed by $Y$. Conditioning shrinks the population to just those cases, then renormalizes so the probabilities sum to one again.

Conditional pmf

\[ f_{X \given Y}(x \given y) = \Prob(X = x \given Y = y) = \frac{f_{X,Y}(x,y)}{f_Y(y)} . \]

Shaded slips only

Among shaded slips ($Y = 1$, probability $0.4$), \[ f_{X \given Y}(x \given 1) = \frac{0.1}{0.4} = 0.25 \] for each $x$ <80><94> once we know the slip is shaded, all four numbers are equally likely.

Rain and the commute

Let $X = 0$ mean rain and $Y = 0$ a long commute. With $\Prob(\text{rain}) = 0.30$ and a rainy-and-long probability of $0.15$, \[ \Prob(\text{long} \given \text{rain}) = \frac{0.15}{0.30} = 0.50 . \]

Independence

$X$ and $Y$ are independent if knowing one tells you nothing about the other <80><94> equivalently, for all $x, y$, \[ f_{X \given Y}(x \given y) = f_X(x) \quad\Longleftrightarrow\quad f_{X,Y}(x,y) = f_X(x)\,f_Y(y). \] That is, the joint factors into the product of the marginals.

The slips are *not* independent

Check the corner $x = 1,\ y = 1$: \[ f_{X,Y}(1,1) = 0.1 \;\neq\; f_X(1)\,f_Y(1) = (0.1)(0.4) = 0.04 . \] A single violated cell is enough <80><94> $X$ and $Y$ are dependent. This makes sense: shaded slips are never a “1.”

3.4 Conditional expectation

Conditional expectation

The conditional expectation $\E(X \given Y = y)$ is the mean computed with the conditional pmf: \[ \E(X \given Y = y) \;=\; \sum_x x\,f_{X \given Y}(x \given y). \]

This answers questions like “what is the mean wage among people with $16$ years of education?”, that is, $\E(\text{WAGE} \given \text{EDUC} = 16)$.

Slips, given shaded

\[ \E(X \given Y = 1) = \sum_x x\,f_{X \given Y}(x \given 1) = (1 + 2 + 3 + 4)(0.25) = 2.5 . \]

Note that $2.5$ is not a value $X$ can take <80><94> an expected value need not be attainable. Conditioning on white slips instead gives \[ \E(X \given Y = 0) = \tfrac{10}{3} \approx 3.33, \] while the unconditional mean is $\E(X) = 3$. So $\E(X \given Y)$ varies with $Y$: it is itself a function of the conditioning value.

The law of iterated expectations

The conditional means must “average back” to the overall mean, weighted by how often each condition occurs.

Law of iterated expectations

\[ \E(X) \;=\; \sum_y \E(X \given Y = y)\,f_Y(y) \;=\; \E\!\left[\E(X \given Y)\right]. \]

Check it on the slips

\[ \E(X) = \underbrace{\tfrac{10}{3}}_{\E(X \given Y = 0)}(0.6) + \underbrace{2.5}_{\E(X \given Y = 1)}(0.4) = 2.0 + 1.0 = 3 \;\checkmark \]

Intuition. Mean adult height is the mean height of men and of women, weighted by their population shares.

Conditional variance <80><94> and a preview of regression

We can also measure spread within a subpopulation: \[ \Var(X \given Y = y) = \E\!\left[(X - \E(X \given Y = y))^2 \,\middle|\, Y = y\right]. \] For the slips, $\Var(X \given Y = 1) = \tfrac{5}{4}$ while $\Var(X \given Y = 0) = \tfrac{5}{9}$: the spread of $X$ differs across subpopulations, and either can exceed or fall short of the unconditional $\Var(X) = 1$.

Why conditional expectation is the punchline of the course

Among all functions $g(X)$, the conditional mean $\E(Y \given X)$ is the best predictor of $Y$ from $X$ <80><94> it minimizes the mean squared prediction error $\E\!\left[(Y - g(X))^2\right]$. The regression line we build later is precisely a model for $\E(Y \given X)$.

3.5 Covariance & correlation

Covariance

The covariance of $X$ and $Y$ measures their linear association: \[ \Cov(X,Y) = \E\!\left[(X - \mu_X)(Y - \mu_Y)\right] = \E(XY) - \mu_X\mu_Y = \sigma_{XY}. \]

The sign tells the story. When $\sigma_{XY} > 0$, an above-average $X$ tends to come with an above-average $Y$ (points fall mostly in quadrants I and III of the mean-centered scatter). When $\sigma_{XY} < 0$, they move in opposite directions (quadrants II and IV). When $\sigma_{XY} \approx 0$, there is no linear tendency. Figure 3.2 shows a cloud with positive covariance.

Show the R code

pts <- data.frame(
  x = c(-3, -2.4, -2, -1.5, -1, -0.6, -0.3, 0.4, 0.7, 1, 1.4, 1.8, 2.2, 2.6, 3, 3.2),
  y = c(-2.4, -1.2, -2.6, -0.7, -1.6, 0.4, -1.1, 0.6, -0.5, 1.7, 0.6, 2.4, 1.1, 2.9, 1.8, 2.6)
)
quad <- data.frame(
  lab = c("I", "II", "III", "IV"),
  x   = c(2.6, -2.6, -2.6, 2.6),
  y   = c(3.4, 3.4, -3.4, -3.4)
)
ggplot(pts, aes(x, y)) +
  geom_hline(yintercept = 0, color = ucla$gray, linewidth = 0.4) +
  geom_vline(xintercept = 0, color = ucla$gray, linewidth = 0.4) +
  geom_point(color = ucla$blue, size = 1.6) +
  geom_text(data = quad, aes(x, y, label = lab), color = ucla$darkblue, size = 3.4) +
  scale_x_continuous(limits = c(-4, 4)) +
  scale_y_continuous(limits = c(-4, 4)) +
  labs(x = expression(X - mu[X]), y = expression(Y - mu[Y]))

Figure 3.2: Positive covariance: mean-centered points fall mostly in quadrants I and III.

Example: covariance of the slips

First the cross-moment. Only the shaded row $Y = 1$ contributes, since $Y = 0$ kills the product: \[ \E(XY) = \sum_{x,y} xy\,f_{X,Y}(x,y) = (1 + 2 + 3 + 4)(1)(0.1) = 1 . \] Then, using $\E(X) = 3$ and $\E(Y) = 0.4$, \[ \Cov(X,Y) = \E(XY) - \mu_X\mu_Y = 1 - (3)(0.4) = -0.2 . \] The covariance is negative: larger numbers are relatively more common on the white slips, so a high $X$ goes with $Y = 0$. This is consistent with the dependence we found earlier.

Correlation: a unit-free covariance

Covariance has awkward units <80><94> here “slip-number $\times$ shaded” <80><94> and its size is hard to read. Dividing by the standard deviations fixes both.

Correlation

\[ \rho_{XY} \;=\; \frac{\Cov(X,Y)}{\sqrt{\Var(X)}\,\sqrt{\Var(Y)}} \;=\; \frac{\sigma_{XY}}{\sigma_X\,\sigma_Y}, \qquad -1 \le \rho_{XY} \le 1 . \]

For the slips, \[ \rho_{XY} = \frac{-0.2}{\sqrt{1}\,\sqrt{0.24}} \approx -0.41 . \] The correlation hits $\rho = \pm 1$ exactly when $X$ is a perfect linear function of $Y$, and $\rho = 0$ means no linear association.

A real-data anchor

The food-expenditure vs. income data from the first chapter has correlation $\rho \approx 0.62$ <80><94> a moderate, positive linear association, matching its upward-sloping cloud (Figure 3.3).

Show the R code

data(food)
rho <- cor(food$income, food$food_exp)
ggplot(food, aes(income, food_exp)) +
  geom_point(color = ucla$blue, size = 1.8, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE, color = ucla$red, linewidth = 1) +
  annotate("text", x = min(food$income), y = max(food$food_exp),
           hjust = 0, vjust = 1, color = ucla$darkblue,
           label = paste0("rho = ", round(rho, 2))) +
  labs(x = "income ($100/week)", y = "food expenditure ($/week)")

Figure 3.3: Weekly food expenditure against income (POE5 `food`); the correlation is about 0.62.

Independence, covariance, and a crucial caveat

Independence implies zero covariance

If $X$ and $Y$ are independent, then $\Cov(X,Y) = 0$ and $\rho_{XY} = 0$.

The converse does *not* hold

$\Cov(X,Y) = 0$ does not imply independence. Covariance only sees linear association; variables can be tightly related in a nonlinear way yet have zero covariance.

Zero covariance, total dependence

Let points lie on the circle $X^2 + Y^2 = 1$, symmetric about the axes. Then $\Cov(X,Y) = 0$, yet $X$ and $Y$ are completely dependent <80><94> knowing $X$ pins $Y$ down to $\pm\sqrt{1 - X^2}$ (Figure 3.4).

Show the R code

theta <- seq(0, 2 * pi, length.out = 200)
circ  <- data.frame(x = cos(theta), y = sin(theta))
ggplot(circ, aes(x, y)) +
  geom_hline(yintercept = 0, color = ucla$gray, linewidth = 0.4) +
  geom_vline(xintercept = 0, color = ucla$gray, linewidth = 0.4) +
  geom_path(color = ucla$blue, linewidth = 1) +
  coord_equal() +
  labs(x = "X", y = "Y")

Figure 3.4: Points on a circle have zero covariance yet are completely dependent.

3.6 Mean & variance of linear combinations

We constantly build new variables as weighted sums of others <80><94> a portfolio, a sample average, a regression fit. Start with the mean: it is always linear.

Mean of a linear combination

\[ \E(aX + bY + c) \;=\; a\,\E(X) + b\,\E(Y) + c, \] whether or not $X$ and $Y$ are independent. This extends to any number of terms, \[ \E\!\left(\sum_i a_i X_i\right) = \sum_i a_i\,\E(X_i). \]

No assumptions are needed <80><94> expectation does not care about dependence.

Variance is a different story.

Variance of a linear combination

\[ \Var(aX + bY) = a^2\Var(X) + b^2\Var(Y) + 2ab\,\Cov(X,Y). \]

A covariance term appears, so variance is not linear. Two special cases are worth memorizing: \[ \Var(X + Y) = \Var(X) + \Var(Y) + 2\Cov(X,Y), \] \[ \Var(X - Y) = \Var(X) + \Var(Y) - 2\Cov(X,Y). \]

The headline

The variance of a sum is not the sum of the variances <80><94> unless the variables are uncorrelated.

The independent (or uncorrelated) case

When $\Cov(X,Y) = 0$ <80><94> in particular when $X$ and $Y$ are independent <80><94> the cross term vanishes and variance does add: \[ \Var(aX + bY) = a^2\Var(X) + b^2\Var(Y), \qquad \Var(X \pm Y) = \Var(X) + \Var(Y). \]

Looking ahead

The sample mean $\bar X = \tfrac{1}{n}\sum_{i=1}^n X_i$ is a linear combination of independent draws. These rules give \[ \E(\bar X) = \mu, \qquad \Var(\bar X) = \frac{\sigma^2}{n}. \] The variance shrinks as $n$ grows <80><94> the reason larger samples are more informative, and the seed of the Central Limit Theorem.

3.7 Recap

For a single variable, the mean $\E(X) = \sum_x x\,f_X(x)$ locates the center and the variance $\Var(X) = \E(X^2) - \mu^2$ measures the spread. Expectation is linear, but in general $\E[g(X)] \neq g(\E X)$; a linear rescaling obeys $\Var(a + bX) = b^2\Var(X)$; and for an indicator, $\E = p$ and $\Var = p(1-p)$.

For two variables, we move from the joint pmf to a marginal (by summing out) to a conditional (by dividing), with independence characterized by $f_{X,Y} = f_X f_Y$. Their linear association is captured by $\Cov = \E(XY) - \mu_X\mu_Y$ and the unit-free $\rho = \sigma_{XY}/(\sigma_X \sigma_Y)$. Independence implies $\Cov = 0$ <80><94> but not conversely. And the variance of a sum carries a covariance term: $\Var(X + Y) = \Var X + \Var Y + 2\Cov(X,Y)$.

The thread to regression

$\E(Y \given X)$ is the best predictor of $Y$, and the regression slope will turn out to be $\Cov(X,Y)/\Var(X)$. These two facts are the bridge from probability to the estimation that follows.

Next time: the Normal distribution, sampling, and the Central Limit Theorem.

--- title: "Expectation, Variance & Covariance" --- {{< include _setup.qmd >}} > **Reading.** SW sec. 2.2--2.3, HGL Probability Primer sec. P.3, P.5--P.6 A [random variable](02-random-vars.qmd) is described by its whole *distribution* --- a pmf, a pdf, a cdf. That is a lot of information. This chapter does the opposite of the last one: it boils a distribution down to a few **numbers**. We summarize where a distribution sits (its *center*, the mean), how spread out it is (its *variance* and standard deviation), and --- for *two* variables at once --- how they *move together* (covariance and correlation). ::: {.keyidea title="Why these three ideas matter"} Every regression coefficient we estimate later is built from exactly these pieces. The slope of a regression line, for instance, will turn out to be $\Cov(x,y)/\Var(x)$ --- so this chapter is the toolkit for the rest of the course. ::: ### A running example: the "slips" population We reuse the population behind the pmf from the [last chapter](02-random-vars.qmd). Ten slips sit in a hat; we draw one at random. Define two random variables on that draw: - $X$ = the **number** printed on the slip $(1,2,3,4)$; - $Y$ = an **indicator**: $Y = 1$ if the slip is shaded, $0$ if not. The full description of how $X$ and $Y$ behave *together* is their **joint pmf**, $f_{X,Y}(x,y) = \Prob(X = x,\,Y = y)$. We can read it as a table, with the **marginal** distributions of $X$ and $Y$ sitting in the margins. ```{r} #| label: tbl-joint #| tbl-cap: "The joint pmf $f_{X,Y}(x,y)$, with marginals in the margins." joint <- data.frame( Y = c("$0$", "$1$", "$f_X(x)$"), x1 = c(0.0, 0.1, 0.1), x2 = c(0.1, 0.1, 0.2), x3 = c(0.2, 0.1, 0.3), x4 = c(0.3, 0.1, 0.4), margin = c(0.6, 0.4, 1.0) ) knitr::kable( joint, col.names = c("$Y \\backslash X$", "$1$", "$2$", "$3$", "$4$", "$f_Y(y)$"), align = "cccccc" ) ``` There are two ways to read it. The **body** gives the joint probabilities, $f_{X,Y}(x,y) = \Prob(X = x,\,Y = y)$. The **right and bottom margins** give the distributions of $Y$ and of $X$ on their own. We will compute *every* number in this chapter from this one table. ## Expected value (the mean) {#sec-expectation} ::: {.definition title="Expected value"} The **expected value** (or **mean**) of a discrete random variable $X$ is the probability-weighted average of its values: $$ \E(X) \;=\; \sum_{x} x\,f_X(x) \;=\; \mu_X . $$ ::: The expected value is the **long-run average** of $X$ over many repetitions of the experiment. Notice that $\mu_X$ is a **population parameter** --- a fixed feature of the population, written with a Greek letter. Later we will *estimate* these parameters from a sample. ::: {.callout-note appearance="simple"} **Heads-up on names.** The "mean" can refer to this *population* mean $\mu_X$ *or* to a *sample* average $\bar x$. They are different objects --- keep track of which one is meant. ::: ### Example: the mean of $X$, and the mean of an indicator For the **number on the slip**, $X$, we weight each value by its marginal probability: $$ \E(X) = \sum_x x\,f_X(x) = 1(0.1) + 2(0.2) + 3(0.3) + 4(0.4) = 3 . $$ Draw thousands of slips and average the numbers --- the running average settles down to $3$. ::: {.example title="Paying off a promise about indicators"} For the **indicator** $Y$ (a Bernoulli variable), with $p = \Prob(Y = 1)$, $$ \E(Y) = 0(1-p) + 1(p) = p . $$ The mean of a $0/1$ variable *is* the proportion of ones. Here $\E(Y) = 0.4 = \Prob(\text{shaded})$. ::: This is the reason that, later, a regression on an indicator reads off a group's *share* or a *treatment effect* --- see [dummy variables](19-dummy-variables.qmd) and [treatment effects](20-treatment-effects.qmd). ### The expected value of a function of $X$ Any function $g(X)$ of a random variable is itself random. Its mean weights the *transformed* values by the *same* probabilities: $$ \E\!\left[g(X)\right] \;=\; \sum_{x} g(x)\,f_X(x). $$ ::: {.example title="Second moment of $X$"} With $g(X) = X^2$, $$ \E(X^2) = \sum_x x^2 f_X(x) = 1(0.1) + 4(0.2) + 9(0.3) + 16(0.4) = 10 . $$ ::: ::: {.warningbox title="A trap to avoid"} In general $$ \E\!\left[g(X)\right] \;\neq\; g\!\left(\E(X)\right). $$ Here $\E(X^2) = 10$ but $\bigl(\E X\bigr)^2 = 3^2 = 9$. We will use $\E(X^2)$ in a moment to get the variance. ::: ### Rules for expected values Let $a, b, c$ be constants and $X, Y$ random variables. Expectation is a **linear** operator. ::: {.property title="Linearity of expectation"} $$ \begin{aligned} \E(aX + b) &= a\,\E(X) + b,\\ \E\!\left[g_1(X) + g_2(X)\right] &= \E\!\left[g_1(X)\right] + \E\!\left[g_2(X)\right],\\ \E(aX + bY + c) &= a\,\E(X) + b\,\E(Y) + c. \end{aligned} $$ ::: In words: *the expected value of a sum is the sum of the expected values*, and constants pass straight through. ::: {.warningbox title="One caution about products"} Linearity is about *sums*. For *products*, $\E(XY) = \E(X)\,\E(Y)$ holds **only when $X$ and $Y$ are independent** --- otherwise the covariance (later in this chapter) gets in the way. ::: ## Variance & standard deviation {#sec-variance} ::: {.definition title="Variance and standard deviation"} The **variance** of $X$ is the expected squared distance from the mean: $$ \Var(X) \;=\; \E\!\left[(X - \mu_X)^2\right] \;=\; \sigma_X^2 . $$ The **standard deviation** $\sigma_X = \sqrt{\Var(X)}$ is in the *same units* as $X$. ::: A larger variance means the distribution is more spread out about its mean. @fig-spread shows two distributions with the same mean but different spreads: the flatter one has the larger variance. ```{r} #| label: fig-spread #| fig-cap: "Two distributions with the same mean but different variances. The wider, flatter curve has the larger spread." #| fig-width: 5 #| fig-height: 3.4 xs <- seq(-6, 6, length.out = 400) dat <- rbind( data.frame(x = xs, y = dnorm(xs, 0, 1), spread = "small variance"), data.frame(x = xs, y = dnorm(xs, 0, 2.2), spread = "large variance") ) ggplot(dat, aes(x, y, color = spread)) + geom_line(linewidth = 1) + geom_vline(xintercept = 0, linetype = "dashed", color = ucla$gray) + scale_color_manual(values = c("small variance" = ucla$blue, "large variance" = ucla$red)) + labs(x = "x", y = expression(f[X](x)), color = NULL) ``` In practice we almost never compute the variance straight from the definition. The following algebraically equivalent formula is far easier to use. ::: {.property title="The computational formula (use this one)"} $$ \Var(X) \;=\; \E(X^2) - \mu_X^2 . $$ ::: The derivation is a one-line expansion: $\E[(X - \mu)^2] = \E(X^2) - 2\mu\,\E(X) + \mu^2 = \E(X^2) - \mu^2$, since $\E(X) = \mu$. ### Example: variance of $X$ and of an indicator For the **number on the slip**, $X$, we already found $\E(X) = 3$ and $\E(X^2) = 10$, so $$ \Var(X) = \E(X^2) - \mu_X^2 = 10 - 3^2 = 1, $$ and $\sigma_X = \sqrt{1} = 1$. ::: {.example title="Variance of a Bernoulli"} For the indicator $Y$ with $\E(Y) = p$ --- and noting $Y^2 = Y$, so $\E(Y^2) = p$ --- $$ \Var(Y) = p - p^2 = p(1-p). $$ Here $\Var(Y) = 0.4(0.6) = 0.24$, so $\sigma_Y = \sqrt{0.24} \approx 0.49$. ::: A coin is most uncertain at $p = \tfrac{1}{2}$, where $p(1-p)$ is largest. ### Variance under a linear transformation What happens to spread when we rescale and shift? Let $a, b$ be constants. ::: {.property title="Mean and variance of $a + bX$"} $$ \E(a + bX) = a + b\,\mu_X, \qquad \Var(a + bX) = b^2\,\Var(X), \qquad \sigma_{a + bX} = |b|\,\sigma_X . $$ ::: The two constants play very different roles. An additive constant $a$ **shifts** the whole distribution --- it moves the mean but leaves the spread unchanged. A multiplicative constant $b$ **rescales** --- it multiplies the standard deviation by $|b|$ and the variance by $b^2$. ::: {.example title="After-tax earnings"} Tax pre-tax earnings $X$ at $20\%$ and add a \$2000 grant: $Y = 2000 + 0.8X$. Then $\mu_Y = 2000 + 0.8\,\mu_X$ and $\sigma_Y = 0.8\,\sigma_X$ --- the spread of take-home pay is $80\%$ that of pre-tax pay. ::: ### A useful special case: standardization Combining the two rules, we can turn *any* $X$ into a variable with mean $0$ and variance $1$. Subtract the mean and divide by the standard deviation: $$ Z \;=\; \frac{X - \mu_X}{\sigma_X}. $$ Reading this as a linear transformation with $a = -\mu_X/\sigma_X$ and $b = 1/\sigma_X$, the rules give $$ \E(Z) = 0, \qquad \Var(Z) = \frac{\Var(X)}{\sigma_X^2} = 1 . $$ ::: {.keyidea title="Why we care"} $Z$ is **unit-free** and measures "how many standard deviations from the mean." This is exactly the move behind the *$Z$-score* and the standard Normal table --- the heart of the [next chapter](04-normal-clt.qmd). ::: ## Two variables: joint, marginal, conditional {#sec-joint} Most economic questions involve *two* variables at once: income *and* education, price *and* quantity. We have already met the **joint pmf** in the running example; here we develop the two distributions we can extract from it. ::: {.definition title="Joint and marginal pmf"} The **joint pmf** is $f_{X,Y}(x,y) = \Prob(X = x,\,Y = y)$ --- the probability the two outcomes occur *together*. Its entries sum to $1$. The **marginal pmf** is the distribution of one variable alone, obtained by *summing the joint over the other*: $$ f_X(x) = \sum_y f_{X,Y}(x,y). $$ ::: From the slips table, summing **down each column** gives $f_X = (0.1, 0.2, 0.3, 0.4)$, and summing **across each row** gives $f_Y = (0.6,\,0.4)$. For instance, $$ \Prob(\text{shaded}) = f_Y(1) = 0.1 + 0.1 + 0.1 + 0.1 = 0.4 . $$ ### Conditional distributions Often we want the distribution of $X$ *within a subpopulation* fixed by $Y$. Conditioning **shrinks the population** to just those cases, then renormalizes so the probabilities sum to one again. ::: {.definition title="Conditional pmf"} $$ f_{X \given Y}(x \given y) = \Prob(X = x \given Y = y) = \frac{f_{X,Y}(x,y)}{f_Y(y)} . $$ ::: ::: {.example title="Shaded slips only"} Among shaded slips ($Y = 1$, probability $0.4$), $$ f_{X \given Y}(x \given 1) = \frac{0.1}{0.4} = 0.25 $$ for each $x$ --- once we know the slip is shaded, all four numbers are equally likely. ::: ::: {.example title="Rain and the commute"} Let $X = 0$ mean rain and $Y = 0$ a long commute. With $\Prob(\text{rain}) = 0.30$ and a rainy-*and*-long probability of $0.15$, $$ \Prob(\text{long} \given \text{rain}) = \frac{0.15}{0.30} = 0.50 . $$ ::: ### Independence ::: {.definition title="Independence"} $X$ and $Y$ are **independent** if knowing one tells you *nothing* about the other --- equivalently, for *all* $x, y$, $$ f_{X \given Y}(x \given y) = f_X(x) \quad\Longleftrightarrow\quad f_{X,Y}(x,y) = f_X(x)\,f_Y(y). $$ That is, the joint factors into the product of the marginals. ::: ::: {.example title="The slips are *not* independent"} Check the corner $x = 1,\ y = 1$: $$ f_{X,Y}(1,1) = 0.1 \;\neq\; f_X(1)\,f_Y(1) = (0.1)(0.4) = 0.04 . $$ A single violated cell is enough --- $X$ and $Y$ are **dependent**. This makes sense: shaded slips are never a "1." ::: ## Conditional expectation {#sec-cond-expectation} ::: {.definition title="Conditional expectation"} The **conditional expectation** $\E(X \given Y = y)$ is the mean computed with the *conditional* pmf: $$ \E(X \given Y = y) \;=\; \sum_x x\,f_{X \given Y}(x \given y). $$ ::: This answers questions like "what is the mean wage *among* people with $16$ years of education?", that is, $\E(\text{WAGE} \given \text{EDUC} = 16)$. ::: {.example title="Slips, given shaded"} $$ \E(X \given Y = 1) = \sum_x x\,f_{X \given Y}(x \given 1) = (1 + 2 + 3 + 4)(0.25) = 2.5 . $$ ::: Note that $2.5$ is **not a value $X$ can take** --- an expected value need not be attainable. Conditioning on white slips instead gives $$ \E(X \given Y = 0) = \tfrac{10}{3} \approx 3.33, $$ while the *unconditional* mean is $\E(X) = 3$. So $\E(X \given Y)$ **varies with $Y$**: it is itself a function of the conditioning value. ### The law of iterated expectations The conditional means must "average back" to the overall mean, weighted by how often each condition occurs. ::: {.property title="Law of iterated expectations"} $$ \E(X) \;=\; \sum_y \E(X \given Y = y)\,f_Y(y) \;=\; \E\!\left[\E(X \given Y)\right]. $$ ::: ::: {.example title="Check it on the slips"} $$ \E(X) = \underbrace{\tfrac{10}{3}}_{\E(X \given Y = 0)}(0.6) + \underbrace{2.5}_{\E(X \given Y = 1)}(0.4) = 2.0 + 1.0 = 3 \;\checkmark $$ ::: ::: {.callout-note appearance="simple"} **Intuition.** Mean adult height is the mean height of men and of women, weighted by their population shares. ::: ### Conditional variance --- and a preview of regression We can also measure *spread* within a subpopulation: $$ \Var(X \given Y = y) = \E\!\left[(X - \E(X \given Y = y))^2 \,\middle|\, Y = y\right]. $$ For the slips, $\Var(X \given Y = 1) = \tfrac{5}{4}$ while $\Var(X \given Y = 0) = \tfrac{5}{9}$: the spread of $X$ differs across subpopulations, and either can exceed or fall short of the unconditional $\Var(X) = 1$. ::: {.keyidea title="Why conditional expectation is the punchline of the course"} Among *all* functions $g(X)$, the conditional mean $\E(Y \given X)$ is the **best predictor** of $Y$ from $X$ --- it minimizes the mean squared prediction error $\E\!\left[(Y - g(X))^2\right]$. The [regression line](05-simple-regression.qmd) we build later is precisely a model for $\E(Y \given X)$. ::: ## Covariance & correlation {#sec-covariance} ::: {.definition title="Covariance"} The **covariance** of $X$ and $Y$ measures their *linear* association: $$ \Cov(X,Y) = \E\!\left[(X - \mu_X)(Y - \mu_Y)\right] = \E(XY) - \mu_X\mu_Y = \sigma_{XY}. $$ ::: The sign tells the story. When $\sigma_{XY} > 0$, an above-average $X$ *tends* to come with an above-average $Y$ (points fall mostly in quadrants I and III of the mean-centered scatter). When $\sigma_{XY} < 0$, they move in *opposite* directions (quadrants II and IV). When $\sigma_{XY} \approx 0$, there is no *linear* tendency. @fig-cov-quadrants shows a cloud with positive covariance. ```{r} #| label: fig-cov-quadrants #| fig-cap: "Positive covariance: mean-centered points fall mostly in quadrants I and III." #| fig-width: 5 #| fig-height: 3.6 pts <- data.frame( x = c(-3, -2.4, -2, -1.5, -1, -0.6, -0.3, 0.4, 0.7, 1, 1.4, 1.8, 2.2, 2.6, 3, 3.2), y = c(-2.4, -1.2, -2.6, -0.7, -1.6, 0.4, -1.1, 0.6, -0.5, 1.7, 0.6, 2.4, 1.1, 2.9, 1.8, 2.6) ) quad <- data.frame( lab = c("I", "II", "III", "IV"), x = c(2.6, -2.6, -2.6, 2.6), y = c(3.4, 3.4, -3.4, -3.4) ) ggplot(pts, aes(x, y)) + geom_hline(yintercept = 0, color = ucla$gray, linewidth = 0.4) + geom_vline(xintercept = 0, color = ucla$gray, linewidth = 0.4) + geom_point(color = ucla$blue, size = 1.6) + geom_text(data = quad, aes(x, y, label = lab), color = ucla$darkblue, size = 3.4) + scale_x_continuous(limits = c(-4, 4)) + scale_y_continuous(limits = c(-4, 4)) + labs(x = expression(X - mu[X]), y = expression(Y - mu[Y])) ``` ### Example: covariance of the slips First the cross-moment. Only the shaded row $Y = 1$ contributes, since $Y = 0$ kills the product: $$ \E(XY) = \sum_{x,y} xy\,f_{X,Y}(x,y) = (1 + 2 + 3 + 4)(1)(0.1) = 1 . $$ Then, using $\E(X) = 3$ and $\E(Y) = 0.4$, $$ \Cov(X,Y) = \E(XY) - \mu_X\mu_Y = 1 - (3)(0.4) = -0.2 . $$ The covariance is **negative**: larger numbers are relatively more common on the *white* slips, so a high $X$ goes with $Y = 0$. This is consistent with the dependence we found earlier. ### Correlation: a unit-free covariance Covariance has awkward units --- here "slip-number $\times$ shaded" --- and its size is hard to read. Dividing by the standard deviations fixes both. ::: {.definition title="Correlation"} $$ \rho_{XY} \;=\; \frac{\Cov(X,Y)}{\sqrt{\Var(X)}\,\sqrt{\Var(Y)}} \;=\; \frac{\sigma_{XY}}{\sigma_X\,\sigma_Y}, \qquad -1 \le \rho_{XY} \le 1 . $$ ::: For the slips, $$ \rho_{XY} = \frac{-0.2}{\sqrt{1}\,\sqrt{0.24}} \approx -0.41 . $$ The correlation hits $\rho = \pm 1$ exactly when $X$ is a perfect linear function of $Y$, and $\rho = 0$ means no linear association. ::: {.example title="A real-data anchor"} The food-expenditure vs. income data from the [first chapter](01-introduction.qmd) has correlation $\rho \approx 0.62$ --- a moderate, *positive* linear association, matching its upward-sloping cloud (@fig-food-cor). ::: ```{r} #| label: fig-food-cor #| fig-cap: "Weekly food expenditure against income (POE5 `food`); the correlation is about 0.62." #| fig-width: 5 #| fig-height: 3.4 data(food) rho <- cor(food$income, food$food_exp) ggplot(food, aes(income, food_exp)) + geom_point(color = ucla$blue, size = 1.8, alpha = 0.8) + geom_smooth(method = "lm", se = FALSE, color = ucla$red, linewidth = 1) + annotate("text", x = min(food$income), y = max(food$food_exp), hjust = 0, vjust = 1, color = ucla$darkblue, label = paste0("rho = ", round(rho, 2))) + labs(x = "income ($100/week)", y = "food expenditure ($/week)") ``` ### Independence, covariance, and a crucial caveat ::: {.property title="Independence implies zero covariance"} If $X$ and $Y$ are **independent**, then $\Cov(X,Y) = 0$ and $\rho_{XY} = 0$. ::: ::: {.warningbox title="The converse does *not* hold"} $\Cov(X,Y) = 0$ does **not** imply independence. Covariance only sees *linear* association; variables can be tightly related in a *nonlinear* way yet have zero covariance. ::: ::: {.example title="Zero covariance, total dependence"} Let points lie on the circle $X^2 + Y^2 = 1$, symmetric about the axes. Then $\Cov(X,Y) = 0$, yet $X$ and $Y$ are completely dependent --- knowing $X$ pins $Y$ down to $\pm\sqrt{1 - X^2}$ (@fig-circle). ::: ```{r} #| label: fig-circle #| fig-cap: "Points on a circle have zero covariance yet are completely dependent." #| fig-width: 4 #| fig-height: 3.6 theta <- seq(0, 2 * pi, length.out = 200) circ <- data.frame(x = cos(theta), y = sin(theta)) ggplot(circ, aes(x, y)) + geom_hline(yintercept = 0, color = ucla$gray, linewidth = 0.4) + geom_vline(xintercept = 0, color = ucla$gray, linewidth = 0.4) + geom_path(color = ucla$blue, linewidth = 1) + coord_equal() + labs(x = "X", y = "Y") ``` ## Mean & variance of linear combinations {#sec-linear-comb} We constantly build new variables as weighted sums of others --- a portfolio, a sample average, a regression fit. Start with the mean: it is *always* linear. ::: {.property title="Mean of a linear combination"} $$ \E(aX + bY + c) \;=\; a\,\E(X) + b\,\E(Y) + c, $$ *whether or not* $X$ and $Y$ are independent. This extends to any number of terms, $$ \E\!\left(\sum_i a_i X_i\right) = \sum_i a_i\,\E(X_i). $$ ::: No assumptions are needed --- expectation does not care about dependence. Variance is a different story. ::: {.property title="Variance of a linear combination"} $$ \Var(aX + bY) = a^2\Var(X) + b^2\Var(Y) + 2ab\,\Cov(X,Y). $$ ::: A **covariance term** appears, so variance is *not* linear. Two special cases are worth memorizing: $$ \Var(X + Y) = \Var(X) + \Var(Y) + 2\Cov(X,Y), $$ $$ \Var(X - Y) = \Var(X) + \Var(Y) - 2\Cov(X,Y). $$ ::: {.warningbox title="The headline"} **The variance of a sum is *not* the sum of the variances** --- unless the variables are uncorrelated. ::: ### The independent (or uncorrelated) case When $\Cov(X,Y) = 0$ --- in particular when $X$ and $Y$ are **independent** --- the cross term vanishes and variance *does* add: $$ \Var(aX + bY) = a^2\Var(X) + b^2\Var(Y), \qquad \Var(X \pm Y) = \Var(X) + \Var(Y). $$ ::: {.keyidea title="Looking ahead"} The **sample mean** $\bar X = \tfrac{1}{n}\sum_{i=1}^n X_i$ is a linear combination of independent draws. These rules give $$ \E(\bar X) = \mu, \qquad \Var(\bar X) = \frac{\sigma^2}{n}. $$ The variance shrinks as $n$ grows --- the reason larger samples are more informative, and the seed of the [Central Limit Theorem](04-normal-clt.qmd). ::: ## Recap {#sec-recap} For a **single variable**, the mean $\E(X) = \sum_x x\,f_X(x)$ locates the center and the variance $\Var(X) = \E(X^2) - \mu^2$ measures the spread. Expectation is linear, but in general $\E[g(X)] \neq g(\E X)$; a linear rescaling obeys $\Var(a + bX) = b^2\Var(X)$; and for an indicator, $\E = p$ and $\Var = p(1-p)$. For **two variables**, we move from the joint pmf to a marginal (by summing out) to a conditional (by dividing), with independence characterized by $f_{X,Y} = f_X f_Y$. Their linear association is captured by $\Cov = \E(XY) - \mu_X\mu_Y$ and the unit-free $\rho = \sigma_{XY}/(\sigma_X \sigma_Y)$. Independence implies $\Cov = 0$ --- but **not** conversely. And the variance of a sum carries a covariance term: $\Var(X + Y) = \Var X + \Var Y + 2\Cov(X,Y)$. ::: {.keyidea title="The thread to regression"} $\E(Y \given X)$ is the best predictor of $Y$, and the regression slope will turn out to be $\Cov(X,Y)/\Var(X)$. These two facts are the bridge from probability to the estimation that follows. ::: **Next time:** the [Normal distribution, sampling, and the Central Limit Theorem](04-normal-clt.qmd).

3 Expectation, Variance & Covariance

A running example: the “slips” population

3.1 Expected value (the mean)

Example: the mean of \(X\), and the mean of an indicator

The expected value of a function of \(X\)

Rules for expected values

3.2 Variance & standard deviation

Example: variance of \(X\) and of an indicator

Variance under a linear transformation

A useful special case: standardization

3.3 Two variables: joint, marginal, conditional

Conditional distributions

Independence

3.4 Conditional expectation

The law of iterated expectations

Conditional variance <80><94> and a preview of regression

3.5 Covariance & correlation

Example: covariance of the slips

Correlation: a unit-free covariance

Independence, covariance, and a crucial caveat

3.6 Mean & variance of linear combinations

The independent (or uncorrelated) case

3.7 Recap