\( \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\given}{\,\vert\,} \newcommand{\indic}[1]{\mathbf{1}\!\left\{#1\right\}} \newcommand{\pmf}{\text{p.m.f.}} \newcommand{\pdf}{\text{p.d.f.}} \newcommand{\cdf}{\text{c.d.f.}} \)

3  Expectation, Variance & Covariance

Reading. SW 2.2<80><93>2.3, HGL Probability Primer P.3, P.5<80><93>P.6

A random variable is described by its whole distribution <80><94> a pmf, a pdf, a cdf. That is a lot of information. This chapter does the opposite of the last one: it boils a distribution down to a few numbers. We summarize where a distribution sits (its center, the mean), how spread out it is (its variance and standard deviation), and <80><94> for two variables at once <80><94> how they move together (covariance and correlation).

Why these three ideas matter

Every regression coefficient we estimate later is built from exactly these pieces. The slope of a regression line, for instance, will turn out to be \(\Cov(x,y)/\Var(x)\) <80><94> so this chapter is the toolkit for the rest of the course.

A running example: the “slips” population

We reuse the population behind the pmf from the last chapter. Ten slips sit in a hat; we draw one at random. Define two random variables on that draw:

  • \(X\) = the number printed on the slip \((1,2,3,4)\);
  • \(Y\) = an indicator: \(Y = 1\) if the slip is shaded, \(0\) if not.

The full description of how \(X\) and \(Y\) behave together is their joint pmf, \(f_{X,Y}(x,y) = \Prob(X = x,\,Y = y)\). We can read it as a table, with the marginal distributions of \(X\) and \(Y\) sitting in the margins.

Show the R code
joint <- data.frame(
  Y      = c("$0$", "$1$", "$f_X(x)$"),
  x1     = c(0.0, 0.1, 0.1),
  x2     = c(0.1, 0.1, 0.2),
  x3     = c(0.2, 0.1, 0.3),
  x4     = c(0.3, 0.1, 0.4),
  margin = c(0.6, 0.4, 1.0)
)
knitr::kable(
  joint,
  col.names = c("$Y \\backslash X$", "$1$", "$2$", "$3$", "$4$", "$f_Y(y)$"),
  align = "cccccc"
)
Table 3.1: The joint pmf \(f_{X,Y}(x,y)\), with marginals in the margins.
\(Y \backslash X\) \(1\) \(2\) \(3\) \(4\) \(f_Y(y)\)
\(0\) 0.0 0.1 0.2 0.3 0.6
\(1\) 0.1 0.1 0.1 0.1 0.4
\(f_X(x)\) 0.1 0.2 0.3 0.4 1.0

There are two ways to read it. The body gives the joint probabilities, \(f_{X,Y}(x,y) = \Prob(X = x,\,Y = y)\). The right and bottom margins give the distributions of \(Y\) and of \(X\) on their own. We will compute every number in this chapter from this one table.

3.1 Expected value (the mean)

Expected value

The expected value (or mean) of a discrete random variable \(X\) is the probability-weighted average of its values: \[ \E(X) \;=\; \sum_{x} x\,f_X(x) \;=\; \mu_X . \]

The expected value is the long-run average of \(X\) over many repetitions of the experiment. Notice that \(\mu_X\) is a population parameter <80><94> a fixed feature of the population, written with a Greek letter. Later we will estimate these parameters from a sample.

Heads-up on names. The “mean” can refer to this population mean \(\mu_X\) or to a sample average \(\bar x\). They are different objects <80><94> keep track of which one is meant.

Example: the mean of \(X\), and the mean of an indicator

For the number on the slip, \(X\), we weight each value by its marginal probability: \[ \E(X) = \sum_x x\,f_X(x) = 1(0.1) + 2(0.2) + 3(0.3) + 4(0.4) = 3 . \] Draw thousands of slips and average the numbers <80><94> the running average settles down to \(3\).

Paying off a promise about indicators

For the indicator \(Y\) (a Bernoulli variable), with \(p = \Prob(Y = 1)\), \[ \E(Y) = 0(1-p) + 1(p) = p . \] The mean of a \(0/1\) variable is the proportion of ones. Here \(\E(Y) = 0.4 = \Prob(\text{shaded})\).

This is the reason that, later, a regression on an indicator reads off a group’s share or a treatment effect <80><94> see dummy variables and treatment effects.

The expected value of a function of \(X\)

Any function \(g(X)\) of a random variable is itself random. Its mean weights the transformed values by the same probabilities: \[ \E\!\left[g(X)\right] \;=\; \sum_{x} g(x)\,f_X(x). \]

Second moment of $X$

With \(g(X) = X^2\), \[ \E(X^2) = \sum_x x^2 f_X(x) = 1(0.1) + 4(0.2) + 9(0.3) + 16(0.4) = 10 . \]

A trap to avoid

In general \[ \E\!\left[g(X)\right] \;\neq\; g\!\left(\E(X)\right). \] Here \(\E(X^2) = 10\) but \(\bigl(\E X\bigr)^2 = 3^2 = 9\). We will use \(\E(X^2)\) in a moment to get the variance.

Rules for expected values

Let \(a, b, c\) be constants and \(X, Y\) random variables. Expectation is a linear operator.

Linearity of expectation

\[ \begin{aligned} \E(aX + b) &= a\,\E(X) + b,\\ \E\!\left[g_1(X) + g_2(X)\right] &= \E\!\left[g_1(X)\right] + \E\!\left[g_2(X)\right],\\ \E(aX + bY + c) &= a\,\E(X) + b\,\E(Y) + c. \end{aligned} \]

In words: the expected value of a sum is the sum of the expected values, and constants pass straight through.

One caution about products

Linearity is about sums. For products, \(\E(XY) = \E(X)\,\E(Y)\) holds only when \(X\) and \(Y\) are independent <80><94> otherwise the covariance (later in this chapter) gets in the way.

3.2 Variance & standard deviation

Variance and standard deviation

The variance of \(X\) is the expected squared distance from the mean: \[ \Var(X) \;=\; \E\!\left[(X - \mu_X)^2\right] \;=\; \sigma_X^2 . \] The standard deviation \(\sigma_X = \sqrt{\Var(X)}\) is in the same units as \(X\).

A larger variance means the distribution is more spread out about its mean. Figure 7.2 shows two distributions with the same mean but different spreads: the flatter one has the larger variance.

Show the R code
xs <- seq(-6, 6, length.out = 400)
dat <- rbind(
  data.frame(x = xs, y = dnorm(xs, 0, 1),   spread = "small variance"),
  data.frame(x = xs, y = dnorm(xs, 0, 2.2), spread = "large variance")
)
ggplot(dat, aes(x, y, color = spread)) +
  geom_line(linewidth = 1) +
  geom_vline(xintercept = 0, linetype = "dashed", color = ucla$gray) +
  scale_color_manual(values = c("small variance" = ucla$blue,
                                "large variance" = ucla$red)) +
  labs(x = "x", y = expression(f[X](x)), color = NULL)
Figure 3.1: Two distributions with the same mean but different variances. The wider, flatter curve has the larger spread.

In practice we almost never compute the variance straight from the definition. The following algebraically equivalent formula is far easier to use.

The computational formula (use this one)

\[ \Var(X) \;=\; \E(X^2) - \mu_X^2 . \]

The derivation is a one-line expansion: \(\E[(X - \mu)^2] = \E(X^2) - 2\mu\,\E(X) + \mu^2 = \E(X^2) - \mu^2\), since \(\E(X) = \mu\).

Example: variance of \(X\) and of an indicator

For the number on the slip, \(X\), we already found \(\E(X) = 3\) and \(\E(X^2) = 10\), so \[ \Var(X) = \E(X^2) - \mu_X^2 = 10 - 3^2 = 1, \] and \(\sigma_X = \sqrt{1} = 1\).

Variance of a Bernoulli

For the indicator \(Y\) with \(\E(Y) = p\) <80><94> and noting \(Y^2 = Y\), so \(\E(Y^2) = p\) <80><94> \[ \Var(Y) = p - p^2 = p(1-p). \] Here \(\Var(Y) = 0.4(0.6) = 0.24\), so \(\sigma_Y = \sqrt{0.24} \approx 0.49\).

A coin is most uncertain at \(p = \tfrac{1}{2}\), where \(p(1-p)\) is largest.

Variance under a linear transformation

What happens to spread when we rescale and shift? Let \(a, b\) be constants.

Mean and variance of $a + bX$

\[ \E(a + bX) = a + b\,\mu_X, \qquad \Var(a + bX) = b^2\,\Var(X), \qquad \sigma_{a + bX} = |b|\,\sigma_X . \]

The two constants play very different roles. An additive constant \(a\) shifts the whole distribution <80><94> it moves the mean but leaves the spread unchanged. A multiplicative constant \(b\) rescales <80><94> it multiplies the standard deviation by \(|b|\) and the variance by \(b^2\).

After-tax earnings

Tax pre-tax earnings \(X\) at \(20\%\) and add a $2000 grant: \(Y = 2000 + 0.8X\). Then \(\mu_Y = 2000 + 0.8\,\mu_X\) and \(\sigma_Y = 0.8\,\sigma_X\) <80><94> the spread of take-home pay is \(80\%\) that of pre-tax pay.

A useful special case: standardization

Combining the two rules, we can turn any \(X\) into a variable with mean \(0\) and variance \(1\). Subtract the mean and divide by the standard deviation: \[ Z \;=\; \frac{X - \mu_X}{\sigma_X}. \] Reading this as a linear transformation with \(a = -\mu_X/\sigma_X\) and \(b = 1/\sigma_X\), the rules give \[ \E(Z) = 0, \qquad \Var(Z) = \frac{\Var(X)}{\sigma_X^2} = 1 . \]

Why we care

\(Z\) is unit-free and measures “how many standard deviations from the mean.” This is exactly the move behind the \(Z\)-score and the standard Normal table <80><94> the heart of the next chapter.

3.3 Two variables: joint, marginal, conditional

Most economic questions involve two variables at once: income and education, price and quantity. We have already met the joint pmf in the running example; here we develop the two distributions we can extract from it.

Joint and marginal pmf

The joint pmf is \(f_{X,Y}(x,y) = \Prob(X = x,\,Y = y)\) <80><94> the probability the two outcomes occur together. Its entries sum to \(1\).

The marginal pmf is the distribution of one variable alone, obtained by summing the joint over the other: \[ f_X(x) = \sum_y f_{X,Y}(x,y). \]

From the slips table, summing down each column gives \(f_X = (0.1, 0.2, 0.3, 0.4)\), and summing across each row gives \(f_Y = (0.6,\,0.4)\). For instance, \[ \Prob(\text{shaded}) = f_Y(1) = 0.1 + 0.1 + 0.1 + 0.1 = 0.4 . \]

Conditional distributions

Often we want the distribution of \(X\) within a subpopulation fixed by \(Y\). Conditioning shrinks the population to just those cases, then renormalizes so the probabilities sum to one again.

Conditional pmf

\[ f_{X \given Y}(x \given y) = \Prob(X = x \given Y = y) = \frac{f_{X,Y}(x,y)}{f_Y(y)} . \]

Shaded slips only

Among shaded slips (\(Y = 1\), probability \(0.4\)), \[ f_{X \given Y}(x \given 1) = \frac{0.1}{0.4} = 0.25 \] for each \(x\) <80><94> once we know the slip is shaded, all four numbers are equally likely.

Rain and the commute

Let \(X = 0\) mean rain and \(Y = 0\) a long commute. With \(\Prob(\text{rain}) = 0.30\) and a rainy-and-long probability of \(0.15\), \[ \Prob(\text{long} \given \text{rain}) = \frac{0.15}{0.30} = 0.50 . \]

Independence

Independence

\(X\) and \(Y\) are independent if knowing one tells you nothing about the other <80><94> equivalently, for all \(x, y\), \[ f_{X \given Y}(x \given y) = f_X(x) \quad\Longleftrightarrow\quad f_{X,Y}(x,y) = f_X(x)\,f_Y(y). \] That is, the joint factors into the product of the marginals.

The slips are *not* independent

Check the corner \(x = 1,\ y = 1\): \[ f_{X,Y}(1,1) = 0.1 \;\neq\; f_X(1)\,f_Y(1) = (0.1)(0.4) = 0.04 . \] A single violated cell is enough <80><94> \(X\) and \(Y\) are dependent. This makes sense: shaded slips are never a “1.”

3.4 Conditional expectation

Conditional expectation

The conditional expectation \(\E(X \given Y = y)\) is the mean computed with the conditional pmf: \[ \E(X \given Y = y) \;=\; \sum_x x\,f_{X \given Y}(x \given y). \]

This answers questions like “what is the mean wage among people with \(16\) years of education?”, that is, \(\E(\text{WAGE} \given \text{EDUC} = 16)\).

Slips, given shaded

\[ \E(X \given Y = 1) = \sum_x x\,f_{X \given Y}(x \given 1) = (1 + 2 + 3 + 4)(0.25) = 2.5 . \]

Note that \(2.5\) is not a value \(X\) can take <80><94> an expected value need not be attainable. Conditioning on white slips instead gives \[ \E(X \given Y = 0) = \tfrac{10}{3} \approx 3.33, \] while the unconditional mean is \(\E(X) = 3\). So \(\E(X \given Y)\) varies with \(Y\): it is itself a function of the conditioning value.

The law of iterated expectations

The conditional means must “average back” to the overall mean, weighted by how often each condition occurs.

Law of iterated expectations

\[ \E(X) \;=\; \sum_y \E(X \given Y = y)\,f_Y(y) \;=\; \E\!\left[\E(X \given Y)\right]. \]

Check it on the slips

\[ \E(X) = \underbrace{\tfrac{10}{3}}_{\E(X \given Y = 0)}(0.6) + \underbrace{2.5}_{\E(X \given Y = 1)}(0.4) = 2.0 + 1.0 = 3 \;\checkmark \]

Intuition. Mean adult height is the mean height of men and of women, weighted by their population shares.

Conditional variance <80><94> and a preview of regression

We can also measure spread within a subpopulation: \[ \Var(X \given Y = y) = \E\!\left[(X - \E(X \given Y = y))^2 \,\middle|\, Y = y\right]. \] For the slips, \(\Var(X \given Y = 1) = \tfrac{5}{4}\) while \(\Var(X \given Y = 0) = \tfrac{5}{9}\): the spread of \(X\) differs across subpopulations, and either can exceed or fall short of the unconditional \(\Var(X) = 1\).

Why conditional expectation is the punchline of the course

Among all functions \(g(X)\), the conditional mean \(\E(Y \given X)\) is the best predictor of \(Y\) from \(X\) <80><94> it minimizes the mean squared prediction error \(\E\!\left[(Y - g(X))^2\right]\). The regression line we build later is precisely a model for \(\E(Y \given X)\).

3.5 Covariance & correlation

Covariance

The covariance of \(X\) and \(Y\) measures their linear association: \[ \Cov(X,Y) = \E\!\left[(X - \mu_X)(Y - \mu_Y)\right] = \E(XY) - \mu_X\mu_Y = \sigma_{XY}. \]

The sign tells the story. When \(\sigma_{XY} > 0\), an above-average \(X\) tends to come with an above-average \(Y\) (points fall mostly in quadrants I and III of the mean-centered scatter). When \(\sigma_{XY} < 0\), they move in opposite directions (quadrants II and IV). When \(\sigma_{XY} \approx 0\), there is no linear tendency. Figure 3.2 shows a cloud with positive covariance.

Show the R code
pts <- data.frame(
  x = c(-3, -2.4, -2, -1.5, -1, -0.6, -0.3, 0.4, 0.7, 1, 1.4, 1.8, 2.2, 2.6, 3, 3.2),
  y = c(-2.4, -1.2, -2.6, -0.7, -1.6, 0.4, -1.1, 0.6, -0.5, 1.7, 0.6, 2.4, 1.1, 2.9, 1.8, 2.6)
)
quad <- data.frame(
  lab = c("I", "II", "III", "IV"),
  x   = c(2.6, -2.6, -2.6, 2.6),
  y   = c(3.4, 3.4, -3.4, -3.4)
)
ggplot(pts, aes(x, y)) +
  geom_hline(yintercept = 0, color = ucla$gray, linewidth = 0.4) +
  geom_vline(xintercept = 0, color = ucla$gray, linewidth = 0.4) +
  geom_point(color = ucla$blue, size = 1.6) +
  geom_text(data = quad, aes(x, y, label = lab), color = ucla$darkblue, size = 3.4) +
  scale_x_continuous(limits = c(-4, 4)) +
  scale_y_continuous(limits = c(-4, 4)) +
  labs(x = expression(X - mu[X]), y = expression(Y - mu[Y]))
Figure 3.2: Positive covariance: mean-centered points fall mostly in quadrants I and III.

Example: covariance of the slips

First the cross-moment. Only the shaded row \(Y = 1\) contributes, since \(Y = 0\) kills the product: \[ \E(XY) = \sum_{x,y} xy\,f_{X,Y}(x,y) = (1 + 2 + 3 + 4)(1)(0.1) = 1 . \] Then, using \(\E(X) = 3\) and \(\E(Y) = 0.4\), \[ \Cov(X,Y) = \E(XY) - \mu_X\mu_Y = 1 - (3)(0.4) = -0.2 . \] The covariance is negative: larger numbers are relatively more common on the white slips, so a high \(X\) goes with \(Y = 0\). This is consistent with the dependence we found earlier.

Correlation: a unit-free covariance

Covariance has awkward units <80><94> here “slip-number \(\times\) shaded” <80><94> and its size is hard to read. Dividing by the standard deviations fixes both.

Correlation

\[ \rho_{XY} \;=\; \frac{\Cov(X,Y)}{\sqrt{\Var(X)}\,\sqrt{\Var(Y)}} \;=\; \frac{\sigma_{XY}}{\sigma_X\,\sigma_Y}, \qquad -1 \le \rho_{XY} \le 1 . \]

For the slips, \[ \rho_{XY} = \frac{-0.2}{\sqrt{1}\,\sqrt{0.24}} \approx -0.41 . \] The correlation hits \(\rho = \pm 1\) exactly when \(X\) is a perfect linear function of \(Y\), and \(\rho = 0\) means no linear association.

A real-data anchor

The food-expenditure vs. income data from the first chapter has correlation \(\rho \approx 0.62\) <80><94> a moderate, positive linear association, matching its upward-sloping cloud (Figure 3.3).

Show the R code
data(food)
rho <- cor(food$income, food$food_exp)
ggplot(food, aes(income, food_exp)) +
  geom_point(color = ucla$blue, size = 1.8, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE, color = ucla$red, linewidth = 1) +
  annotate("text", x = min(food$income), y = max(food$food_exp),
           hjust = 0, vjust = 1, color = ucla$darkblue,
           label = paste0("rho = ", round(rho, 2))) +
  labs(x = "income ($100/week)", y = "food expenditure ($/week)")
Figure 3.3: Weekly food expenditure against income (POE5 food); the correlation is about 0.62.

Independence, covariance, and a crucial caveat

Independence implies zero covariance

If \(X\) and \(Y\) are independent, then \(\Cov(X,Y) = 0\) and \(\rho_{XY} = 0\).

The converse does *not* hold

\(\Cov(X,Y) = 0\) does not imply independence. Covariance only sees linear association; variables can be tightly related in a nonlinear way yet have zero covariance.

Zero covariance, total dependence

Let points lie on the circle \(X^2 + Y^2 = 1\), symmetric about the axes. Then \(\Cov(X,Y) = 0\), yet \(X\) and \(Y\) are completely dependent <80><94> knowing \(X\) pins \(Y\) down to \(\pm\sqrt{1 - X^2}\) (Figure 3.4).

Show the R code
theta <- seq(0, 2 * pi, length.out = 200)
circ  <- data.frame(x = cos(theta), y = sin(theta))
ggplot(circ, aes(x, y)) +
  geom_hline(yintercept = 0, color = ucla$gray, linewidth = 0.4) +
  geom_vline(xintercept = 0, color = ucla$gray, linewidth = 0.4) +
  geom_path(color = ucla$blue, linewidth = 1) +
  coord_equal() +
  labs(x = "X", y = "Y")
Figure 3.4: Points on a circle have zero covariance yet are completely dependent.

3.6 Mean & variance of linear combinations

We constantly build new variables as weighted sums of others <80><94> a portfolio, a sample average, a regression fit. Start with the mean: it is always linear.

Mean of a linear combination

\[ \E(aX + bY + c) \;=\; a\,\E(X) + b\,\E(Y) + c, \] whether or not \(X\) and \(Y\) are independent. This extends to any number of terms, \[ \E\!\left(\sum_i a_i X_i\right) = \sum_i a_i\,\E(X_i). \]

No assumptions are needed <80><94> expectation does not care about dependence.

Variance is a different story.

Variance of a linear combination

\[ \Var(aX + bY) = a^2\Var(X) + b^2\Var(Y) + 2ab\,\Cov(X,Y). \]

A covariance term appears, so variance is not linear. Two special cases are worth memorizing: \[ \Var(X + Y) = \Var(X) + \Var(Y) + 2\Cov(X,Y), \] \[ \Var(X - Y) = \Var(X) + \Var(Y) - 2\Cov(X,Y). \]

The headline

The variance of a sum is not the sum of the variances <80><94> unless the variables are uncorrelated.

The independent (or uncorrelated) case

When \(\Cov(X,Y) = 0\) <80><94> in particular when \(X\) and \(Y\) are independent <80><94> the cross term vanishes and variance does add: \[ \Var(aX + bY) = a^2\Var(X) + b^2\Var(Y), \qquad \Var(X \pm Y) = \Var(X) + \Var(Y). \]

Looking ahead

The sample mean \(\bar X = \tfrac{1}{n}\sum_{i=1}^n X_i\) is a linear combination of independent draws. These rules give \[ \E(\bar X) = \mu, \qquad \Var(\bar X) = \frac{\sigma^2}{n}. \] The variance shrinks as \(n\) grows <80><94> the reason larger samples are more informative, and the seed of the Central Limit Theorem.

3.7 Recap

For a single variable, the mean \(\E(X) = \sum_x x\,f_X(x)\) locates the center and the variance \(\Var(X) = \E(X^2) - \mu^2\) measures the spread. Expectation is linear, but in general \(\E[g(X)] \neq g(\E X)\); a linear rescaling obeys \(\Var(a + bX) = b^2\Var(X)\); and for an indicator, \(\E = p\) and \(\Var = p(1-p)\).

For two variables, we move from the joint pmf to a marginal (by summing out) to a conditional (by dividing), with independence characterized by \(f_{X,Y} = f_X f_Y\). Their linear association is captured by \(\Cov = \E(XY) - \mu_X\mu_Y\) and the unit-free \(\rho = \sigma_{XY}/(\sigma_X \sigma_Y)\). Independence implies \(\Cov = 0\) <80><94> but not conversely. And the variance of a sum carries a covariance term: \(\Var(X + Y) = \Var X + \Var Y + 2\Cov(X,Y)\).

The thread to regression

\(\E(Y \given X)\) is the best predictor of \(Y\), and the regression slope will turn out to be \(\Cov(X,Y)/\Var(X)\). These two facts are the bridge from probability to the estimation that follows.

Next time: the Normal distribution, sampling, and the Central Limit Theorem.