\( \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\given}{\,\vert\,} \newcommand{\indic}[1]{\mathbf{1}\!\left\{#1\right\}} \newcommand{\pmf}{\text{p.m.f.}} \newcommand{\pdf}{\text{p.d.f.}} \newcommand{\cdf}{\text{c.d.f.}} \)

16  Interaction Terms

Reading. SW 8.2<80><93>8.3, HGL 5.6

Every multiple-regression coefficient so far has been a constant partial effect. In Big Andy’s burger model, \(\beta_3\) was the effect of advertising <80><94> the same effect at every level of advertising, for every firm. But economics is full of effects that change. Advertising shows diminishing returns: the next dollar buys less than the last. The return to experience may depend on education. How income drives spending may depend on age. A model with only constant slopes cannot speak to any of these.

This chapter lets marginal effects vary, using two devices that are still ordinary OLS. Polynomials (\(x^2\)) make an effect depend on its own level; interactions (\(x_2 \times x_3\)) make an effect depend on another variable. Once marginal effects can vary, we can finally do genuine economic optimization <80><94> pushing a choice to the point where marginal benefit equals marginal cost. None of this requires any new estimator: it is all the multiple regression you already know, applied to cleverly constructed regressors.

This chapter sits in the multiple-regression sequence. It builds directly on multiple regression and the functional forms we met in simple regression; the next chapter gives us a way to test whether a whole block of these curvature and interaction terms is worth keeping.

16.1 Polynomials in multiple regression

A linear model forces a constant advertising effect \(\beta_3\) <80><94> but the 10th $1{,}000 of ads surely does less than the 1st. We fix this by adding a squared term, so the slope is free to bend: \[ \text{SALES} = \beta_1 + \beta_2\,\text{PRICE} + \beta_3\,\text{ADVERT} + \beta_4\,\text{ADVERT}^2 + e . \] Now the marginal effect of advertising is no longer a single number <80><94> it is a function of how much you already advertise. Differentiating the conditional mean, \[ \frac{\partial\,\E(\text{SALES})}{\partial\,\text{ADVERT}} = \beta_3 + 2\beta_4\,\text{ADVERT}. \] For diminishing returns we expect \(\beta_3 > 0\) (advertising helps at first) and \(\beta_4 < 0\) (the help tapers off as advertising grows).

Crucially, this is a multiple regression: ADVERT and ADVERT\(^2\) are two distinct regressors. In simple regression we could only fit one of them at a time; here both enter the same equation.

Big Andy’s, with diminishing returns

Estimating the quadratic model on Big Andy’s data gives \[ \widehat{\text{SALES}} = 109.72 - 7.640\,\text{PRICE} + \underset{(3.556)}{12.151}\,\text{ADVERT} - \underset{(0.941)}{2.768}\,\text{ADVERT}^2 , \] with standard errors in parentheses below the advertising coefficients. Both signs come out as expected, and the ADVERT\(^2\) term is statistically significant <80><94> the curvature is real, not noise. The estimated marginal effect of advertising is \[ \widehat{\frac{\partial\text{SALES}}{\partial\text{ADVERT}}} = 12.151 - 5.536\,\text{ADVERT} . \] Evaluated at a low and a high level of advertising, this is \(9.38\) at $500 and only \(1.08\) at $2{,}000. An extra $1{,}000 of ads is worth far less once you already advertise heavily <80><94> exactly the diminishing-returns story the linear model could not tell. Figure 16.1 shows the fitted sales curve flattening as advertising rises.

data(andy)
m_andy <- lm(sales ~ price + advert + I(advert^2), andy)
round(coef(m_andy), 3)
#> (Intercept)       price      advert I(advert^2) 
#>     109.719      -7.640      12.151      -2.768
Show the R code
b <- coef(m_andy)
price_bar <- mean(andy$price)
sales_hat <- function(a) b[1] + b[2]*price_bar + b[3]*a + b[4]*a^2
me <- function(a) b[3] + 2*b[4]*a            # marginal effect at advert = a

curve_df <- data.frame(a = seq(0.2, 3, length.out = 200))
curve_df$s <- sales_hat(curve_df$a)

tangent <- function(a0, lo, hi) {
  aa <- seq(lo, hi, length.out = 2)
  data.frame(a = aa, s = sales_hat(a0) + me(a0) * (aa - a0))
}
t_steep <- tangent(0.5, 0.2, 0.95)
t_flat  <- tangent(2.0, 1.55, 2.45)

ggplot(curve_df, aes(a, s)) +
  geom_line(color = ucla$blue, linewidth = 1) +
  geom_line(data = t_steep, aes(a, s),
            linetype = "dashed", color = ucla$red) +
  geom_line(data = t_flat, aes(a, s),
            linetype = "dashed", color = ucla$red) +
  annotate("text", x = 1.05, y = sales_hat(0.5),
           label = "steep", color = ucla$red, size = 3.4, hjust = 0) +
  annotate("text", x = 2.0, y = sales_hat(2.0) + 1.3,
           label = "flat", color = ucla$red, size = 3.4) +
  scale_x_continuous(breaks = c(0.5, 2)) +
  labs(x = "ADVERT ($000)", y = "SALES")
Figure 16.1: Big Andy’s fitted sales as advertising rises (price held at its mean). The curve flattens <80><94> diminishing returns. Dashed tangent lines show the steep slope at low advertising and the flat slope at high advertising.

Polynomials are everywhere in economics

Cost and product curves are inherently curved, and polynomials capture that curvature while staying linear in the parameters <80><94> so OLS applies unchanged.

Cost curves as polynomials

A U-shaped average cost curve is a quadratic, \[ \text{AC} = \beta_1 + \beta_2 Q + \beta_3 Q^2 + e, \qquad \text{slope } \beta_2 + 2\beta_3 Q , \] where we expect \(\beta_2 < 0\) (cost falls at first) and \(\beta_3 > 0\) (cost eventually rises). An S-shaped total cost curve is a cubic, \[ \text{TC} = \alpha_1 + \alpha_2 Q + \alpha_3 Q^2 + \alpha_4 Q^3 + e, \qquad \text{marginal cost } = \alpha_2 + 2\alpha_3 Q + 3\alpha_4 Q^2 . \]

The interpretation habit

A polynomial coefficient is not a slope. Always report the marginal effect \(dy/dx\) evaluated at chosen values of \(x\) <80><94> say a low, median, and high value <80><94> or simply plot the curve. With curvature, “the effect of \(x\)” is a moving target, and quoting a single coefficient is meaningless.

One practical wrinkle: \(x\) and \(x^2\) can be highly correlated, which sometimes inflates their standard errors. This is the collinearity problem in disguise.

16.2 Interaction terms

A polynomial lets an effect depend on its own level. An interaction lets it depend on another variable. We build one by including the product of two regressors: \[ y = \beta_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4\,(x_2 \times x_3) + e . \] Differentiating, each variable’s marginal effect now slides with the other: \[ \frac{\partial\,\E(y)}{\partial x_2} = \beta_2 + \beta_4\,x_3, \qquad \frac{\partial\,\E(y)}{\partial x_3} = \beta_3 + \beta_4\,x_2 . \]

What $\beta_4$ means

\(\beta_4\) is the effect of raising both \(x_2\) and \(x_3\), above and beyond the sum of their separate effects. If \(\beta_4 = 0\), the two effects are additive and separable. If not, they either reinforce each other (\(\beta_4 > 0\)) or offset each other (\(\beta_4 < 0\)).

The motivating case: age <97> income <86><92> pizza

Does the effect of income on pizza spending depend on age? Write the model as \[ \text{PIZZA} = \beta_1 + \beta_2\,\text{AGE} + \beta_3\,\text{INCOME} + \beta_4\,(\text{AGE}\times\text{INCOME}) + e . \] The marginal effect of income is then \[ \frac{\partial\,\E(\text{PIZZA})}{\partial\,\text{INCOME}} = \beta_3 + \beta_4\,\text{AGE} . \] If \(\beta_4 < 0\), an extra dollar of income raises pizza spending less for older people <80><94> the income effect fades with age. Without the interaction term you would be forced to report a single income effect for everyone, hiding exactly the pattern of interest.

Never read a main effect in isolation

With the interaction present, \(\beta_3\) alone is the income effect only at \(\text{AGE} = 0\) <80><94> rarely a meaningful quantity. Once you include an interaction, a so-called “main effect” coefficient is the effect when the other variable is zero, and should never be interpreted on its own.

A worked interaction: education and experience

Do education and experience reinforce each other in the labor market? Interact them in a wage equation: \[ \text{WAGE} = \beta_1 + \beta_2\,\text{EDUC} + \beta_3\,\text{EXPER} + \beta_4\,(\text{EDUC}\times\text{EXPER}) + e . \] OLS on the CPS data gives

data(cps5_small)
m_wage <- lm(wage ~ educ + exper + I(educ * exper), cps5_small)
round(coef(m_wage), 6)
#>     (Intercept)            educ           exper I(educ * exper) 
#>      -18.759265        2.655739        0.238374       -0.002747

so that \[ \widehat{\text{WAGE}} = -18.76 + 2.656\,\text{EDUC} + 0.2384\,\text{EXPER} - 0.002747\,(\text{EDUC}\times\text{EXPER}) . \] The return to an extra year of experience is \[ \frac{\partial\text{WAGE}}{\partial\text{EXPER}} = 0.2384 - 0.002747\,\text{EDUC}, \] which is about $0.22/hr at \(\text{EDUC} = 8\) and $0.19/hr at \(\text{EDUC} = 16\). The small (and here statistically insignificant) negative \(\beta_4\) hints that more schooling makes an extra year of experience slightly less valuable <80><94> a substitutes story rather than a reinforcing one. Figure 16.2 plots this declining return as a function of education.

Show the R code
bw <- coef(m_wage)
me_df <- data.frame(educ = seq(0, 21, length.out = 100))
me_df$me <- bw["exper"] + bw["I(educ * exper)"] * me_df$educ
pts <- data.frame(educ = c(8, 16))
pts$me <- bw["exper"] + bw["I(educ * exper)"] * pts$educ

ggplot(me_df, aes(educ, me)) +
  geom_line(color = ucla$blue, linewidth = 1) +
  geom_point(data = pts, aes(educ, me), color = ucla$darkblue, size = 2.4) +
  geom_segment(data = pts,
               aes(x = educ, xend = educ, y = 0, yend = me),
               linetype = "dashed", color = ucla$gray) +
  scale_x_continuous(breaks = c(0, 8, 16, 21)) +
  labs(x = "EDUC (years)",
       y = "return to a year of EXPER ($/hr)")
Figure 16.2: The estimated return to an extra year of experience falls as education rises. The negative interaction makes education and experience mild substitutes.

Binary interactions <80><94> a preview

Interactions are even more common when one of the variables is a 0/1 indicator. Interacting a dummy with a continuous \(x\) gives the two groups different slopes, while a dummy on its own merely shifts the intercept.

A dummy shifts the intercept; a dummy interacted with \(x\) also changes the slope.
Model Effect
\(y = \beta_1 + \beta_2 x + \beta_3 D\) different intercepts, same slope
\(y = \beta_1 + \beta_2 x + \beta_3 D + \beta_4 (x \times D)\) different intercepts and slopes

For example, does the return to the student<80><93>teacher ratio differ in districts with many versus few English learners? You would interact STR with a high-EL dummy and read off two slopes.

Indicator variables get their own chapter <80><94> intercept shifts, slope dummies, reference groups, and the house-price UTOWN/POOL example all live in dummy variables. Today’s tools, the product term and the moving marginal effect, are exactly the machinery you will use there.

16.3 Economic optimization

A constant slope can never have an interior optimum <80><94> a straight line just keeps going. But a varying marginal effect can. The economic logic is the familiar one: push a choice until marginal benefit equals marginal cost.

Big Andy’s optimal advertising

From the quadratic ADVERT model, the marginal revenue of $1 more advertising is \(\beta_3 + 2\beta_4\,\text{ADVERT}\). The marginal cost of $1 of advertising is exactly $1. Setting them equal and solving for the optimal advertising level, \[ \beta_3 + 2\beta_4\,\text{ADVERT}_0 = 1 \quad\Longrightarrow\quad \text{ADVERT}_0 = \frac{1 - \beta_3}{2\beta_4} . \] Plugging in the estimates, \[ \widehat{\text{ADVERT}}_0 = \frac{1 - 12.151}{2(-2.768)} = 2.014 \;\Rightarrow\; \text{optimal} \approx \$2{,}014/\text{month}. \]

b3 <- coef(m_andy)["advert"]
b4 <- coef(m_andy)["I(advert^2)"]
advert0 <- (1 - b3) / (2 * b4)
round(advert0, 3)
#> advert 
#>  2.014

The optimum is a nonlinear function of the coefficients

Notice that \(\widehat{\text{ADVERT}}_0 = (1 - b_3)/(2 b_4)\) divides one estimator by another. That makes it a nonlinear function of the coefficients, so the tidy variance rule for linear combinations no longer applies exactly.

The delta method (in brief)

The delta method approximates the standard error of a smooth function \(g(b_3, b_4)\) using its derivatives and the estimated variance<80><93>covariance matrix of the coefficients. The approximation is valid in large samples, and software computes it automatically. For Big Andy’s, \(\mathrm{se}(\widehat{\text{ADVERT}}_0) = 0.129\), giving an approximate 95% interval \[ 2.014 \pm t_c(0.129) = [\,1.757,\ 2.271\,] \;\Rightarrow\; \$1{,}757 \text{ to } \$2{,}271 . \]

The same idea answers the question “how many years of experience maximize wages?” <80><94> set \(\partial\text{WAGE}/\partial\text{EXPER} = 0\) in a quadratic and solve. In both cases the regression is doing genuine economic optimization, and the delta method attaches an honest margin of error to the answer.

16.4 Recap

This chapter let marginal effects vary while staying inside ordinary OLS.

  • Polynomials. Adding \(x^2\) makes the marginal effect \(\beta_3 + 2\beta_4 x\), which varies with \(x\)’s own level. For Big Andy’s, the advertising effect fell from \(9.38\) to \(1.08\) <80><94> diminishing returns. Cost and product curves are natural polynomials; always report the slope at chosen values of \(x\).
  • Interactions. Adding \(x_2 \times x_3\) makes \(\partial y / \partial x_2 = \beta_2 + \beta_4 x_3\), so one variable’s effect slides with another (age <97> income <86><92> pizza, educ <97> exper <86><92> wage). Never read a “main effect” in isolation.
  • Optimization. Setting marginal benefit equal to marginal cost, \(\beta_3 + 2\beta_4\,\text{ADVERT}_0 = 1\), gives Big Andy’s optimum of $2{,}014 with a 95% interval of \([\$1{,}757,\ \$2{,}271]\). Because the optimum is a nonlinear function of the coefficients, its standard error comes from the delta method.
  • Binary interactions. A dummy interacted with a continuous variable gives groups different slopes <80><94> the topic of dummy variables.

Next time: is the whole curvature-or-interaction block worth keeping? Testing several coefficients at once (\(\beta_4 = 0\) and \(\beta_5 = 0\)) needs the \(F\)-test <80><94> restricted versus unrestricted models, overall significance, and economic restrictions like constant returns to scale.