1 What Is Econometrics?

Reading. SW 1.1<80><93>1.3, HGL 1.1<80><93>1.5

Welcome to ECON 103. The goal of this course is to learn the core tool of empirical economics <80><94> linear regression <80><94> and how to use it with real data in R. Over six weeks we build a single idea in three stages: a probability review (the language of uncertainty), simple regression (one explanatory variable), and finally multiple regression (many explanatory variables, plus causality and treatment effects). This first chapter paints the big picture: what econometrics is, what it is for, and the single distinction that runs through the entire course <80><94> correlation vs. causation.

1.1 Course mechanics

Before the economics, a few logistics. Lectures meet Tuesday and Thursday, 3:15<80><93>5:20 pm, and the paired lab meets Tuesday, 1:00<80><93>2:15 pm <80><94> both on Zoom, with recordings posted whenever technically possible. You must enroll in both the lecture and its paired lab. Office hours are Wednesday and Friday, 11:30<80><93>1:00, in Bunche 2265. Email the instructor at rlongmuir@g.ucla.edu with “ECON 103” in the subject line, and post questions to Campuswire (join code 6251), where the whole class benefits from the answer.

The main text is Principles of Econometrics by Hill, Griffiths & Lim (5th ed.); we cover roughly the first eight chapters, and the slides are posted as well. Stock & Watson is a useful second reference and is mapped chapter-by-chapter in the lecture notes. All the computing is done in R and RStudio, both free; install them before Lab 1 (campus and virtual desktops have them too). No prior coding experience is required.

Prerequisites

The formal prerequisites are Econ 11 and Econ 41. We lean heavily on Econ 41 <80><94> the next three chapters refresh exactly the probability and statistics we need and nothing more.

Your grade is a weighted average of homework, the midterm, and the final, and we take whichever weighting is better for you. There are about four homework assignments, submitted online and due Mondays at midnight; late work earns half credit absent an accommodation or prior arrangement. Every assignment includes some coding questions (R recommended). The exams are multiple choice, online, and closed-book / closed-internet.

Show the R code

grading <- data.frame(
  Component = c("Homework", "Midterm", "Final"),
  Option1  = c("20%", "30%", "50%"),
  Option2  = c("20%", "40%", "40%")
)
knitr::kable(grading, col.names = c("Component", "Option 1", "Option 2"),
             align = "lcc")

Table 1.1: Grade weighting <80><94> we use whichever option is better for you.

Component	Option 1	Option 2
Homework	20%	20%
Midterm	30%	40%
Final	50%	40%

The midterm is Thursday, August 20 and covers Lectures 1<80><93>10; the final is Thursday, September 10 and is cumulative.

1.2 What is econometrics?

Ask a half-dozen econometricians what econometrics is and you get a half-dozen answers: the science of testing theories; a toolkit for forecasting; fitting models to data; making numerical policy recommendations. They are all right.

Working definition

Econometrics is the science and art of using economic theory and statistical methods to analyze economic data <80><94> to estimate relationships, test hypotheses, and predict outcomes.

Econometrics bridges the gap between being “a student of economics” and being a practicing economist. Theory tells you that “a price increase lowers quantity demanded.” Econometrics answers the question theory cannot: by how much.

Economics runs on “how much?” questions

Theory gives you the direction of a relationship, but real decisions need a number. Consider:

Cut class size by 5 students <80><94> how much do test scores rise?
Raise cigarette prices 1% <80><94> how much does consumption fall?
A Pizza Hut buys more newspaper ads <80><94> how much do sales rise?
The Fed raises the discount rate <80><94> how much does inflation slow?

The recurring structure

Each question is about a parameter <80><94> an unknown number (an elasticity, a slope, a multiplier) describing how one variable relates to another. The parameter’s value is unknown and must be estimated from data.

Because the answer comes from data, it carries uncertainty: a different sample gives a different number. For that reason we never report a bare estimate <80><94> we always report an estimate and a measure of how precise it is. Quantifying that precision is much of what the second half of the course is about.

From an economic model to an econometric model

Economic theory writes relationships as exact functions. A demand equation, for instance, might say that quantity demanded depends on the good’s own price $P$, the price of a substitute $P^s$, the price of a complement $P^c$, and income: \[ Q^d = \beta_1 + \beta_2 P + \beta_3 P^s + \beta_4 P^c + \beta_5 \text{INC}. \] But real outcomes are not exact <80><94> countless small factors are omitted from any model. Econometrics confronts this by adding a random error $e$: \[ Q^d = \underbrace{\beta_1 + \beta_2 P + \beta_3 P^s + \beta_4 P^c + \beta_5 \text{INC}}_{\text{systematic part (from theory)}} \;+\; \underbrace{e}_{\text{random ``noise''}}. \] The $\beta$’s are the unknown parameters we want to learn, and $e$ collects everything we left out together with the intrinsic randomness of behavior.

This “systematic part $+$ error” template is the skeleton of every model in the course. We meet it formally when we set up simple regression.

1.3 Three goals: explain, predict, optimize

The same regression can serve different purposes <80><94> and the purpose changes what we must assume. It is worth being explicit about why we are building a model in the first place, because the answer determines how hard the job is.

1. Explain (causal)

Estimate the effect of $X$ on $Y$, holding all else equal: “How much do test scores rise because of smaller classes?”

2. Predict (forecast)

Use $X$ to make a good guess of $Y$. Causality is not required here: “What will GDP growth be next year?”

3. Optimize (decide)

Plug the estimates into a decision: “How much should the Fed raise rates? How much should we advertise?”

The goal sets the bar

For prediction, you only need a stable association <80><94> a variable that reliably moves with $Y$. For explanation and for sound decisions, you need the parameter to be causal, a much higher bar. Clearing that bar is most of the work of this course.

1.4 Correlation vs. causation

This is the central distinction of the course. Consider a regression of a student’s grade on the fraction of lectures they skip: \[ \text{GRADE} = \beta_1 + \beta_2\,\text{SKIP} + e . \] We expect $\beta_2 < 0$: more skipping goes with lower grades. But does skipping cause low grades?

Does skipping cause low grades?

Maybe not. A demanding job, or low motivation, could drive both skipping and poor grades. In that case SKIP and GRADE are correlated <80><94> which is useful for prediction <80><94> yet $\beta_2$ is not the causal effect of skipping. The slope also soaks up the influence of those omitted factors.

The lesson, repeated throughout the course: correlation is not causation. A good predictor need not be a cause. Umbrellas predict rain, but umbrellas don’t make it rain.

What would pin down a causal effect?

The gold standard for measuring a causal effect is a randomized experiment.

The ideal: a randomized experiment

Split units into a treatment group and a control group by random assignment, then compare average outcomes between them.

Randomization makes the groups comparable in every other way, so the only systematic difference between them is the treatment. The causal effect is in fact defined as the difference such an experiment would reveal. Two classic illustrations: in agronomy, fertilize randomly chosen plots and compare yields; in education, Tennessee’s Project STAR randomly assigned students to small versus regular classes (we revisit it when we study treatment effects).

Why we usually can't experiment

Real experiments are often unethical, infeasible, or too expensive. As a result most economic data are observational <80><94> we watch the world, we don’t run it.

When the data are observational, the danger is confounding: a third variable that drives both the treatment and the outcome.

The confounding trap

Districts with small classes also tend to be wealthier. So a raw “small classes $\to$ higher scores” comparison mixes the class-size effect with the income effect. Untangling the two is precisely the job of multiple regression in Part III.

1.5 A first look at data

Data come in three types

Before modeling anything, it helps to know what kind of data we have. Economic data fall into three structural types:

Cross-section <80><94> many entities observed in one time period. For example, 420 California school districts in 1999, or a CPS wage survey. The number of entities is denoted $n$.
Time series <80><94> one entity observed over many periods, such as U.S. GDP growth, quarterly, from 1960 to 2017. The number of periods is denoted $T$.
Panel <80><94> many entities, each observed over many periods: 48 states across 11 years of cigarette sales, giving $n \times T$ observations.

Cutting across this classification, data are also experimental (generated by a designed experiment) or observational (surveys, administrative records) <80><94> and, as noted above, most economic data are observational. We start with, and spend most of the course on, cross-sectional data.

A first look: food expenditure and income

Our running example throughout the course (from Hill, Griffiths & Lim) is a cross-section of 40 households, recording each household’s weekly income (in hundreds of dollars) and weekly food expenditure (in dollars). The natural first look at the relationship between two variables is a scatterplot, shown in Figure 6.3.

Two things stand out. Higher income tends to go with higher food spending <80><94> the cloud of points slopes upward, and the correlation is about $0.62$. But the points plainly do not lie on a line: at any given income, spending varies a great deal. That vertical scatter is exactly the random error $e$ from our “systematic part $+$ error” template.

Show the R code

data(food)
fit <- lm(food_exp ~ income, food)
ggplot(food, aes(income, food_exp)) +
  geom_point(color = ucla$darkblue, fill = ucla$blue,
             shape = 21, size = 1.8) +
  geom_abline(intercept = coef(fit)[1], slope = coef(fit)[2],
              color = ucla$red, linewidth = 1) +
  scale_x_continuous(limits = c(0, 35), breaks = c(0, 10, 20, 30)) +
  scale_y_continuous(limits = c(0, 620), breaks = c(0, 200, 400, 600)) +
  labs(x = "weekly income ($100s)", y = "weekly food exp. ($)")

Figure 1.1: Weekly food expenditure against weekly income for 40 households. The red line previews the OLS fit we build in Part II.

The course in one picture

The OLS fit through this cloud is food_exp $= 83.42 + 10.21 \times$ income: each extra $100 of weekly income is associated with about $10.21 of additional weekly food spending. Simple regression and OLS estimation make the idea of “draw the best line through this cloud” precise <80><94> that line is ordinary least squares.

1.6 Recap

Econometrics uses economic theory together with statistics to turn data into numbers: estimating parameters, testing hypotheses, and predicting outcomes. We organize that work around three goals and a small vocabulary of data structures.

The three goals of modeling and the three structures of data.
	Three goals	Data types
Explain	causal effects, holding all else equal	cross-section, time series, panel
Predict	a good forecast; causality not required	experimental vs. observational
Optimize	plug estimates into a decision / policy	most economic data are observational

The big ideas to carry forward: every model is a systematic part $+$ a random error $e$; correlation $\neq$ causation; and randomized experiments define causal effects, while with observational data we must work to approximate them.

Next time: the language of uncertainty <80><94> random variables and distributions.

--- title: "What Is Econometrics?" --- {{< include _setup.qmd >}} > **Reading.** SW sec. 1.1--1.3, HGL sec. 1.1--1.5 Welcome to ECON 103. The goal of this course is to learn the core tool of empirical economics --- **linear regression** --- and how to use it with real data in R. Over six weeks we build a single idea in three stages: a **probability review** (the language of uncertainty), **simple regression** (one explanatory variable), and finally **multiple regression** (many explanatory variables, plus causality and treatment effects). This first chapter paints the big picture: what econometrics *is*, what it is *for*, and the single distinction that runs through the entire course --- **correlation vs. causation**. ## Course mechanics {#sec-mechanics} Before the economics, a few logistics. Lectures meet Tuesday and Thursday, 3:15--5:20 pm, and the paired lab meets Tuesday, 1:00--2:15 pm --- both on Zoom, with recordings posted whenever technically possible. You must enroll in **both** the lecture and its paired lab. Office hours are Wednesday and Friday, 11:30--1:00, in Bunche 2265. Email the instructor at `rlongmuir@g.ucla.edu` with "ECON 103" in the subject line, and post questions to **Campuswire** (join code `6251`), where the whole class benefits from the answer. The main text is *Principles of Econometrics* by Hill, Griffiths & Lim (5th ed.); we cover roughly the first eight chapters, and the slides are posted as well. Stock & Watson is a useful second reference and is mapped chapter-by-chapter in the lecture notes. All the computing is done in **R** and **RStudio**, both free; install them before Lab 1 (campus and virtual desktops have them too). No prior coding experience is required. ::: {.example title="Prerequisites"} The formal prerequisites are Econ 11 and Econ 41. We lean heavily on **Econ 41** --- the next three chapters refresh exactly the probability and statistics we need and nothing more. ::: Your grade is a weighted average of homework, the midterm, and the final, and we take **whichever weighting is better for you**. There are about four homework assignments, submitted online and due Mondays at midnight; late work earns half credit absent an accommodation or prior arrangement. Every assignment includes some **coding questions** (R recommended). The exams are multiple choice, online, and closed-book / closed-internet. ```{r} #| label: tbl-grading #| tbl-cap: "Grade weighting --- we use whichever option is better for you." grading <- data.frame( Component = c("Homework", "Midterm", "Final"), Option1 = c("20%", "30%", "50%"), Option2 = c("20%", "40%", "40%") ) knitr::kable(grading, col.names = c("Component", "Option 1", "Option 2"), align = "lcc") ``` The **midterm** is Thursday, August 20 and covers Lectures 1--10; the **final** is Thursday, September 10 and is cumulative. ## What is econometrics? {#sec-what} Ask a half-dozen econometricians what econometrics is and you get a half-dozen answers: the science of testing theories; a toolkit for forecasting; fitting models to data; making numerical policy recommendations. They are all right. ::: {.keyidea title="Working definition"} **Econometrics** is the science and art of using **economic theory** and **statistical methods** to analyze **economic data** --- to estimate relationships, test hypotheses, and predict outcomes. ::: Econometrics bridges the gap between being "a student of economics" and being a *practicing* economist. Theory tells you that "a price increase lowers quantity demanded." Econometrics answers the question theory cannot: **by how much.** ### Economics runs on "how much?" questions Theory gives you the *direction* of a relationship, but real decisions need a *number*. Consider: - Cut class size by 5 students --- how much do test scores rise? - Raise cigarette prices 1% --- how much does consumption fall? - A Pizza Hut buys more newspaper ads --- how much do sales rise? - The Fed raises the discount rate --- how much does inflation slow? ::: {.keyidea title="The recurring structure"} Each question is about a **parameter** --- an unknown number (an elasticity, a slope, a multiplier) describing how one variable relates to another. The parameter's value is **unknown** and must be **estimated from data**. ::: Because the answer comes from data, it carries **uncertainty**: a different sample gives a different number. For that reason we never report a bare estimate --- we always report an estimate *and* a measure of how precise it is. Quantifying that precision is much of what the second half of the course is about. ### From an economic model to an econometric model Economic theory writes relationships as exact functions. A demand equation, for instance, might say that quantity demanded depends on the good's own price $P$, the price of a substitute $P^s$, the price of a complement $P^c$, and income: $$ Q^d = \beta_1 + \beta_2 P + \beta_3 P^s + \beta_4 P^c + \beta_5 \text{INC}. $$ But real outcomes are *not* exact --- countless small factors are omitted from any model. Econometrics confronts this by adding a **random error** $e$: $$ Q^d = \underbrace{\beta_1 + \beta_2 P + \beta_3 P^s + \beta_4 P^c + \beta_5 \text{INC}}_{\text{systematic part (from theory)}} \;+\; \underbrace{e}_{\text{random ``noise''}}. $$ The $\beta$'s are the **unknown parameters** we want to learn, and $e$ collects everything we left out together with the intrinsic randomness of behavior. ::: {.callout-note appearance="simple"} This "systematic part $+$ error" template is the skeleton of *every* model in the course. We meet it formally when we set up [simple regression](05-simple-regression.qmd). ::: ## Three goals: explain, predict, optimize {#sec-goals} The same regression can serve different purposes --- and the purpose changes what we must assume. It is worth being explicit about why we are building a model in the first place, because the answer determines how hard the job is. ::: {.keyidea title="1. Explain (causal)"} Estimate the **effect** of $X$ on $Y$, *holding all else equal*: "How much do test scores rise *because of* smaller classes?" ::: ::: {.keyidea title="2. Predict (forecast)"} Use $X$ to make a good **guess** of $Y$. Causality is *not* required here: "What will GDP growth be next year?" ::: ::: {.keyidea title="3. Optimize (decide)"} Plug the estimates into a **decision**: "*How much* should the Fed raise rates? How much should we advertise?" ::: ::: {.example title="The goal sets the bar"} For prediction, you only need a stable association --- a variable that reliably moves with $Y$. For explanation and for sound decisions, you need the parameter to be **causal**, a much higher bar. Clearing that bar is most of the work of this course. ::: ## Correlation vs. causation {#sec-correlation} This is the central distinction of the course. Consider a regression of a student's *grade* on the fraction of lectures they *skip*: $$ \text{GRADE} = \beta_1 + \beta_2\,\text{SKIP} + e . $$ We expect $\beta_2 < 0$: more skipping goes with lower grades. But does skipping *cause* low grades? ::: {.warningbox title="Does skipping cause low grades?"} Maybe not. A demanding job, or low motivation, could drive **both** skipping and poor grades. In that case SKIP and GRADE are **correlated** --- which is useful for *prediction* --- yet $\beta_2$ is **not** the causal effect of skipping. The slope also soaks up the influence of those omitted factors. ::: The lesson, repeated throughout the course: **correlation is not causation.** A good predictor need not be a cause. Umbrellas predict rain, but umbrellas don't *make* it rain. ### What *would* pin down a causal effect? The gold standard for measuring a causal effect is a **randomized experiment**. ::: {.keyidea title="The ideal: a randomized experiment"} Split units into a **treatment group** and a **control group** by **random assignment**, then compare average outcomes between them. ::: Randomization makes the groups comparable in every other way, so the only systematic difference between them is the treatment. The **causal effect** is in fact *defined* as the difference such an experiment would reveal. Two classic illustrations: in agronomy, fertilize randomly chosen plots and compare yields; in education, Tennessee's **Project STAR** randomly assigned students to small versus regular classes (we revisit it when we study [treatment effects](20-treatment-effects.qmd)). ::: {.warningbox title="Why we usually can't experiment"} Real experiments are often **unethical, infeasible, or too expensive**. As a result most economic data are **observational** --- we watch the world, we don't run it. ::: When the data are observational, the danger is **confounding**: a third variable that drives both the treatment and the outcome. ::: {.example title="The confounding trap"} Districts with small classes also tend to be *wealthier*. So a raw "small classes $\to$ higher scores" comparison mixes the class-size effect with the income effect. Untangling the two is precisely the job of [multiple regression](13-multiple-regression.qmd) in Part III. ::: ## A first look at data {#sec-data} ### Data come in three types Before modeling anything, it helps to know what kind of data we have. Economic data fall into three structural types: - **Cross-section** --- many *entities* observed in one time period. For example, 420 California school districts in 1999, or a CPS wage survey. The number of entities is denoted $n$. - **Time series** --- one *entity* observed over many periods, such as U.S. GDP growth, quarterly, from 1960 to 2017. The number of periods is denoted $T$. - **Panel** --- many entities, *each* observed over many periods: 48 states across 11 years of cigarette sales, giving $n \times T$ observations. Cutting across this classification, data are also **experimental** (generated by a designed experiment) or **observational** (surveys, administrative records) --- and, as noted above, most economic data are observational. We start with, and spend most of the course on, **cross-sectional** data. ### A first look: food expenditure and income Our running example throughout the course (from Hill, Griffiths & Lim) is a cross-section of **40 households**, recording each household's weekly *income* (in hundreds of dollars) and weekly *food expenditure* (in dollars). The natural first look at the relationship between two variables is a **scatterplot**, shown in @fig-food-scatter. Two things stand out. Higher income **tends to** go with higher food spending --- the cloud of points slopes upward, and the correlation is about $0.62$. But the points plainly do not lie on a line: at any given income, spending varies a great deal. That vertical scatter is exactly the **random error** $e$ from our "systematic part $+$ error" template. ```{r} #| label: fig-food-scatter #| fig-cap: "Weekly food expenditure against weekly income for 40 households. The red line previews the OLS fit we build in Part II." #| fig-width: 5 #| fig-height: 3.4 data(food) fit <- lm(food_exp ~ income, food) ggplot(food, aes(income, food_exp)) + geom_point(color = ucla$darkblue, fill = ucla$blue, shape = 21, size = 1.8) + geom_abline(intercept = coef(fit)[1], slope = coef(fit)[2], color = ucla$red, linewidth = 1) + scale_x_continuous(limits = c(0, 35), breaks = c(0, 10, 20, 30)) + scale_y_continuous(limits = c(0, 620), breaks = c(0, 200, 400, 600)) + labs(x = "weekly income ($100s)", y = "weekly food exp. ($)") ``` ::: {.keyidea title="The course in one picture"} The OLS fit through this cloud is `food_exp` $= 83.42 + 10.21 \times$ `income`: each extra $100 of weekly income is associated with about $10.21 of additional weekly food spending. [Simple regression](05-simple-regression.qmd) and [OLS estimation](06-ols-estimation.qmd) make the idea of "draw the best line through this cloud" precise --- that line is **ordinary least squares**. ::: ## Recap {#sec-recap} **Econometrics** uses economic theory together with statistics to turn *data* into *numbers*: estimating parameters, testing hypotheses, and predicting outcomes. We organize that work around three goals and a small vocabulary of data structures. | | **Three goals** | **Data types** | |----------------|------------------------------------------|---------------------------------------------| | Explain | causal effects, holding all else equal | cross-section, time series, panel | | Predict | a good forecast; causality not required | experimental vs. observational | | Optimize | plug estimates into a decision / policy | most economic data are observational | : The three goals of modeling and the three structures of data. The big ideas to carry forward: every model is a **systematic part $+$ a random error** $e$; **correlation $\neq$ causation**; and randomized experiments *define* causal effects, while with observational data we must work to approximate them. **Next time:** the language of uncertainty --- [random variables and distributions](02-random-vars.qmd).