---
title: "What Is Econometrics?"
---
{{< include _setup.qmd >}}
> **Reading.** SW §1.1–1.3, HGL §1.1–1.5
Welcome to ECON 103. The goal of this course is to learn the core tool of
empirical economics — **linear regression** — and how to use it with real data
in R. Over six weeks we build a single idea in three stages: a **probability
review** (the language of uncertainty), **simple regression** (one explanatory
variable), and finally **multiple regression** (many explanatory variables, plus
causality and treatment effects). This first chapter paints the big picture:
what econometrics *is*, what it is *for*, and the single distinction that runs
through the entire course — **correlation vs. causation**.
## Course mechanics {#sec-mechanics}
Before the economics, a few logistics. Lectures meet Tuesday and Thursday,
3:15–5:20 pm, and the paired lab meets Tuesday, 1:00–2:15 pm — both on Zoom, with
recordings posted whenever technically possible. You must enroll in **both** the
lecture and its paired lab. Office hours are Wednesday and Friday, 11:30–1:00, in
Bunche 2265. Email the instructor at `rlongmuir@g.ucla.edu` with "ECON 103" in
the subject line, and post questions to **Campuswire** (join code `6251`), where
the whole class benefits from the answer.
The main text is *Principles of Econometrics* by Hill, Griffiths & Lim (5th ed.);
we cover roughly the first eight chapters, and the slides are posted as well.
Stock & Watson is a useful second reference and is mapped chapter-by-chapter in
the lecture notes. All the computing is done in **R** and **RStudio**, both free;
install them before Lab 1 (campus and virtual desktops have them too). No prior
coding experience is required.
::: {.example title="Prerequisites"}
The formal prerequisites are Econ 11 and Econ 41. We lean heavily on **Econ 41** —
the next three chapters refresh exactly the probability and statistics we need
and nothing more.
:::
Your grade is a weighted average of homework, the midterm, and the final, and we
take **whichever weighting is better for you**. There are about four homework
assignments, submitted online and due Mondays at midnight; late work earns half
credit absent an accommodation or prior arrangement. Every assignment includes
some **coding questions** (R recommended). The exams are multiple choice, online,
and closed-book / closed-internet.
```{r}
#| label: tbl-grading
#| tbl-cap: "Grade weighting — we use whichever option is better for you."
grading <- data.frame(
Component = c("Homework", "Midterm", "Final"),
Option1 = c("20%", "30%", "50%"),
Option2 = c("20%", "40%", "40%")
)
knitr::kable(grading, col.names = c("Component", "Option 1", "Option 2"),
align = "lcc")
```
The **midterm** is Thursday, August 20 and covers Lectures 1–10; the **final** is
Thursday, September 10 and is cumulative.
## What is econometrics? {#sec-what}
Ask a half-dozen econometricians what econometrics is and you get a half-dozen
answers: the science of testing theories; a toolkit for forecasting; fitting
models to data; making numerical policy recommendations. They are all right.
::: {.keyidea title="Working definition"}
**Econometrics** is the science and art of using **economic theory** and
**statistical methods** to analyze **economic data** — to estimate relationships,
test hypotheses, and predict outcomes.
:::
Econometrics bridges the gap between being "a student of economics" and being a
*practicing* economist. Theory tells you that "a price increase lowers quantity
demanded." Econometrics answers the question theory cannot: **by how much.**
### Economics runs on "how much?" questions
Theory gives you the *direction* of a relationship, but real decisions need a
*number*. Consider:
- Cut class size by 5 students — how much do test scores rise?
- Raise cigarette prices 1% — how much does consumption fall?
- A Pizza Hut buys more newspaper ads — how much do sales rise?
- The Fed raises the discount rate — how much does inflation slow?
::: {.keyidea title="The recurring structure"}
Each question is about a **parameter** — an unknown number (an elasticity, a
slope, a multiplier) describing how one variable relates to another. The
parameter's value is **unknown** and must be **estimated from data**.
:::
Because the answer comes from data, it carries **uncertainty**: a different sample
gives a different number. For that reason we never report a bare estimate — we
always report an estimate *and* a measure of how precise it is. Quantifying that
precision is much of what the second half of the course is about.
### From an economic model to an econometric model
Economic theory writes relationships as exact functions. A demand equation, for
instance, might say that quantity demanded depends on the good's own price $P$,
the price of a substitute $P^s$, the price of a complement $P^c$, and income:
$$
Q^d = \beta_1 + \beta_2 P + \beta_3 P^s + \beta_4 P^c + \beta_5 \text{INC}.
$$
But real outcomes are *not* exact — countless small factors are omitted from any
model. Econometrics confronts this by adding a **random error** $e$:
$$
Q^d = \underbrace{\beta_1 + \beta_2 P + \beta_3 P^s + \beta_4 P^c + \beta_5 \text{INC}}_{\text{systematic part (from theory)}} \;+\; \underbrace{e}_{\text{random ``noise''}}.
$$
The $\beta$'s are the **unknown parameters** we want to learn, and $e$ collects
everything we left out together with the intrinsic randomness of behavior.
::: {.callout-note appearance="simple"}
This "systematic part $+$ error" template is the skeleton of *every* model in the
course. We meet it formally when we set up [simple
regression](05-simple-regression.qmd).
:::
## Three goals: explain, predict, optimize {#sec-goals}
The same regression can serve different purposes — and the purpose changes what
we must assume. It is worth being explicit about why we are building a model in
the first place, because the answer determines how hard the job is.
::: {.keyidea title="1. Explain (causal)"}
Estimate the **effect** of $X$ on $Y$, *holding all else equal*: "How much do
test scores rise *because of* smaller classes?"
:::
::: {.keyidea title="2. Predict (forecast)"}
Use $X$ to make a good **guess** of $Y$. Causality is *not* required here: "What
will GDP growth be next year?"
:::
::: {.keyidea title="3. Optimize (decide)"}
Plug the estimates into a **decision**: "*How much* should the Fed raise rates?
How much should we advertise?"
:::
::: {.example title="The goal sets the bar"}
For prediction, you only need a stable association — a variable that reliably
moves with $Y$. For explanation and for sound decisions, you need the parameter
to be **causal**, a much higher bar. Clearing that bar is most of the work of
this course.
:::
## Correlation vs. causation {#sec-correlation}
This is the central distinction of the course. Consider a regression of a
student's *grade* on the fraction of lectures they *skip*:
$$
\text{GRADE} = \beta_1 + \beta_2\,\text{SKIP} + e .
$$
We expect $\beta_2 < 0$: more skipping goes with lower grades. But does skipping
*cause* low grades?
::: {.warningbox title="Does skipping cause low grades?"}
Maybe not. A demanding job, or low motivation, could drive **both** skipping and
poor grades. In that case SKIP and GRADE are **correlated** — which is useful for
*prediction* — yet $\beta_2$ is **not** the causal effect of skipping. The slope
also soaks up the influence of those omitted factors.
:::
The lesson, repeated throughout the course: **correlation is not causation.** A
good predictor need not be a cause. Umbrellas predict rain, but umbrellas don't
*make* it rain.
### What *would* pin down a causal effect?
The gold standard for measuring a causal effect is a **randomized experiment**.
::: {.keyidea title="The ideal: a randomized experiment"}
Split units into a **treatment group** and a **control group** by **random
assignment**, then compare average outcomes between them.
:::
Randomization makes the groups comparable in every other way, so the only
systematic difference between them is the treatment. The **causal effect** is in
fact *defined* as the difference such an experiment would reveal. Two classic
illustrations: in agronomy, fertilize randomly chosen plots and compare yields;
in education, Tennessee's **Project STAR** randomly assigned students to small
versus regular classes (we revisit it when we study [treatment
effects](20-treatment-effects.qmd)).
::: {.warningbox title="Why we usually can't experiment"}
Real experiments are often **unethical, infeasible, or too expensive**. As a
result most economic data are **observational** — we watch the world, we don't
run it.
:::
When the data are observational, the danger is **confounding**: a third variable
that drives both the treatment and the outcome.
::: {.example title="The confounding trap"}
Districts with small classes also tend to be *wealthier*. So a raw "small classes
$\to$ higher scores" comparison mixes the class-size effect with the income
effect. Untangling the two is precisely the job of [multiple
regression](13-multiple-regression.qmd) in Part III.
:::
## A first look at data {#sec-data}
### Data come in three types
Before modeling anything, it helps to know what kind of data we have. Economic
data fall into three structural types:
- **Cross-section** — many *entities* observed in one time period. For example,
420 California school districts in 1999, or a CPS wage survey. The number of
entities is denoted $n$.
- **Time series** — one *entity* observed over many periods, such as U.S. GDP
growth, quarterly, from 1960 to 2017. The number of periods is denoted $T$.
- **Panel** — many entities, *each* observed over many periods: 48 states across
11 years of cigarette sales, giving $n \times T$ observations.
Cutting across this classification, data are also **experimental** (generated by a
designed experiment) or **observational** (surveys, administrative records) — and,
as noted above, most economic data are observational. We start with, and spend
most of the course on, **cross-sectional** data.
### A first look: food expenditure and income
Our running example throughout the course (from Hill, Griffiths & Lim) is a
cross-section of **40 households**, recording each household's weekly *income* (in
hundreds of dollars) and weekly *food expenditure* (in dollars). The natural
first look at the relationship between two variables is a **scatterplot**, shown
in @fig-food-scatter.
Two things stand out. Higher income **tends to** go with higher food spending —
the cloud of points slopes upward, and the correlation is about $0.62$. But the
points plainly do not lie on a line: at any given income, spending varies a great
deal. That vertical scatter is exactly the **random error** $e$ from our
"systematic part $+$ error" template.
```{r}
#| label: fig-food-scatter
#| fig-cap: "Weekly food expenditure against weekly income for 40 households. The red line previews the OLS fit we build in Part II."
#| fig-width: 5
#| fig-height: 3.4
data(food)
fit <- lm(food_exp ~ income, food)
ggplot(food, aes(income, food_exp)) +
geom_point(color = ucla$darkblue, fill = ucla$blue,
shape = 21, size = 1.8) +
geom_abline(intercept = coef(fit)[1], slope = coef(fit)[2],
color = ucla$red, linewidth = 1) +
scale_x_continuous(limits = c(0, 35), breaks = c(0, 10, 20, 30)) +
scale_y_continuous(limits = c(0, 620), breaks = c(0, 200, 400, 600)) +
labs(x = "weekly income ($100s)", y = "weekly food exp. ($)")
```
::: {.keyidea title="The course in one picture"}
The OLS fit through this cloud is `food_exp` $= 83.42 + 10.21 \times$ `income`:
each extra $100 of weekly income is associated with about $10.21 of additional
weekly food spending. [Simple regression](05-simple-regression.qmd) and [OLS
estimation](06-ols-estimation.qmd) make the idea of "draw the best line through
this cloud" precise — that line is **ordinary least squares**.
:::
## Recap {#sec-recap}
**Econometrics** uses economic theory together with statistics to turn *data*
into *numbers*: estimating parameters, testing hypotheses, and predicting
outcomes. We organize that work around three goals and a small vocabulary of
data structures.
| | **Three goals** | **Data types** |
|----------------|------------------------------------------|---------------------------------------------|
| Explain | causal effects, holding all else equal | cross-section, time series, panel |
| Predict | a good forecast; causality not required | experimental vs. observational |
| Optimize | plug estimates into a decision / policy | most economic data are observational |
: The three goals of modeling and the three structures of data.
The big ideas to carry forward: every model is a **systematic part $+$ a random
error** $e$; **correlation $\neq$ causation**; and randomized experiments
*define* causal effects, while with observational data we must work to
approximate them.
**Next time:** the language of uncertainty — [random variables and
distributions](02-random-vars.qmd).