GV249 Seminar AT5: Causation

Lennard Metson

2024-11-12

Correlation \(\neq\) causation

📖 Key reading

Bueno de Mesquita & Fowler - Thinking Clearly with Data - Chapters 2 & 9

Causation

Correlation \(\neq\) causation… but what does?
Counterfactuals!

Causation

Definition

A is caused by B when A would not have happened if B had not happened.

A causal effect is the comparison of the value of A in a world in which B happened versus a world in which B didn’t happen.
These two values of A are called potential outcomes

Causation - Potential outcomes

\(Y_{i}^{1}\)
\(Y_{i}^{0}\)

😟 The fundamental problem of causal inference

⚠️ The problem

We cannot observe both potential outcomes for an individual.

Once they are treated, we only observe their treated potential outcome.
And when they are untreated, we only observe their untreated potential outcome.

Causal inference is a field of statistics focused on getting around this problem.

😌 Solving the fundamental problem

Causal inference methods try to fill in missing potential outcomes using information from treated and untreated units.

😌 Solving the fundamental problem

The answer lies in comparing average potential outcomes across treated and untreated units.

\[ \text{mean}(Y_{i}^{1}-Y_{i}^{0}) = \text{mean}(Y_{i}^{1}) - \text{mean}(Y_{i}^{0}) \]

🧪 Treatment effects

🧪 ATE

The ATE is the mean of the individual-level treatment effect (\(\tau_i\)) for all units (\(\text{mean}(Y_i^1 - Y_i^0)\)).
This is mathematically equal to \(\text{mean}(Y_{i}^{1}) - \text{mean}(Y_{i}^{0})\)

🧪 ATT & ATU

ATT is the mean \(\tau_i\) for the subset of units who received treatment.
ATU is the mean \(\tau_i\) for the subset of units who did not receive treatment.
The ATT and ATU are different from the ATE when there is biased (i.e., non-random) selection into who receives treatment.

📺 Example: news and attitudes

📺 Set-up

Imagine political attitudes range from 0-5, with 0 being the most left-wing and 5 being the most right-wing.
We are interested in whether watching a left-wing news (IV) channel makes individuals more left wing (DV). This is a causal research question!

📺 Set-up

We have some variables:

\(X\) = individual \(i\)’s political attitudes before watching left-leaning news (or not).
\(D\) (IV) = whether \(i\) watches left-leaning news.
\(Y\) (DV) = \(i\)’s political attitudes after watching left-leaning news (or not).

📺 The counter-factual world

In the counter-factual world, we can see both potential outcomes.
Which means we can calculate the individual causal effect for each individual.
On the next slide we can see the full schedule of potential outcomes for 10 people.

📺 The counter-factual world

\(i\)	\(X_i\)	\(Y_{i}^{0}\)	\(Y_{i}^{1}\)
\(i_1\)	5	5	4
\(i_2\)	4	3	3
\(i_3\)	3	4	5
\(i_4\)	4	4	3
\(i_5\)	5	5	5
\(i_6\)	2	3	3
\(i_7\)	3	2	3
\(i_8\)	2	1	2
\(i_9\)	1	1	1
\(i_{10}\)	1	1	2

📺 The counter-factual world

\(i\)	\(X_i\)	\(Y_{i}^{0}\)	\(Y_{i}^{1}\)	\(\tau_i\)
\(i_1\)	5	5	4	-1
\(i_2\)	4	3	3	0
\(i_3\)	3	4	5	1
\(i_4\)	4	4	3	-1
\(i_5\)	5	5	5	0
\(i_6\)	2	3	3	0
\(i_7\)	3	2	3	1
\(i_8\)	2	1	2	1
\(i_9\)	1	1	1	0
\(i_{10}\)	1	1	2	1

\(ATE = mean(\tau_i) = 0.2\)
\(mean(Y_{i}^{1}) = 3.1\)
\(mean(Y_{i}^{0}) = 2.9\)
→ \(ATE = 3.1-2.9 = 0.2\)

📺 The real-world (1)

However, in the real-world, we only see one potential outcome.
More right-wing people are less likely to choose to watch left-wing news.
The correlation between pre-treatment variables and selecting into treatment is selection bias.
Let’s see what happens when we compare potential outcomes letting individuals choose whether they watch left-wing news.

📺 The real-world (1)

\(i\)	\(X_i\)	\(D_i\)	\(Y_{i}^{0}\)	\(Y_{i}^{1}\)
\(i_1\)	5	0	5	?
\(i_2\)	4	0	3	?
\(i_3\)	3	0	4	?
\(i_4\)	4	1	?	3
\(i_5\)	5	0	5	?
\(i_6\)	2	1	?	3
\(i_7\)	3	1	?	3
\(i_8\)	2	0	1	?
\(i_9\)	1	1	?	1
\(i_{10}\)	1	1	?	2

In the real world, we only see one potential outcome. This is defined by the value of \(D_i\).
We refer to the observed outcome as \(Y_i\).

📺 The real-world (1)

\(i\)	\(X_i\)	\(D_i\)	\(Y_i\)
\(i_1\)	5	0	5
\(i_2\)	4	0	3
\(i_3\)	3	0	4
\(i_4\)	4	1	3
\(i_5\)	5	0	5
\(i_6\)	2	1	3
\(i_7\)	3	1	3
\(i_8\)	2	0	1
\(i_9\)	1	1	1
\(i_{10}\)	1	1	2

In the real world, we only see one potential outcome. This is defined by the value of \(D_i\).
We refer to the observed outcome as \(Y_i\).

📺 The real-world (1)

\(i\)	\(X_i\)	\(D_i\)	\(Y_i\)
\(i_1\)	5	0	5
\(i_2\)	4	0	3
\(i_3\)	3	0	4
\(i_4\)	4	1	3
\(i_5\)	5	0	5
\(i_6\)	2	1	3
\(i_7\)	3	1	3
\(i_8\)	2	0	1
\(i_9\)	1	1	1
\(i_{10}\)	1	1	2

\(mean(Y_i|D_i = 1)\) = 2.4
\(mean(Y_i|D_i = 0)\) = 3.6
Estimated ATE = \(2.4 - 3.6 = -1.2\)

📺 The real-world (1)

\(i\)	\(X_i\)	\(D_i\)	\(Y_i\)
\(i_1\)	5	0	5
\(i_2\)	4	0	3
\(i_3\)	3	0	4
\(i_4\)	4	1	3
\(i_5\)	5	0	5
\(i_6\)	2	1	3
\(i_7\)	3	1	3
\(i_8\)	2	0	1
\(i_9\)	1	1	1
\(i_{10}\)	1	1	2

\(mean(Y_i|D_i = 1)\) = 2.4
\(mean(Y_i|D_i = 0)\) = 3.6
Estimated ATE = \(2.4 - 3.6 = -1.2\)

What’s the problem?

📺 The real-world (2)

Now imagine we can decide who watches left-wing news or not.
We can randomly select individuals to reveal their treated (and untreated) potential outcomes.
Because we select them randomly, there is no selection bias: individuals have the same probability of watching left-wing news regardless of their underlying attitudes.

📺 The real-world (2)

\(i\)	\(X_i\)	\(D_i\)	\(Y_i\)
\(i_1\)	5	0	5
\(i_2\)	4	1	3
\(i_3\)	3	1	5
\(i_4\)	4	0	4
\(i_5\)	5	1	5
\(i_6\)	2	0	3
\(i_7\)	3	0	2
\(i_8\)	2	0	1
\(i_9\)	1	1	1
\(i_{10}\)	1	1	2

📺 The real-world (2)

\(i\)	\(X_i\)	\(D_i\)	\(Y_i\)
\(i_1\)	5	0	5
\(i_2\)	4	1	3
\(i_3\)	3	1	5
\(i_4\)	4	0	4
\(i_5\)	5	1	5
\(i_6\)	2	0	3
\(i_7\)	3	0	2
\(i_8\)	2	0	1
\(i_9\)	1	1	1
\(i_{10}\)	1	1	2

\(mean(Y_i|D_i = 1)\) = 3.2
\(mean(Y_i|D_i = 0)\) = 3
Estimated ATE = \(3.2 - 3 = 0.2\) ¹

📺 What does this tell us about causal inference?

Sometimes, we can use information about the average outcomes for treated and untreated units to estimate average treatment effects.
Where there is no selection bias, we can directly compare the mean outcomes of two groups.
When there is selection bias, we need to use alternative strategies to “fill in” missing potential outcomes.

📺 Broockman and Kalla (2022) did this for real!

On a sample of \(\approx\) 5,500 American voters, Broockman and Kalla (2022) randomized encouragement to watch CNN (rather than Fox News).
They find evidence that actually, there is an effect, even when you randomise!

📊 Visualisation: extra resources

Leland Wilkinson - Grammar of Graphics
- This is what the package ggplot2 is based on - hence “gg”
- This blog has some detailed information about the grammar of graphics framework.
Claus Wilke - Fundamentals of Data Visualization.
- This chapter - on visualising uncertainty - links nicely to what we covered in AT4.
Edward Tufte - The Visual Display of Quantitative Information
- Leans more heavily into graphic design
- Tufte’s 6 principle’s: this blog post dicusses them.

References

Broockman, David, and Joshua Kalla. 2022. “Consuming Cross-Cutting Media Causes Learning and Moderates Attitudes: A Field Experiment with Fox News Viewers.” Preprint. Open Science Framework. https://doi.org/10.31219/osf.io/jrw26.