GV249 Seminar AT4: Correlation

Lennard Metson

2024-10-29

✅ Plan for the class

Finish up Lab 1
Lab 2 code
- Run through sample property and correlation estimators during the lab

📏 Descriptive Statistics

Note

Bueno de Mesquita & Fowler - Thinking Clearly with Data - Chapter 2

The key reading has clear explanations of these statistics (pp. 24-30).
The list of definitions (p. 33) is really useful for revision.

Mean

\[ \mu(X) = \frac{\sum^N_i{X_i}}{N} \]

mean()
# In code: 
mean(df$variable1, na.rm = TRUE) 
# Or
with(df, mean(variable1, na.rm = TRUE))
# Note the `na.rm = TRUE`! This tells R to ignore NA values.
# Otherwise, the mean of a variable that has an NA will be NA.

Variance

\[ \sigma^{2}_{X} = \frac{\sum^{N}_{i}{(X_i - \mu_{X})^2}}{N} \]

Take the deviations (\(X_i - \mu_X\))
Square these deviations to remove negatives
Take the mean of the squared deviations to get the variance

var()
# In code: 
var(df$variable1, na.rm = TRUE)
with(df, var(variable1, na.rm = TRUE))

Standard deviation

\[ \sigma_X = \sqrt{\sigma^{2}_{X}} = \sqrt{\frac{\sum^{N}_{i}{(X_i - \mu_{X})^2}}{N}} \]

The standard deviation is the square root of the variance.
- Values at the extreme (i.e., large deviations) affect the variance significantly, so taking the square root returns the estimate to the same scale as the variable.

sd()
# In code: 
sd(df$variable1, na.rm = TRUE)
with(df, sd(variable1, na.rm = TRUE))

📏 Correlations

Correlation needs variance

Note

Bueno de Mesquita & Fowler - Thinking Clearly with Data - Chapter 4

To see the correlation between two variables, you need to look at the whole distribution of each variable, not just its extremes.
Selecting on the dependent variable (i.e., studying only parts of a variable’s distribution) can lead to misleading answers. Check out this blog post for a clear explanation.

Covariance

\[ cov(X,Y) = \frac{\sum^{N}_{i=1}(X_i - \mu_X) \cdot (Y_i - \mu_Y)}{N} \]

Take the product of each variable’s deviation for unit \(i\) → the product of the deviations
Take the mean of these products of deviations.

cov()
# In code
cov(df$variable1, df$variable2)
with(df, cov(variable1, variable2))

Correlation coefficient (\(r\))

\[ corr(X,Y) = \frac{cov(X, Y)}{\sigma_X \cdot \sigma_Y} \]

Divide the covariance of the two variables by the product of their standard deviations.
Dividing by the product of standard deviations standardizes the correlation coefficient → \(-1 \leq corr(X,Y) \leq 1\)

cor()
# In code
cor(df$variable1, df$variable2)
with(df, cor(variable1, variable2))

\(r^2\)

\[ r^2 = corr(X,Y)^2 = \left(\frac{cov(X, Y)}{\sigma_X \cdot \sigma_Y}\right)^2 \]

We simply square the correlation coefficient.
We can think of \(r^2\) as the proportion of variation that is explained by the correlation (note that “explained” has no causal meaning here).

cor()^2
# In code
cor(df$variable1, df$variable2)^2
with(df, cor(variable1, variable2))^2

Slope of the OLS line of best fit

We will return to OLS in a future lab, where we will cover:
- Standard errors
- \(t\)-values
- \(p\)-values

GV249 Seminar AT4: Correlation

✅ Plan for the class

📏 Descriptive Statistics

Mean

Variance

Standard deviation

📏 Correlations

Correlation needs variance

Covariance

Correlation coefficient (\(r\))

\(r^2\)

Slope of the OLS line of best fit

💻 Lab