GV249 Seminar AT4: Correlation

Lennard Metson

2024-10-29

✅ Plan for the class

📏 Descriptive Statistics

Note

Bueno de Mesquita & Fowler - Thinking Clearly with Data - Chapter 2

  • The key reading has clear explanations of these statistics (pp. 24-30).
  • The list of definitions (p. 33) is really useful for revision.

Mean

\[ \mu(X) = \frac{\sum^N_i{X_i}}{N} \]

mean()
# In code: 
mean(df$variable1, na.rm = TRUE) 
# Or
with(df, mean(variable1, na.rm = TRUE))
# Note the `na.rm = TRUE`! This tells R to ignore NA values.
# Otherwise, the mean of a variable that has an NA will be NA.

Variance

\[ \sigma^{2}_{X} = \frac{\sum^{N}_{i}{(X_i - \mu_{X})^2}}{N} \]

  • Take the deviations (\(X_i - \mu_X\))
  • Square these deviations to remove negatives
  • Take the mean of the squared deviations to get the variance
var()
# In code: 
var(df$variable1, na.rm = TRUE)
with(df, var(variable1, na.rm = TRUE))

Standard deviation

\[ \sigma_X = \sqrt{\sigma^{2}_{X}} = \sqrt{\frac{\sum^{N}_{i}{(X_i - \mu_{X})^2}}{N}} \]

  • The standard deviation is the square root of the variance.
    • Values at the extreme (i.e., large deviations) affect the variance significantly, so taking the square root returns the estimate to the same scale as the variable.
sd()
# In code: 
sd(df$variable1, na.rm = TRUE)
with(df, sd(variable1, na.rm = TRUE))

📏 Correlations

Correlation needs variance

Note

Bueno de Mesquita & Fowler - Thinking Clearly with Data - Chapter 4

  • To see the correlation between two variables, you need to look at the whole distribution of each variable, not just its extremes.
  • Selecting on the dependent variable (i.e., studying only parts of a variable’s distribution) can lead to misleading answers. Check out this blog post for a clear explanation.

Covariance

\[ cov(X,Y) = \frac{\sum^{N}_{i=1}(X_i - \mu_X) \cdot (Y_i - \mu_Y)}{N} \]

  • Take the product of each variable’s deviation for unit \(i\) → the product of the deviations
  • Take the mean of these products of deviations.
cov()
# In code
cov(df$variable1, df$variable2)
with(df, cov(variable1, variable2))

Correlation coefficient (\(r\))

\[ corr(X,Y) = \frac{cov(X, Y)}{\sigma_X \cdot \sigma_Y} \]

  • Divide the covariance of the two variables by the product of their standard deviations.
  • Dividing by the product of standard deviations standardizes the correlation coefficient → \(-1 \leq corr(X,Y) \leq 1\)
cor()
# In code
cor(df$variable1, df$variable2)
with(df, cor(variable1, variable2))

\(r^2\)

\[ r^2 = corr(X,Y)^2 = \left(\frac{cov(X, Y)}{\sigma_X \cdot \sigma_Y}\right)^2 \]

  • We simply square the correlation coefficient.
  • We can think of \(r^2\) as the proportion of variation that is explained by the correlation (note that “explained” has no causal meaning here).
cor()^2
# In code
cor(df$variable1, df$variable2)^2
with(df, cor(variable1, variable2))^2

Slope of the OLS line of best fit

  • We will return to OLS in a future lab, where we will cover:
    • Standard errors
    • \(t\)-values
    • \(p\)-values

💻 Lab