GV249 Seminar AT8: Regression

Lennard Metson

2024-11-26

✅ Lesson plan

Today

Next week, we will go through the formative coding section together.

📈 OLS regression

Recap: correlation

  • Because it is standardised, \(r\) ranges from -1 to 1.
  • What is this useful for?
  • What is the limitation of this?
  • Solution: correlation using the slope of the line (OLS regression).

📈 OLS: lines

Lines can be described using the following formula: \(y = \alpha + \beta x\)

📈 OLS: lines

OLS regression finds the line that minimises the sum of squared errors:

\[ SSE = \sum^{n}_{i=1}{(y_i - \hat{y}_i)^2} \]

Where:

  • \(y_i\) is the actual value of the data point
  • \(\hat{y}_i\) is the value predicted by our line

📈 OLS: fitting lines

Screenshot from course textbook PDF version, p. 104 (= p. 78 in published book).

📈 OLS: interpreting results

  • \(\alpha\) is the “intercept”—the predicted value of \(y\) when \(x = 0\)
  • \(\beta\) is the regression coefficient—the predicted change in \(y\) for a one-unit change in \(x\)

📈 OLS: multivariate regression

  • We can extend this to include more than one independent variable:
    • \(Y = \alpha + \beta_1 X_1 + \beta_2 X_2\)
  • To help visualize this intuitively, this interactive website shows a “surface” of best fit.

📈 OLS: non-linear lines

  • This online tool is useful for showing how different values of \(\alpha\), \(\beta\), and functional forms change the shape of the line.
  • Let’s use it to show what polynomials do to our line!

📈 OLS: common problems

  • Out-of-sample predictions
  • Overfitting

📈 OLS: common problems

Out-of-sample prediction:

Coefficients:
             Estimate Std. Error t value
(Intercept)  0.793346  0.0304079   26.09
age         -0.006956  0.0005246  -13.26
  • What proportion of 12 year olds vote for Labour in the UK?
  • \(0.793+12*-0.00696 = 0.709 = 70.9\%\)?

📈 OLS: common problems

Overfitting:

💻 Lab

✅ Lesson plan

Today

Appendix

Recap: Variance

\[ \sigma^{2}_{X} = \frac{\sum^{N}_{i}{(X_i - \mu_{X})^2}}{N} \]

  • Take the deviations (\(X_i - \mu_X\))
  • Square these deviations to remove negatives
  • Take the mean of the squared deviations to get the variance
var()
# In code: 
var(df$variable1, na.rm = TRUE)
with(df, var(variable1, na.rm = TRUE))

Recap: Covariance

\[ cov(X,Y) = \frac{\sum^{N}_{i=1}(X_i - \mu_X) \cdot (Y_i - \mu_Y)}{N} \]

  • Take the product of each variable’s deviation for unit \(i\) → the product of the deviations
  • Take the mean of these products of deviations.
cov()
# In code
cov(df$variable1, df$variable2)
with(df, cov(variable1, variable2))

Recap: Correlation coefficient (\(r\))

\[ corr(X,Y) = \frac{cov(X, Y)}{\sigma_X \cdot \sigma_Y} \]

cor()
# In code
cor(df$variable1, df$variable2)
with(df, cor(variable1, variable2))

Recap: \(r^2\)

\[ r^2 = corr(X,Y)^2 = \left(\frac{cov(X, Y)}{\sigma_X \cdot \sigma_Y}\right)^2 \]

  • We simply square the correlation coefficient.
  • We can think of \(r^2\) as the proportion of variation that is explained by the correlation (note that “explained” has no causal meaning here).
cor()^2
# In code
cor(df$variable1, df$variable2)^2
with(df, cor(variable1, variable2))^2

Recap: Standard deviation

\[ \sigma_X = \sqrt{\sigma^{2}_{X}} = \sqrt{\frac{\sum^{N}_{i}{(X_i - \mu_{X})^2}}{N}} \]

  • The standard deviation is the square root of the variance.
    • Values at the extreme (i.e., large deviations) affect the variance significantly, so taking the square root returns the estimate to the same scale as the variable.
sd()
# In code: 
sd(df$variable1, na.rm = TRUE)
with(df, sd(variable1, na.rm = TRUE))