GV249 Seminar AT8: Regression

Lennard Metson

2024-11-26

✅ Lesson plan

Today

Questions from this week’s content
Review theory behind OLS regression
Coding Lab 3… do OLS!
Review how to interpret OLS regression results

Next week, we will go through the formative coding section together.

📈 OLS regression

Recap: correlation

So far we’ve seen the correlation coefficient (\(r\)).¹

Because it is standardised, \(r\) ranges from -1 to 1.
What is this useful for?
What is the limitation of this?
Solution: correlation using the slope of the line (OLS regression).

📈 OLS: lines

Lines can be described using the following formula: \(y = \alpha + \beta x\)

📈 OLS: lines

OLS regression finds the line that minimises the sum of squared errors:

\[ SSE = \sum^{n}_{i=1}{(y_i - \hat{y}_i)^2} \]

Where:

\(y_i\) is the actual value of the data point
\(\hat{y}_i\) is the value predicted by our line

📈 OLS: fitting lines

Screenshot from course textbook PDF version, p. 104 (= p. 78 in published book).

📈 OLS: interpreting results

\(\alpha\) is the “intercept”—the predicted value of \(y\) when \(x = 0\)
\(\beta\) is the regression coefficient—the predicted change in \(y\) for a one-unit change in \(x\)

📈 OLS: multivariate regression

We can extend this to include more than one independent variable:
- \(Y = \alpha + \beta_1 X_1 + \beta_2 X_2\)
To help visualize this intuitively, this interactive website shows a “surface” of best fit.

📈 OLS: non-linear lines

This online tool is useful for showing how different values of \(\alpha\), \(\beta\), and functional forms change the shape of the line.
Let’s use it to show what polynomials do to our line!

📈 OLS: common problems

Out-of-sample predictions
Overfitting

📈 OLS: common problems

Out-of-sample prediction:

Coefficients:
             Estimate Std. Error t value
(Intercept)  0.793346  0.0304079   26.09
age         -0.006956  0.0005246  -13.26

What proportion of 12 year olds vote for Labour in the UK?
\(0.793+12*-0.00696 = 0.709 = 70.9\%\)?

📈 OLS: common problems

Overfitting:

💻 Lab

✅ Lesson plan

Today

Questions from this week’s content
Review theory behind OLS regression
Coding Lab 3… do OLS!
Review how to interpret OLS regression results

Appendix

Recap: Variance

\[ \sigma^{2}_{X} = \frac{\sum^{N}_{i}{(X_i - \mu_{X})^2}}{N} \]

Take the deviations (\(X_i - \mu_X\))
Square these deviations to remove negatives
Take the mean of the squared deviations to get the variance

var()
# In code: 
var(df$variable1, na.rm = TRUE)
with(df, var(variable1, na.rm = TRUE))

Recap: Covariance

\[ cov(X,Y) = \frac{\sum^{N}_{i=1}(X_i - \mu_X) \cdot (Y_i - \mu_Y)}{N} \]

Take the product of each variable’s deviation for unit \(i\) → the product of the deviations
Take the mean of these products of deviations.

cov()
# In code
cov(df$variable1, df$variable2)
with(df, cov(variable1, variable2))

Recap: Correlation coefficient (\(r\))

\[ corr(X,Y) = \frac{cov(X, Y)}{\sigma_X \cdot \sigma_Y} \]

Divide the covariance of the two variables by the product of their standard deviations.
Dividing by the product of standard deviations standardizes the correlation coefficient → \(-1 \leq corr(X,Y) \leq 1\)

cor()
# In code
cor(df$variable1, df$variable2)
with(df, cor(variable1, variable2))

Recap: \(r^2\)

\[ r^2 = corr(X,Y)^2 = \left(\frac{cov(X, Y)}{\sigma_X \cdot \sigma_Y}\right)^2 \]

We simply square the correlation coefficient.
We can think of \(r^2\) as the proportion of variation that is explained by the correlation (note that “explained” has no causal meaning here).

cor()^2
# In code
cor(df$variable1, df$variable2)^2
with(df, cor(variable1, variable2))^2

Recap: Standard deviation

\[ \sigma_X = \sqrt{\sigma^{2}_{X}} = \sqrt{\frac{\sum^{N}_{i}{(X_i - \mu_{X})^2}}{N}} \]

The standard deviation is the square root of the variance.
- Values at the extreme (i.e., large deviations) affect the variance significantly, so taking the square root returns the estimate to the same scale as the variable.

sd()
# In code: 
sd(df$variable1, na.rm = TRUE)
with(df, sd(variable1, na.rm = TRUE))