GV249 Seminar WT0: Formative 2 Revision

Lennard Metson

2025-01-21

✅ Lesson plan

If you feel confident about the Formative and you would like more R practice, you can carry on the extra lab materials:

  1. Download cses5_br.csv and extra-questions.R from the Len Seminar Slides folder.
  2. Work through the question. Let me know if you need help, you can also use extra-questions_solutions.R to help.

Formative 2

Part 1, Question 1

  1. At a conceptual level, what is the difference between validity and reliability in measurement? (3 marks)

  2. Give an example of a measure that is reliable, but not valid. (2 marks)

Look at:

  • 📑 Week: AT7 - Measurement
  • 📖 Reading: Bueno de Mesquita & Fowler - Thinking Clearly with Data - Chs 15-16.

Part 1, Question 1

  1. Imagine you intend to measure a party’s ideological position on the libertarian-authoritarian dimension and you rely on an expert survey. How would you assess the validity and reliability of your measure of libertarian- authoritarian ideology? (5 marks)

Look at:

  • 📑 Week: AT7 - Measurement
  • 📖 Reading: Bueno de Mesquita & Fowler - Thinking Clearly with Data - Chs 15-16.

Part 1, Question 1

  1. Why, as researchers, do we worry about measurement error? Should we worry more about systematic or random error? (3 marks)

Look at:

  • 📑 Week: AT7 - Measurement
  • 📖 Reading: Bueno de Mesquita & Fowler - Thinking Clearly with Data - Chs 15-16.

Part 1, Question 2

  1. What are potential outcomes? (2 marks)

  2. Explain the following notations: \(Y_i(1)\), \(Y_i(1)|D_i= 0\), and \(Y_i(0)|d_i = 1\). (3 marks)

  3. What is the fundamental problem of causal inference? Use potential outcomes notation. (4 marks)

Look at:

  • 📑 Week: AT3 - Data
  • 📖 Reading: Gerring (2012) Mere Description

Part 1, Question 2

  1. Agree or disagree: “There has been too much focus on causal identification in Political Science, often at the expense of good measurement.” Make sure to define the key concepts in your answer. (6 marks)

Look at:

  • 📑 Week: AT5 - Causation

Part 1, Question 3

  1. What is the difference between estimand and estimate? (3 marks)

  2. How can any one estimate be too low or too high if the estimator used to obtain it is unbiased? (3 marks)

Look at:

  • 📑 Week: AT4 - Correlation & AT8 Regression
  • 📖 Reading: Bueno de Mesquita & Fowler - Thinking Clearly with Data - Chapters from those two weeks

Part 2

  1. Formulate a descriptive research question covering any subfield of Political Science, which includes two variables, \(x\) (the predictor) and \(y\) (the dependent variable). (10 marks)

  2. Define the theoretical concepts that these variables attempt to measure, operationalise these concepts, and propose valid measurement instruments for them. (10 marks)

  3. Formulate a clear, testable hypothesis. (5 marks)

  4. Write down a linear regression model of \(y\) on \(x\), and define each term in the regression model. (8 marks)

Part 3

#install.packages(c("estimatr","readstata13", "tidyverse"))

library(tidyverse)
library(readstata13)
library(estimatr)

# setwd("WT/WT0_formative-revision")

df <- read.dta13("BES_2017_problemset2.dta")

Part 3

Part 3, Exercise 1

  1. Regress self-reported turnout on levels of education (edu_num). Remember, your dependent variable needs to be numeric before you can run your regression. Assume for now that levels of education is a continuous variable. Display and interpret the substantive results of your model. (3 marks)
table(df$turnout_self_reported, useNA ="ifany")

Did not vote        Voted         <NA> 
         458         1732            4 
df$turnout_sr_num <- NA
df$turnout_sr_num[df$turnout_self_reported=="Voted"] <- 1
df$turnout_sr_num[df$turnout_self_reported=="Did not vote"] <- 0

table(df$turnout_sr_num,useNA ="ifany")

   0    1 <NA> 
 458 1732    4 

Part 3, Exercise 1

  1. Regress self-reported turnout on levels of education (edu_num). Remember, your dependent variable needs to be numeric before you can run your regression. Assume for now that levels of education is a continuous variable. Display and interpret the substantive results of your model. (3 marks)
summary(lm_robust(turnout_sr_num~edu_num,data =df))

Call:
lm_robust(formula = turnout_sr_num ~ edu_num, data = df)

Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper   DF
(Intercept)  0.66348   0.021934  30.248 9.184e-168  0.62046  0.70649 2150
edu_num      0.03519   0.005099   6.901  6.767e-12  0.02519  0.04518 2150

Multiple R-squared:  0.02176 ,  Adjusted R-squared:  0.02131 
F-statistic: 47.62 on 1 and 2150 DF,  p-value: 6.767e-12

Part 3, Exercise 1

  1. Write down the regression model that you are estimating. (2 marks)

  2. What is the predicted turnout for someone with “GCSEs”/level 3? (2 points)

  3. What is the null hypothesis? Can you reject it at the 0.05 level? (3 marks)


Call:
lm_robust(formula = turnout_sr_num ~ edu_num, data = df)

Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper   DF
(Intercept)  0.66348   0.021934  30.248 9.184e-168  0.62046  0.70649 2150
edu_num      0.03519   0.005099   6.901  6.767e-12  0.02519  0.04518 2150

Multiple R-squared:  0.02176 ,  Adjusted R-squared:  0.02131 
F-statistic: 47.62 on 1 and 2150 DF,  p-value: 6.767e-12

Part 3, Exercise 2

  1. Regress validated turnout on levels of education. Remember, your dependent variable needs to be numeric before you can run your regression. Display and interpret the substantive results of your model. (4 marks)
table(df$turnout_validated,useNA ="ifany")

   0    1 <NA> 
 380 1095  719 

Part 3, Exercise 2

  1. Regress validated turnout on levels of education. Remember, your dependent variable needs to be numeric before you can run your regression. Display and interpret the substantive results of your model. (4 marks)
summary(lm_robust(turnout_validated ~ edu_num,data = df))

Call:
lm_robust(formula = turnout_validated ~ edu_num, data = df)

Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper   DF
(Intercept)  0.66507   0.027754  23.963 3.839e-107 0.610622  0.71951 1447
edu_num      0.02116   0.006649   3.183  1.489e-03 0.008121  0.03421 1447

Multiple R-squared:  0.006686 , Adjusted R-squared:  0.005999 
F-statistic: 10.13 on 1 and 1447 DF,  p-value: 0.001489

Part 3, Exercise 2

  1. What is the null hypothesis? Can you reject it at the 0.01 level? What can you say based on this test about the relationship between turnout and levels of education in the population of British citizens? (4 marks)

  2. Compare the estimates from the two models you ran and interpret what you find. (3 marks)

  3. Do some research. How is turnout “validated”? (2 marks)

Part 3, Exercise 2

Self-reported:


Call:
lm_robust(formula = turnout_sr_num ~ edu_num, data = df)

Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper   DF
(Intercept)  0.66348   0.021934  30.248 9.184e-168  0.62046  0.70649 2150
edu_num      0.03519   0.005099   6.901  6.767e-12  0.02519  0.04518 2150

Multiple R-squared:  0.02176 ,  Adjusted R-squared:  0.02131 
F-statistic: 47.62 on 1 and 2150 DF,  p-value: 6.767e-12

Part 3, Exercise 2

Validated:


Call:
lm_robust(formula = turnout_validated ~ edu_num, data = df)

Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper   DF
(Intercept)  0.66507   0.027754  23.963 3.839e-107 0.610622  0.71951 1447
edu_num      0.02116   0.006649   3.183  1.489e-03 0.008121  0.03421 1447

Multiple R-squared:  0.006686 , Adjusted R-squared:  0.005999 
F-statistic: 10.13 on 1 and 1447 DF,  p-value: 0.001489

Part 3, Exercise 3

  1. Would you use self-reported turnout or valdiated turnout as your dependent variable if you had to choose? Explain your choice. (4 marks)

Part 3, Exercise 2

  1. Use either the crude (binary) gender variable included in the dataset or the age variable and regress your chosen turnout measure on your predictor of choice. Interpret your results, and make an inference about the population of interest. (6 marks)
summary(df$Age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  18.00   37.00   53.00   52.58   67.00   99.00      19 

Part 3, Exercise 2

  1. Use either the crude (binary) gender variable included in the dataset or the age variable and regress your chosen turnout measure on your predictor of choice. Interpret your results, and make an inference about the population of interest. (6 marks)
summary(lm_robust(turnout_validated ~ Age, data = df))

Call:
lm_robust(formula = turnout_validated ~ Age, data = df)

Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
(Intercept) 0.495903  0.0370289  13.392 1.126e-38 0.423268 0.568538 1469
Age         0.004655  0.0006304   7.384 2.562e-13 0.003418 0.005891 1469

Multiple R-squared:  0.03857 ,  Adjusted R-squared:  0.03792 
F-statistic: 54.52 on 1 and 1469 DF,  p-value: 2.562e-13