Make sure to save the dataset in an easy-to-access folder
A place you can also access from elsewhere (e.g., N-drive)
Objectives of This Bootcamp
Overarching goal
Be on common starting point + ready to take on the quantitative MSc modules
Refresh key statistical concepts
E.g., sampling distributions, hypothesis testing
Obtain basic familiarity with R & Stata
Statistical Inference
Statistical inference is used to learn from incomplete data
We wish to learn some characteristics of a population (e.g., the mean and standard deviation of the heights of all women in the UK), which we must estimate from a sample or subset of that population
Two types of inference:
Descriptive inference: What is going on, or what exists?
Causal inference: Why is something going on, why does it exist? Does X cause Y?
Example data
The code below loads the WDI packages and searches for an indicator on CO2 per capita. It uses the statistics software R (more later).
# load packagelibrary(WDI)# Search GDP per capita (log-transformed)WDIsearch("CO2.*capita")
indicator
6032 EN.ATM.CO2E.PC
6048 EN.ATM.METH.PC
6059 EN.ATM.NOXE.PC
name
6032 CO2 emissions (metric tons per capita)
6048 Methane emissions (kt of CO2 equivalent per capita)
6059 Nitrous oxide emissions (metric tons of CO2 equivalent per capita)
The code below uses the WDI API to retrieve the data and creates a dataframe of three indicators.
# Define countries, indicators form above, and time periodwd.df <-WDI(country ="all", indicator =c('population'="SP.POP.TOTL", 'gdp_pc'="NY.GDP.PCAP.KD", 'co2_pc'="EN.ATM.CO2E.PC"),extra =TRUE,start =2019, end =2019)# Drop all country aggregrateswd.df <- wd.df[which(wd.df$region !="Aggregates"), ]# Save datasave(wd.df, file ="WDI_short.RData")
Descriptive Inference
expand for full code
library(ggplot2)pl <-ggplot(wd.df, aes(x = gdp_pc, y = co2_pc, size = population, color = region)) +geom_point(alpha =0.5) +theme_minimal() +scale_y_log10() +scale_x_log10(labels = scales::dollar_format()) +labs(y ="CO2 emissions per capita (log-transformed)", x ="GDP per capita (log-transformed)")pl
Descriptive: What are the average CO2 emissions of all European countries?
Descriptive Inference
What are the average CO2 emissions of all European countries?
Mean by region
mean_co2 <-mean(wd.df$co2_pc[which(wd.df$region =="Europe & Central Asia")], na.rm =TRUE)mean_co2
[1] 5.608387
This is the average across European countries.
Why is it not the average across all Europeans?
Descriptive Inference
What are the average CO2 emissions of all European countries?
Code
# Create group means (only for complete observations)tmp.df <- wd.df[, c("region", "gdp_pc", "co2_pc")]tmp.df <- tmp.df[complete.cases(tmp.df),]means.df <-aggregate(tmp.df[, c("gdp_pc", "co2_pc")],by =list(region = tmp.df$region),FUN =function(x) mean(x, na.rm =TRUE))# Plotpl_mean <-ggplot(wd.df, aes(x = gdp_pc, y = co2_pc, color = region)) +geom_point(data = wd.df, mapping =aes(x = gdp_pc, y = co2_pc, size = population, color = region),alpha =0.5) +geom_point(data = means.df, mapping =aes(x = gdp_pc, y = co2_pc, fill = region),size =4, shape =25, color ="yellow") +theme_minimal() +scale_y_log10() +scale_x_log10(labels = scales::dollar_format()) +labs(y ="CO2 emissions per capita (log-transformed)", x ="GDP per capita (log-transformed)")pl_mean
Causal Inference
expand for full code
pl2 <-ggplot(wd.df, aes(x = gdp_pc, y = co2_pc, size = population, color = region)) +geom_smooth(aes(group =1), show.legend ="none") +geom_point(alpha =0.5) +theme_minimal() +scale_y_log10() +scale_x_log10(labels = scales::dollar_format()) +labs(y ="CO2 emissions per capita (log-transformed)", x ="GDP per capita (log-transformed)")pl2
Causal: How does GDP influence the amount of CO2 emissions?
Variables and Observations
A variable is anything that can vary across units of analysis: it is a characteristic that can have multiple values
E.g., sex, age, diameter, financial revenue, temperature
A unit of analysis is the major entity that you analyse
E.g., individuals, objects, schools, countries
An observation is the value of a particular variable for a particular unit (sometimes a unit is in its entirety referred to as observation)
E.g., the individual King Charles III is 73 years of age
Different Types of Variables
Continuous / interval-ratio variables:
They have an ordering, they can take on infinitely many values, and you can do calculations with them
E.g., income, age, weight, minutes
Categorical variables:
Each observation belongs to one out of a fixed number of categories
Ordinal variables: there is a natural ordering of the categories
Nominal variables: there is no natural ordering of the categories
May give a distorted impression if there are outliers
Median: value that falls in the middle if we order all units by their value on the variable
Mode: most frequently occurring value across all units
Measures of Dispersion
Variance: average of the squared differences between each observed value on a variable and its mean \[
\sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2
\]
Why the square? To treat + and – differences alike
Standard deviation: average departure of the observed values on a variable from its mean
It is the square root of the variance; it “reverts” the square-taking in the variance calculation, to bring the statistic back to the original scale of the variable
Typically, we use Roman letters for sample statistics and Greek letters for population statistics:
Sample mean \(= \bar{x}\) population mean \(= \mu\)
Sample variance \(= s^2\), population variance \(= \sigma^2\)
Sample standard error \(= s\), population standard deviation \(= \sigma\)
Recall: the sample is what we observe, the population is what we want to make inferences about
Decribing relationships
Decribing relationships
Often, we are not just interested in the distribution of a single variable. 1
Instead, we are interested in relationships between several variables. If there is a relationship between variable, knowing one variable can tell you something about the other variable.
Age and height
GDP and CO2 emissions
Education and income
Hypothesis testing
A hypothesis is a theory-based statement about a relationship that we expect to observe
E.g., girls achieve higher scores than boys on reading tests
For every hypothesis there is a corresponding null hypothesis about what we would expect if our theory is incorrect
E.g., there is no association between Y and X in the population
In our example: girls are not better readers than boys
Covariance
Covariance refers to the idea that the pattern of variation for one variable corresponds to the pattern of variation for another variable: the two variables “vary together”
Statistically speaking, covariance is the multiplication of the deviations from the mean for the first variables and the deviations from the mean for the second variable:
\(y_i\): log-transformed wage of individual \(i\);
\(X_{1i}\): education years of individual \(i\).
\(X_{2i}\): whether or not individual \(i\) is union member.
\(\epsilon_i\): error term. It contains all factors affecting the wage, but not contained in education years and union membership (such as age, gender, subject, quantitative methods skills, and lots more).
An example: wage
Code
# Get example data library(plm)data("Wages")# Reduce data frame to cross-sectionWages$year <-rep(c(1976:1982), 595)wages_cs.df <- Wages[which(Wages$year ==1976), ]# Reduce to 50 individual random sampleset.seed(653210)wages_cs.df <- wages_cs.df[sample(1:595, 50), ]# Meansmeans.df <-aggregate(wages_cs.df[, c("ed", "lwage")],by =list(union = wages_cs.df$union),FUN =function(x) mean(x, na.rm =TRUE))# Plot with linear linepl3 <-ggplot(wages_cs.df, aes(x = ed, y = lwage, color = union, shape = union)) +geom_point(alpha =1) +geom_point(data = means.df, mapping =aes(color = union), alpha =0.5, stroke =2, size =2, shape =c(1,2), show.legend =FALSE) +theme_minimal() +labs(y ="log-transformed wage", x ="Education years") +ggtitle("50 individuals from the Panel Study of Income Dynamics")pl3
An example: wage
The OLS estimator
mod1.lm <-lm(lwage ~ ed + union, data = wages_cs.df)summary(mod1.lm)
Call:
lm(formula = lwage ~ ed + union, data = wages_cs.df)
Residuals:
Min 1Q Median 3Q Max
-0.86123 -0.19264 0.01753 0.22068 0.59062
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.49218 0.25015 21.956 < 2e-16 ***
ed 0.06330 0.01746 3.626 0.000707 ***
unionyes 0.16195 0.11259 1.438 0.156938
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3526 on 47 degrees of freedom
Multiple R-squared: 0.2186, Adjusted R-squared: 0.1853
F-statistic: 6.574 on 2 and 47 DF, p-value: 0.003039
An example: wage
Code
library(gganimate)# Residualise ytmp.mod1 <-lm(lwage ~ union, data = wages_cs.df)wages_cs.df$lwage_resid <-resid(tmp.mod1)# Residualise xtmp.mod1 <-lm(ed ~ union, data = wages_cs.df)wages_cs.df$ed_resid <-resid(tmp.mod1)# Create data of states to animatedf <-rbind(data.frame(lwage = wages_cs.df$lwage,ed = wages_cs.df$ed,union = wages_cs.df$union,resid =0),data.frame(lwage = wages_cs.df$lwage_resid,ed = wages_cs.df$ed,union = wages_cs.df$union,resid =1),data.frame(lwage = wages_cs.df$lwage_resid, # just to duplicateed = wages_cs.df$ed,union = wages_cs.df$union,resid =2),data.frame(lwage = wages_cs.df$lwage_resid,ed = wages_cs.df$ed_resid,union = wages_cs.df$union,resid =3),data.frame(lwage = wages_cs.df$lwage_resid, # just to duplicateed = wages_cs.df$ed_resid,union = wages_cs.df$union,resid =4),data.frame(lwage = wages_cs.df$lwage_resid, # just to duplicateed = wages_cs.df$ed_resid,union = wages_cs.df$union,resid =5))means.df <-aggregate(df[, c("lwage", "ed")],by =list(union = df$union,resid = df$resid),FUN =function(x) mean(x, na.rm =TRUE))# Plot with linear linepl4 <-ggplot(df, aes(x = ed, y = lwage, color = union, shape = union)) +geom_point(alpha =1) +geom_point(data = means.df, alpha =0.5, stroke =2, size =2, show.legend =FALSE) +theme_minimal() +labs(y ="log-transformed wage", x ="Education years") +ggtitle("50 individuals from the Panel Study of Income Dynamics")+transition_states(resid, wrap =FALSE) +shadow_mark(alpha =0.7, size =1) # + view_zoom(nsteps = 3, include = TRUE, # look_ahead = 4, wrap = FALSE)pl4
An example: wage
Code
# Meansmeans.df <-aggregate(wages_cs.df[, c("ed_resid", "lwage_resid")],by =list(union = wages_cs.df$union),FUN =function(x) mean(x, na.rm =TRUE))# Plot with linear linepl5_0 <-ggplot(wages_cs.df, aes(x = ed_resid, y = lwage_resid)) +geom_point(aes(color = union, shape = union), alpha =1) +geom_point(data = means.df, mapping =aes(color = union), alpha =0.5, stroke =2, size =2, shape =c(1,2), show.legend =FALSE) +theme_minimal() +labs(y ="residualized log-transformed wage", x ="residualized Education years") +ggtitle("50 individuals from the Panel Study of Income Dynamics")pl5_0
An example: wage
Code
# fit the modelfit <-lm(lwage_resid ~ ed_resid, data = wages_cs.df)# Save the predicted valueswages_cs.df$predicted <-predict(fit)# Save the residual valueswages_cs.df$residuals <-residuals(fit)library(ggpubr)# Plot with linear linepl5 <-ggplot(wages_cs.df, aes(x = ed_resid, y = lwage_resid)) +geom_point(aes(color = union, shape = union), alpha =1) +geom_point(data = means.df, mapping =aes(color = union), alpha =0.5, stroke =2, size =2, shape =c(1,2), show.legend =FALSE) +geom_smooth(aes(group =1), method ='lm', show.legend ="none", se =FALSE) +geom_segment(aes(xend = ed_resid, yend = predicted), alpha = .2, color ="purple") +geom_point(aes(x = ed_resid, y = predicted), alpha =0.5, shape =1, color ="red") +theme_minimal() +labs(y ="residualized log-transformed wage", x ="residualized Education years") +ggtitle("50 individuals from the Panel Study of Income Dynamics") +stat_regline_equation(aes(label = ..eq.label..))pl5