Code
<- c("plm", "feisr", "sandwich", "texreg", "tidyr", "haven", "dplyr", "ggplot2", "ggforce")
pkgs lapply(pkgs, require, character.only = TRUE)
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.utf8
[2] LC_CTYPE=English_United Kingdom.utf8
[3] LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.utf8
time zone: Europe/London
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggforce_0.4.1 ggplot2_3.4.2 dplyr_1.1.2 haven_2.5.3 tidyr_1.3.0
[6] texreg_1.38.6 sandwich_3.0-2 feisr_1.3.0 plm_2.6-3
loaded via a namespace (and not attached):
[1] utf8_1.2.3 generics_0.1.3 dreamerr_1.2.3
[4] lattice_0.21-8 hms_1.1.3 digest_0.6.32
[7] magrittr_2.0.3 evaluate_0.21 grid_4.3.1
[10] fastmap_1.1.1 jsonlite_1.8.5 Matrix_1.5-4.1
[13] Formula_1.2-5 httr_1.4.6 purrr_1.0.1
[16] fansi_1.0.4 scales_1.2.1 tweenr_2.0.2
[19] numDeriv_2016.8-1.1 lfe_2.9-0 Rdpack_2.4
[22] cli_3.6.1 rlang_1.1.1 polyclip_1.10-4
[25] rbibutils_2.2.13 miscTools_0.6-28 munsell_0.5.0
[28] withr_2.5.0 yaml_2.3.7 fixest_0.11.1
[31] tools_4.3.1 parallel_4.3.1 bdsmatrix_1.3-6
[34] colorspace_2.1-0 forcats_1.0.0 maxLik_1.5-2
[37] vctrs_0.6.3 R6_2.5.1 zoo_1.8-12
[40] lifecycle_1.0.3 htmlwidgets_1.6.2 MASS_7.3-60
[43] pkgconfig_2.0.3 gtable_0.3.3 pillar_1.9.0
[46] glue_1.6.2 Rcpp_1.0.10 collapse_1.9.6
[49] xfun_0.39 tibble_3.2.1 lmtest_0.9-40
[52] tidyselect_1.2.0 rstudioapi_0.14 knitr_1.43
[55] farver_2.1.1 xtable_1.8-4 htmltools_0.5.5
[58] nlme_3.1-162 rmarkdown_2.23 compiler_4.3.1
For the purpose of this exercise, we will use a real-world data set. Instead of constructing our own data, we use a shortcut and use data from the replication package of Hospido (2012). The replication package can be found here.
This is an unbalanced panel with 32,066 observations and 2066 individuals for the period 1968–1993 of the PSID. It consists of male household heads aged 25–55 with at least 9 years of usable wages data.
variable name | description |
---|---|
pid | INDIVIDUAL IDENTIFIER |
year | YEAR OF INTERVIEW |
age | AGE OF INDIVIDUAL |
white | WHITE DUMMY |
dropout | DROPOUT DUMMY |
grad | GRADUATE DUMMY |
college | COLLEGE DUMMY |
married | MARRIED DUMMY |
child | NUMBER OF CHILDREN |
fsize | FAMILY SIZE |
hours | YEARLY HOURS OF WORK |
logwages | LOG OF REAL ANNUAL WAGES |
changejob | JOB CHANGE DUMMY |
ten1 | TENURE DUMMY less than a year |
ten2 | TENURE DUMMY a year |
ten3 | TENURE DUMMY 2-3 years |
ten4 | TENURE DUMMY 4 through 9 years |
ten5 | TENURE DUMMY 10 through 19 years |
ten6 | TENURE DUMMY 20 years or more |
profes | PROFESSIONAL, TECHNICAL, AND KINDRED WORKERS DUMMY |
admin | MANAGERS AND ADMINISTRATORS DUMMY |
sales | CLERICAL AND SALES WORKERS DUMMY |
crafts | CRAFTSMAN AND KINDRED WORKERS DUMMY |
operat | OPERATIVES WORKERS DUMMY |
servic | LABORERS AND SERVICES WORKERS DUMMY |
smsa | SMSA (Standard Metropolitan Statistical Area) DUMMY |
neast | NORTH-EAST DUMMY |
ncentr | NORTH-CENTRAL DUMMY |
south | SOUTH DUMMY |
west | WEST DUMMY |
Download and load the data.
Have a look at the data.
How many observations do we have in 1968? How many in 1990?
What is the average age in 1968? What was it in 1984? And how is this possible?
At which age did individual with ID “5790002” become father?
Please calculate the average age for each person.
Please calculate the lagged age (the age in the previous period)
Just to play a little bit around with the data, let us estimate some models.
Can we use this dataset to replicate our earlier analysis on the marital wage premium? What might be a problem here? (tip: have a look a the marriage variable).
However, we do something similar: We want to investigate if there is a fatherhood wage premium? In other words, do men experience an increase in wages when they become fathers?
* Restrict the age at the start (first wage) to people aged 25-35
* Use number of children to construct a binary indicator of wether there is a child in the household or not
* Make sure we start only with men who are not yet fathers in the first period.
Note: this feels like dropping a lot of information! However, it makes sense if we want to correctly identify the effect of interest.
* Do we need to drop observations where people go from child to no child?
* Calculate the effect of having a child on the wage of men (including controls if reasonable).
* Calculate effects for POLS, RE, and FE (if you have some extra time, also FEIS). (#Hint: feis needs a class `data.frame` as input data)
* Compare using cluster robust standard errors (and screenreg).
* Interpret the results
Can we use one of the new event-study approaches, such as the Callaway and SantAnna estimator?
* Preprocess data (treatment group and timing indicator)
* Estimate the model using `att_gt`
* Show the group-time specific estimates (use `method=ipw` and only restricted set of controls)
* Interpret the results
---
title: "Exercises"
format:
html:
code-fold: true
code-tools: true
output:
html_document:
toc: true
toc_float: true
toc_depth: 2
number_sections: true
author:
- name: Tobias Rüttenauer
orcid: 0000-0001-5747-9735
email: t.ruttenauer@ucl.ac.uk
url: https://ruettenauer.github.io/
affiliations:
- name: UCL Social Research Institute
address: University College London
city: London WC1H 0AL
---
### Required packages
```{r, message = FALSE, warning = FALSE, results = 'hide'}
pkgs <- c("plm", "feisr", "sandwich", "texreg", "tidyr", "haven", "dplyr", "ggplot2", "ggforce")
lapply(pkgs, require, character.only = TRUE)
```
### Session info
```{r}
sessionInfo()
```
### Load data
For the purpose of this exercise, we will use a real-world data set. Instead of constructing our own data, we use a shortcut and use data from the replication package of @Hospido.2012. The replication package can be found [here](http://qed.econ.queensu.ca/jae/datasets/hospido001/).
This is an unbalanced panel with 32,066 observations and 2066 individuals for the period 1968–1993 of the PSID. It consists of male household heads aged 25–55 with at least 9 years of usable wages data.
```{r}
# Load stata file
data.df <- read_dta("_data/h-data.dta")
# Lets order this
names <- names(data.df)
names <- c("pid", "year", names[-which(names %in% c("pid", "year"))])
data.df <- data.df[, names]
data.df <- data.df[order(data.df$pid, data.df$year), ]
```
| variable name | description|
|-----------------|-------------------------------------------------------|
| pid | INDIVIDUAL IDENTIFIER|
| year | YEAR OF INTERVIEW|
| age | AGE OF INDIVIDUAL|
| white | WHITE DUMMY|
| dropout | DROPOUT DUMMY|
| grad | GRADUATE DUMMY|
| college | COLLEGE DUMMY|
| married | MARRIED DUMMY|
| child | NUMBER OF CHILDREN|
| fsize | FAMILY SIZE|
| hours | YEARLY HOURS OF WORK|
| logwages | LOG OF REAL ANNUAL WAGES|
| changejob | JOB CHANGE DUMMY|
| ten1 | TENURE DUMMY less than a year|
| ten2 | TENURE DUMMY a year|
| ten3 | TENURE DUMMY 2-3 years|
| ten4 | TENURE DUMMY 4 through 9 years|
| ten5 | TENURE DUMMY 10 through 19 years|
| ten6 | TENURE DUMMY 20 years or more|
| profes | PROFESSIONAL, TECHNICAL, AND KINDRED WORKERS DUMMY|
| admin | MANAGERS AND ADMINISTRATORS DUMMY|
| sales | CLERICAL AND SALES WORKERS DUMMY|
| crafts | CRAFTSMAN AND KINDRED WORKERS DUMMY|
| operat | OPERATIVES WORKERS DUMMY|
| servic | LABORERS AND SERVICES WORKERS DUMMY|
| smsa | SMSA (Standard Metropolitan Statistical Area) DUMMY|
| neast | NORTH-EAST DUMMY|
| ncentr | NORTH-CENTRAL DUMMY|
| south | SOUTH DUMMY|
| west | WEST DUMMY|
# Exercise 1
Download and load the data.
Have a look at the data.
* How many observations do we have in 1968? How many in 1990?
* What is the average age in 1968? What was it in 1984? And how is this possible?
* At which age did individual with ID "5790002" become father?
* Please calculate the average age for each person.
* Please calculate the lagged age (the age in the previous period)
# Exercise 2
Just to play a little bit around with the data, let us estimate some models.
* What is the correlation between age and wage? Please use different estimators to determine different types of correlations: Pooled, Between, FE, RE, CRE.
# Exercise 3
Can we use this dataset to replicate our earlier analysis on the marital wage premium? What might be a problem here? (tip: have a look a the marriage variable).
However, we do something similar: We want to investigate if there is a fatherhood wage premium? In other words, do men experience an increase in wages when they become fathers?
* Restrict the age at the start (first wage) to people aged 25-35
* Use number of children to construct a binary indicator of wether there is a child in the household or not
* Make sure we start only with men who are not yet fathers in the first period.
__Note: this feels like dropping a lot of information! However, it makes sense if we want to correctly identify the effect of interest.__
* Do we need to drop observations where people go from child to no child?
* Calculate the effect of having a child on the wage of men (including controls if reasonable).
* Calculate effects for POLS, RE, and FE (if you have some extra time, also FEIS). (#Hint: feis needs a class `data.frame` as input data)
* Compare using cluster robust standard errors (and screenreg).
* Interpret the results
* Try to perform a placebo test: what happens if you use the "lead" of becoming father. Why is this an interesting test?
# Exercise 4
Can we use one of the new event-study approaches, such as the Callaway and SantAnna estimator?
* Preprocess data (treatment group and timing indicator)
* Estimate the model using `att_gt`
* Show the group-time specific estimates (use `method=ipw` and only restricted set of controls)
* Interpret the results
# References