<- c("sf", "gstat", "mapview", "nngeo", "rnaturalearth", "dplyr",
pkgs "nomisr", "osmdata", "tidyr", "texreg")
lapply(pkgs, require, character.only = TRUE)
1 Refresher
Required packages
Session info
sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.utf8
[2] LC_CTYPE=English_United Kingdom.utf8
[3] LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.utf8
time zone: Europe/London
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] texreg_1.38.6 tidyr_1.3.0 osmdata_0.2.3
[4] nomisr_0.4.7 dplyr_1.1.2 rnaturalearth_0.3.3
[7] nngeo_0.4.7 mapview_2.11.0 gstat_2.1-1
[10] sf_1.0-13
loaded via a namespace (and not attached):
[1] xfun_0.39 raster_3.6-20 htmlwidgets_1.6.2
[4] lattice_0.21-8 vctrs_0.6.3 tools_4.3.1
[7] crosstalk_1.2.0 generics_0.1.3 stats4_4.3.1
[10] tibble_3.2.1 proxy_0.4-27 spacetime_1.3-0
[13] fansi_1.0.4 xts_0.13.1 pkgconfig_2.0.3
[16] KernSmooth_2.23-21 satellite_1.0.4 data.table_1.14.8
[19] leaflet_2.1.2 webshot_0.5.4 lifecycle_1.0.3
[22] compiler_4.3.1 FNN_1.1.3.2 rsdmx_0.6-2
[25] munsell_0.5.0 terra_1.7-29 codetools_0.2-19
[28] snakecase_0.11.0 htmltools_0.5.5 class_7.3-22
[31] pillar_1.9.0 classInt_0.4-9 tidyselect_1.2.0
[34] digest_0.6.31 purrr_1.0.1 fastmap_1.1.1
[37] grid_4.3.1 colorspace_2.1-0 cli_3.6.1
[40] magrittr_2.0.3 base64enc_0.1-3 XML_3.99-0.14
[43] utf8_1.2.3 leafem_0.2.0 e1071_1.7-13
[46] scales_1.2.1 sp_1.6-1 rmarkdown_2.22
[49] httr_1.4.6 zoo_1.8-12 png_0.1-8
[52] evaluate_0.21 knitr_1.43 rlang_1.1.1
[55] Rcpp_1.0.10 glue_1.6.2 DBI_1.1.3
[58] rstudioapi_0.14 jsonlite_1.8.5 R6_2.5.1
[61] plyr_1.8.8 intervals_0.15.3 units_0.8-2
1.1 Coordinates
In general, spatial data is structured like conventional/tidy data (e.g. data.frames, matrices), but has one additional dimension: every observation is linked to some sort of geo-spatial information. Most common types of spatial information are:
Points (one coordinate pair)
Lines (two coordinate pairs)
Polygons (at least three coordinate pairs)
Regular grids (one coordinate pair for centroid + raster / grid size)
1.1.1 Coordinate reference system (CRS)
In its raw form, a pair of coordinates consists of two numerical values. For instance, the pair c(51.752595, -1.262801)
describes the location of Nuffield College in Oxford (one point). The fist number represents the latitude (north-south direction), the second number is the longitude (west-east direction), both are in decimal degrees.
However, we need to specify a reference point for latitudes and longitudes (in the Figure above: equator and Greenwich). For instance, the pair of coordinates above comes from Google Maps which returns GPS coordinates in ‘WGS 84’ (EPSG:4326).
# Coordinate pairs of two locations
<- c(51.752595, -1.262801)
coords1 <- c(51.753237, -1.253904)
coords2 <- rbind(coords1, coords2)
coords
# Conventional data frame
<- data.frame(name = c("Nuffield College", "Radcliffe Camera"),
nuffield.df address = c("New Road", "Radcliffe Sq"),
lat = coords[,1], lon = coords[,2])
head(nuffield.df)
name address lat lon
coords1 Nuffield College New Road 51.75259 -1.262801
coords2 Radcliffe Camera Radcliffe Sq 51.75324 -1.253904
# Combine to spatial data frame
<- st_as_sf(nuffield.df,
nuffield.spdf coords = c("lon", "lat"), # Order is important
crs = 4326) # EPSG number of CRS
# Map
mapview(nuffield.spdf, zcol = "name")
1.1.2 Projected CRS
However, different data providers use different CRS. For instance, spatial data in the UK usually uses ‘OSGB 1936 / British National Grid’ (EPSG:27700). Here, coordinates are in meters, and projected onto a planar 2D space.
There are a lot of different CRS projections, and different national statistics offices provide data in different projections. Data providers usually specify which reference system they use. This is important as using the correct reference system and projection is crucial for plotting and manipulating spatial data.
If you do not know the correct CRS, try starting with a standards CRS like EPSG:4326 if you have decimal degree like coordinates. If it looks like projected coordinates, try searching for the country or region in CRS libraries like https://epsg.io/. However, you must check if the projected coordinates match their real location, e.g. using mpaview()
.
1.1.3 Why different projections?
By now, (most) people agree that the earth is not flat. So, to plot data on a 2D planar surface and to perform certain operations on a planar world, we need to make some re-projections. Depending on where we are, different re-projections of our data (globe in this case) might work better than others.
<- ne_countries(scale = "medium", returnclass = "sf")
world class(world)
[1] "sf" "data.frame"
st_crs(world)
Coordinate Reference System:
User input: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
wkt:
BOUNDCRS[
SOURCECRS[
GEOGCRS["unknown",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ID["EPSG",6326]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8901]],
CS[ellipsoidal,2],
AXIS["longitude",east,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433,
ID["EPSG",9122]]],
AXIS["latitude",north,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433,
ID["EPSG",9122]]]]],
TARGETCRS[
GEOGCRS["WGS 84",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["latitude",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["longitude",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4326]]],
ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
METHOD["Geocentric translations (geog2D domain)",
ID["EPSG",9603]],
PARAMETER["X-axis translation",0,
ID["EPSG",8605]],
PARAMETER["Y-axis translation",0,
ID["EPSG",8606]],
PARAMETER["Z-axis translation",0,
ID["EPSG",8607]]]]
# Extract a country and plot in current CRS (WGS84)
<- world[world$name == "Germany", ]
ger.spdf plot(st_geometry(ger.spdf))
# Now, lets transform Germany into a CRS optimized for Iceland
<- st_transform(ger.spdf, crs = 5325)
ger_rep.spdf plot(st_geometry(ger_rep.spdf))
Depending on the angle, a 2D projection of the earth looks different. It is important to choose a suitable projection for the available spatial data. For more information on CRS and re-projection, see e.g. Lovelace, Nowosad, and Muenchow (2019) or Stefan Jünger & Anne-Kathrin Stroppe’s GESIS workshop materials.
1.2 Importing some real word data
sf
imports many of the most common spatial data files, like geojson, gpkg, or shp.
1.2.1 London shapefile (polygon)
Lets get some administrative boundaries for London from the London Datastore. We use the sf
package and its funtion st_read()
to import the data.
# Create subdir (all data withh be stored in "_data")
<- "_data"
dn ifelse(dir.exists(dn), "Exists", dir.create(dn))
[1] "Exists"
# Download zip file and unzip
<- tempfile()
tmpf <- "https://data.london.gov.uk/download/statistical-gis-boundary-files-london/9ba8c833-6370-4b11-abdc-314aa020d5e0/statistical-gis-boundaries-london.zip"
boundary.link download.file(boundary.link, tmpf)
unzip(zipfile = tmpf, exdir = paste0(dn))
unlink(tmpf)
# This is a shapefile
# We only need the MSOA layer for now
<- st_read(dsn = paste0(dn, "/statistical-gis-boundaries-london/ESRI"),
msoa.spdf layer = "MSOA_2011_London_gen_MHW" # Note: no file ending
)
Reading layer `MSOA_2011_London_gen_MHW' from data source
`C:\work\Lehre\Geodata_Spatial_Regression\_data\statistical-gis-boundaries-london\ESRI'
using driver `ESRI Shapefile'
Simple feature collection with 983 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 503574.2 ymin: 155850.8 xmax: 561956.7 ymax: 200933.6
Projected CRS: OSGB36 / British National Grid
The object msoa.spdf
is our spatial data.frame. It looks essentially like a conventional data.frame, but has some additional attributes and geo-graphical information stored with it. Most importantaly, notice the column geometry
, which contains a list of polygons. In most cases, we have one polygon for each line / observation.
head(msoa.spdf)
Simple feature collection with 6 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 530966.7 ymin: 180510.7 xmax: 551943.8 ymax: 191139
Projected CRS: OSGB36 / British National Grid
MSOA11CD MSOA11NM LAD11CD LAD11NM
1 E02000001 City of London 001 E09000001 City of London
2 E02000002 Barking and Dagenham 001 E09000002 Barking and Dagenham
3 E02000003 Barking and Dagenham 002 E09000002 Barking and Dagenham
4 E02000004 Barking and Dagenham 003 E09000002 Barking and Dagenham
5 E02000005 Barking and Dagenham 004 E09000002 Barking and Dagenham
6 E02000007 Barking and Dagenham 006 E09000002 Barking and Dagenham
RGN11CD RGN11NM USUALRES HHOLDRES COMESTRES POPDEN HHOLDS
1 E12000007 London 7375 7187 188 25.5 4385
2 E12000007 London 6775 6724 51 31.3 2713
3 E12000007 London 10045 10033 12 46.9 3834
4 E12000007 London 6182 5937 245 24.8 2318
5 E12000007 London 8562 8562 0 72.1 3183
6 E12000007 London 8791 8672 119 50.6 3441
AVHHOLDSZ geometry
1 1.6 MULTIPOLYGON (((531667.6 18...
2 2.5 MULTIPOLYGON (((548881.6 19...
3 2.6 MULTIPOLYGON (((549102.4 18...
4 2.6 MULTIPOLYGON (((551550 1873...
5 2.7 MULTIPOLYGON (((549099.6 18...
6 2.5 MULTIPOLYGON (((549819.9 18...
Shapefiles are still among the most common formats to store and transmit spatial data, despite them being inefficient (file size and file number).
However, sf
reads everything spatial, such as geo.json
, which usually is more efficient, but less common (but we’re getting there).
# Download file
<- "https://data.london.gov.uk/download/ultra_low_emissions_zone/3d980a29-c340-4892-8230-ed40d8c7f32d/Ultra_Low_Emissions_Zone.json"
ulez.link download.file(ulez.link, paste0(dn, "/ulez.json"))
# Read geo.json
st_layers(paste0(dn, "/ulez.json"))
Driver: GeoJSON
Available layers:
layer_name geometry_type features fields crs_name
1 Ultra_Low_Emissions_Zone Polygon 1 2 WGS 84
<- st_read(dsn = paste0(dn, "/ulez.json")) # here dsn is simply the file ulez.spdf
Reading layer `Ultra_Low_Emissions_Zone' from data source
`C:\work\Lehre\Geodata_Spatial_Regression\_data\ulez.json'
using driver `GeoJSON'
Simple feature collection with 1 feature and 2 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -0.167077 ymin: 51.48627 xmax: -0.07213563 ymax: 51.53182
Geodetic CRS: WGS 84
head(ulez.spdf)
Simple feature collection with 1 feature and 2 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -0.167077 ymin: 51.48627 xmax: -0.07213563 ymax: 51.53182
Geodetic CRS: WGS 84
boundary
1 Ultra Low Emission Zone
url
1 https://tfl.gov.uk/modes/driving/ultra-low-emission-zone
geometry
1 POLYGON ((-0.1046873 51.531...
Again, this looks like a conventional data.frame
but has the additional column geometry
containing the coordinates of each observation. st_geometry()
returns only the geographic object and st_drop_geometry()
only the data.frame
without the coordinates. We can plot the object using mapview()
.
mapview(msoa.spdf[, "POPDEN"])
1.2.2 Census API (admin units)
Now that we have some boundaries and shapes of spatial units in London, we can start looking for different data sources to populate the geometries.
A good source for demographic data is for instance the 2011 census. Below we use the nomis API to retrieve population data for London, See the Vignette for more information (Guest users are limited to 25,000 rows per query). Below is a wrapper to avoid some errors with sex and urban-rural cross-tabulation in some of the data.
### For larger request, register and set key
# Sys.setenv(NOMIS_API_KEY = "XXX")
# nomis_api_key(check_env = TRUE)
<- nomis_data_info()
x
# Get London ids
<- msoa.spdf$MSOA11CD
london_ids
### Get key statistics ids
# select requires tables (https://www.nomisweb.co.uk/sources/census_2011_ks)
# Let's get KS201EW (ethnic group), KS205EW (passport held), and KS402EW (housing tenure)
# Get internal ids
<- c("KS201EW", "KS402EW", "KS205EW")
stats <- which(grepl(paste(stats, collapse = "|"), x$name.value))
oo <- x$id[oo]
ksids # This are the internal ids ksids
[1] "NM_608_1" "NM_612_1" "NM_619_1"
### look at meta information
<- nomis_overview(ksids[1])
q head(q)
# A tibble: 6 × 2
name value
<chr> <list>
1 analyses <named list [1]>
2 analysisname <chr [1]>
3 analysisnumber <int [1]>
4 contact <named list [4]>
5 contenttypes <named list [1]>
6 coverage <chr [1]>
<- nomis_get_metadata(id = ksids[1], concept = "GEOGRAPHY", type = "type")
a # TYPE297 is MSOA level a
# A tibble: 24 × 3
id label.en description.en
<chr> <chr> <chr>
1 TYPE265 NHS area teams NHS area teams
2 TYPE266 clinical commissioning groups clinical comm…
3 TYPE267 built-up areas including subdivisions built-up area…
4 TYPE269 built-up areas built-up areas
5 TYPE273 national assembly for wales electoral reg… national asse…
6 TYPE274 postcode areas postcode areas
7 TYPE275 postcode districts postcode dist…
8 TYPE276 postcode sectors postcode sect…
9 TYPE277 national assembly for wales constituencie… national asse…
10 TYPE279 parishes 2011 parishes 2011
# ℹ 14 more rows
<- nomis_get_metadata(id = ksids[1], concept = "MEASURES", type = "TYPE297")
b # 20100 is the measure of absolute numbers b
# A tibble: 2 × 3
id label.en description.en
<chr> <chr> <chr>
1 20100 value value
2 20301 percent percent
### Query data in loop over the required statistics
for(i in ksids){
# Determin if data is divided by sex or urban-rural
<- nomis_get_metadata(id = i)
nd if("RURAL_URBAN" %in% nd$conceptref){
<- TRUE
UR else{
}<- FALSE
UR
}if("C_SEX" %in% nd$conceptref){
<- TRUE
SEX else{
}<- FALSE
SEX
}
# make data request
if(UR == TRUE){
if(SEX == TRUE){
<- nomis_get_data(id = i, time = "2011",
tmp_en geography = london_ids, # replace with "TYPE297" for all MSOAs
measures = 20100, RURAL_URBAN = 0, C_SEX = 0)
else{
}<- nomis_get_data(id = i, time = "2011",
tmp_en geography = london_ids, # replace with "TYPE297" for all MSOAs
measures = 20100, RURAL_URBAN = 0)
}else{
}if(SEX == TRUE){
<- nomis_get_data(id = i, time = "2011",
tmp_en geography = london_ids, # replace with "TYPE297" for all MSOAs
measures = 20100, C_SEX = 0)
else{
}<- nomis_get_data(id = i, time = "2011",
tmp_en geography = london_ids, # replace with "TYPE297" for all MSOAs
measures = 20100)
}
}
# Append (in case of different regions)
<- tmp_en
ks_tmp
# Make lower case names
names(ks_tmp) <- tolower(names(ks_tmp))
names(ks_tmp)[names(ks_tmp) == "geography_code"] <- "msoa11"
names(ks_tmp)[names(ks_tmp) == "geography_name"] <- "name"
# replace weird cell codes
<- which(grepl("^[[:digit:]]+$", ks_tmp$cell_code))
onlynum if(length(onlynum) != 0){
<- substr(ks_tmp$cell_code[-onlynum][1], 1, 7)
code if(is.na(code)){
<- i
code
}$cell_code[onlynum] <- paste0(code, "_", ks_tmp$cell_code[onlynum])
ks_tmp
}
# save codebook
<- unique(ks_tmp[, c("date", "cell_type", "cell", "cell_code", "cell_name")])
ks_cb
### Reshape
<- tidyr::pivot_wider(ks_tmp, id_cols = c("msoa11", "name"),
ks_res names_from = "cell_code",
values_from = "obs_value")
### Merge
if(i == ksids[1]){
<- ks_res
census_keystat.df <- ks_cb
census_keystat_cb.df else{
}<- merge(census_keystat.df, ks_res, by = c("msoa11", "name"), all = TRUE)
census_keystat.df <- rbind(census_keystat_cb.df, ks_cb)
census_keystat_cb.df
}
}
# Descriptions are saved in the codebook
head(census_keystat_cb.df)
# A tibble: 6 × 5
date cell_type cell cell_code cell_name
<dbl> <chr> <dbl> <chr> <chr>
1 2011 Ethnic Group 0 KS201EW0001 All usual residents
2 2011 Ethnic Group 100 KS201EW_100 White
3 2011 Ethnic Group 1 KS201EW0002 White: English/Welsh/Scottis…
4 2011 Ethnic Group 2 KS201EW0003 White: Irish
5 2011 Ethnic Group 3 KS201EW0004 White: Gypsy or Irish Travel…
6 2011 Ethnic Group 4 KS201EW0005 White: Other White
save(census_keystat_cb.df, file = "_data/Census_codebook.RData")
Now, we have one file containing the geometries of MSOAs and one file with the census information on ethnic groups. Obviously, we can easily merge them together using the MSOA identifiers.
<- merge(msoa.spdf, census_keystat.df,
msoa.spdf by.x = "MSOA11CD", by.y = "msoa11", all.x = TRUE)
And we can, for instance, plot the spatial distribution of ethnic groups.
$per_white <- msoa.spdf$KS201EW_100 / msoa.spdf$KS201EW0001 * 100
msoa.spdf
mapview(msoa.spdf[, "per_white"])
If you’re interested in more data sources, see for instance APIs for social scientists: A collaborative review by Paul C. Bauer, Camille Landesvatter, Lion Behrens. It’s a collection of several APIs for social sciences.
1.2.3 Gridded data
So far, we have queried data on administrative units. However, often data comes on other spatial scales. For instance, we might be interested in the amount of air pollution, which is provided on a regular grid across the UK from Defra.
# Download
<- "https://uk-air.defra.gov.uk/datastore/pcm/mapno22011.csv"
pol.link download.file(pol.link, paste0(dn, "/mapno22011.csv"))
<- read.csv(paste0(dn, "/mapno22011.csv"), skip = 5, header = T, sep = ",",
pol.df stringsAsFactors = F, na.strings = "MISSING")
head(pol.df)
ukgridcode x y no22011
1 54291 460500 1221500 NA
2 54292 461500 1221500 NA
3 54294 463500 1221500 NA
4 54979 458500 1220500 NA
5 54980 459500 1220500 NA
6 54981 460500 1220500 NA
The data comes as point data with x and y as coordinates. We have to transform this into spatial data first. We first setup a spatial points object with st_as_sf
. Subsequently, we transform the point coordinates into a regular grid. We use a buffer method st_buffer
with “diameter”, and only one segment per quadrant (nQuadSegs
). This gives us a 1x1km regular grid.
# Build spatial object
<- st_as_sf(pol.df, coords = c("x", "y"),
pol.spdf crs = 27700)
# we transform the point coordinates into a regular grid with "diameter" 500m
<- st_buffer(pol.spdf, dist = 500, nQuadSegs = 1,
pol.spdf endCapStyle = 'SQUARE')
# Plot NO2
plot(pol.spdf[, "no22011"], border = NA)
1.2.4 OpenStreetMap (points)
Another interesting data source is the OpenStreetMap API, which provides information about the geographical location of a serious of different indicators. Robin Lovelace provides a nice introduction to the osmdata API. Available features can be found on OSM wiki.
First we create a bounding box of where we want to query data. st_bbox()
can be used to get bounding boxes of an existing spatial object (needs CRS = 4326
). An alternative would be to use opq(bbox = 'greater london uk')
.
# bounding box of where we want to query data
<- opq(bbox = st_bbox(st_transform(msoa.spdf, 4326))) q
And we want to get data for all pubs and bars which are within this bounding box.
# First build the query of location of pubs in London
<- add_osm_feature(q, key = "amenity", value = "pub")
osmq
# And then query the data
<- osmdata_sf(osmq) pubs.osm
Right now there are some results in polygons, some in points, and they overlap. Often, data from OSM needs some manual cleaning. Sometimes the same features are represented by different spatial objects (e.g. points + polygons).
# Make unique points / polygons
<- unique_osmdata(pubs.osm)
pubs.osm
# Get points and polygons (there are barley any pubs as polygons, so we ignore them)
<- pubs.osm$osm_points
pubs.points <- pubs.osm$osm_multipolygons
pubs.polys
# # Drop OSM file
# rm(pubs.osm); gc()
# Reduce to point object only
<- pubs.points
pubs.spdf
# Reduce to a few variables
<- pubs.spdf[, c("osm_id", "name", "addr:postcode", "diet:vegan")] pubs.spdf
Again, we can inspect the results with mapview
.
mapview(st_geometry(pubs.spdf))
Note that OSM is solely based on contribution by users, and the quality of OSM data varies. Usually data quality is better in larger cities, and better for more stable features (such as hospitals, train stations, highways) rahter than pubs or restaurants which regularly appear and disappear. However, data from London Datastore would indicate more pubs than what we find with OSM.
1.2.5 Save
We will store the created data to use them again in the next session.
save(msoa.spdf, file = "_data/msoa_spatial.RData")
save(ulez.spdf, file = "_data/ulez_spatial.RData")
save(pol.spdf, file = "_data/pollution_spatial.RData")
save(pubs.spdf, file = "_data/pubs_spatial.RData")