Once again, create your own .Rmd file from File > New File > R Markdown. Only one report is needed per group.
Run the following code to load the needed packages. You may need to install.packages() them first; note that they’re both pretty big, so if you need to install them, it may take a bit:
library(tidyverse)
library(sf)
We will use the sf package, which stands for simple features. Simple features are a commonly-used spatial data standard that specify storage and access for map geometrics used by geographic information systems (GIS). The sf package in R represents simple features in a tidy way, in which each row stands for a simple feature, and each column corresponds to a variable.
Simple features use 2D geometries which “connect the dots” between defined points in space:
## Warning: package 'knitr' was built under R version 3.6.3
To read simple features from a file or database, use the st_read() function. There are already some objects included in the sf package. We will load a file that corresponds to the counties in North Carolina:
library(sf)
nc <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
nc
## Simple feature collection with 100 features and 14 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## geographic CRS: NAD27
## First 10 features:
## AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74
## 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1
## 2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0
## 3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5
## 4 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 1
## 5 0.153 2.206 1832 1832 Northampton 37131 37131 66 1421 9
## 6 0.097 1.670 1833 1833 Hertford 37091 37091 46 1452 7
## 7 0.062 1.547 1834 1834 Camden 37029 37029 15 286 0
## 8 0.091 1.284 1835 1835 Gates 37073 37073 37 420 0
## 9 0.118 1.421 1836 1836 Warren 37185 37185 93 968 4
## 10 0.124 1.428 1837 1837 Stokes 37169 37169 85 1612 1
## NWBIR74 BIR79 SID79 NWBIR79 geometry
## 1 10 1364 0 19 MULTIPOLYGON (((-81.47276 3...
## 2 10 542 3 12 MULTIPOLYGON (((-81.23989 3...
## 3 208 3616 6 260 MULTIPOLYGON (((-80.45634 3...
## 4 123 830 2 145 MULTIPOLYGON (((-76.00897 3...
## 5 1066 1606 3 1197 MULTIPOLYGON (((-77.21767 3...
## 6 954 1838 5 1237 MULTIPOLYGON (((-76.74506 3...
## 7 115 350 2 139 MULTIPOLYGON (((-76.00897 3...
## 8 254 594 2 371 MULTIPOLYGON (((-76.56251 3...
## 9 748 1190 2 844 MULTIPOLYGON (((-78.30876 3...
## 10 160 2038 5 176 MULTIPOLYGON (((-80.02567 3...
When calling the nc object, you should see some metadata associated with these data:
Other than these metadata, we have a tidy dataset as we’re used to, except the last column is a geometry.
The dataset from today examines Sudden Infant Death Syndrome (SIDS) in each county in North Carolina. SIDS is an unexplained death of an apparently healthy infant, often occurring during sleep that was a big cause for concern in the 1970s and 1980s (before people figured out a better way to lay infants down in the crib). We will create a basic visualization that maps the number of SIDS cases to each county. A few of the variables in the dataset are as follows:
Now let’s create a basic plot using the nc object! sf objects work well with the tidyverse. For instance, try the following code, noting that we did not specify any aesthetic mapping:
ggplot(data = nc) +
geom_sf()
We can also add some global plot options:
ggplot(data = nc) +
geom_sf(color = "purple", fill = "lightblue") +
theme_bw()
Remember that in our dataset, we had some data that corresponded to each county. How might we incorporate them into our plot? We need to set an aesthetic mapping. Importantly: for sf geometries, the aesthetic mapping is done in the geom_sf() layer.
Let’s create a choropleth map that displays the number of births in 1974 for each county. Note how the following code is different from the earlier code:
ggplot(data = nc) +
geom_sf(aes(fill = BIR74)) +
theme_bw()
We can also manually set the color scheme by adding a scale_fill_gradient() layer. Here, we’ll specify hex values for our color extremes:
ggplot(nc) +
geom_sf(aes(fill = BIR74)) +
scale_fill_gradient(low = "#fee8c8", high ="#7f0000") +
theme_bw()
As said previously, simple features work well with the tidyverse. For instance, we can use dplyr() functions to manipulate data. Let’s filter for observations with over 10,000 births in 1979, and select only the county name and number of birth variables:
nc %>%
filter(BIR74 > 10000) %>%
select(NAME, BIR74)
## Simple feature collection with 6 features and 2 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -81.06555 ymin: 34.45701 xmax: -77.12939 ymax: 36.25709
## geographic CRS: NAD27
## NAME BIR74 geometry
## 1 Forsyth 11858 MULTIPOLYGON (((-80.0381 36...
## 2 Guilford 16184 MULTIPOLYGON (((-79.53782 3...
## 3 Wake 14484 MULTIPOLYGON (((-78.92107 3...
## 4 Mecklenburg 21588 MULTIPOLYGON (((-81.0493 35...
## 5 Cumberland 20366 MULTIPOLYGON (((-78.49929 3...
## 6 Onslow 11158 MULTIPOLYGON (((-77.53864 3...
Notice that the geometry is “sticky”: the geoemtry and metadata associated with it are still carried with the dataset, even though we didn’t select for it. In order to remove the geometry, include the function st_drop_geometry() in your pipeline:
nc %>%
filter(BIR74 > 10000) %>%
select(NAME, BIR74) %>%
st_drop_geometry()
## NAME BIR74
## 1 Forsyth 11858
## 2 Guilford 16184
## 3 Wake 14484
## 4 Mecklenburg 21588
## 5 Cumberland 20366
## 6 Onslow 11158
We will be using the nc dataset again, as loaded from the sf() package.
Hint: look up the documentation for the scale_fill_gradient() function.
Hint: be careful about how you calculate the SIDS rate per 1,000 births.
Drop the geometries from all three tables.