class: center, middle, inverse, title-slide # Spatial data wrangling ## Statistical Computing & Programming ### Shawn Santo --- ## Supplementary materials Full video lecture available in Zoom Cloud Recordings Additional resources - Simple Features for R [vignettes](https://r-spatial.github.io/sf/) - [CRS in R](https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/OverviewCoordinateReferenceSystems.pdf) by Melanie Frazier - [Leaflet for R](https://rstudio.github.io/leaflet/) --- ## Data Data for today is available at - Package `sf` - NC counties - [NC OneMap](https://www.nconemap.gov) - lots of spatial data on all things NC - [NC PMTW Streams 2020](https://www.nconemap.gov/datasets/0cc135e6c6244c9e9646b45ee3cb6c1e_0) - [NC Public Schools](https://www.nconemap.gov/datasets/public-schools) --- ## Access data Create a folder `data/`. NC PMTW Streams: .small[ ```r download.file(paste0("https://opendata.arcgis.com/datasets/", "0cc135e6c6244c9e9646b45ee3cb6c1e_0.zip", "?outSR=%7B%22latestWkid%22%3A3857%2C%22", "wkid%22%3A102100%7D"), destfile = "data/pmtw.zip") unzip("data/pmtw.zip", exdir = "data/") file.remove("data/pmtw.zip") ``` ] NC Public Schools: .small[ ```r download.file(paste0("https://opendata.arcgis.com/datasets/", "dea6ff0e8b4743a0ba361e13a85a4c70_3.zip", "?outSR=%7B%22latestWkid%22%3A32119%2C%22", "wkid%22%3A32119%7D"), destfile = "data/schools.zip") unzip("data/schools.zip", exdir = "data/") file.remove("data/schools.zip") ``` ] --- ## Read data and load packages Packages ```r library(tidyverse) library(sf) library(janitor) ``` Read in the shapefiles ```r nc_counties <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = T) %>% clean_names() %>% select(name, area) ``` ```r nc_pmtw <- st_read("data/PMTW_Streams_2020.shp", quiet = T) %>% clean_names() %>% select(name = displ_name, designation = first_wrc, length = shape_len) ``` ```r nc_schools <- st_read("data/Public_Schools.shp", quiet = T) %>% clean_names() ``` --- class: inverse, center, middle # Recall --- ## Spatial data is different Our **typical tidy data frame**: .tiny[ ``` #> # A tibble: 336,776 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 533 529 4 850 830 #> 3 2013 1 1 542 540 2 923 850 #> 4 2013 1 1 544 545 -1 1004 1022 #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> 7 2013 1 1 555 600 -5 913 854 #> 8 2013 1 1 557 600 -3 709 723 #> 9 2013 1 1 557 600 -3 838 846 #> 10 2013 1 1 558 600 -2 753 745 #> # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` ] --- ## Spatial data is different A **simple features object**: .tiny[ ``` #> Simple feature collection with 100 features and 5 fields #> geometry type: MULTIPOLYGON #> dimension: XY #> bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 #> geographic CRS: NAD27 #> First 10 features: #> AREA PERIMETER CNTY_ CNTY_ID NAME geometry #> 1 0.114 1.442 1825 1825 Ashe MULTIPOLYGON (((-81.47276 3... #> 2 0.061 1.231 1827 1827 Alleghany MULTIPOLYGON (((-81.23989 3... #> 3 0.143 1.630 1828 1828 Surry MULTIPOLYGON (((-80.45634 3... #> 4 0.070 2.968 1831 1831 Currituck MULTIPOLYGON (((-76.00897 3... #> 5 0.153 2.206 1832 1832 Northampton MULTIPOLYGON (((-77.21767 3... #> 6 0.097 1.670 1833 1833 Hertford MULTIPOLYGON (((-76.74506 3... #> 7 0.062 1.547 1834 1834 Camden MULTIPOLYGON (((-76.00897 3... #> 8 0.091 1.284 1835 1835 Gates MULTIPOLYGON (((-76.56251 3... #> 9 0.118 1.421 1836 1836 Warren MULTIPOLYGON (((-78.30876 3... #> 10 0.124 1.428 1837 1837 Stokes MULTIPOLYGON (((-80.02567 3... ``` ] --- ## Analysis of spatial data in R .pull-left[ <br/> - Package `raster` contains classes and tools for handling spatial raster data. <br/><br/> - Package `sf` combines the functionality of `sp`, `rgdal`, and `rgeos` into a single package based on tidy simple features. ] .pull-right[ ![](images/vector_raster_comparison.png) ] <br/> Whether or not you use vector or raster data depends on the type of problem and the data source. Our focus will be on vector data and package `sf`. *Source:* https://commons.wikimedia.org/wiki/File:Raster_vector_tikz.png --- class: inverse, center, middle # Data processing --- ## Align the CRS ```r st_crs(nc_counties)[1] ``` ``` #> $input #> [1] "NAD27" ``` ```r st_crs(nc_pmtw)[1] ``` ``` #> $input #> [1] "WGS 84 / Pseudo-Mercator" ``` ```r st_crs(nc_schools)[1] ``` ``` #> $input #> [1] "NAD83 / North Carolina" ``` -- <br/> Let's put everything on `NAD83` - [more on these datums](https://gisgeography.com/geodetic-datums-nad27-nad83-wgs84/) --- ## Align the CRS (continued) ```r nc_counties <- st_transform(nc_counties, crs = st_crs(nc_schools)) nc_pmtw <- st_transform(nc_pmtw, crs = st_crs(nc_schools)) ``` -- ```r st_crs(nc_counties)[1] ``` ``` #> $input #> [1] "NAD83 / North Carolina" ``` ```r st_crs(nc_pmtw)[1] ``` ``` #> $input #> [1] "NAD83 / North Carolina" ``` ```r st_crs(nc_schools)[1] ``` ``` #> $input #> [1] "NAD83 / North Carolina" ``` --- class: inverse, center, middle # Geometric confirmation --- ## Touches ```r durham <- nc_counties %>% filter(str_detect(name, "^Durham")) ``` -- ```r st_touches(x = durham, y = nc_counties) ``` ``` #> Sparse geometry binary predicate list of length 1, #> where the predicate was `touches' #> 1: 13, 14, 29, 37, 48 ``` -- Setting the argument `sparse = FALSE` will return a logical matrix. Element `i, j` is `TRUE` when geometry feature `i` and `j` touch. --- ## Touches What are the counties that `touch` Durham? ```r nc_counties %>% filter(st_touches(durham, nc_counties, sparse = FALSE)) ``` ``` #> Simple feature collection with 5 features and 2 fields #> geometry type: MULTIPOLYGON #> dimension: XY #> bbox: xmin: 559249.4 ymin: 195329 xmax: 676988.9 ymax: 310237 #> projected CRS: NAD83 / North Carolina #> name area geometry #> 1 Granville 0.143 MULTIPOLYGON (((632225.8 25... #> 2 Person 0.109 MULTIPOLYGON (((626993.2 27... #> 3 Orange 0.104 MULTIPOLYGON (((607985.9 23... #> 4 Wake 0.219 MULTIPOLYGON (((616777.2 20... #> 5 Chatham 0.180 MULTIPOLYGON (((559249.4 19... ``` -- Where is Durham? -- Specifically, `st_touches()` checks if `x` and `y` share a common point but their interiors do not intersect. --- ## Contains Recall that `nc_schools` has a point geometry. Let's identify the Durham public schools. First, let's check if there are any schools in Durham. ```r durham %>% st_contains(nc_schools, sparse = FALSE) %>% apply(., MARGIN = 1, FUN = sum) ``` ``` #> [1] 66 ``` --- ## Contains .small.pull-left[ ```r durham %>% st_contains(nc_schools, sparse = FALSE) %>% filter(.data = nc_schools, .) %>% ggplot() + geom_sf(color = "red", size = 2, alpha = .5) + geom_sf(data = durham, color = "purple", alpha = 0, size = 2) + theme_bw(base_size = 14) ``` ] .pull-right[ <img src="lec_12_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> ] --- ## Exercise Use `st_intersects()` and `st_contains()` to plot all the public schools in Durham and the counties that neighbor Durham. <br/> <img src="lec_12_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- ## Summary: geo-predicates - More geometric predicates on pairs of `sf` objects can be found by looking at the help for any one of them. - The vignette for the `sf` package also provides some convenient visual summaries for these geometric predicate functions. - Don't assume by the name alone that you know what the function does. There are subtle differences, e.g. `st_covers()` vs. `st_overlaps()`. --- class: inverse, center, middle # Geometric operations --- ## Centroids Identify the geometric center of the geometry. ```r ggplot(nc_counties) + geom_sf() + geom_sf(data = st_centroid(nc_counties), color = "purple", size = 2) + theme_bw(base_size = 14) ``` <img src="lec_12_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- ## Convex hull Create a convex hull around the Durham high schools. Find Durham high schools: ```r durham_hs <- durham %>% st_contains(nc_schools, sparse = FALSE) %>% filter(.data = nc_schools, .) %>% filter(high == "yes") ``` -- Create a geometry for the convex hull: ```r hull <- durham_hs %>% st_union() %>% st_convex_hull ``` --- ## Convex hull .pull-left[ ```r ggplot(durham) + geom_sf(alpha = 0) + geom_sf(data = hull, color = "red") + geom_sf(data = durham_hs, color = "darkgreen", size = 2) + theme_bw(base_size = 14) ``` ] .pull-right[ <img src="lec_12_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> ] --- ## Exercise Use `st_buffer()` to place a 500 meter buffer around the trout waters in Ashe county North Carolina. <img src="lec_12_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> --- ## Summary: geo-operations - More geometric predicates on pairs of `sf` objects can be found by looking at the help for any one of them, e.g. `?st_boundary()`. - The vignette for the `sf` package also provides some convenient visual summaries for these geometric predicate functions. --- class: inverse, center, middle # Geometric measurements --- ## Length Compute the length of a 2D geometry. ```r st_length(nc_pmtw)[1:10] ``` ``` #> Units: [m] #> [1] 30.16081 14.14364 4014.76214 2350.58316 4654.37640 9506.25856 #> [7] 24942.73248 60.82258 13222.55721 24454.80161 ``` -- Our `nc_pmtw` had a length variable, let's check if they match. ```r nc_pmtw$length[1:10] ``` ``` #> [1] 37.44214 17.63478 4996.62215 2925.42681 5796.97361 11832.80079 #> [7] 31060.31155 75.62282 16460.30354 30455.43357 ``` -- What's going on here? --- ## Area ```r st_area(durham) ``` ``` #> 770501348 [m^2] ``` Look at the area variable for Durham in our `sf` object. ```r durham$area ``` ``` #> [1] 0.077 ``` Same issue as before. --- ## Distance Compute the distance between the two furthest schools in NC. ```r school_distances <- st_distance(x = nc_schools, nc_schools) ``` ```r school_distances[1:5, 1:5] ``` ``` #> Units: [m] #> [,1] [,2] [,3] [,4] [,5] #> [1,] 0.000 36352.19 6881.309 23836.74 36645.07 #> [2,] 36352.192 0.00 34752.409 18857.48 59691.60 #> [3,] 6881.309 34752.41 0.000 25817.11 31551.79 #> [4,] 23836.736 18857.48 25817.108 0.00 56549.32 #> [5,] 36645.074 59691.60 31551.794 56549.32 0.00 ``` -- By default, `st_distance()` computes all the pairwise distances between features in `sf` objects. You get a matrix as a result. -- How can we find the pair of schools with the largest distance between them? ```r max(school_distances) ``` ``` #> 789774.1 [m] ``` --- ## Distance ```r nc_schools %>% filter(apply(max(school_distances) == school_distances, 1, any)) %>% ggplot() + geom_sf(color = "red", size = 3) + geom_sf(data = nc_counties, alpha = 0) ``` <img src="lec_12_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Geocoding --- ## Geocoding in R Geocoding converts addresses into geographic coordinates to be placed on a map. ```r library(tidygeocoder) ``` Tidygeocoder makes getting data from geocoder services easy. A unified interface is provided for the supported geocoder services listed below. All results are returned in tibble format. -- Available geocoder options: .small-text[ | Service | Geography | Batch Geocoding | API Key Required | Query Rate Limit | |-----------------|---------------|-----------------|------------------|-------------------------| | US Census | US | Yes | No | N/A | | Nominatim (OSM) | Worldwide | No | No | 1/second | | Geocodio | US and Canada | Yes | Yes | 1000/minute (free tier) | | Location IQ | Worldwide | No | Yes | 2/second (free tier) | | Google | Worldwide | No | Yes | 50/second | | OpenCage | Worldwide | No | Yes | 1/second (free tier) | ] --- ## Example: geocode the schools ```r nc_schools %>% st_drop_geometry() %>% select(phys_addr, phys_city, phys_zip) %>% slice(1:8) %>% geocode(street = phys_addr, city = phys_city, postalcode = phys_zip, method = "census") ``` ``` #> # A tibble: 8 x 5 #> phys_addr phys_city phys_zip lat long #> <chr> <chr> <int> <dbl> <dbl> #> 1 820 North Bridge St Washington 27899 35.6 -77.1 #> 2 693 North 7th Street Aurora 27806 35.3 -76.8 #> 3 606 Gray Road Chocowinity 27817 35.5 -77.1 #> 4 110 S King Street Bath 27808 35.5 -76.8 #> 5 7653 NC 11 South Ayden 28513 35.4 -77.4 #> 6 192 Third St Ayden 28513 35.5 -77.4 #> 7 2657 Church Street Winterville 28590 35.5 -77.4 #> 8 1815 W Berkley Rd Greenville 27858 35.6 -77.4 ``` --- ## Exercise Check the accuracy of the geocoding where `method = census`. Only use a sample of the schools. You can gauge the accuracy visually or by computing distances with `st_distance()`. Don't forget to align the CRS! --- ## References 1. Interactive Viewing of Spatial Data in R. (2021). https://r-spatial.github.io/mapview/index.html. 2. "Leaflet For R - Introduction". Rstudio.Github.Io, 2021, https://rstudio.github.io/leaflet/. 3. Melanie Frazier. Coordinate Reference Systems in R. https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/OverviewCoordinateReferenceSystems.pdf. 4. Simple Features for R. (2021). https://r-spatial.github.io/sf/.