08 - Tidy data + data wrangling

September 26, 2017

Getting started

Recap

Any questions from last time?
Any questions from the reading?

Midterm

Assigned on Thursday Sep 28 at 12:30pm, due Tuesday Oct 3 at 10:05am
Ask questions via direct message to Kyle and/or I, do not post on other public channels
Individual assessment:

This exam is open book, open internet, closed other people. You may use any online or book based resource you would like, but you should include citations for any code that you use (directly or indirectly). You may not consult with anyone else for this exam other than the professor and the TA. You cannot ask direct questions on the internet, or consult with each other, not even for hypothetical questions.

Duke Community Standard:

Duke University is a community dedicated to scholarship, leadership, and service and to the principles of honesty, fairness, respect, and accountability. Citizens of this community commit to reflect upon and uphold these principles in all academic and non-academic endeavors, and to protect and promote a culture of integrity. Cheating on exams or plagiarism on homework assignments, lying about an illness or absence and other forms of academic dishonesty are a breach of trust with classmates and faculty, violate the Duke Community Standard, and will not be tolerated. Such incidences will result in a 0 grade for all parties involved. Additionally, there may be penalties to your final class grade along with being reported to the Undergraduate Conduct Board.

Tidy data

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Is each of the following a dataset or a summary table?

## # A tibble: 1,314 x 3
##    outcome smoker   age
##     <fctr> <fctr> <int>
##  1   Alive    Yes    23
##  2   Alive    Yes    18
##  3    Dead    Yes    71
##  4   Alive     No    67
##  5   Alive     No    64
##  6   Alive    Yes    38
##  7   Alive    Yes    45
##  8    Dead     No    76
##  9   Alive     No    28
## 10   Alive     No    27
## # ... with 1,304 more rows

## # A tibble: 133 x 4
##      age smoker Alive  Dead
##  * <int> <fctr> <int> <int>
##  1    18     No    10     1
##  2    18    Yes    10    NA
##  3    19     No     8    NA
##  4    19    Yes     6    NA
##  5    20     No    11    NA
##  6    20    Yes     7     1
##  7    21     No     8    NA
##  8    21    Yes     7     1
##  9    22     No     8    NA
## 10    22    Yes     4    NA
## # ... with 123 more rows

Pipes

magrittr

You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.

Expressed as a set of nested functions in R pseudocode this would look like:

park(drive(start_car(find("keys")), to="campus"))

Writing it out using pipes give it a more natural (and easier to read) structure:

find("keys") %>%
  start_car() %>%
  drive(to = "campus") %>%
  park()

What about other arguments

To send results to a function argument other than first one or to use the previous result for multiple arguments, use .:

acs12 %>%
  filter(employment == "employed") %>%
  ggplot(mapping = aes(x = hrs_work, y = income)) +
    geom_point()

acs12 %>%
  filter(employment == "employed") %>%
  lm(income ~ hrs_work, data = .)

## 
## Call:
## lm(formula = income ~ hrs_work, data = .)
## 
## Coefficients:
## (Intercept)     hrs_work  
##      -13227         1472

Data wrangling

NC DOT Fatal Crashes in North Carolina

From https://opendurham.nc.gov

bike <- read_csv2("https://stat.duke.edu/~mc301/data/nc_bike_crash.csv", 
                  na = c("NA", "", "."))

## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   FID = col_integer(),
##   OBJECTID = col_integer(),
##   Bike_Age = col_integer(),
##   Crash_Date = col_date(format = ""),
##   Crash_Hour = col_integer(),
##   Crash_Time = col_datetime(format = ""),
##   Crash_Ty_1 = col_integer(),
##   Crash_Year = col_integer(),
##   Drvr_Age = col_integer(),
##   Num_Units = col_integer()
## )

## See spec(...) for full column specifications.

Variables

View the names of variables via

names(bike)

##  [1] "FID"        "OBJECTID"   "AmbulanceR" "BikeAge_Gr" "Bike_Age"  
##  [6] "Bike_Alc_D" "Bike_Dir"   "Bike_Injur" "Bike_Pos"   "Bike_Race" 
## [11] "Bike_Sex"   "City"       "County"     "CrashAlcoh" "CrashDay"  
## [16] "Crash_Date" "Crash_Grp"  "Crash_Hour" "Crash_Loc"  "Crash_Mont"
## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri"
## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age"   "Drvr_Alc_D" "Drvr_EstSp"
## [31] "Drvr_Injur" "Drvr_Race"  "Drvr_Sex"   "Drvr_VehTy" "ExcsSpdInd"
## [36] "Hit_Run"    "Light_Cond" "Locality"   "Num_Lanes"  "Num_Units" 
## [41] "Rd_Charact" "Rd_Class"   "Rd_Conditi" "Rd_Config"  "Rd_Defects"
## [46] "Rd_Feature" "Rd_Surface" "Region"     "Rural_Urba" "Speed_Limi"
## [51] "Traff_Cntr" "Weather"    "Workzone_I" "Location"

and see detailed descriptions at https://stat.duke.edu/~mc301/data/nc_bike_crash.html.

Viewing your data

In the Environment, click on the name of the data frame to view it in the data viewer
Use the str function to compactly display the internal structure of an R object

str(bike)

## Classes 'tbl_df', 'tbl' and 'data.frame':    5716 obs. of  54 variables:
##  $ FID       : int  18 29 33 35 49 53 56 60 63 66 ...
##  $ OBJECTID  : int  19 30 34 36 50 54 57 61 64 67 ...
##  $ AmbulanceR: chr  "No" "Yes" "No" "Yes" ...
##  $ BikeAge_Gr: chr  NA "50-59" NA "16-19" ...
##  $ Bike_Age  : int  6 51 10 17 6 52 18 40 6 7 ...
##  $ Bike_Alc_D: chr  "No" "No" "No" "No" ...
##  $ Bike_Dir  : chr  "Not Applicable" "With Traffic" "With Traffic" NA ...
##  $ Bike_Injur: chr  "C: Possible Injury" "C: Possible Injury" "Injury" "B: Evident Injury" ...
##  $ Bike_Pos  : chr  "Driveway / Alley" "Travel Lane" "Travel Lane" "Travel Lane" ...
##  $ Bike_Race : chr  "Black" "Black" "Black" "White" ...
##  $ Bike_Sex  : chr  "Female" "Male" "Male" "Male" ...
##  $ City      : chr  "Durham" "Greenville" "Farmville" "Charlotte" ...
##  $ County    : chr  "Durham" "Pitt" "Pitt" "Mecklenburg" ...
##  $ CrashAlcoh: chr  "No" "No" "No" "No" ...
##  $ CrashDay  : chr  "01-01-06" "01-01-02" "01-01-07" "01-01-05" ...
##  $ Crash_Date: Date, format: "2007-01-06" "2007-01-09" ...
##  $ Crash_Grp : chr  "Bicyclist Failed to Yield - Midblock" "Crossing Paths - Other Circumstances" "Bicyclist Failed to Yield - Sign-Controlled Intersection" "Loss of Control / Turning Error" ...
##  $ Crash_Hour: int  13 23 16 19 12 20 19 14 16 0 ...
##  $ Crash_Loc : chr  "Non-Intersection" "Intersection-Related" "Intersection" "Intersection" ...
##  $ Crash_Mont: chr  NA NA NA NA ...
##  $ Crash_Time: POSIXct, format: "0001-01-01 13:17:58" "0001-01-01 23:08:58" ...
##  $ Crash_Type: chr  "Bicyclist Ride Out - Residential Driveway" "Crossing Paths - Intersection - Other /" "Bicyclist Ride Through - Sign-Controlled Intersection" "Motorist Lost Control - Other /" ...
##  $ Crash_Ty_1: int  353311 211180 111144 119139 112114 311231 119144 132180 112142 460910 ...
##  $ Crash_Year: int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ Crsh_Sevri: chr  "C: Possible Injury" "C: Possible Injury" "O: No Injury" "B: Evident Injury" ...
##  $ Developmen: chr  "Residential" "Commercial" "Residential" "Residential" ...
##  $ DrvrAge_Gr: chr  "60-69" "30-39" "50-59" "30-39" ...
##  $ Drvr_Age  : int  66 34 52 33 NA 20 40 NA 17 51 ...
##  $ Drvr_Alc_D: chr  "No" "No" "No" "No" ...
##  $ Drvr_EstSp: chr  "11-15 mph" "0-5 mph" "21-25 mph" "46-50 mph" ...
##  $ Drvr_Injur: chr  "O: No Injury" "O: No Injury" "O: No Injury" "O: No Injury" ...
##  $ Drvr_Race : chr  "Black" "Black" "White" "White" ...
##  $ Drvr_Sex  : chr  "Male" "Male" "Female" "Female" ...
##  $ Drvr_VehTy: chr  "Pickup" "Passenger Car" "Passenger Car" "Sport Utility" ...
##  $ ExcsSpdInd: chr  "No" "No" "No" "No" ...
##  $ Hit_Run   : chr  "No" "No" "No" "No" ...
##  $ Light_Cond: chr  "Daylight" "Dark - Lighted Roadway" "Daylight" "Dark - Roadway Not Lighted" ...
##  $ Locality  : chr  "Mixed (30% To 70% Developed)" "Urban (>70% Developed)" "Mixed (30% To 70% Developed)" "Urban (>70% Developed)" ...
##  $ Num_Lanes : chr  "2 lanes" "5 lanes" "2 lanes" "4 lanes" ...
##  $ Num_Units : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ Rd_Charact: chr  "Straight - Level" "Straight - Level" "Straight - Level" "Straight - Level" ...
##  $ Rd_Class  : chr  "Local Street" "Local Street" "Local Street" "NC Route" ...
##  $ Rd_Conditi: chr  "Dry" "Dry" "Dry" "Dry" ...
##  $ Rd_Config : chr  "Two-Way, Not Divided" "Two-Way, Divided, Unprotected Median" "Two-Way, Not Divided" "Two-Way, Divided, Unprotected Median" ...
##  $ Rd_Defects: chr  "None" "None" "None" "None" ...
##  $ Rd_Feature: chr  "No Special Feature" "Four-Way Intersection" "Four-Way Intersection" "Four-Way Intersection" ...
##  $ Rd_Surface: chr  "Smooth Asphalt" "Smooth Asphalt" "Smooth Asphalt" "Smooth Asphalt" ...
##  $ Region    : chr  "Piedmont" "Coastal" "Coastal" "Piedmont" ...
##  $ Rural_Urba: chr  "Urban" "Urban" "Rural" "Urban" ...
##  $ Speed_Limi: chr  "20 - 25  MPH" "40 - 45  MPH" "30 - 35  MPH" "40 - 45  MPH" ...
##  $ Traff_Cntr: chr  "No Control Present" "Stop And Go Signal" "Stop Sign" "Stop And Go Signal" ...
##  $ Weather   : chr  "Clear" "Clear" "Clear" "Cloudy" ...
##  $ Workzone_I: chr  "No" "No" "No" "No" ...
##  $ Location  : chr  "36.002743, -78.8785" "35.612984, -77.39265" "35.595676, -77.59074" "35.076767, -80.7728" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 54
##   .. ..$ FID       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ OBJECTID  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ AmbulanceR: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ BikeAge_Gr: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Bike_Age  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Bike_Alc_D: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Bike_Dir  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Bike_Injur: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Bike_Pos  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Bike_Race : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Bike_Sex  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ City      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ County    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ CrashAlcoh: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ CrashDay  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Crash_Date:List of 1
##   .. .. ..$ format: chr ""
##   .. .. ..- attr(*, "class")= chr  "collector_date" "collector"
##   .. ..$ Crash_Grp : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Crash_Hour: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Crash_Loc : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Crash_Mont: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Crash_Time:List of 1
##   .. .. ..$ format: chr ""
##   .. .. ..- attr(*, "class")= chr  "collector_datetime" "collector"
##   .. ..$ Crash_Type: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Crash_Ty_1: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Crash_Year: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Crsh_Sevri: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Developmen: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ DrvrAge_Gr: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Drvr_Age  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Drvr_Alc_D: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Drvr_EstSp: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Drvr_Injur: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Drvr_Race : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Drvr_Sex  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Drvr_VehTy: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ ExcsSpdInd: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Hit_Run   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Light_Cond: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Locality  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Num_Lanes : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Num_Units : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Rd_Charact: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Rd_Class  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Rd_Conditi: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Rd_Config : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Rd_Defects: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Rd_Feature: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Rd_Surface: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Region    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Rural_Urba: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Speed_Limi: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Traff_Cntr: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Weather   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Workzone_I: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Location  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

A Grammar of Data Manipulation

dplyr is based on the concepts of functions as verbs that manipulate data frames.

Single data frame functions / verbs:

filter: pick rows matching criteria
slice: pick rows using index(es)
select: pick columns by name
pull: grab a column as a vector
rename: rename specific columns
arrange: reorder rows
mutate: add new variables
transmute: create new data frame with variables
distinct: filter for unique rows
sample_n / sample_frac: randomly sample rows
summarise: reduce variables to values
… (many more)

dplyr rules for functions

First argument is always a data frame
Subsequent arguments say what to do with that data frame
Always\(^*\) return a data frame
Don't modify in place
Performance via lazy evaluation

A note on piping and layering

The %>% operator in dplyr functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code.
The + operator in ggplot2 functions is used for "layering". This means you create the plot in layers, separated by +.

Filter rows with `filter`

Select a subset of rows in a data frame.
Easily filter for many conditions at once.

`filter`

for crashes in Durham County

bike %>%
  filter(County == "Durham")

## # A tibble: 253 x 54
##      FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D       Bike_Dir
##    <int>    <int>      <chr>      <chr>    <int>      <chr>          <chr>
##  1    18       19         No       <NA>        6         No Not Applicable
##  2    53       54        Yes      50-59       52         No   With Traffic
##  3    56       57        Yes      16-19       18         No           <NA>
##  4   209      210         No      16-19       16         No Facing Traffic
##  5   228      229        Yes      40-49       40         No   With Traffic
##  6   620      621        Yes      50-59       55         No   With Traffic
##  7   667      668        Yes      60-69       61         No Not Applicable
##  8   458      459        Yes      60-69       62         No   With Traffic
##  9   576      577         No      40-49       49         No   With Traffic
## 10   618      619         No      20-24       23         No   With Traffic
## # ... with 243 more rows, and 47 more variables: Bike_Injur <chr>,
## #   Bike_Pos <chr>, Bike_Race <chr>, Bike_Sex <chr>, City <chr>,
## #   County <chr>, CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>,
## #   Crash_Grp <chr>, Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>,
## #   Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>,
## #   Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>,
## #   DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>,
## #   Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>,
## #   ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>,
## #   Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>,
## #   Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>,
## #   Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>,
## #   Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>

`filter`

for crashes in Durham County where biker was < 10 yrs old

bike %>%
  filter(County == "Durham", Bike_Age < 10)

## # A tibble: 20 x 54
##      FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D       Bike_Dir
##    <int>    <int>      <chr>      <chr>    <int>      <chr>          <chr>
##  1    18       19         No       <NA>        6         No Not Applicable
##  2    47       48         No     10-Jun        9         No Not Applicable
##  3   124      125        Yes     10-Jun        8         No   With Traffic
##  4   531      532        Yes     10-Jun        7         No   With Traffic
##  5   704      705        Yes     10-Jun        9         No Not Applicable
##  6    42       43         No     10-Jun        8         No   With Traffic
##  7   392      393        Yes        0-5        2         No Not Applicable
##  8   941      942         No     10-Jun        9         No   With Traffic
##  9   436      437        Yes     10-Jun        6         No Not Applicable
## 10   160      161        Yes     10-Jun        7         No   With Traffic
## 11   273      274        Yes     10-Jun        7         No Facing Traffic
## 12    78       79        Yes     10-Jun        7         No   With Traffic
## 13   422      423         No     10-Jun        9         No Not Applicable
## 14   570      571         No       <NA>        0    Missing Not Applicable
## 15   683      684        Yes     10-Jun        8         No Not Applicable
## 16    62       63        Yes     10-Jun        7         No   With Traffic
## 17   248      249         No        0-5        4         No Not Applicable
## 18   306      307        Yes     10-Jun        8         No   With Traffic
## 19   231      232        Yes     10-Jun        8         No   With Traffic
## 20   361      362        Yes     10-Jun        9         No   With Traffic
## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>,
## #   Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>,
## #   CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>,
## #   Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>,
## #   Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>,
## #   Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>,
## #   DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>,
## #   Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>,
## #   ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>,
## #   Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>,
## #   Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>,
## #   Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>,
## #   Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>

Commonly used logical operators in R

operator	definition
`<`	less than
`<=`	less than or equal to
`>`	greater than
`>=`	greater than or equal to
`==`	exactly equal to
`!=`	not equal to
`x \| y`	`x` OR `y`
`x & y`	`x` AND `y`

Commonly used logical operators in R

operator	definition
`is.na(x)`	test if `x` is `NA`
`!is.na(x)`	test if `x` is not `NA`
`x %in% y`	test if `x` is in `y`
`!(x %in% y)`	test if `x` is not in `y`
`!x`	not `x`

Aside: real data is messy!

What in the world does a BikeAge_gr of 10-Jun or 15-Nov mean?

bike %>%
  group_by(BikeAge_Gr) %>%
  summarise(crash_count = n())

## # A tibble: 13 x 2
##    BikeAge_Gr crash_count
##         <chr>       <int>
##  1        0-5          60
##  2     10-Jun         421
##  3     15-Nov         747
##  4      16-19         605
##  5      20-24         680
##  6      25-29         430
##  7      30-39         658
##  8      40-49         920
##  9      50-59         739
## 10      60-69         274
## 11         70          12
## 12        70+          58
## 13       <NA>         112

Careful data scientists clean up their data first!

We're going to need to do some text parsing to clean up these data
- 10-Jun should be 6-10
- 15-Nov should be 11-15
New R package: stringr

Install and load: `stringr`

Install:

install.packages("stringr") # only have to do this once, in your Console

Load:

library(stringr)

Package reference: Most R packages come with a vignette that describe in detail what each function does and how to use them, they're incredibly useful resources (in addition to other worked out examples on the web) http://stringr.tidyverse.org/articles/stringr.html

Replace with `str_replace` and add new variables with `mutate`

Remember we want to do the following in the BikeAge_Gr variable
- 10-Jun should be 6-10
- 15-Nov should be 11-15

bike <- bike %>%
  mutate(
    BikeAge_Gr = str_replace(BikeAge_Gr, "10-Jun", "6-10"),
    BikeAge_Gr = str_replace(BikeAge_Gr, "15-Nov", "11-15")
    )

Note that we're overwriting existing data and columns, so be careful!
- But remember, it's easy to revert if you make a mistake since we didn't touch the raw data, we can always reload it and start over

Check before you move on

Always check your changes and confirm code did what you wanted it to do

bike %>%
  group_by(BikeAge_Gr) %>%
  summarise(count = n())

## # A tibble: 13 x 2
##    BikeAge_Gr count
##         <chr> <int>
##  1        0-5    60
##  2      11-15   747
##  3      16-19   605
##  4      20-24   680
##  5      25-29   430
##  6      30-39   658
##  7      40-49   920
##  8      50-59   739
##  9       6-10   421
## 10      60-69   274
## 11         70    12
## 12        70+    58
## 13       <NA>   112

`mutate` to add new variables

How is the new alcohol variable determined?

bike %>%
  mutate(alcohol = case_when(
    Bike_Alc_D == "No" & Drvr_Alc_D == "No"      ~ "No",
    Bike_Alc_D == "Yes" | Drvr_Alc_D == "Yes"    ~ "Yes",
    Bike_Alc_D == "Missing" & Drvr_Alc_D == "No" ~ "Missing",
    Bike_Alc_D == "No" & Drvr_Alc_D == "Missing" ~ "Missing"
  ))

"Save" when you `mutate`

Most often when you define a new variable with mutate you'll also want to save the resulting data frame, often by writing over the original data frame.

bike <- bike %>%
  mutate(alcohol = case_when(
    Bike_Alc_D == "No" & Drvr_Alc_D == "No"      ~ "No",
    Bike_Alc_D == "Yes" | Drvr_Alc_D == "Yes"    ~ "Yes",
    Bike_Alc_D == "Missing" & Drvr_Alc_D == "No" ~ "Missing",
    Bike_Alc_D == "No" & Drvr_Alc_D == "Missing" ~ "Missing"
  ))

`transmute` to create a new dataset

You'll use this much less often than mutate but when you need it, you need it.

bike %>% 
  transmute(ID = paste(FID, OBJECTID, sep = "-"))

## # A tibble: 5,716 x 1
##       ID
##    <chr>
##  1 18-19
##  2 29-30
##  3 33-34
##  4 35-36
##  5 49-50
##  6 53-54
##  7 56-57
##  8 60-61
##  9 63-64
## 10 66-67
## # ... with 5,706 more rows

`slice` for certain row numbers

First five

bike %>%
  slice(1:5)

## # A tibble: 5 x 54
##     FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D       Bike_Dir
##   <int>    <int>      <chr>      <chr>    <int>      <chr>          <chr>
## 1    18       19         No       <NA>        6         No Not Applicable
## 2    29       30        Yes      50-59       51         No   With Traffic
## 3    33       34         No       <NA>       10         No   With Traffic
## 4    35       36        Yes      16-19       17         No           <NA>
## 5    49       50         No       <NA>        6         No Facing Traffic
## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>,
## #   Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>,
## #   CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>,
## #   Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>,
## #   Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>,
## #   Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>,
## #   DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>,
## #   Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>,
## #   ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>,
## #   Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>,
## #   Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>,
## #   Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>,
## #   Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>

`slice` for certain row numbers

Last five

last_row <- nrow(bike)
bike %>%
  slice((last_row-4):last_row)

## # A tibble: 5 x 54
##     FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D       Bike_Dir
##   <int>    <int>      <chr>      <chr>    <int>      <chr>          <chr>
## 1   460      461        Yes       6-10        7         No Not Applicable
## 2   474      475        Yes      50-59       50         No   With Traffic
## 3   479      480        Yes      16-19       16         No Not Applicable
## 4   487      488         No      40-49       47        Yes   With Traffic
## 5   488      489        Yes      30-39       35         No Facing Traffic
## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>,
## #   Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>,
## #   CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>,
## #   Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>,
## #   Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>,
## #   Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>,
## #   DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>,
## #   Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>,
## #   ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>,
## #   Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>,
## #   Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>,
## #   Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>,
## #   Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>

`select` to keep only the variables you mention

bike %>%
  select(Crash_Loc, Hit_Run) %>%
  table()

##                       Hit_Run
## Crash_Loc                No  Yes
##   Intersection         2223  275
##   Intersection-Related  252   42
##   Location                3    7
##   Non-Intersection     2213  462
##   Non-Roadway           205   30

or `select` to exclude variables

bike %>%
  select(-OBJECTID)

## # A tibble: 5,716 x 53
##      FID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D       Bike_Dir
##    <int>      <chr>      <chr>    <int>      <chr>          <chr>
##  1    18         No       <NA>        6         No Not Applicable
##  2    29        Yes      50-59       51         No   With Traffic
##  3    33         No       <NA>       10         No   With Traffic
##  4    35        Yes      16-19       17         No           <NA>
##  5    49         No       <NA>        6         No Facing Traffic
##  6    53        Yes      50-59       52         No   With Traffic
##  7    56        Yes      16-19       18         No           <NA>
##  8    60         No      40-49       40         No Facing Traffic
##  9    63        Yes       6-10        6         No Facing Traffic
## 10    66        Yes       6-10        7         No           <NA>
## # ... with 5,706 more rows, and 47 more variables: Bike_Injur <chr>,
## #   Bike_Pos <chr>, Bike_Race <chr>, Bike_Sex <chr>, City <chr>,
## #   County <chr>, CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>,
## #   Crash_Grp <chr>, Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>,
## #   Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>,
## #   Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>,
## #   DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>,
## #   Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>,
## #   ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>,
## #   Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>,
## #   Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>,
## #   Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>,
## #   Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>

or `select` a range of variables

bike %>%
  select(OBJECTID:Bike_Injur)

## # A tibble: 5,716 x 7
##    OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D       Bike_Dir
##       <int>      <chr>      <chr>    <int>      <chr>          <chr>
##  1       19         No       <NA>        6         No Not Applicable
##  2       30        Yes      50-59       51         No   With Traffic
##  3       34         No       <NA>       10         No   With Traffic
##  4       36        Yes      16-19       17         No           <NA>
##  5       50         No       <NA>        6         No Facing Traffic
##  6       54        Yes      50-59       52         No   With Traffic
##  7       57        Yes      16-19       18         No           <NA>
##  8       61         No      40-49       40         No Facing Traffic
##  9       64        Yes       6-10        6         No Facing Traffic
## 10       67        Yes       6-10        7         No           <NA>
## # ... with 5,706 more rows, and 1 more variables: Bike_Injur <chr>

`pull` to extract a column as a vector

bike %>%
  pull(Location) %>%
  head()

## [1] "36.002743, -78.8785"  "35.612984, -77.39265" "35.595676, -77.59074"
## [4] "35.076767, -80.7728"  "35.19999, -80.75713"  "35.966644, -78.96749"

bike %>%
  select(Location)

## # A tibble: 5,716 x 1
##                Location
##                   <chr>
##  1  36.002743, -78.8785
##  2 35.612984, -77.39265
##  3 35.595676, -77.59074
##  4  35.076767, -80.7728
##  5  35.19999, -80.75713
##  6 35.966644, -78.96749
##  7 35.996304, -78.88007
##  8 35.09925, -78.877464
##  9  35.74463, -81.69225
## 10 35.184418, -78.50205
## # ... with 5,706 more rows

The two `pull`s in your lives

Don't get pull happy when wrangling data! Only extract out variables if you truly need to, otherwise keep in data frame.
But always ⬇️ Pull before starting your work when collaborating on GitHub.

`rename` specific columns

Useful for correcting typos, and renaming to make variable names shorter and/or more informative

Original names:

names(bike)

##  [1] "FID"        "OBJECTID"   "AmbulanceR" "BikeAge_Gr" "Bike_Age"  
##  [6] "Bike_Alc_D" "Bike_Dir"   "Bike_Injur" "Bike_Pos"   "Bike_Race" 
## [11] "Bike_Sex"   "City"       "County"     "CrashAlcoh" "CrashDay"  
## [16] "Crash_Date" "Crash_Grp"  "Crash_Hour" "Crash_Loc"  "Crash_Mont"
## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri"
## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age"   "Drvr_Alc_D" "Drvr_EstSp"
## [31] "Drvr_Injur" "Drvr_Race"  "Drvr_Sex"   "Drvr_VehTy" "ExcsSpdInd"
## [36] "Hit_Run"    "Light_Cond" "Locality"   "Num_Lanes"  "Num_Units" 
## [41] "Rd_Charact" "Rd_Class"   "Rd_Conditi" "Rd_Config"  "Rd_Defects"
## [46] "Rd_Feature" "Rd_Surface" "Region"     "Rural_Urba" "Speed_Limi"
## [51] "Traff_Cntr" "Weather"    "Workzone_I" "Location"

Rename Speed_Limi to Speed_Limit:

bike <- bike %>%
  rename(Speed_Limit = Speed_Limi)

Check before you move on

Always check your changes and confirm code did what you wanted it to do

names(bike)

##  [1] "FID"         "OBJECTID"    "AmbulanceR"  "BikeAge_Gr"  "Bike_Age"   
##  [6] "Bike_Alc_D"  "Bike_Dir"    "Bike_Injur"  "Bike_Pos"    "Bike_Race"  
## [11] "Bike_Sex"    "City"        "County"      "CrashAlcoh"  "CrashDay"   
## [16] "Crash_Date"  "Crash_Grp"   "Crash_Hour"  "Crash_Loc"   "Crash_Mont" 
## [21] "Crash_Time"  "Crash_Type"  "Crash_Ty_1"  "Crash_Year"  "Crsh_Sevri" 
## [26] "Developmen"  "DrvrAge_Gr"  "Drvr_Age"    "Drvr_Alc_D"  "Drvr_EstSp" 
## [31] "Drvr_Injur"  "Drvr_Race"   "Drvr_Sex"    "Drvr_VehTy"  "ExcsSpdInd" 
## [36] "Hit_Run"     "Light_Cond"  "Locality"    "Num_Lanes"   "Num_Units"  
## [41] "Rd_Charact"  "Rd_Class"    "Rd_Conditi"  "Rd_Config"   "Rd_Defects" 
## [46] "Rd_Feature"  "Rd_Surface"  "Region"      "Rural_Urba"  "Speed_Limit"
## [51] "Traff_Cntr"  "Weather"     "Workzone_I"  "Location"

`summarise` to reduce variables to values

The values are summarised in a data frame

bike %>%
  group_by(BikeAge_Gr) %>%
  summarise(crash_count = n())

## # A tibble: 13 x 2
##    BikeAge_Gr crash_count
##         <chr>       <int>
##  1        0-5          60
##  2      11-15         747
##  3      16-19         605
##  4      20-24         680
##  5      25-29         430
##  6      30-39         658
##  7      40-49         920
##  8      50-59         739
##  9       6-10         421
## 10      60-69         274
## 11         70          12
## 12        70+          58
## 13       <NA>         112

and `arrange` to order rows

bike %>%
  group_by(BikeAge_Gr) %>%
  summarise(crash_count = n()) %>%
  arrange(desc(crash_count))

## # A tibble: 13 x 2
##    BikeAge_Gr crash_count
##         <chr>       <int>
##  1      40-49         920
##  2      11-15         747
##  3      50-59         739
##  4      20-24         680
##  5      30-39         658
##  6      16-19         605
##  7      25-29         430
##  8       6-10         421
##  9      60-69         274
## 10       <NA>         112
## 11        0-5          60
## 12        70+          58
## 13         70          12

`count` to group by then count

bike %>%
  count(BikeAge_Gr)

## # A tibble: 13 x 2
##    BikeAge_Gr     n
##         <chr> <int>
##  1        0-5    60
##  2      11-15   747
##  3      16-19   605
##  4      20-24   680
##  5      25-29   430
##  6      30-39   658
##  7      40-49   920
##  8      50-59   739
##  9       6-10   421
## 10      60-69   274
## 11         70    12
## 12        70+    58
## 13       <NA>   112

If you wanted to arrange these in ascending order what would you add to the pipe?

Select rows with `sample_n` or `sample_frac`

sample_n: randomly sample 5 observations

bike_n5 <- bike %>%
  sample_n(5, replace = FALSE)
dim(bike_n5)

## [1]  5 54

sample_frac: randomly sample 20% of observations

bike_perc20 <-bike %>%
  sample_frac(0.2, replace = FALSE)
dim(bike_perc20)

## [1] 1143   54

`distinct` to filter for unique rows

bike %>% 
  select(County, City) %>% 
  distinct() %>% 
  arrange(County, City)

## # A tibble: 360 x 2
##       County               City
##        <chr>              <chr>
##  1  Alamance           Alamance
##  2  Alamance         Burlington
##  3  Alamance       Elon College
##  4  Alamance        Gibsonville
##  5  Alamance             Graham
##  6  Alamance        Green Level
##  7  Alamance             Mebane
##  8  Alamance None - Rural Crash
##  9 Alexander None - Rural Crash
## 10 Alleghany None - Rural Crash
## # ... with 350 more rows

Getting started

Recap

Midterm

Tidy data

Tidy data

Pipes

magrittr

What about other arguments

Data wrangling

NC DOT Fatal Crashes in North Carolina

Variables

Viewing your data

A Grammar of Data Manipulation

dplyr rules for functions

A note on piping and layering

Filter rows with filter

filter

filter

Commonly used logical operators in R

Commonly used logical operators in R

Aside: real data is messy!

Careful data scientists clean up their data first!

Install and load: stringr

Replace with str_replace and add new variables with mutate

Check before you move on

mutate to add new variables

"Save" when you mutate

transmute to create a new dataset

slice for certain row numbers

slice for certain row numbers

select to keep only the variables you mention

or select to exclude variables

or select a range of variables

pull to extract a column as a vector

The two pulls in your lives

rename specific columns

Check before you move on

summarise to reduce variables to values

and arrange to order rows

count to group by then count

Select rows with sample_n or sample_frac

distinct to filter for unique rows

Filter rows with `filter`

`filter`

`filter`

Install and load: `stringr`

Replace with `str_replace` and add new variables with `mutate`

`mutate` to add new variables

"Save" when you `mutate`

`transmute` to create a new dataset

`slice` for certain row numbers

`slice` for certain row numbers

`select` to keep only the variables you mention

or `select` to exclude variables

or `select` a range of variables

`pull` to extract a column as a vector

The two `pull`s in your lives

`rename` specific columns

`summarise` to reduce variables to values

and `arrange` to order rows

`count` to group by then count

Select rows with `sample_n` or `sample_frac`

`distinct` to filter for unique rows