September 26, 2017
Any questions from last time?
Any questions from the reading?
Assigned on Thursday Sep 28 at 12:30pm, due Tuesday Oct 3 at 10:05am
Ask questions via direct message to Kyle and/or I, do not post on other public channels
Individual assessment:
This exam is open book, open internet, closed other people. You may use any online or book based resource you would like, but you should include citations for any code that you use (directly or indirectly). You may not consult with anyone else for this exam other than the professor and the TA. You cannot ask direct questions on the internet, or consult with each other, not even for hypothetical questions.
Duke University is a community dedicated to scholarship, leadership, and service and to the principles of honesty, fairness, respect, and accountability. Citizens of this community commit to reflect upon and uphold these principles in all academic and non-academic endeavors, and to protect and promote a culture of integrity. Cheating on exams or plagiarism on homework assignments, lying about an illness or absence and other forms of academic dishonesty are a breach of trust with classmates and faculty, violate the Duke Community Standard, and will not be tolerated. Such incidences will result in a 0 grade for all parties involved. Additionally, there may be penalties to your final class grade along with being reported to the Undergraduate Conduct Board.
Is each of the following a dataset or a summary table?
## # A tibble: 1,314 x 3 ## outcome smoker age ## <fctr> <fctr> <int> ## 1 Alive Yes 23 ## 2 Alive Yes 18 ## 3 Dead Yes 71 ## 4 Alive No 67 ## 5 Alive No 64 ## 6 Alive Yes 38 ## 7 Alive Yes 45 ## 8 Dead No 76 ## 9 Alive No 28 ## 10 Alive No 27 ## # ... with 1,304 more rows
## # A tibble: 133 x 4 ## age smoker Alive Dead ## * <int> <fctr> <int> <int> ## 1 18 No 10 1 ## 2 18 Yes 10 NA ## 3 19 No 8 NA ## 4 19 Yes 6 NA ## 5 20 No 11 NA ## 6 20 Yes 7 1 ## 7 21 No 8 NA ## 8 21 Yes 7 1 ## 9 22 No 8 NA ## 10 22 Yes 4 NA ## # ... with 123 more rows
You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.
Expressed as a set of nested functions in R pseudocode this would look like:
park(drive(start_car(find("keys")), to="campus"))
Writing it out using pipes give it a more natural (and easier to read) structure:
find("keys") %>% start_car() %>% drive(to = "campus") %>% park()
To send results to a function argument other than first one or to use the previous result for multiple arguments, use .
:
acs12 %>% filter(employment == "employed") %>% ggplot(mapping = aes(x = hrs_work, y = income)) + geom_point()
acs12 %>% filter(employment == "employed") %>% lm(income ~ hrs_work, data = .)
## ## Call: ## lm(formula = income ~ hrs_work, data = .) ## ## Coefficients: ## (Intercept) hrs_work ## -13227 1472
From https://opendurham.nc.gov
bike <- read_csv2("https://stat.duke.edu/~mc301/data/nc_bike_crash.csv", na = c("NA", "", "."))
## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
## Parsed with column specification: ## cols( ## .default = col_character(), ## FID = col_integer(), ## OBJECTID = col_integer(), ## Bike_Age = col_integer(), ## Crash_Date = col_date(format = ""), ## Crash_Hour = col_integer(), ## Crash_Time = col_datetime(format = ""), ## Crash_Ty_1 = col_integer(), ## Crash_Year = col_integer(), ## Drvr_Age = col_integer(), ## Num_Units = col_integer() ## )
## See spec(...) for full column specifications.
View the names of variables via
names(bike)
## [1] "FID" "OBJECTID" "AmbulanceR" "BikeAge_Gr" "Bike_Age" ## [6] "Bike_Alc_D" "Bike_Dir" "Bike_Injur" "Bike_Pos" "Bike_Race" ## [11] "Bike_Sex" "City" "County" "CrashAlcoh" "CrashDay" ## [16] "Crash_Date" "Crash_Grp" "Crash_Hour" "Crash_Loc" "Crash_Mont" ## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri" ## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age" "Drvr_Alc_D" "Drvr_EstSp" ## [31] "Drvr_Injur" "Drvr_Race" "Drvr_Sex" "Drvr_VehTy" "ExcsSpdInd" ## [36] "Hit_Run" "Light_Cond" "Locality" "Num_Lanes" "Num_Units" ## [41] "Rd_Charact" "Rd_Class" "Rd_Conditi" "Rd_Config" "Rd_Defects" ## [46] "Rd_Feature" "Rd_Surface" "Region" "Rural_Urba" "Speed_Limi" ## [51] "Traff_Cntr" "Weather" "Workzone_I" "Location"
and see detailed descriptions at https://stat.duke.edu/~mc301/data/nc_bike_crash.html.
In the Environment, click on the name of the data frame to view it in the data viewer
Use the str
function to compactly display the internal structure of an R object
str(bike)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5716 obs. of 54 variables: ## $ FID : int 18 29 33 35 49 53 56 60 63 66 ... ## $ OBJECTID : int 19 30 34 36 50 54 57 61 64 67 ... ## $ AmbulanceR: chr "No" "Yes" "No" "Yes" ... ## $ BikeAge_Gr: chr NA "50-59" NA "16-19" ... ## $ Bike_Age : int 6 51 10 17 6 52 18 40 6 7 ... ## $ Bike_Alc_D: chr "No" "No" "No" "No" ... ## $ Bike_Dir : chr "Not Applicable" "With Traffic" "With Traffic" NA ... ## $ Bike_Injur: chr "C: Possible Injury" "C: Possible Injury" "Injury" "B: Evident Injury" ... ## $ Bike_Pos : chr "Driveway / Alley" "Travel Lane" "Travel Lane" "Travel Lane" ... ## $ Bike_Race : chr "Black" "Black" "Black" "White" ... ## $ Bike_Sex : chr "Female" "Male" "Male" "Male" ... ## $ City : chr "Durham" "Greenville" "Farmville" "Charlotte" ... ## $ County : chr "Durham" "Pitt" "Pitt" "Mecklenburg" ... ## $ CrashAlcoh: chr "No" "No" "No" "No" ... ## $ CrashDay : chr "01-01-06" "01-01-02" "01-01-07" "01-01-05" ... ## $ Crash_Date: Date, format: "2007-01-06" "2007-01-09" ... ## $ Crash_Grp : chr "Bicyclist Failed to Yield - Midblock" "Crossing Paths - Other Circumstances" "Bicyclist Failed to Yield - Sign-Controlled Intersection" "Loss of Control / Turning Error" ... ## $ Crash_Hour: int 13 23 16 19 12 20 19 14 16 0 ... ## $ Crash_Loc : chr "Non-Intersection" "Intersection-Related" "Intersection" "Intersection" ... ## $ Crash_Mont: chr NA NA NA NA ... ## $ Crash_Time: POSIXct, format: "0001-01-01 13:17:58" "0001-01-01 23:08:58" ... ## $ Crash_Type: chr "Bicyclist Ride Out - Residential Driveway" "Crossing Paths - Intersection - Other /" "Bicyclist Ride Through - Sign-Controlled Intersection" "Motorist Lost Control - Other /" ... ## $ Crash_Ty_1: int 353311 211180 111144 119139 112114 311231 119144 132180 112142 460910 ... ## $ Crash_Year: int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ... ## $ Crsh_Sevri: chr "C: Possible Injury" "C: Possible Injury" "O: No Injury" "B: Evident Injury" ... ## $ Developmen: chr "Residential" "Commercial" "Residential" "Residential" ... ## $ DrvrAge_Gr: chr "60-69" "30-39" "50-59" "30-39" ... ## $ Drvr_Age : int 66 34 52 33 NA 20 40 NA 17 51 ... ## $ Drvr_Alc_D: chr "No" "No" "No" "No" ... ## $ Drvr_EstSp: chr "11-15 mph" "0-5 mph" "21-25 mph" "46-50 mph" ... ## $ Drvr_Injur: chr "O: No Injury" "O: No Injury" "O: No Injury" "O: No Injury" ... ## $ Drvr_Race : chr "Black" "Black" "White" "White" ... ## $ Drvr_Sex : chr "Male" "Male" "Female" "Female" ... ## $ Drvr_VehTy: chr "Pickup" "Passenger Car" "Passenger Car" "Sport Utility" ... ## $ ExcsSpdInd: chr "No" "No" "No" "No" ... ## $ Hit_Run : chr "No" "No" "No" "No" ... ## $ Light_Cond: chr "Daylight" "Dark - Lighted Roadway" "Daylight" "Dark - Roadway Not Lighted" ... ## $ Locality : chr "Mixed (30% To 70% Developed)" "Urban (>70% Developed)" "Mixed (30% To 70% Developed)" "Urban (>70% Developed)" ... ## $ Num_Lanes : chr "2 lanes" "5 lanes" "2 lanes" "4 lanes" ... ## $ Num_Units : int 2 2 2 3 2 2 2 2 2 2 ... ## $ Rd_Charact: chr "Straight - Level" "Straight - Level" "Straight - Level" "Straight - Level" ... ## $ Rd_Class : chr "Local Street" "Local Street" "Local Street" "NC Route" ... ## $ Rd_Conditi: chr "Dry" "Dry" "Dry" "Dry" ... ## $ Rd_Config : chr "Two-Way, Not Divided" "Two-Way, Divided, Unprotected Median" "Two-Way, Not Divided" "Two-Way, Divided, Unprotected Median" ... ## $ Rd_Defects: chr "None" "None" "None" "None" ... ## $ Rd_Feature: chr "No Special Feature" "Four-Way Intersection" "Four-Way Intersection" "Four-Way Intersection" ... ## $ Rd_Surface: chr "Smooth Asphalt" "Smooth Asphalt" "Smooth Asphalt" "Smooth Asphalt" ... ## $ Region : chr "Piedmont" "Coastal" "Coastal" "Piedmont" ... ## $ Rural_Urba: chr "Urban" "Urban" "Rural" "Urban" ... ## $ Speed_Limi: chr "20 - 25 MPH" "40 - 45 MPH" "30 - 35 MPH" "40 - 45 MPH" ... ## $ Traff_Cntr: chr "No Control Present" "Stop And Go Signal" "Stop Sign" "Stop And Go Signal" ... ## $ Weather : chr "Clear" "Clear" "Clear" "Cloudy" ... ## $ Workzone_I: chr "No" "No" "No" "No" ... ## $ Location : chr "36.002743, -78.8785" "35.612984, -77.39265" "35.595676, -77.59074" "35.076767, -80.7728" ... ## - attr(*, "spec")=List of 2 ## ..$ cols :List of 54 ## .. ..$ FID : list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ OBJECTID : list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ AmbulanceR: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ BikeAge_Gr: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Age : list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Bike_Alc_D: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Dir : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Injur: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Pos : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Race : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Sex : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ City : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ County : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ CrashAlcoh: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ CrashDay : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Crash_Date:List of 1 ## .. .. ..$ format: chr "" ## .. .. ..- attr(*, "class")= chr "collector_date" "collector" ## .. ..$ Crash_Grp : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Crash_Hour: list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Crash_Loc : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Crash_Mont: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Crash_Time:List of 1 ## .. .. ..$ format: chr "" ## .. .. ..- attr(*, "class")= chr "collector_datetime" "collector" ## .. ..$ Crash_Type: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Crash_Ty_1: list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Crash_Year: list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Crsh_Sevri: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Developmen: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ DrvrAge_Gr: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_Age : list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Drvr_Alc_D: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_EstSp: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_Injur: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_Race : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_Sex : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_VehTy: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ ExcsSpdInd: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Hit_Run : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Light_Cond: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Locality : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Num_Lanes : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Num_Units : list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Rd_Charact: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Class : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Conditi: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Config : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Defects: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Feature: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Surface: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Region : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rural_Urba: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Speed_Limi: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Traff_Cntr: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Weather : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Workzone_I: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Location : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## ..$ default: list() ## .. ..- attr(*, "class")= chr "collector_guess" "collector" ## ..- attr(*, "class")= chr "col_spec"
dplyr is based on the concepts of functions as verbs that manipulate data frames.
Single data frame functions / verbs:
filter
: pick rows matching criteriaslice
: pick rows using index(es)select
: pick columns by namepull
: grab a column as a vectorrename
: rename specific columnsarrange
: reorder rowsmutate
: add new variablestransmute
: create new data frame with variablesdistinct
: filter for unique rowssample_n
/ sample_frac
: randomly sample rowssummarise
: reduce variables to valuesFirst argument is always a data frame
Subsequent arguments say what to do with that data frame
Always\(^*\) return a data frame
Don't modify in place
Performance via lazy evaluation
The %>%
operator in dplyr functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code.
The +
operator in ggplot2 functions is used for "layering". This means you create the plot in layers, separated by +
.
filter
filter
for crashes in Durham County
bike %>% filter(County == "Durham")
## # A tibble: 253 x 54 ## FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <int> <chr> <chr> <int> <chr> <chr> ## 1 18 19 No <NA> 6 No Not Applicable ## 2 53 54 Yes 50-59 52 No With Traffic ## 3 56 57 Yes 16-19 18 No <NA> ## 4 209 210 No 16-19 16 No Facing Traffic ## 5 228 229 Yes 40-49 40 No With Traffic ## 6 620 621 Yes 50-59 55 No With Traffic ## 7 667 668 Yes 60-69 61 No Not Applicable ## 8 458 459 Yes 60-69 62 No With Traffic ## 9 576 577 No 40-49 49 No With Traffic ## 10 618 619 No 20-24 23 No With Traffic ## # ... with 243 more rows, and 47 more variables: Bike_Injur <chr>, ## # Bike_Pos <chr>, Bike_Race <chr>, Bike_Sex <chr>, City <chr>, ## # County <chr>, CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, ## # Crash_Grp <chr>, Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
filter
for crashes in Durham County where biker was < 10 yrs old
bike %>% filter(County == "Durham", Bike_Age < 10)
## # A tibble: 20 x 54 ## FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <int> <chr> <chr> <int> <chr> <chr> ## 1 18 19 No <NA> 6 No Not Applicable ## 2 47 48 No 10-Jun 9 No Not Applicable ## 3 124 125 Yes 10-Jun 8 No With Traffic ## 4 531 532 Yes 10-Jun 7 No With Traffic ## 5 704 705 Yes 10-Jun 9 No Not Applicable ## 6 42 43 No 10-Jun 8 No With Traffic ## 7 392 393 Yes 0-5 2 No Not Applicable ## 8 941 942 No 10-Jun 9 No With Traffic ## 9 436 437 Yes 10-Jun 6 No Not Applicable ## 10 160 161 Yes 10-Jun 7 No With Traffic ## 11 273 274 Yes 10-Jun 7 No Facing Traffic ## 12 78 79 Yes 10-Jun 7 No With Traffic ## 13 422 423 No 10-Jun 9 No Not Applicable ## 14 570 571 No <NA> 0 Missing Not Applicable ## 15 683 684 Yes 10-Jun 8 No Not Applicable ## 16 62 63 Yes 10-Jun 7 No With Traffic ## 17 248 249 No 0-5 4 No Not Applicable ## 18 306 307 Yes 10-Jun 8 No With Traffic ## 19 231 232 Yes 10-Jun 8 No With Traffic ## 20 361 362 Yes 10-Jun 9 No With Traffic ## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>, ## # Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>, ## # CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>, ## # Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
operator | definition |
---|---|
< |
less than |
<= |
less than or equal to |
> |
greater than |
>= |
greater than or equal to |
== |
exactly equal to |
!= |
not equal to |
x | y |
x OR y |
x & y |
x AND y |
operator | definition |
---|---|
is.na(x) |
test if x is NA |
!is.na(x) |
test if x is not NA |
x %in% y |
test if x is in y |
!(x %in% y) |
test if x is not in y |
!x |
not x |
What in the world does a BikeAge_gr
of 10-Jun
or 15-Nov
mean?
bike %>% group_by(BikeAge_Gr) %>% summarise(crash_count = n())
## # A tibble: 13 x 2 ## BikeAge_Gr crash_count ## <chr> <int> ## 1 0-5 60 ## 2 10-Jun 421 ## 3 15-Nov 747 ## 4 16-19 605 ## 5 20-24 680 ## 6 25-29 430 ## 7 30-39 658 ## 8 40-49 920 ## 9 50-59 739 ## 10 60-69 274 ## 11 70 12 ## 12 70+ 58 ## 13 <NA> 112
10-Jun
should be 6-10
15-Nov
should be 11-15
stringr
stringr
install.packages("stringr") # only have to do this once, in your Console
library(stringr)
str_replace
and add new variables with mutate
BikeAge_Gr
variable
10-Jun
should be 6-10
15-Nov
should be 11-15
bike <- bike %>% mutate( BikeAge_Gr = str_replace(BikeAge_Gr, "10-Jun", "6-10"), BikeAge_Gr = str_replace(BikeAge_Gr, "15-Nov", "11-15") )
Always check your changes and confirm code did what you wanted it to do
bike %>% group_by(BikeAge_Gr) %>% summarise(count = n())
## # A tibble: 13 x 2 ## BikeAge_Gr count ## <chr> <int> ## 1 0-5 60 ## 2 11-15 747 ## 3 16-19 605 ## 4 20-24 680 ## 5 25-29 430 ## 6 30-39 658 ## 7 40-49 920 ## 8 50-59 739 ## 9 6-10 421 ## 10 60-69 274 ## 11 70 12 ## 12 70+ 58 ## 13 <NA> 112
mutate
to add new variablesHow is the new alcohol
variable determined?
bike %>% mutate(alcohol = case_when( Bike_Alc_D == "No" & Drvr_Alc_D == "No" ~ "No", Bike_Alc_D == "Yes" | Drvr_Alc_D == "Yes" ~ "Yes", Bike_Alc_D == "Missing" & Drvr_Alc_D == "No" ~ "Missing", Bike_Alc_D == "No" & Drvr_Alc_D == "Missing" ~ "Missing" ))
mutate
Most often when you define a new variable with mutate
you'll also want to save the resulting data frame, often by writing over the original data frame.
bike <- bike %>% mutate(alcohol = case_when( Bike_Alc_D == "No" & Drvr_Alc_D == "No" ~ "No", Bike_Alc_D == "Yes" | Drvr_Alc_D == "Yes" ~ "Yes", Bike_Alc_D == "Missing" & Drvr_Alc_D == "No" ~ "Missing", Bike_Alc_D == "No" & Drvr_Alc_D == "Missing" ~ "Missing" ))
transmute
to create a new datasetYou'll use this much less often than mutate
but when you need it, you need it.
bike %>% transmute(ID = paste(FID, OBJECTID, sep = "-"))
## # A tibble: 5,716 x 1 ## ID ## <chr> ## 1 18-19 ## 2 29-30 ## 3 33-34 ## 4 35-36 ## 5 49-50 ## 6 53-54 ## 7 56-57 ## 8 60-61 ## 9 63-64 ## 10 66-67 ## # ... with 5,706 more rows
slice
for certain row numbersFirst five
bike %>% slice(1:5)
## # A tibble: 5 x 54 ## FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <int> <chr> <chr> <int> <chr> <chr> ## 1 18 19 No <NA> 6 No Not Applicable ## 2 29 30 Yes 50-59 51 No With Traffic ## 3 33 34 No <NA> 10 No With Traffic ## 4 35 36 Yes 16-19 17 No <NA> ## 5 49 50 No <NA> 6 No Facing Traffic ## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>, ## # Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>, ## # CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>, ## # Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
slice
for certain row numbersLast five
last_row <- nrow(bike) bike %>% slice((last_row-4):last_row)
## # A tibble: 5 x 54 ## FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <int> <chr> <chr> <int> <chr> <chr> ## 1 460 461 Yes 6-10 7 No Not Applicable ## 2 474 475 Yes 50-59 50 No With Traffic ## 3 479 480 Yes 16-19 16 No Not Applicable ## 4 487 488 No 40-49 47 Yes With Traffic ## 5 488 489 Yes 30-39 35 No Facing Traffic ## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>, ## # Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>, ## # CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>, ## # Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
select
to keep only the variables you mentionbike %>% select(Crash_Loc, Hit_Run) %>% table()
## Hit_Run ## Crash_Loc No Yes ## Intersection 2223 275 ## Intersection-Related 252 42 ## Location 3 7 ## Non-Intersection 2213 462 ## Non-Roadway 205 30
select
to exclude variablesbike %>% select(-OBJECTID)
## # A tibble: 5,716 x 53 ## FID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <chr> <chr> <int> <chr> <chr> ## 1 18 No <NA> 6 No Not Applicable ## 2 29 Yes 50-59 51 No With Traffic ## 3 33 No <NA> 10 No With Traffic ## 4 35 Yes 16-19 17 No <NA> ## 5 49 No <NA> 6 No Facing Traffic ## 6 53 Yes 50-59 52 No With Traffic ## 7 56 Yes 16-19 18 No <NA> ## 8 60 No 40-49 40 No Facing Traffic ## 9 63 Yes 6-10 6 No Facing Traffic ## 10 66 Yes 6-10 7 No <NA> ## # ... with 5,706 more rows, and 47 more variables: Bike_Injur <chr>, ## # Bike_Pos <chr>, Bike_Race <chr>, Bike_Sex <chr>, City <chr>, ## # County <chr>, CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, ## # Crash_Grp <chr>, Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
select
a range of variablesbike %>% select(OBJECTID:Bike_Injur)
## # A tibble: 5,716 x 7 ## OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <chr> <chr> <int> <chr> <chr> ## 1 19 No <NA> 6 No Not Applicable ## 2 30 Yes 50-59 51 No With Traffic ## 3 34 No <NA> 10 No With Traffic ## 4 36 Yes 16-19 17 No <NA> ## 5 50 No <NA> 6 No Facing Traffic ## 6 54 Yes 50-59 52 No With Traffic ## 7 57 Yes 16-19 18 No <NA> ## 8 61 No 40-49 40 No Facing Traffic ## 9 64 Yes 6-10 6 No Facing Traffic ## 10 67 Yes 6-10 7 No <NA> ## # ... with 5,706 more rows, and 1 more variables: Bike_Injur <chr>
pull
to extract a column as a vectorbike %>% pull(Location) %>% head()
## [1] "36.002743, -78.8785" "35.612984, -77.39265" "35.595676, -77.59074" ## [4] "35.076767, -80.7728" "35.19999, -80.75713" "35.966644, -78.96749"
bike %>% select(Location)
## # A tibble: 5,716 x 1 ## Location ## <chr> ## 1 36.002743, -78.8785 ## 2 35.612984, -77.39265 ## 3 35.595676, -77.59074 ## 4 35.076767, -80.7728 ## 5 35.19999, -80.75713 ## 6 35.966644, -78.96749 ## 7 35.996304, -78.88007 ## 8 35.09925, -78.877464 ## 9 35.74463, -81.69225 ## 10 35.184418, -78.50205 ## # ... with 5,706 more rows
pull
s in your lives
Don't get pull
happy when wrangling data! Only extract out variables if you truly need to, otherwise keep in data frame.
But always ⬇️ Pull
before starting your work when collaborating on GitHub.
rename
specific columnsUseful for correcting typos, and renaming to make variable names shorter and/or more informative
names(bike)
## [1] "FID" "OBJECTID" "AmbulanceR" "BikeAge_Gr" "Bike_Age" ## [6] "Bike_Alc_D" "Bike_Dir" "Bike_Injur" "Bike_Pos" "Bike_Race" ## [11] "Bike_Sex" "City" "County" "CrashAlcoh" "CrashDay" ## [16] "Crash_Date" "Crash_Grp" "Crash_Hour" "Crash_Loc" "Crash_Mont" ## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri" ## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age" "Drvr_Alc_D" "Drvr_EstSp" ## [31] "Drvr_Injur" "Drvr_Race" "Drvr_Sex" "Drvr_VehTy" "ExcsSpdInd" ## [36] "Hit_Run" "Light_Cond" "Locality" "Num_Lanes" "Num_Units" ## [41] "Rd_Charact" "Rd_Class" "Rd_Conditi" "Rd_Config" "Rd_Defects" ## [46] "Rd_Feature" "Rd_Surface" "Region" "Rural_Urba" "Speed_Limi" ## [51] "Traff_Cntr" "Weather" "Workzone_I" "Location"
Speed_Limi
to Speed_Limit
:bike <- bike %>% rename(Speed_Limit = Speed_Limi)
Always check your changes and confirm code did what you wanted it to do
names(bike)
## [1] "FID" "OBJECTID" "AmbulanceR" "BikeAge_Gr" "Bike_Age" ## [6] "Bike_Alc_D" "Bike_Dir" "Bike_Injur" "Bike_Pos" "Bike_Race" ## [11] "Bike_Sex" "City" "County" "CrashAlcoh" "CrashDay" ## [16] "Crash_Date" "Crash_Grp" "Crash_Hour" "Crash_Loc" "Crash_Mont" ## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri" ## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age" "Drvr_Alc_D" "Drvr_EstSp" ## [31] "Drvr_Injur" "Drvr_Race" "Drvr_Sex" "Drvr_VehTy" "ExcsSpdInd" ## [36] "Hit_Run" "Light_Cond" "Locality" "Num_Lanes" "Num_Units" ## [41] "Rd_Charact" "Rd_Class" "Rd_Conditi" "Rd_Config" "Rd_Defects" ## [46] "Rd_Feature" "Rd_Surface" "Region" "Rural_Urba" "Speed_Limit" ## [51] "Traff_Cntr" "Weather" "Workzone_I" "Location"
summarise
to reduce variables to valuesThe values are summarised in a data frame
bike %>% group_by(BikeAge_Gr) %>% summarise(crash_count = n())
## # A tibble: 13 x 2 ## BikeAge_Gr crash_count ## <chr> <int> ## 1 0-5 60 ## 2 11-15 747 ## 3 16-19 605 ## 4 20-24 680 ## 5 25-29 430 ## 6 30-39 658 ## 7 40-49 920 ## 8 50-59 739 ## 9 6-10 421 ## 10 60-69 274 ## 11 70 12 ## 12 70+ 58 ## 13 <NA> 112
arrange
to order rowsbike %>% group_by(BikeAge_Gr) %>% summarise(crash_count = n()) %>% arrange(desc(crash_count))
## # A tibble: 13 x 2 ## BikeAge_Gr crash_count ## <chr> <int> ## 1 40-49 920 ## 2 11-15 747 ## 3 50-59 739 ## 4 20-24 680 ## 5 30-39 658 ## 6 16-19 605 ## 7 25-29 430 ## 8 6-10 421 ## 9 60-69 274 ## 10 <NA> 112 ## 11 0-5 60 ## 12 70+ 58 ## 13 70 12
count
to group by then countbike %>% count(BikeAge_Gr)
## # A tibble: 13 x 2 ## BikeAge_Gr n ## <chr> <int> ## 1 0-5 60 ## 2 11-15 747 ## 3 16-19 605 ## 4 20-24 680 ## 5 25-29 430 ## 6 30-39 658 ## 7 40-49 920 ## 8 50-59 739 ## 9 6-10 421 ## 10 60-69 274 ## 11 70 12 ## 12 70+ 58 ## 13 <NA> 112
If you wanted to arrange these in ascending order what would you add to the pipe?
sample_n
or sample_frac
sample_n
: randomly sample 5 observationsbike_n5 <- bike %>% sample_n(5, replace = FALSE) dim(bike_n5)
## [1] 5 54
sample_frac
: randomly sample 20% of observationsbike_perc20 <-bike %>% sample_frac(0.2, replace = FALSE) dim(bike_perc20)
## [1] 1143 54
distinct
to filter for unique rowsbike %>% select(County, City) %>% distinct() %>% arrange(County, City)
## # A tibble: 360 x 2 ## County City ## <chr> <chr> ## 1 Alamance Alamance ## 2 Alamance Burlington ## 3 Alamance Elon College ## 4 Alamance Gibsonville ## 5 Alamance Graham ## 6 Alamance Green Level ## 7 Alamance Mebane ## 8 Alamance None - Rural Crash ## 9 Alexander None - Rural Crash ## 10 Alleghany None - Rural Crash ## # ... with 350 more rows