class: center, middle, inverse, title-slide # Functions and Automation ## Intro to Data Science ### Shawn Santo ### 02-11-20 --- ## Announcements - Homework 2 due Feb 13 - Exam 1 assigned Feb 14 - Today's application exercise: https://classroom.github.com/a/gstr-lrh --- class: center, middle, inverse # Functions --- ## Function components A function is comprised of arguments (formals), body, and environment. The first two will be our focus as we use and develop these objects. Let's look at the help of a few functions to see their arguments. ```r ?typeof ?log ?mean ``` --- ## Function calls Function calls involve the function's name and, at a minimum, values to its required arguments. Arguments can be given values by 1. position ```r z <- 1:10 mean(z, .3, FALSE) ``` ``` #> [1] 5.5 ``` -- 2. name ```r mean(x = z, trim = .3, na.rm = FALSE) ``` ``` #> [1] 5.5 ``` -- 3. partial name matching ```r mean(x = z, na = FALSE, t = .3) ``` ``` #> [1] 5.5 ``` **Which option do you think is best?** --- ## Call style A common choice is a combination of 1 and 2. ```r mean(z, trim = .3) ``` ``` #> [1] 5.5 ``` Leave the argument's name out for the commonly used arguments, and always specify the argument names for the optional arguments. --- ## Why create functions? Package `nycflights13` contains five datasets. Let's write code to preview some of them. ```r library(nycflights13) ``` ```r airlines %>% sample_n(size = 5) ``` ``` #> # A tibble: 5 x 2 #> carrier name #> <chr> <chr> #> 1 MQ Envoy Air #> 2 UA United Air Lines Inc. #> 3 US US Airways Inc. #> 4 AA American Airlines Inc. #> 5 HA Hawaiian Airlines Inc. ``` --- ```r planes %>% sample_n(size = 5) ``` ``` #> # A tibble: 5 x 9 #> tailnum year type manufacturer model engines seats speed engine #> <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr> #> 1 N825MH 2000 Fixed win… BOEING 767-4… 2 300 NA Turbo… #> 2 N3756 2001 Fixed win… BOEING 737-8… 2 189 NA Turbo… #> 3 N625AW 1989 Fixed win… AIRBUS INDUST… A320-… 2 182 NA Turbo… #> 4 N413UA 1994 Fixed win… AIRBUS INDUST… A320-… 2 200 NA Turbo… #> 5 N17233 1999 Fixed win… BOEING 737-8… 2 149 NA Turbo… ``` -- ```r flights %>% sample_n(size = 5) ``` ``` #> # A tibble: 5 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time #> <int> <int> <int> <int> <int> <dbl> <int> #> 1 2013 7 31 1821 1825 -4 1948 #> 2 2013 11 24 1705 1615 50 1850 #> 3 2013 2 17 1058 1100 -2 1157 #> 4 2013 11 30 1655 1700 -5 2002 #> 5 2013 1 30 2053 2100 -7 14 #> # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` --- ## Create a function We want to use our code for multiple datasets. Therefore, our function should have a data argument. **First, name your function. Carefully follow the syntax style.** ```r preview_data <- function() { } ``` <br/> - `preview_data` is our function's name - `function()` is a keyword in R and will always be used in the type of functions we create - inside `{` `}` will be our function's body --- ## Create a function We want to use our code for multiple datasets. Therefore, our function should have a data argument. **Second, add descriptive argument names.** ```r preview_data <- function(data) { } ``` <br/> - `preview_data` is our function's name - `function()` is a keyword in R and will always be used in the type of functions we create - inside `{` `}` will be our function's body - `data` is our argument's name, all arguments live inside the `(` `)` --- ## Create a function We want to use our code for multiple datasets. Therefore, our function should have a data argument. **Third, add the function's body - code that does the work.** ```r preview_data <- function(data) { * data %>% sample_n(size = 5) } ``` <br/> - `preview_data` is our function's name - `function()` is a keyword in R and will always be used in the type of functions we create - inside `{` `}` will be our function's body - `data` is our argument's name, all arguments live inside the `(` `)` --- ## Create a function We want to use our code for multiple datasets. Therefore, our function should have a data argument. <b>Fourth (optional), save the object you want to return and pass it to `return()`.</b> ```r preview_data <- function(data) { * result <- data %>% sample_n(size = 5) * return(result) } ``` <br/> - `preview_data` is our function's name - `function()` is a keyword in R and will always be used in the type of functions we create - inside `{` `}` will be our function's body - `data` is our argument's name, all arguments live inside the `(` `)` --- ## Test `preview_data()` ```r preview_data(data = airlines) ``` -- .pull-left[ Function's body <br/><br/> ```r * result <- data %>% sample_n(size = 5) return(result) ``` ] -- .pull-right[ With preview_data(data = airlines), the function's body will operate as if we have <br/> ```r * result <- airlines %>% sample_n(size = 5) return(result) ``` ] <br/> -- ``` #> # A tibble: 5 x 2 #> carrier name #> <chr> <chr> #> 1 MQ Envoy Air #> 2 VX Virgin America #> 3 FL AirTran Airways Corporation #> 4 B6 JetBlue Airways #> 5 F9 Frontier Airlines Inc. ``` --- ```r preview_data(planes) ``` ``` #> # A tibble: 5 x 9 #> tailnum year type manufacturer model engines seats speed engine #> <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr> #> 1 N17984 2000 Fixed wi… EMBRAER EMB-… 2 55 NA Turbo… #> 2 N754SW 1999 Fixed wi… BOEING 737-… 2 140 NA Turbo… #> 3 N640VA 2007 Fixed wi… AIRBUS A320… 2 182 NA Turbo… #> 4 N519UW 2009 Fixed wi… AIRBUS A321… 2 379 NA Turbo… #> 5 N920DE 1993 Fixed wi… MCDONNELL DOUGL… MD-88 2 142 NA Turbo… ``` ```r preview_data(flights) ``` ``` #> # A tibble: 5 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time #> <int> <int> <int> <int> <int> <dbl> <int> #> 1 2013 5 24 827 830 -3 1118 #> 2 2013 4 23 1956 2000 -4 2148 #> 3 2013 4 29 1611 1600 11 1815 #> 4 2013 1 7 1456 1445 11 1714 #> 5 2013 1 4 1603 1600 3 1919 #> # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm> ``` --- ## Application exercise - Tasks 1, 2 Write a function called `center_measures` that has one argument - `x` a numeric vector. The function should return the mean and median of `x`. Below is some sample code. ```r nums <- c(10, 5, -1, 0, 4) c(mean(nums), median(nums)) ``` ``` #> [1] 3.6 4.0 ``` <br/> Turn the following code into a function. How many arguments does it need? What is a good name for this function? Assume `x` will be a nonnegative numeric vector. ```r x / sum(x) ``` --- ## Best practices - Write a function when you have copied code more than twice. - Try to use a verb for your function's name. - Keep argument names short but descriptive. - Add code comments to explain the "why" of your code. ```r # this is a code comment ``` - Link a family of functions with a common prefix: `pnorm()`, `pbinom()`, `ppois()`. - Keep data arguments first, then other required arguments, then followed by default arguments. --- ## Naming examples ```r # too short f() # not a verb my_function_for_df() # good get_html() impute_missing() fit_models() # not good min_col() max_col() mean_col() # good (take advantage of autocomplete) col_min() col_max() col_mean() ``` --- ## Function with multiple arguments ```r congress <- read_csv("http://www2.stat.duke.edu/~sms185/data/politics/congress_long.csv") congress ``` ``` #> # A tibble: 432 x 5 #> year_start year_end party branch seats #> <dbl> <dbl> <chr> <chr> <dbl> #> 1 1913 1915 dem house 290 #> 2 1913 1915 dem senate 51 #> 3 1913 1915 gop house 127 #> 4 1913 1915 gop senate 44 #> 5 1913 1915 other house 18 #> 6 1913 1915 other senate 1 #> 7 1913 1915 vacant house NA #> 8 1913 1915 vacant senate NA #> 9 1915 1917 dem house 231 #> 10 1915 1917 dem senate 56 #> # … with 422 more rows ``` <br/> Let's write a function that will return a subset of this data based on the `year_start` and `branch`. Start with some working code for a given year and branch. --- ## Example ```r congress %>% filter(year_start == 1931, branch == "senate") ``` ``` #> # A tibble: 4 x 5 #> year_start year_end party branch seats #> <dbl> <dbl> <chr> <chr> <dbl> #> 1 1931 1933 dem senate 47 #> 2 1931 1933 gop senate 48 #> 3 1931 1933 other senate 1 #> 4 1931 1933 vacant senate NA ``` -- We'll need two arguments for our function. ```r get_congress <- function(year, leg_branch) { congress %>% filter(year_start == year, branch == leg_branch) } ``` --- ## Test `get_congress()` ```r get_congress(year = 1929, leg_branch = "senate") ``` ``` #> # A tibble: 4 x 5 #> year_start year_end party branch seats #> <dbl> <dbl> <chr> <chr> <dbl> #> 1 1929 1931 dem senate 39 #> 2 1929 1931 gop senate 56 #> 3 1929 1931 other senate 1 #> 4 1929 1931 vacant senate NA ``` -- ```r get_congress(1931, "house") ``` ``` #> # A tibble: 4 x 5 #> year_start year_end party branch seats #> <dbl> <dbl> <chr> <chr> <dbl> #> 1 1931 1933 dem house 217 #> 2 1931 1933 gop house 217 #> 3 1931 1933 other house 1 #> 4 1931 1933 vacant house NA ``` --- ## Let's build on this some more Package `ggpol` has a function `geom_parliament()` that allows us to create parliament plots. We can use our newly crafted function to create this type of plot and integrate functions from package `ggplot2`. ```r library(ggpol) ``` ```r *ggplot(get_congress(year = 1931, leg_branch = "house")) + * geom_parliament(aes(seats = seats, fill = factor(party))) + scale_fill_manual(values = c("#3A89CB", "#D65454", "#BF6FF0", "Grey"), labels = c("Dem", "GOP", "Other", "Vacant")) + labs(fill = "Party", caption = "1931 House") + coord_fixed() + theme_void(base_size = 16) ``` --- ## Parliament plot <img src="lec-06a-functions_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> --- ## Create a plot function Think about the plot's code and our plot. **What arguments do we need?** -- ```r *plot_congress <- function(year, leg_branch, cap_lbl = "") { } ``` --- ## Create a plot function Think about the plot's code and our plot. **What arguments do we need?** ```r plot_congress <- function(year, leg_branch, cap_lbl = "") { * get_congress(year, leg_branch) } ``` --- ## Create a plot function Think about the plot's code and our plot. **What arguments do we need?** ```r plot_congress <- function(year, leg_branch, cap_lbl = "") { * get_congress(year, leg_branch) %>% * ggplot() + geom_parliament(aes(seats = seats, fill = factor(party))) + scale_fill_manual(values = c("#3A89CB", "#D65454", "#BF6FF0", "Grey"), labels = c("Dem", "GOP", "Other", "Vacant")) + * labs(fill = "Party", caption = cap_lbl) + coord_fixed() + theme_void(base_size = 16) } ``` --- ## Test `plot_congress()` .tiny[ ```r plot_congress(year = "1929", leg_branch = "senate", cap_lbl = "1929 Senate") ``` <img src="lec-06a-functions_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" /> ] --- ## Test `plot_congress()` .tiny[ ```r plot_congress(year = "2001", leg_branch = "house", cap_lbl = "2001 House") ``` <img src="lec-06a-functions_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" /> ] --- class: center, middle, inverse # Automation --- ## `for` loops - One tool for reducing duplication is functions. Another tool is iteration via `for` loops. - This will help if you need to do the same thing to multiple inputs. - For example, we can iterate through elements of a vector and evaluate code based on each vector element's value. -- Let's create a small tibble `x`. .pull-left[ ```r x <- tibble( col_a = c(3, -1, 0, 10), col_b = c(2, -2, 2, -2), col_c = c(8, sqrt(131), log(4), 33), col_d = 1:4 ) ``` ] .pull-right[ ```r x ``` ``` #> # A tibble: 4 x 4 #> col_a col_b col_c col_d #> <dbl> <dbl> <dbl> <int> #> 1 3 2 8 1 #> 2 -1 -2 11.4 2 #> 3 0 2 1.39 3 #> 4 10 -2 33 4 ``` ] --- ## Compute column means A first attempt might be ... .tiny[ ```r x %>% pull(1) %>% mean() ``` ``` #> [1] 3 ``` ] -- .tiny[ ```r x %>% pull(2) %>% mean() ``` ``` #> [1] 0 ``` ] -- .tiny[ ```r x %>% pull(3) %>% mean() ``` ``` #> [1] 13.45795 ``` ] -- .tiny[ ```r x %>% pull(4) %>% mean() ``` ``` #> [1] 2.5 ``` ] --- ## How can we automate this process? Looking at our previous code, we see that the only variation is with regards to the column index being pulled. A `for` loop can easily automate our process from the previous slide. **First, create an output object.** This is where we will save our results. ```r *result <- numeric(4) ``` --- ## How can we automate this process? Looking at our previous code, we see that the only variation is with regards to the column index being pulled. A `for` loop can easily automate our process from the previous slide. **Second, define the loop sequence.** Here `i` is a looping variable, in each run of the loop `i` will be assigned a different value in the vector `c(1, 2, 3, 4)`. ```r result <- numeric(4) *for (i in c(1, 2, 3, 4)) { *} ``` --- ## How can we automate this process? Looking at our previous code, we see that the only variation is with regards to the column index being pulled. A `for` loop can easily automate our process from the previous slide. **Third, add the loop's body.** This is the code that does the work. ```r result <- numeric(4) for (i in c(1, 2, 3, 4)) { * result[i] <- x %>% * pull(i) %>% * mean() } ``` --- ## `for` loop in action ```r result <- numeric(4) for (i in c(1, 2, 3, 4)) { result[i] <- x %>% pull(i) %>% mean() } ``` ```r result ``` ``` #> [1] 3.00000 0.00000 13.45795 2.50000 ``` -- <br/> This is a small example, but it's easy to see the benefits of using a `for` loop if we needed to scale this computation to 100 columns. All we would have to change is our loop sequence. --- ## Example Let's write a `for` loop so we can get the R data type from each column in tibble `congress`. ``` #> # A tibble: 4 x 5 #> year_start year_end party branch seats #> <dbl> <dbl> <chr> <chr> <dbl> #> 1 1913 1915 dem house 290 #> 2 1913 1915 dem senate 51 #> 3 1913 1915 gop house 127 #> 4 1913 1915 gop senate 44 ``` Below are a few functions that are helpful with data frames. ```r length(congress) ``` ``` #> [1] 5 ``` ```r seq_along(congress) ``` ``` #> [1] 1 2 3 4 5 ``` -- <br/> **How are these functions going to be useful to us with regards to our task?** --- ## Example ```r var_types <- character(length(congress)) for (i in seq_along(congress)) { var_types[i] <- congress %>% pull(i) %>% typeof() } ``` -- ```r var_types ``` ``` #> [1] "double" "double" "character" "character" "double" ``` --- ## Application exercise - Task 3 Write a `for` loop to compute the number of unique values in each column of tibble `congress`. Try to stay within the `tidyverse` syntax as much as possible. You may find function `length()` useful. Below are some examples of it in action. ```r length(1:10) ``` ``` #> [1] 10 ``` ```r length(c("a", "ab", "abc", "abcd")) ``` ``` #> [1] 4 ``` ```r length(seq(from = 2, to = 10, by = 2)) ``` ``` #> [1] 5 ``` ??? ## Solution ```r result <- numeric(length(congress)) for (i in seq_along(congress)) { result[i] <- congress %>% select(i) %>% distinct() %>% pull() %>% length() } result ``` ``` #> [1] 54 54 4 2 136 ``` --- ## Revisit `plot_congress()` ```r plot_congress <- function(year, leg_branch, cap_lbl = "") { get_congress(year, leg_branch) %>% ggplot() + geom_parliament(aes(seats = seats, fill = factor(party))) + scale_fill_manual(values = c("#3A89CB", "#D65454", "#BF6FF0", "Grey"), labels = c("Dem", "GOP", "Other", "Vacant")) + labs(fill = "Party", caption = cap_lbl) + coord_fixed() + theme_void(base_size = 16) } ``` -- <br/> <b>What would we have to do if we want to create multiple plots for different years of the senate?</b> --- ## GIF - Task 4 <div class="figure" style="text-align: center"> <img src="lec-06a-functions_files/figure-html/unnamed-chunk-55-.gif" alt="Dem: blue, GOP: red, Other: purple, Vacant: grey" /> <p class="caption">Dem: blue, GOP: red, Other: purple, Vacant: grey</p> </div> ??? ## Solutions ```r for (i in seq(1993, 2019, 2)) { print({ plot_congress(year = i, leg_branch = "senate") }) } ``` --- ## References 1. Grolemund, G., & Wickham, H. (2020). R for Data Science. R4ds.had.co.nz. Retrieved 9 February 2020, from https://r4ds.had.co.nz/ 2. erocoar/ggpol. (2020). GitHub. Retrieved 9 February 2020, from https://github.com/erocoar/ggpol