Functions and Automation

# Functions and Automation
## Intro to Data Science
### Shawn Santo
### 02-11-20

---

## Announcements

- Homework 2 due Feb 13

- Exam 1 assigned Feb 14

- Today's application exercise: https://classroom.github.com/a/gstr-lrh

---

# Functions

---

## Function components

A function is comprised of arguments (formals), body, and environment. 
The first two will be our focus as we use and develop these objects.

Let's look at the help of a few functions to see their arguments.

```r
?typeof

?log

?mean
```

---

## Function calls

Function calls involve the function's name and, at a minimum, values to
its required arguments. Arguments can be given values by

1. position
 
 ```r
 z <- 1:10
 mean(z, .3, FALSE)
 ```
 
 ```
 #> [1] 5.5
 ```
--
2. name
 
 ```r
 mean(x = z, trim = .3, na.rm = FALSE)
 ```
 
 ```
 #> [1] 5.5
 ```
--
3. partial name matching
 
 ```r
 mean(x = z, na = FALSE, t = .3)
 ```
 
 ```
 #> [1] 5.5
 ```

**Which option do you think is best?**

---

## Call style

A common choice is a combination of 1 and 2.

```r
mean(z, trim = .3)
```

```
#> [1] 5.5
```

Leave the argument's name out for the commonly used arguments, and 
always specify the argument names for the optional arguments.

---

## Why create functions?

Package `nycflights13` contains five datasets. Let's write code to preview
some of them.

```r
library(nycflights13)
```

```r
airlines %>% 
  sample_n(size = 5)
```

```
#> # A tibble: 5 x 2
#> carrier name 
#> <chr> <chr> 
#> 1 MQ Envoy Air 
#> 2 UA United Air Lines Inc. 
#> 3 US US Airways Inc. 
#> 4 AA American Airlines Inc.
#> 5 HA Hawaiian Airlines Inc.
```

---

```r
planes %>% 
  sample_n(size = 5)
```

```
#> # A tibble: 5 x 9
#> tailnum year type manufacturer model engines seats speed engine
#> <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr> 
#> 1 N825MH 2000 Fixed win… BOEING 767-4… 2 300 NA Turbo…
#> 2 N3756 2001 Fixed win… BOEING 737-8… 2 189 NA Turbo…
#> 3 N625AW 1989 Fixed win… AIRBUS INDUST… A320-… 2 182 NA Turbo…
#> 4 N413UA 1994 Fixed win… AIRBUS INDUST… A320-… 2 200 NA Turbo…
#> 5 N17233 1999 Fixed win… BOEING 737-8… 2 149 NA Turbo…
```

```r
flights %>% 
  sample_n(size = 5)
```

```
#> # A tibble: 5 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 7 31 1821 1825 -4 1948
#> 2 2013 11 24 1705 1615 50 1850
#> 3 2013 2 17 1058 1100 -2 1157
#> 4 2013 11 30 1655 1700 -5 2002
#> 5 2013 1 30 2053 2100 -7 14
#> # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>
```

---

## Create a function

We want to use our code for multiple datasets. Therefore, our function
should have a data argument.

**First, name your function. Carefully follow the syntax style.**

```r
preview_data <- function() {
 
 
 
}
```

- `preview_data` is our function's name
- `function()` is a keyword in R and will always be used in the type of 
  functions we create
- inside `{`  `}` will be our function's body

---

## Create a function

We want to use our code for multiple datasets. Therefore, our function
should have a data argument.

**Second, add descriptive argument names.**

```r
preview_data <- function(data) {
 
 
 
}
```

- `preview_data` is our function's name
- `function()` is a keyword in R and will always be used in the type of 
  functions we create
- inside `{`  `}` will be our function's body
- `data` is our argument's name, all arguments live inside the `(` `)`

---

## Create a function

We want to use our code for multiple datasets. Therefore, our function
should have a data argument.

**Third, add the function's body - code that does the work.**

```r
preview_data <- function(data) {
 
* data %>%
 sample_n(size = 5)
}
```

---

## Create a function

We want to use our code for multiple datasets. Therefore, our function
should have a data argument.

Fourth (optional), save the object you want to return and pass 
it to `return()`.

```r
preview_data <- function(data) {
 
* result <- data %>%
 sample_n(size = 5)
 
* return(result)
}
```

---

## Test `preview_data()`

```r
preview_data(data = airlines)
```
--
.pull-left[
Function's body

```r
* result <- data %>%
 sample_n(size = 5)
 
 return(result)
```
]
--
.pull-right[
With preview_data(data = airlines), the function's body will operate as if
we have

```r
* result <- airlines %>%
 sample_n(size = 5)
 
 return(result)
```
]

--

```
#> # A tibble: 5 x 2
#> carrier name 
#> <chr> <chr> 
#> 1 MQ Envoy Air 
#> 2 VX Virgin America 
#> 3 FL AirTran Airways Corporation
#> 4 B6 JetBlue Airways 
#> 5 F9 Frontier Airlines Inc.
```

---

```r
preview_data(planes)
```

```
#> # A tibble: 5 x 9
#> tailnum year type manufacturer model engines seats speed engine
#> <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr> 
#> 1 N17984 2000 Fixed wi… EMBRAER EMB-… 2 55 NA Turbo…
#> 2 N754SW 1999 Fixed wi… BOEING 737-… 2 140 NA Turbo…
#> 3 N640VA 2007 Fixed wi… AIRBUS A320… 2 182 NA Turbo…
#> 4 N519UW 2009 Fixed wi… AIRBUS A321… 2 379 NA Turbo…
#> 5 N920DE 1993 Fixed wi… MCDONNELL DOUGL… MD-88 2 142 NA Turbo…
```

```r
preview_data(flights)
```

```
#> # A tibble: 5 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 5 24 827 830 -3 1118
#> 2 2013 4 23 1956 2000 -4 2148
#> 3 2013 4 29 1611 1600 11 1815
#> 4 2013 1 7 1456 1445 11 1714
#> 5 2013 1 4 1603 1600 3 1919
#> # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>
```

---

## Application exercise - Tasks 1, 2

Write a function called `center_measures` that has one argument - `x` a 
numeric vector. The function should return the mean and median of `x`. 
Below is some sample code.

```r
nums <- c(10, 5, -1, 0, 4)
c(mean(nums), median(nums))
```

```
#> [1] 3.6 4.0
```

Turn the following code into a function. How many arguments does it need?
What is a good name for this function? Assume `x` will be a nonnegative numeric
vector.

```r
x / sum(x)
```

---

## Best practices

- Write a function when you have copied code more than twice.

- Try to use a verb for your function's name.

- Keep argument names short but descriptive.

- Add code comments to explain the "why" of your code.
    
    ```r
    # this is a code comment
    ```
  
- Link a family of functions with a common prefix: `pnorm()`, `pbinom()`, 
  `ppois()`.

- Keep data arguments first, then other required arguments, then followed by 
  default arguments.
        
---

## Naming examples

```r
# too short
f()

# not a verb
my_function_for_df()

# good
get_html()
impute_missing()
fit_models()

# not good
min_col()
max_col()
mean_col()

# good (take advantage of autocomplete)
col_min()
col_max()
col_mean()
```

---

## Function with multiple arguments

```r
congress <- read_csv("http://www2.stat.duke.edu/~sms185/data/politics/congress_long.csv")
congress
```

```
#> # A tibble: 432 x 5
#> year_start year_end party branch seats
#> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 1913 1915 dem house 290
#> 2 1913 1915 dem senate 51
#> 3 1913 1915 gop house 127
#> 4 1913 1915 gop senate 44
#> 5 1913 1915 other house 18
#> 6 1913 1915 other senate 1
#> 7 1913 1915 vacant house NA
#> 8 1913 1915 vacant senate NA
#> 9 1915 1917 dem house 231
#> 10 1915 1917 dem senate 56
#> # … with 422 more rows
```

Let's write a function that will return a subset of this data based on the
`year_start` and `branch`. Start with some working code for a given year and
branch.

---

## Example

```r
congress %>% 
  filter(year_start == 1931, branch == "senate")
```

```
#> # A tibble: 4 x 5
#> year_start year_end party branch seats
#> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 1931 1933 dem senate 47
#> 2 1931 1933 gop senate 48
#> 3 1931 1933 other senate 1
#> 4 1931 1933 vacant senate NA
```

We'll need two arguments for our function.

```r
get_congress <- function(year, leg_branch) {
 congress %>% 
 filter(year_start == year, branch == leg_branch)
}
```

---

## Test `get_congress()`

```r
get_congress(year = 1929, leg_branch = "senate")
```

```
#> # A tibble: 4 x 5
#> year_start year_end party branch seats
#> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 1929 1931 dem senate 39
#> 2 1929 1931 gop senate 56
#> 3 1929 1931 other senate 1
#> 4 1929 1931 vacant senate NA
```

```r
get_congress(1931, "house")
```

```
#> # A tibble: 4 x 5
#> year_start year_end party branch seats
#> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 1931 1933 dem house 217
#> 2 1931 1933 gop house 217
#> 3 1931 1933 other house 1
#> 4 1931 1933 vacant house NA
```

---

## Let's build on this some more

Package `ggpol` has a function `geom_parliament()` that allows us to create
parliament plots. We can use our newly crafted function to create this type of
plot and integrate functions from package `ggplot2`.

```r
library(ggpol)
```

```r
*ggplot(get_congress(year = 1931, leg_branch = "house")) +
* geom_parliament(aes(seats = seats, fill = factor(party))) +
  scale_fill_manual(values = c("#3A89CB", "#D65454", "#BF6FF0", "Grey"), 
                    labels = c("Dem", "GOP", "Other", "Vacant")) +
  labs(fill = "Party", caption = "1931 House") +
  coord_fixed() +
  theme_void(base_size = 16)
```

---

## Parliament plot

---

## Create a plot function

Think about the plot's code and our plot. **What arguments do we need?**

```r
*plot_congress <- function(year, leg_branch, cap_lbl = "") {

}
```

---

## Create a plot function

Think about the plot's code and our plot. **What arguments do we need?**

```r
plot_congress <- function(year, leg_branch, cap_lbl = "") {
* get_congress(year, leg_branch)
 
}
```

---

## Create a plot function

Think about the plot's code and our plot. **What arguments do we need?**

```r
plot_congress <- function(year, leg_branch, cap_lbl = "") {
* get_congress(year, leg_branch) %>%
* ggplot() +
 geom_parliament(aes(seats = seats, fill = factor(party))) +
 scale_fill_manual(values = c("#3A89CB", "#D65454", "#BF6FF0", "Grey"), 
 labels = c("Dem", "GOP", "Other", "Vacant")) +
* labs(fill = "Party", caption = cap_lbl) +
 coord_fixed() +
 theme_void(base_size = 16)
}
```

---

## Test `plot_congress()`

```r
plot_congress(year = "1929", leg_branch = "senate", cap_lbl = "1929 Senate")
```

<img src="lec-06a-functions_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" />
]

---

## Test `plot_congress()`

```r
plot_congress(year = "2001", leg_branch = "house", cap_lbl = "2001 House")
```

<img src="lec-06a-functions_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" />
]

---

# Automation

---

## `for` loops

- One tool for reducing duplication is functions. Another tool is iteration
  via `for` loops.
  
- This will help if you need to do the same thing to multiple inputs.

- For example, we can iterate through elements of a vector and evaluate code
  based on each vector element's value.

Let's create a small tibble `x`.

```r
x <- tibble(
 col_a = c(3, -1, 0, 10),
 col_b = c(2, -2, 2, -2),
 col_c = c(8, sqrt(131), log(4), 33),
 col_d = 1:4
)
```
]
.pull-right[

```r
x
```

```
#> # A tibble: 4 x 4
#> col_a col_b col_c col_d
#> <dbl> <dbl> <dbl> <int>
#> 1 3 2 8 1
#> 2 -1 -2 11.4 2
#> 3 0 2 1.39 3
#> 4 10 -2 33 4
```
]

---

## Compute column means

A first attempt might be ...

```r
x %>% 
  pull(1) %>% 
  mean()
```

```
#> [1] 3
```
]

```r
x %>% 
  pull(2) %>% 
  mean()
```

```
#> [1] 0
```
]

```r
x %>% 
  pull(3) %>% 
  mean()
```

```
#> [1] 13.45795
```
]

```r
x %>% 
  pull(4) %>% 
  mean()
```

```
#> [1] 2.5
```
]

---

## How can we automate this process?

Looking at our previous code, we see that the only variation is with regards
to the column index being pulled. A `for` loop can easily automate our process
from the previous slide.

**First, create an output object.** This is where we will save our results.

```r
*result <- numeric(4)
```

---

## How can we automate this process?

Looking at our previous code, we see that the only variation is with regards
to the column index being pulled. A `for` loop can easily automate our process
from the previous slide.

**Second, define the loop sequence.** Here `i` is a looping variable, in each
run of the loop `i` will be assigned a different value in the vector
`c(1, 2, 3, 4)`.

```r
result <- numeric(4)

*for (i in c(1, 2, 3, 4)) {
  
  
  
*} 
```

---

## How can we automate this process?

Looking at our previous code, we see that the only variation is with regards
to the column index being pulled. A `for` loop can easily automate our process
from the previous slide.

**Third, add the loop's body.** This is the code that does the work.

```r
result <- numeric(4)

for (i in c(1, 2, 3, 4)) {
 
* result[i] <- x %>%
* pull(i) %>%
* mean()
 
}
```

---

## `for` loop in action

```r
result <- numeric(4)

for (i in c(1, 2, 3, 4)) {
 
 result[i] <- x %>% 
 pull(i) %>%
 mean()
 
}
```

```r
result
```

```
#> [1]  3.00000  0.00000 13.45795  2.50000
```

This is a small example, but it's easy to see the benefits of using a `for`
loop if we needed to scale this computation to 100 columns. All we would
have to change is our loop sequence.

---

## Example

Let's write a `for` loop so we can get the R data type from each column in
tibble `congress`.

```
#> # A tibble: 4 x 5
#> year_start year_end party branch seats
#> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 1913 1915 dem house 290
#> 2 1913 1915 dem senate 51
#> 3 1913 1915 gop house 127
#> 4 1913 1915 gop senate 44
```

Below are a few functions that are helpful with data frames.

```r
length(congress)
```

```
#> [1] 5
```

```r
seq_along(congress)
```

```
#> [1] 1 2 3 4 5
```

**How are these functions going to be useful to us with regards to our task?**

---

## Example

```r
var_types <- character(length(congress))

for (i in seq_along(congress)) {
 
 var_types[i] <- congress %>% 
 pull(i) %>% 
 typeof()
}
```

```r
var_types
```

```
#> [1] "double"    "double"    "character" "character" "double"
```

---

## Application exercise - Task 3

Write a `for` loop to compute the number of unique values in each column of
tibble `congress`. Try to stay within the `tidyverse` syntax as much as
possible.

You may find function `length()` useful. Below are some examples of it in 
action.

```r
length(1:10)
```

```
#> [1] 10
```

```r
length(c("a", "ab", "abc", "abcd"))
```

```
#> [1] 4
```

```r
length(seq(from = 2, to = 10, by = 2))
```

```
#> [1] 5
```

???

## Solution

```r
result <- numeric(length(congress))

for (i in seq_along(congress)) {
 result[i] <- congress %>% 
 select(i) %>% 
 distinct() %>% 
 pull() %>% 
 length()
}

result
```

```
#> [1]  54  54   4   2 136
```

---

## Revisit `plot_congress()`

```r
plot_congress <- function(year, leg_branch, cap_lbl = "") {
 get_congress(year, leg_branch) %>% 
 ggplot() + 
 geom_parliament(aes(seats = seats, fill = factor(party))) +
 scale_fill_manual(values = c("#3A89CB", "#D65454", "#BF6FF0", "Grey"), 
 labels = c("Dem", "GOP", "Other", "Vacant")) +
 labs(fill = "Party", caption = cap_lbl) + 
 coord_fixed() +
 theme_void(base_size = 16)
}
```

What would we have to do if we want to create multiple plots for different
years of the senate?

---

## GIF - Task 4

<div class="figure" style="text-align: center">
<img src="lec-06a-functions_files/figure-html/unnamed-chunk-55-.gif" alt="Dem: blue, GOP: red, Other: purple, Vacant: grey" />
Dem: blue, GOP: red, Other: purple, Vacant: grey
</div>

???

## Solutions

```r
for (i in seq(1993, 2019, 2)) {
  print({
    plot_congress(year = i, leg_branch = "senate")
  })
}
```

---
        
## References

1. Grolemund, G., & Wickham, H. (2020). R for Data Science. R4ds.had.co.nz. 
   Retrieved 9 February 2020, from https://r4ds.had.co.nz/
   
2. erocoar/ggpol. (2020). GitHub. Retrieved 9 February 2020, from
   https://github.com/erocoar/ggpol