class: center, middle, inverse, title-slide # Introduction and Fundamentals of R ## Statistical Computing & Programming ### Shawn Santo --- class: center, inverse, middle # Introduction --- ## Who am I? - [Shawn Santo](https://www.shawnsanto.com) - [shawn.santo@duke.edu](mailto:shawn.santo@duke.edu) - Office hours (Zoom link in Sakai) - Monday 8:00pm - 9:00pm ET - Thursday 12:30pm - 1:30pm ET --- ## Who else is involved? - [Quinn Frank](https://github.com/quinnfrank) - [quinn.frank@duke.edu](mailto:quinn.frank@duke.edu) - Office hours (Zoom link in Sakai) - Wednesday 11:30am - 1:30pm ET - [Pierre Gardan](https://github.com/gardanpm) - [pierre.gardan@duke.edu](mailto:pierre.gardan@duke.edu) - Office hours (Zoom link in Sakai) - Tuesday 9:00am - 11:00am ET - [Sarah Mansfield](https://github.com/sarahmansfield) - [sarah.b.mansfield@duke.edu](mailto:sarah.b.mansfield@duke.edu) - Office hours (Zoom link in Sakai) - Friday 11:30am - 1:30pm ET --- ## What is statistical computing? .middle.center[ <img src="images/statistical_computing_venn.png" height="400px"> ] .small-text[ *Source:* Deborah Nolan & Duncan Temple Lang (2010) Computing in the Statistics Curricula, The American Statistician, 64:2, 97-107, DOI: [10.1198/tast.2010.09132](https://doi.org/10.1198/tast.2010.09132) ] --- ## Some topics covered .pull-left[ - Fundamentals of R - S3 OO system - Package `tidyverse` - Regular expressions - Web scraping - Web based applications with RShiny - Wrangling and managing big data - Databases, SQL, and NoSQL ] .pull-right[ - Data types and functions - Parallelization - Version control with git and GitHub - Shell - Reproducible reports with R Markdown - Debugging and testing - Spark - Make ] <br><br> [Full course schedule](http://www2.stat.duke.edu/courses/Spring21/sta323.001/schedule.html) --- ## Why this class matters >Programming Languages: Python and R continue to dominate. The new entry this year was Javascript, which got a respectable 6.8% share. Julia share has increased, while most other languages have declined. .tiny-text[ | Platform | 2019 % share | 2018 % share | % change | |--------------------------------------|--------------|--------------|----------| | Python | 65.8% | 65.6% | 0.2% | | R Language | 46.6% | 48.5% | -4.0% | | SQL Language | 32.8% | 39.6% | -17.2% | | Java | 12.4% | 15.1% | -17.7% | | Unix shell/awk | 7.9% | 9.2% | -13.4% | | C/C++ | 7.1% | 6.8% | 3.7% | | Javascript | 6.8% | na | na | | Other programming and data languages | 5.7% | 6.9% | -17.1% | | Scala | 3.5% | 5.9% | -41.0% | | Julia | 1.7% | 0.7% | 150.4% | | Perl | 1.3% | 1.0% | 25.2% | | Lisp | 0.4% | 0.3% | 46.1% | ] .small-text[ *Source:* https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platforms.html ] --- class: center, inverse, middle # Course essentials --- ## Text toolkit .middle.center[  ] These are recommended textbooks - **all are available for free online**. There is no required textbook for this course. --- ## Software toolkit .middle.center[  ] --- ## Course structure This class is about you doing as opposed to you just watching or listening. In-person and Zoom lectures and labs will be interactive. My role as an instructor is to introduce you to new tools and techniques, but it is up to you to take them and make use of them. If you only read the code and never run it or experiment with it, then you will not get much out of this course. Most lectures will include supplemental resources for you to delve deeper into the topic of discussion. Occasionally, there will be pre-class readings and activities to enrich our lecture and lab experiences. --- ## Grading | Assessment | Weight | |-----------:|:----------:| | Homework | 45% | | Exam 1 | 20% | | Exam 2 | 20% | | Project | 15% | <br/><br/> The exact ranges for letter grades may be curved and cutoffs will be determined at the end of the semester. However, if you have a cumulative numerical average of 90 - 100, you are guaranteed at least an A-, 80 - 89 at least a B-, 70 - 79 at least a C-, and so on. --- ## Teams I will initially construct teams based on an introductory survey. Team expectations: - Each member must commit to giving equal effort. - Each member must read, run, and understand all code. - Each member must honestly complete the intragroup peer evaluations. --- ## Policy: sharing / reusing code It is your responsibility to carefully read each assignment so you know what is permitted and what is not. If you are ever unsure what is allowed, please ask me or one of the TAs. - Similar reproducible examples (reprex) exist online that will help you answer many of the questions posed on labs and homework assignments. Use of these resources is allowed unless it is written explicitly on the assignment. - You must always cite any code you copy or use as inspiration. Copied code without citation is plagiarism and will result in a 0 for the assignment. There will be additional punitive measures based on the severity of plagiarism. - Copying and citing a large amount of code to satisfy a main objective of an assignment will result in a 0 for the assignment. - Discussion (not code sharing / copying) with other students and groups is allowed unless stated otherwise on an assignment. --- ## Getting help - Post your content and course related questions on Slack. - Attend office hours. - Set up a Zoom meeting with me or one of the TAs outside of office hours. We are always willing to accommodate your schedule. - Email me or one of the TAs grade and personal related questions. --- ## Links to bookmark - Course page: http://www2.stat.duke.edu/courses/Spring21/sta323.001 - GitHub organization: https://github.com/sta323-523-sp21 - I'll send out an invite to join once I have all usernames - Department of Statistical Science (DSS) RStudio servers: - Rook - http://rook.stat.duke.edu:8787/ - Knight - http://knight.stat.duke.edu:8787/ <b>To access Rook or Knight, you must be connected to Duke's network via VPN (off-campus) or Dukeblue (on-campus).</b> --- ## To-do list Before Friday's lecture, please complete the following (in order). 1. Create a GitHub account at https://github.com/join. 2. Join the course's Slack Workspace. The link is available in Sakai. 3. Complete the introductory survey. Posted in Slack under `#general`. 4. Verify you can log in to the DSS RStudio Pro servers. 5. If you plan to use R/RStudio locally, download the most up-to-date versions or update your current versions. Also, update any packages you have previously installed, especially `tidyverse`. --- class: inverse, center, middle # Fundamentals of R --- ## Supplementary materials Full video lecture available in Zoom Cloud Recordings Companion videos - [RStudio Tour](https://warpwire.duke.edu/w/zQ4FAA/) .small-text[ Videos were created for STA 323 & 523 - Summer 2020 ] Additional resources - [The tidyverse style guide](https://style.tidyverse.org) - [Sections 3.1 – 3.2](https://adv-r.hadley.nz/vectors-chap.html) Advanced R - [Chapter 5](https://adv-r.hadley.nz/control-flow.html) Advanced R --- class: inverse, center, middle # Vectors --- ## Vectors The fundamental building block of data in R is a vector (collections of related values, objects, other data structures, etc). -- R has two types of vectors: * **atomic** vectors - homogeneous collections of the *same* type (e.g. all logical values, all numbers, or all character strings). * **generic** vectors - heterogeneous collections of *any* type of R object, even other lists (meaning they can have a hierarchical/tree-like structure). <br><br> I will use the term component or element when referring to a value inside a vector. --- ## Vector interrelationships .middle.center[ <img src="images/vector_types.png" height="400px"> ] .small-text[ *Source*: https://r4ds.had.co.nz/vectors.html ] --- ## Atomic vectors R has six atomic vector types: .center[ `logical`, `integer`, `double`, `character`, `complex`, `raw` ] In this course we will mostly work with the first four. You will rarely work with the last two types - complex and raw. -- ```r x <- c(T, F, TRUE, FALSE) typeof(x) ``` ``` #> [1] "logical" ``` ```r y <- c("a", "few", "more", "slides") typeof(y) ``` ``` #> [1] "character" ``` <br/> Note: `c()` is a function that combines its arguments to form a vector. It's a quick way to make small vectors for testing and experimentation. Later, we'll see better ways to create vectors. --- ## Coercion hierarchy If you try to combine components of different types into a single atomic vector, R will try to coerce all elements so they can be represented as the simplest type. The ordering is `logical` < `integer` < `double` < `character`, where logical is considered the "simplest". -- ```r x <- c(T, 5, F, 0, 1) y <- c("a", 1, T) z <- c(3.0, 4L, 0L) ``` -- .pull-left[ ```r x ``` ``` #> [1] 1 5 0 0 1 ``` ```r y ``` ``` #> [1] "a" "1" "TRUE" ``` ```r z ``` ``` #> [1] 3 4 0 ``` ] .pull-right[ ```r typeof(x) ``` ``` #> [1] "double" ``` ```r typeof(y) ``` ``` #> [1] "character" ``` ```r typeof(z) ``` ``` #> [1] "double" ``` ] --- ## Concatenation One way to construct atomic vectors is with function `c()`. ```r c(1, 0, 1, 1, 6) ``` ``` #> [1] 1 0 1 1 6 ``` ```r c(c(3, 4), c(10, TRUE)) ``` ``` #> [1] 3 4 10 1 ``` ```r c(pi) ``` ``` #> [1] 3.141593 ``` --- class: inverse, center, middle # Operators, vectorization, and length coercion --- ## Logical (Boolean) operators | Operator | Operation | Vectorized? |:-----------------------------:|:-------------:|:------------: | <code>x | y</code> | or | Yes | `x & y` | and | Yes | `!x` | not | Yes | <code>x || y</code> | or | No | `x && y` | and | No |`xor(x,y)` | exclusive or | Yes <br><br> -- What do we mean if we say a function or operation is vectorized? --- ## Boolean examples ```r x <- c(T, F, T, T) y <- c(F, F, T, F) ``` -- .pull-left[ ```r !x ``` ``` #> [1] FALSE TRUE FALSE FALSE ``` ```r x | y ``` ``` #> [1] TRUE FALSE TRUE TRUE ``` ```r x || y ``` ``` #> [1] TRUE ``` ] -- .pull-right[ ```r x & y ``` ``` #> [1] FALSE FALSE TRUE FALSE ``` ```r x && y ``` ``` #> [1] FALSE ``` ```r xor(x, y) ``` ``` #> [1] TRUE FALSE FALSE TRUE ``` ] --- ## Comparison operators | Operator | Comparison | Vectorized? |:----------:|:--------------------------:|:----------------: | `x < y` | less than | Yes | `x > y` | greater than | Yes | `x <= y` | less than or equal to | Yes | `x >= y` | greater than or equal to | Yes | `x != y` | not equal to | Yes | `x == y` | equal to | Yes | `x %in% y` | contains | Yes (over `x`) --- ## Comparison examples ```r x <- c(4, 10, -5) y <- c(0, 51, 9 / 5) z <- c("four", "for", "4") ``` -- .pull-left[ ```r x > y ``` ``` #> [1] TRUE FALSE FALSE ``` ```r x != y ``` ``` #> [1] TRUE TRUE TRUE ``` ] -- .pull-right[ ```r x == z ``` ``` #> [1] FALSE FALSE FALSE ``` ```r x %in% z ``` ``` #> [1] TRUE FALSE FALSE ``` ] --- ## What else is vectorized? - Most of the mathematical operators - Many functions in `base` R and created by user's in packages ```r a <- c(0, -3, sqrt(75)) b <- c(1, 3, 2) ``` .pull-left[ ```r a + b ``` ``` #> [1] 1.00000 0.00000 10.66025 ``` ```r a ^ b ``` ``` #> [1] 0 -27 75 ``` ] .pull-right[ ```r rnorm(n = 3, mean = a, sd = b) ``` ``` #> [1] -0.6498462 -3.8206122 11.8322392 ``` ```r exp(a / b) ``` ``` #> [1] 1.0000000 0.3678794 75.9539335 ``` ] --- ## Length coercion (vector recycling) The shorter of two atomic vectors in an operation is recycled until it is the same length as the longer atomic vector. ```r x <- c(2, 4, 6) y <- c(1, 1, 1, 2, 2) ``` .small-text[ ```r x > y ``` ``` #> [1] TRUE TRUE TRUE FALSE TRUE ``` ```r x == y ``` ``` #> [1] FALSE FALSE FALSE TRUE FALSE ``` ```r 10 / x ``` ``` #> [1] 5.000000 2.500000 1.666667 ``` ] --- class: inverse, center, middle # Control flow --- ## Conditional control flow Conditional (choice) control flow is governed by `if` and `switch()`. .pull-left[ ```r if (condition) { # code to run # when condition is # TRUE } ``` ] .pull-right[ ```r if (TRUE) { print("The condition must have been true!") } ``` ] --- ## `if` examples ```r if (1 > 0) { print("Yes, 1 is greater than 0.") } ``` ``` #> [1] "Yes, 1 is greater than 0." ``` -- ```r x <- c(1, 2, 3, 4) if (3 %in% x) { print("Yes, 3 is in x.") } ``` ``` #> [1] "Yes, 3 is in x." ``` -- ```r if (-6) { print("Other types are coerced to logical if possible.") } ``` ``` #> [1] "Other types are coerced to logical if possible." ``` --- ## More `if` examples ```r if (c(F, T, T)) { print("How many logical values can if handle?") } ``` ``` #> Warning in if (c(F, T, T)) {: the condition has length > 1 and only the first element will be used ``` -- ```r x <- c(1, 2, 3, 4) if (x %in% 3) { print("This works?") } ``` ```r if (c(1, 0, 1)) { print("Other types are coerced to logical if possible.") } ``` ``` #> [1] "Other types are coerced to logical if possible." ``` <br/> .small-text[ Note: I suppressed warnings in the last two examples. ] --- ## `if` is not vectorized To remedy this potential problem of a non-vectorized `if`, you can 1. try to collapse a logical vector of length greater than 1 to a logical vector of length 1 with functions - `any()` - `all()` 2. use a vectorized conditional function such as `ifelse()` or `dplyr::case_when()`. --- ## Functions `any()` and `all()` ```r x <- c(-5, 0, 5, 10, 15) any(x >= 5) ``` ``` #> [1] TRUE ``` ```r all(x >= 5) ``` ``` #> [1] FALSE ``` <br/><br/> Functions `any()` and `all()` require a logical vector as input. --- ## Vectorized `if` ```r z <- c(-4:-1, 1:3) z ``` ``` #> [1] -4 -3 -2 -1 1 2 3 ``` ```r ifelse(test = z < 0, yes = "neg", no = "pos") ``` ``` #> [1] "neg" "neg" "neg" "neg" "pos" "pos" "pos" ``` <br> -- ```r set.seed(532) x <- rnorm(n = 4, mean = 0, sd = 1) x ``` ``` #> [1] 3.105059 -1.329432 -1.466140 -0.345289 ``` ```r ifelse(test = abs(x) > 3, yes = "outlier", no = "no outlier") ``` ``` #> [1] "outlier" "no outlier" "no outlier" "no outlier" ``` --- ## Nested conditionals .pull-left[ ```r if (condition_one) { ## ## Code to run ## } else if (condition_two) { ## ## Code to run ## } else { ## ## Code to run ## } ``` ] .pull-right[ ```r x <- 0 if (x < 0) { "Negative" } else if (x > 0) { "Positive" } else { "Zero" } ``` ``` #> [1] "Zero" ``` ] --- class: inverse, middle, center # Error action --- ## Execute error action Functions `stop()` and `stopifnot()` execute an error action. These are useful if you want to validate inputs or function arguments. ```r x <- -1 if (x < 0) { stop("Negative numbers not allowed!") } ``` ``` #> Error in eval(expr, envir, enclos): Negative numbers not allowed! ``` <br> -- ```r x <- c(3, 9, 28) stopifnot(any(x >= 0), all(x %% 3 == 0)) ``` ``` #> Error: all(x%%3 == 0) is not TRUE ``` If any of the expressions in function `stopifnot()` are not `TRUE`, then function `stop()` is called and an error message is shown. --- ## Exercises 1. What do each of the following return? Run the code to check your answer. ```r if (1 == "1") "coercion works" else "no coercion " ifelse(5 > c(1, 10, 2), "hello", "olleh") ``` <br> 2. Consider two vectors, `x` and `y`, each of length one. Write a set of conditionals that satisfy the following. - If `x` is positive and `y` is negative or `y` is positive and `x` is negative, print "knits". - If `x` divided by `y` is positive, print "stink". - Stop execution if `x` or `y` are zero. Test your code with various `x` and `y` values. Where did you place the stop execution code? ??? ## Solutions 1. .solution[ ```r if (1 == "1") "coercion works" else "no coercion " ``` ``` #> [1] "coercion works" ``` ```r ifelse(5 > c(1, 10, 2), "hello", "olleh") ``` ``` #> [1] "hello" "olleh" "hello" ``` ] 2. .solution[ ```r x <- 4 y <- -10 if (x == 0 | y == 0) { stop("One of x or y is 0!") } else if (x / y > 0) { print("stink") } else { print("knits") } ``` ``` #> [1] "knits" ``` ] --- class: inverse, middle, center # Loops --- ## Loop types R supports three types of loops: `for`, `while`, and `repeat`. ```r for (item in vector) { ## ## Iterate this code ## } ``` ```r while (we_have_a_true_condition) { ## ## Iterate this code ## } ``` ```r repeat { ## ## Iterate this code ## } ``` In `repeat`, we will need a `break` statement to end iteration. --- ## `for` loop A `for` loop allows you to iterate code over items in a vector. ```r k <- 0 for (i in c(2, 4, 6, 8)) { print(i ^ 2) k <- k + i ^ 2 } ``` ``` #> [1] 4 #> [1] 16 #> [1] 36 #> [1] 64 ``` -- ```r k ``` ``` #> [1] 120 ``` -- ```r for (i in c(2, 4, 6, 8)) { i ^ 2 } ``` *Automatic printing is turned off inside loops.* ??? .tiny[ ```r i <- 10 for (i in 1:3) { print("Today is Tuesday.") } ``` ``` #> [1] "Today is Tuesday." #> [1] "Today is Tuesday." #> [1] "Today is Tuesday." ``` ```r i ``` ``` #> [1] 3 ``` ] Variable `i` is overwritten when `for` is executed. Thus, `i` is assigned to the current environment. Convention is to use `i`, `j`, `k` as loop items. It will be best to avoid using these as named objects elsewhere in your code. --- ## `while` loop A `while` loop will iterate code until a given condition is `FALSE`. ```r i <- 1 res <- rep(0, 10) res ``` ``` #> [1] 0 0 0 0 0 0 0 0 0 0 ``` -- ```r while (i <= 10) { res[i] <- i ^ 2 i <- i + 1 } res ``` ``` #> [1] 1 4 9 16 25 36 49 64 81 100 ``` Note: `[` `]` are use to index vectors. Indexing begins at 1 in R, not 0. We'll discuss this more when we get to subsetting objects. --- ## `repeat` loop A `repeat` loop will iterate code until a `break` statement is executed. ```r i <- 1 res <- rep(NA, 10) repeat { res[i] <- i ^ 2 i <- i + 1 if (i > 10) {break} } res ``` ``` #> [1] 1 4 9 16 25 36 49 64 81 100 ``` --- ## Loop keywords: `next` and `break` - `next` exits the current iteration and advances the looping index - `break` exits the loop - Both `break` and `next` apply only to the innermost of nested loops. ```r for (i in 1:10) { * if (i %% 2 == 0) {next} print(paste("Number", i, "is odd.")) * if (i %% 7 == 0) {break} } ``` -- ``` #> [1] "Number 1 is odd." #> [1] "Number 3 is odd." #> [1] "Number 5 is odd." #> [1] "Number 7 is odd." ``` --- ## Ancillary loop functions You may want to loop over indices of an object as opposed to the object's values. To do this, consider using one of `length()`, `seq()`, `seq_along()`, and `seq_len()`. .pull-left[ ```r 4:7 ``` ``` #> [1] 4 5 6 7 ``` ```r length(4:7) ``` ``` #> [1] 4 ``` ```r seq(4, 7) ``` ``` #> [1] 4 5 6 7 ``` ] .pull-right[ ```r seq_along(4:7) ``` ``` #> [1] 1 2 3 4 ``` ```r seq_len(length(4:7)) ``` ``` #> [1] 1 2 3 4 ``` ```r seq(4, 7, by = 2) ``` ``` #> [1] 4 6 ``` ] <br> Iterating over `seq_along(x)` is a better option than `1:length(x)`. --- ## Loop tips 1. Preallocate your output object when possible. 2. Don't use a `while` or `repeat` loop if a `for` loop is possible. 3. Don't use any type of loop if vectorization is possible. -- .pull-left[ Slow... ```r a <- c() for (i in seq_len(10)) { a <- c(a, i ^ 3) } ``` ] -- .pull-right[ Faster... ```r a <- numeric(10) for (i in seq_len(10)) { a[i] <- i ^ 3 } ``` ] -- Even faster... ```r (1:10) ^ 3 ``` --- ## Exercises 1. Consider the vector `x` below. ```r x <- c(3, 4, 12, 19, 23, 49, 100, 63, 70) ``` Write R code that prints the perfect squares in `x`. <br><br> 2. Consider `z <- c(-1, .5, 0, .5, 1)`. Write R code that prints the smallest non-negative integer `\(k\)` satisfying the inequality `$$\lvert cos(k) - z \rvert < 0.001$$` for each component of `z`. ??? ## Solution 1. .solution[ ```r x <- c(3, 4, 12, 19, 23, 49, 100, 63, 70) for (i in x) { if (sqrt(i) %% 1) { next } print(i) } ``` ``` #> [1] 4 #> [1] 49 #> [1] 100 ``` ] 2. .solution[ ```r for (z in c(-1, .5, 0, .5, 1)) { k <- 0 while (abs(cos(k) - z) >= .001) { k <- k + 1 } print(k) } ``` ``` #> [1] 22 #> [1] 21766 #> [1] 40459 #> [1] 21766 #> [1] 0 ``` ] --- ## References 1. Deborah Nolan & Duncan Temple Lang (2010) Computing in the Statistics Curricula, The American Statistician, 64:2, 97-107, DOI: 10.1198/tast.2010.09132 2. Grolemund, G., & Wickham, H. (2021). R for Data Science. https://r4ds.had.co.nz/ 3. Piatetsky, G. (2019). Python leads the 11 top Data Science, Machine Learning platforms: Trends and Analysis. Kdnuggets.com. Retrieved from https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platforms.html 4. Wickham, H. (2021). Advanced R. https://adv-r.hadley.nz/