Introduction & R Fundamentals

# Introduction & R Fundamentals
## Statistical Computing & Programming
### Shawn Santo
### 05-13-20

---

# Introduction

---

## Who am I?

- Shawn Santo

- [shawn.santo@duke.edu](mailto:shawn.santo@duke.edu)

- Office hours (Zoom)
  - Wednesday 1:00 – 2:00pm, Zoom
  - Friday 8:30 – 9:30am, Zoom
  
*All times listed are in Eastern Time.*

---

## Who else is involved?

- [Pierre Gardan](https://stat.duke.edu/people/pierre-gardan-0)
  - [pierre.gardan@duke.edu](mailto:pierre.gardan@duke.edu)
  - Office hours (Zoom): Monday 10:00 - 11:00am, Thursday 3:30 - 4:30pm
  
- [Abbey List]()
  - [abbey.list@duke.edu](mailto:abbey.list@duke.edu)
  - Office hours (Zoom): Tuesday 1:00 - 2:00pm, Thursday 1:00 - 2:00pm
  
- [Bo Liu](https://stat.duke.edu/people/bo-liu-0)
  - [bo.liu1997@duke.edu](mailto:bo.liu1997@duke.edu)
  - Office hours (Zoom): Monday 4:30 - 5:30pm, Friday 3:00 - 4:00pm
  
*All times listed are in Eastern Time.*

---

## What is statistical computing / programming?

.middle.center[

<img src="images/statistical-computing-venn.png" height="400px">
]

*Source:* Deborah Nolan & Duncan Temple Lang (2010) Computing in the Statistics 
Curricula, The American Statistician, 64:2, 97-107, DOI: 10.1198/tast.2010.09132

---

## What you will learn

- Fundamentals of R

- S3 objects

- Data visualization with package `ggplot2`

- Package `tidyverse`

- Web scraping

- Web based applications with RShiny

- Wrangling and managing big data

- SQL and databases

]

- Data types and functions

- Parallelization

- Git and GitHub

- Shell

- Reproducible reports with R Markdown

- Debugging and testing

- Spark

- Make

]

[Full course schedule](http://www2.stat.duke.edu/courses/Summer20/sta323.001-1/schedule.html)

---

## Why this class matters

.middle.center[

![](images/data-science-popularity.jpg)
]

*Source:* https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platforms.html

---

## Why this class matters

### Some 2020 internships:

- Mayo Clinic : Interns will work with statisticians,
bioinformaticists, and clinical investigators on research projects in areas 
such as clinical trials, statistical genetics, and bioinformatics. 
Experience with SAS and/or R preferred.

- Netflix (Science and Analytics): Comfortable coding
in at least one language (e.g., R, Python, Java, Scala, C++), experience
preferred with version control (e.g., git), great communication skills,
both oral and written.

- Two Sigma: 
 Use the scientific method to develop sophisticated investment models and 
shape our insights into how the markets will behave. Create and test complex 
investment ideas and partner with our engineers to test your theories.
You should possess the following qualifications: Demonstrate intermediate 
skills in at least one programming language, performed an in-depth research 
project, examining real-world data, are an independent thinker who can 
creatively approach data analysis and communicate complex ideas clearly.

*Source:* https://stattrak.amstat.org/2019/12/01/2020-internship-listings/

---

## Why this class matters

.middle.center[

]

100 data science job ads from LinkedIn across USA, UK, Canada, and Australia
between April 22, 2019 and May 5, 2019

*Source*: https://towardsdatascience.com/which-programming-language-should-data-scientists-learn-first-aac4d3fd3038

---

## Text toolkit

.middle.center[

![](images/texts.png)
]

These are recommended textbooks - **all are available for free online**. 
There is no required textbook for this course.

---

## Software toolkit

.middle.center[

![](images/software.png)
]

---

## Course structure

This class is about you doing as opposed to you just watching or listening. 
Video lectures and labs will be interactive. My role as instructor is to 
introduce you to new tools and techniques, but it is up to you to take them 
and make use of them. If you only read the code and never run it or experiment 
with it, then you will not get much out of this course. Most slides will 
include supplemental resources for you to delve deeper in the topic of 
discussion. Occasionally, there will be readings assigned.

To be successful in this course as an undergraduate student, you will need to 
commit up to 20 hours per week of your time. If you are a graduate student, 
you need to commit up to 25 hours per week of your time. In this online summer 
version, we will cover the same topics and at the same depth as what is 
covered in the typical 15-week semester. I have detailed a course schedule 
that you should follow in order to be successful in this course.
      
---

## Grading

| Grade Item | Percentage |
|-----------:|:----------:|
| Homework | 40% |
| Exam | 25% |
| Project | 25% |
| Labs | 10% |
 
 
 
The exact ranges for letter grades may be curved and cutoffs will be determined 
at the end of the semester. However, if you have a cumulative numerical average 
of 90 - 100, you are guaranteed at least an A-, 80 - 89 at least a B-, 70 - 79 
at least a C-, and so on.

---

## Teams

I will construct teams based on the results of a survey you complete.

Team expectations:

- Each member must commit to giving equal effort.

- Each member must read, run, and understand all code in a final submission.

- Each member must honestly complete the intragroup peer evaluation.

---

## Policies - sharing / reusing code

- Similar reproducible examples (reprex) exist online that will help you 
  answer many of the questions posed on labs and homework assignments. Use of 
  these resources is allowed unless it is written explicitly on the assignment.

- You must always cite any code you copy or use as inspiration. Copied code
  without citation is plagiarism and will result in a 0 for the assignment.
  There may also be additional punitive measures taken depending on the
  severity of plagiarism.
  
- Copying and citing a large amount of code to satisfy a main objective of an
  assignment will result in a 0 for the assignment.

- Discussion (not code sharing / copying) with other students and groups is 
  allowed unless it is written explicitly on the assignment.

- Carefully read each assignment so you know what is permitted and 
  what is not. If you are ever unsure what is allowed, please ask myself or one
  of the TAs.

---

## Getting help

- Post your content and course related questions on Slack

- Set up a Zoom meeting with myself or one of the TAs

- Email me or one of the TAs

---

## Links to bookmark

- Course page: http://www2.stat.duke.edu/courses/Summer20/sta323.001-1/

- GitHub organization: https://github.com/sta323-523-su20

- To access the DSS RStudio servers, use `AnyConnect` to connect with Duke's VPN.
  Navigate to:

- Server http://pawn.stat.duke.edu:8787/ for undergraduate students
    - Server http://rook.stat.duke.edu:8787/ for graduate students

---

## To do list

Before moving to the next section, please

1. create a GitHub account,

- https://github.com/join

2. join Slack,
    
3. verify you can log-in to the Department's RStudio servers,

4. complete the first-day survey.

Links for the necessary items were sent via email on 05-12-20.

---

# Fundamentals of R

---

## Supplementary materials

Companion videos

- [Vectors](https://warpwire.duke.edu/w/YcADAA/)
- [Operators, vectorization, and length coercion](https://warpwire.duke.edu/w/W8ADAA/)
- [Control flow](https://warpwire.duke.edu/w/V8ADAA/)
- [Error action](https://warpwire.duke.edu/w/WcADAA/)
- [Loops](https://warpwire.duke.edu/w/Y8ADAA/)
- [Introduction to using RMarkdown](https://warpwire.duke.edu/w/XcADAA/)

Additional resources

- [Google’s R Style Guide](https://google.github.io/styleguide/Rguide.html)
- [Hadley's R Style Guide](http://r-pkgs.had.co.nz/style.html)
- [Sections 3.1 – 3.2](https://adv-r.hadley.nz/vectors-chap.html) Advanced R
- [Chapter 5](https://adv-r.hadley.nz/control-flow.html) Advanced R

---

# Vectors

---

## Vectors

The fundamental building block of data in R is a vector (collections of related 
values, objects, other data structures, etc).

R has two types of vectors:

* **atomic** vectors

- homogeneous collections of the *same* type (e.g. all logical values, 
      all numbers, or all character strings).

* **generic** vectors
 
 - heterogeneous collections of *any* type of R object, even other lists 
 (meaning they can have a hierarchical/tree-like structure).

I will use the term component or element when referring to a value
inside a vector.

---

## Vector interrelationships

.middle.center[

]

*Source*: https://r4ds.had.co.nz/vectors.html

---

## Atomic vectors

R has six atomic vector types: 
  
.center[
`logical`,  `double`, `integer`, `character`, `complex`, `raw`      
]

In this course we will mostly work with the first four. You will rarely work
with the last two types - complex and raw.

```r
x <- c(T, F, TRUE, FALSE)
typeof(x)
```

```
#> [1] "logical"
```

```r
y <- c("a", "few", "more", "slides")
typeof(y)
```

```
#> [1] "character"
```

---

## Coercion hierarchy

If you try to combine components of different types into a single atomic vector, 
R will try to coerce all elements so they can be represented as the simplest 
type.

```r
x <- c(T, 5, F, 0, 1)
y <- c("a", 1, T)
z <- c(3.0, 4L, 0L)
```

--
.pull-left[

```r
x
```

```
#> [1] 1 5 0 0 1
```

```r
y
```

```
#> [1] "a"    "1"    "TRUE"
```

```r
z
```

```
#> [1] 3 4 0
```
]

```r
typeof(x)
```

```
#> [1] "double"
```

```r
typeof(y)
```

```
#> [1] "character"
```

```r
typeof(z)
```

```
#> [1] "double"
```
]

---

## Concatenation

One way to construct atomic vectors is with function `c()`.

```r
c(1, 0, 1, 1, 6)
```

```
#> [1] 1 0 1 1 6
```

```r
c(c(3, 4), c(10, TRUE))
```

```
#> [1]  3  4 10  1
```

```r
c(pi)
```

```
#> [1] 3.141593
```

---

# Operators, vectorization, and length coercion

---

## Logical (Boolean) operators

| Operator | Operation | Vectorized? 
|:-----------------------------:|:-------------:|:------------:
| <code>x &#124; y</code> | or | Yes 
| `x & y` | and | Yes 
| `!x` | not | Yes 
| <code>x &#124;&#124; y</code> | or | No 
| `x && y` | and | No 
|`xor(x,y)` | exclusive or | Yes

What do we mean if we say a function or operation is vectorized?

---

## Boolean examples

```r
x <- c(T, F, T, T)
y <- c(F, F, T, F)
```

```r
!x
```

```
#> [1] FALSE  TRUE FALSE FALSE
```

```r
x | y
```

```
#> [1]  TRUE FALSE  TRUE  TRUE
```

```r
x || y
```

```
#> [1] TRUE
```
]

```r
x & y
```

```
#> [1] FALSE FALSE  TRUE FALSE
```

```r
x && y
```

```
#> [1] FALSE
```

```r
xor(x, y)
```

```
#> [1]  TRUE FALSE FALSE  TRUE
```
]

---

## Comparison operators

| Operator | Comparison | Vectorized?
|:----------:|:--------------------------:|:----------------:
| `x < y` | less than | Yes
| `x > y` | greater than | Yes
| `x <= y` | less than or equal to | Yes
| `x >= y` | greater than or equal to | Yes
| `x != y` | not equal to | Yes
| `x == y` | equal to | Yes
| `x %in% y` | contains | Yes (over `x`)

---

## Comparison examples

```r
x <- c(4, 10, -5)
y <- c(0, 51, 9 / 5)
z <- c("four", "for", "4")
```

```r
x > y
```

```
#> [1]  TRUE FALSE FALSE
```

```r
x != y
```

```
#> [1] TRUE TRUE TRUE
```
]

```r
x == z
```

```
#> [1] FALSE FALSE FALSE
```

```r
x %in% z
```

```
#> [1]  TRUE FALSE FALSE
```
]

---

## What else is vectorized?

- Most of the mathematical operators

- Many functions built-in to R and created by user's in packages

```r
a <- c(0, -3, sqrt(75))
b <- c(1, 3, 2)
```

```r
a + b
```

```
#> [1]  1.00000  0.00000 10.66025
```

```r
a ^ b
```

```
#> [1]   0 -27  75
```
]

```r
rnorm(n = 3, mean = a, sd = b)
```

```
#> [1]  0.1255788 -4.7301223  7.4654714
```

```r
exp(a / b)
```

```
#> [1]  1.0000000  0.3678794 75.9539335
```
]

---

## Length coercion (vector recycling)

The shorter of two atomic vectors in an operation is recycled until it is the 
same length as the longer atomic vector.

```r
x <- c(2, 4, 6)
y <- c(1, 1, 1, 2, 2)
```

```r
x > y
```

```
#> [1]  TRUE  TRUE  TRUE FALSE  TRUE
```

```r
x == y
```

```
#> [1] FALSE FALSE FALSE  TRUE FALSE
```

```r
10 / x
```

```
#> [1] 5.000000 2.500000 1.666667
```
]

---

# Control flow

---

## Conditional control flow

Conditional (choice) control flow is governed by `if` and `switch()`.

```r
if (condition) {
  # code to run
  # when condition is
  # TRUE
}
```
]

```r
if (TRUE) {
  print("The condition must have been true!")
}
```
]

---

## `if` examples

```r
if (1 > 0) {
  print("Yes, 1 is greater than 0.")
}
```

```
#> [1] "Yes, 1 is greater than 0."
```

```r
x <- c(1, 2, 3, 4)
if (3 %in% x) {
 print("Yes, 3 is in x.")
}
```

```
#> [1] "Yes, 3 is in x."
```

```r
if (-6) {
  print("Other types are coerced to logical if possible.")
}
```

```
#> [1] "Other types are coerced to logical if possible."
```

---

## More `if` examples

```r
if (c(F, T, T)) {
  print("How many logical values can if handle?")
}
```

```r
x <- c(1, 2, 3, 4)
if (x %in% 3) {
 print("This works?")
}
```

```r
if (c(1, 0, 1)) {
  print("Other types are coerced to logical if possible.")
}
```

```
#> [1] "Other types are coerced to logical if possible."
```

---

## `if` is not vectorized

To remedy this potential problem of a non-vectorized `if`, you can

1. try to collapse a logical vector of length greater than 1
   to a logical vector of length 1 with functions

- `any()`
    - `all()`

2. use a vectorized conditional function such as `ifelse()` or
   `dplyr::case_when()`.

---

## Functions `any()` and `all()`

```r
x <- c(-5, 0, 5, 10, 15)
any(x >= 5)
```

```
#> [1] TRUE
```

```r
all(x >= 5)
```

```
#> [1] FALSE
```

Functions `any()` and `all()` require a logical vector as input.

---

## Vectorized `if`

```r
z <- c(-4:-1, 1:3)
z
```

```
#> [1] -4 -3 -2 -1  1  2  3
```

```r
ifelse(test = z < 0, yes = "neg", no = "pos")
```

```
#> [1] "neg" "neg" "neg" "neg" "pos" "pos" "pos"
```

--

```r
x <- rnorm(n = 4, mean = 0, sd = 1)
x
```

```
#> [1] -0.5738760  0.7314262  0.4771350  1.1926576
```

```r
ifelse(test = abs(x) > 3, yes = "outlier", no = "no outlier")
```

```
#> [1] "no outlier" "no outlier" "no outlier" "no outlier"
```

---

## Nested conditionals

```r
if (condition_one) {
  ##
  ## Code to run
  ##
} else if (condition_two) {
  ##
  ## Code to run
  ##
} else {
  ##
  ## Code to run
  ##
}
```
]

```r
x = 0
if (x < 0) {
 "Negative"
} else if (x > 0) {
 "Positive"
} else {
 "Zero"
}
```

```
#> [1] "Zero"
```
]

---

# Error action

---

## Execute error action

Functions `stop()` and `stopifnot()` execute an error action. These are useful
if you want to validate inputs or function arguments.

```r
x <- -1
if (x < 0) {
 stop("Negative numbers not allowed!")
}
```

```
#> Error in eval(expr, envir, enclos): Negative numbers not allowed!
```

--

```r
x <- c(3, 9, 28)
stopifnot(any(x >= 0), all(x %% 3 == 0))
```

```
#> Error: all(x%%3 == 0) is not TRUE
```

If any of the expressions in function `stopifnot()` are not `TRUE`, then
function `stop()` is called and an error message is shown.

---

## Exercises

1. What does each of the following return? Run the code to check your answer.
 
 ```r
 if (1 == "1") "coercion works" else "no coercion "
 
 ifelse(5 > c(1, 10, 2), "hello", "olleh")
 ```

2. Consider two vectors, `x` and `y`, each of length one. Write a set of
   conditionals that satisfy the following.
   
    - If `x` is positive and `y` is negative or `y` is positive and `x` is
      negative, print "knits".
    - If `x` divided by `y` is positive, print "stink".
    - Stop execution if `x` or `y` are zero.
    
  Test your code with various `x` and `y` values. Where did you
  place the stop execution code?
  
???

## Solutions

1.
.solution[

```r
if (1 == "1") "coercion  works" else "no coercion "
```

```
#> [1] "coercion  works"
```

```r
ifelse(5 > c(1, 10, 2), "hello", "olleh")
```

```
#> [1] "hello" "olleh" "hello"
```
]
2.
.solution[

```r
x <- 4
y <- -10

if (x == 0 | y == 0) {
  stop("One of x or y is 0!")
} else if (x / y > 0) {
  print("stink")
} else {
  print("knits")
}
```

```
#> [1] "knits"
```
]

---

# Loops

---

## Loop types

R supports three types of loops: `for`, `while`, and `repeat`.

```r
for (item in vector) {
  ##
  ## Iterate this code
  ##
}
```

```r
while (we_have_a_true_condition) {
  ##
  ## Iterate this code
  ##
}
```

```r
repeat {
  ##
  ## Iterate this code
  ##
}
```

In the `repeat` loop we will need a `break` statement to end iteration.

---

## `for` loop

A `for` loop allows you to iterate code over items in a vector.

```r
k = 0
for (i in c(2, 4, 6, 8)) {
 print(i ^ 2)
 k <- k + i ^ 2
}
```

```
#> [1] 4
#> [1] 16
#> [1] 36
#> [1] 64
```

```r
k
```

```
#> [1] 120
```

```r
for (i in c(2, 4, 6, 8)) {
  i ^ 2
}
```

*Automatic printing is turned off inside loops.*

???
.tiny[

```r
i <- 10
for (i in 1:3) {
 print("Today is Tuesday.")
}
```

```
#> [1] "Today is Tuesday."
#> [1] "Today is Tuesday."
#> [1] "Today is Tuesday."
```

```r
i
```

```
#> [1] 3
```
]

Variable `i` is overwritten when `for` is executed. Thus, `i` is assigned
to the current environment. Convention is to use `i`, `j`, `k` as loop items.
It will be best to avoid using these as named objects elsewhere in your code.

---

## `while` loop

A `while` loop will iterate code until a given condition is `FALSE`.

```r
i <- 1
res <- rep(0, 10)

i
```

```
#> [1] 1
```

```r
res
```

```
#>  [1] 0 0 0 0 0 0 0 0 0 0
```

```r
while (i <= 10) {
 res[i] <- i ^ 2
 i <- i + 1
}

res
```

```
#>  [1]   1   4   9  16  25  36  49  64  81 100
```

---

## `repeat` loop

A `repeat` loop will iterate code until a `break` statement is executed.

```r
i <- 1
res <- rep(NA, 10)

repeat {
 res[i] <- i ^ 2
 i <- i + 1
 if (i > 10) {break}
}

res
```

```
#>  [1]   1   4   9  16  25  36  49  64  81 100
```

---

## Loop keywords: `next` and `break`

- `next` exits the current iteration and advances the looping index

- `break` exits the loop

- Both `break` and `next` apply only to the innermost of nested loops.

```r
for (i in 1:10) {
* if (i %% 2 == 0) {next}
  
  print(paste("Number ", i, " is odd."))
  
* if (i %% 7 == 0) {break}
}
```

```
#> [1] "Number  1  is odd."
#> [1] "Number  3  is odd."
#> [1] "Number  5  is odd."
#> [1] "Number  7  is odd."
```

---

## Ancillary loop functions

You may want to loop over indices of an object as opposed to the object's values.
To do this, consider using one of `length()`, `seq()`, `seq_along()`, and
`seq_len()`.

```r
4:7
```

```
#> [1] 4 5 6 7
```

```r
length(4:7)
```

```
#> [1] 4
```

```r
seq(4, 7)
```

```
#> [1] 4 5 6 7
```
]

```r
seq_along(4:7)
```

```
#> [1] 1 2 3 4
```

```r
seq_len(length(4:7))
```

```
#> [1] 1 2 3 4
```

```r
seq(4, 7, by = 2)
```

```
#> [1] 4 6
```
]

Iterating over `seq_along(x)` is a better option than `1:length(x)`.

---

## Loop tips

1. Preallocate your output object when possible.

2. Don't use a `while` or `repeat` loop if a `for` loop is possible.

3. Don't use any type of loop if vectorization is possible.

```r
a <- c()
for (i in seq_len(10)) {
 a <- c(a, i ^ 3)
}
```
]

```r
a <- numeric(10)
for (i in seq_len(10)) {
 a[i] <- i ^ 3
}
```
]

Even faster...

```r
(1:10) ^ 3
```

---

## Exercises

1. Consider the vector `x` below.
 
 ```r
 x <- c(3, 4, 12, 19, 23, 49, 100, 63, 70)
 ```
Write R code that prints the perfect squares in `x`.
 
2. Consider `z <- c(-1, .5, 0, .5, 1)`. Write R code that prints
 the smallest non-negative integer `$k$` satisfying the inequality
 `$$\lvert cos(k) - z \rvert < 0.001$$`
 for each component of `z`.

???

## Solution

1.
.solution[

```r
x <- c(3, 4, 12, 19, 23, 49, 100, 63, 70)

for (i in x) {
  if (sqrt(i) %% 1) {
    next
  }
  print(i)
}
```

```
#> [1] 4
#> [1] 49
#> [1] 100
```
]

2.
.solution[

```r
for (z in c(-1, .5, 0, .5, 1)) {
 k <- 0
 while (abs(cos(k) - z) >= .001) {
 k <- k + 1
 }
 print(k)
}
```

```
#> [1] 22
#> [1] 21766
#> [1] 40459
#> [1] 21766
#> [1] 0
```
]

---

## References

- Deborah Nolan & Duncan Temple Lang (2010) Computing in the Statistics 
  Curricula, The American Statistician, 64:2, 97-107, 
  DOI: 10.1198/tast.2010.09132

- Piatetsky, G. (2019). Python leads the 11 top Data Science, Machine Learning 
  platforms: Trends and Analysis. Kdnuggets.com. Retrieved 21 August 2019, from
  https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platforms.html

- Which Programming Language Should Data Scientists Learn First?. (2019). 
  Medium. 
  https://towardsdatascience.com/which-programming-language-should-data-scientists-learn-first-aac4d3fd3038
  
- Grolemund, G., & Wickham, H. (2019). R for Data Science. https://r4ds.had.co.nz/

- Wickham, H. (2019). Advanced R. https://adv-r.hadley.nz/