Code Style and Data Types

# Code Style and Data Types
## Intro to Data Science
### Yue Jiang
### 02.05.20

---

# Announcements

- HW 02

- Follow along with the second half of today's lecture at [https://classroom.github.com/a/DGPcCDWK](https://classroom.github.com/a/DGPcCDWK)

---

# Coding style

---

## Style guide

>"Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread."
>
>Hadley Wickham

- Style guide for this course is based on the Tidyverse style guide: 
  [http://style.tidyverse.org/](http://style.tidyverse.org/)

- There's more to it than what we'll cover today. We'll mention more as we 
  introduce more functionality throughout the semester.

---

## File names and code chunk labels

- Do not use spaces in file names, use `-` or `_` to separate words.

- Use all lowercase letters.

```r
# Good
ucb-admit.csv

# Bad
UCB Admit.csv
```

---

## Assignment, object creation

Use `<-`, not `=`

```r
# Good
x <- 2

# Bad
x = 2
```

In an `R` chunk, Windows users may use Alt and - together (the hyphen key) as a
shortcut. Mac users may use Option and -.

---

## Object names

- Use an `_` to separate words in object names.

- Use informative but short object names.

- Do not reuse object names within an analysis.

- Don't choose existing function names or names that have special meaning in
  R such as `NA`, `T`, `NaN`, `pi`, etc.

```r
# Good
acs_employed

# Bad
acs.employed
acs2
acs_subset
acs_subsetted_for_males
mean
NA
log
```

---

## Spacing

- Put a space before and after all infix operators (`=, +, -, <-`, etc.) and 
  when naming arguments in function calls.

- Always put a space after a comma, and never before 
  (just like in regular English).

```r
# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)

# Bad
average<-mean(feet/12+inches,na.rm=TRUE)
```

---

## `ggplot2`

- Always end a line with `+`

- Always indent the next line (this should happen automatically)

```r
# Good
ggplot(diamonds, mapping = aes(x = price)) +
  geom_histogram()

# Bad
ggplot(diamonds,mapping=aes(x=price))+geom_histogram()
```

---

## Long lines

- Try to limit your code to 80 characters per line. This fits comfortably on a 
  printed page with a reasonably sized font.
    - You can add a margin line to help you stay within 80 characters.
      <br/><br/>
      `Tools -> Global Options -> Code -> Display -> Check "Show Margin"`

- Take advantage of RStudio editor's auto formatting for indentation at line 
  breaks.
  
- We *will* be taking off points if your code "runs off the page"!

---

## Quotes

Use `" "`, not `' '`, for quoting text. The only exception is when the text 
already contains double quotes and no single quotes.

```r
ggplot(diamonds, mapping = aes(x = price)) +
  geom_histogram() +
  labs(title = "Shine bright like a diamond", # Good
       x = "Diamond prices",                  # Good
*      y = 'Frequency')                       # Bad
```

---

# Data types

---

## Data types in R

- **logical**

- **double**

- **integer**

- **character**

- **lists**

<br/>

More exist, but we'll focus on the first four. If you end up needing help with
lists, come see me! We won't officially cover it in class.

---

## Logical & character

**logical** - boolean values `TRUE` and `FALSE`, or `T` and `F`

```r
typeof(TRUE)
```

```
#> [1] "logical"
```

```r
typeof(F)
```

```
#> [1] "logical"
```

**character** - character strings

```r
typeof("hello")
```

```
#> [1] "character"
```

```r
typeof("T")
```

```
#> [1] "character"
```

---

## Double & integer

**double** - floating point numerical values (default numerical type)

```r
typeof(1.335)
```

```
#> [1] "double"
```

```r
typeof(7)
```

```
#> [1] "double"
```

**integer** - integer numerical values (coerced with an `L`)

```r
typeof(7L)
```

```
#> [1] "integer"
```

```r
x <- 1:3
x
```

```
#> [1] 1 2 3
```

```r
typeof(x)
```

```
#> [1] "integer"
```
]

---

## Vectors

A vector is a collection of elements that are all the same data type.

- Vectors can be constructed using the `c()` function for concatenation.

- All elements of a vector must be the same data type. Implicit coercion will
  happen if you mix types.

```r
c(1, 2, 3)
```

```
#> [1] 1 2 3
```

```r
c("Hello", "World!")
```

```
#> [1] "Hello"  "World!"
```

```r
c(1, c(2, c(3)))
```

```
#> [1] 1 2 3
```

---

## Coercion

R is a dynamically typed language - it will easily convert between various types

```r
x <- c(1, "Hello")
```

```r
x
```

```
#> [1] "1"     "Hello"
```

```r
y <- c(1, -4:-8, TRUE, FALSE)
```

```r
y
```

```
#> [1]  1 -4 -5 -6 -7 -8  1  0
```

In general, R will convert all values of a vector to the simplest type needed 
to represent all the information.

---

## Missing values

R uses `NA` to represent missing values in its data structures.

```r
typeof(NA)
```

```
#> [1] "logical"
```

```r
c(4, NA, pi, log(2))
```

```
#> [1] 4.0000000        NA 3.1415927 0.6931472
```

```r
c(NA, NA, NA)
```

```
#> [1] NA NA NA
```

---

## Other special values

`NaN`: Not a number

`Inf`: Positive infinity

`-Inf`: Negative infinity

```r
-pi / 0
```

```
#> [1] -Inf
```

```r
0 / 0
```

```
#> [1] NaN
```

```r
1/0 + 1/0
```

```
#> [1] Inf
```
]

```r
1/0 - 1/0
```

```
#> [1] NaN
```

```r
NaN / NA
```

```
#> [1] NaN
```

```r
NaN * NA
```

```
#> [1] NaN
```
]

---

## Activity

What is the type of the following vectors? Explain why they have that type.

1. `c(1, NA+1L, "C")`

2. `c(1L / 0, NA)`

3. `c(1:3, 5)`

4. `c(3L, NaN+1L)`

5. `c(NA, TRUE)`

**Did you figure out R's coercion hierarchy?**

---

## Example: Cat lovers

A survey asked respondents their name and number of cats. The instructions 
said to enter the number of cats as a numerical value. The full table of data
is available on the next slide.

```r
cat_lovers <- read_csv("data/cat-lovers.csv")
```

---

<div id="htmlwidget-545ac7f105b6549bcbe1" style="width:700px;height:500px;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-545ac7f105b6549bcbe1">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60"],["Bernice Warren","Woodrow Stone","Willie Bass","Tyrone Estrada","Alex Daniels","Jane Bates","Latoya Simpson","Darin Woods","Agnes Cobb","Tabitha Grant","Perry Cross","Wanda Silva","Alicia Sims","Emily Logan","Woodrow Elliott","Brent Copeland","Pedro Carlson","Patsy Luna","Brett Robbins","Oliver George","Calvin Perry","Lora Gutierrez","Charlotte Sparks","Earl Mack","Leslie Wade","Santiago Barker","Jose Bell","Lynda Smith","Bradford Marshall","Irving Miller","Caroline Simpson","Frances Welch","Melba Jenkins","Veronica Morales","Juanita Cunningham","Maurice Howard","Teri Pierce","Phil Franklin","Jan Zimmerman","Leslie Price","Bessie Patterson","Ethel Wolfe","Naomi Wright","Sadie Frank","Lonnie Cannon","Tony Garcia","Darla Newton","Ginger Clark","Lionel Campbell","Florence Klein","Harriet Leonard","Terrence Harrington","Travis Garner","Doug Bass","Pat Norris","Dawn Young","Shari Alvarez","Tamara Robinson","Megan Morgan","Kara Obrien"],["0","0","1","3","3","2","1","1","0","0","0","0","1","3","3","2","1","1","0","0","1","1","0","0","4","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","1","3","3","2","1","1.5 - honestly I think one of my cats is half human","0","0","0","0","1","three","1","1","1","0","0","2"],["left","left","left","left","left","left","left","left","left","left","left","left","left","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","ambidextrous","ambidextrous","ambidextrous","ambidextrous","ambidextrous"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>name<\/th>\n      <th>number_of_cats<\/th>\n      <th>handedness<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"order":[],"autoWidth":false,"orderClasses":false,"columnDefs":[{"orderable":false,"targets":0}]}},"evals":[],"jsHooks":[]}</script>

---

## Why isn't this working?

```r
cat_lovers %>%
  summarise(mean_cats = mean(number_of_cats))
```

```
#> # A tibble: 1 x 1
#>   mean_cats
#>       <dbl>
#> 1        NA
```

---

## Why is this still not working?

```r
cat_lovers %>%
  summarise(mean_cats = mean(number_of_cats, na.rm = TRUE))
```

```
#> # A tibble: 1 x 1
#>   mean_cats
#>       <dbl>
#> 1        NA
```

---

## Let's look at the data...

What is the type of the `number_of_cats` variable?

```r
glimpse(cat_lovers)
```

```
#> Observations: 60
#> Variables: 3
#> $ name           <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass", "T...
#> $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", "0", "0", "0...
#> $ handedness     <chr> "left", "left", "left", "left", "left", "left", "lef...
```

---

## Let's take another look

```r
cat_lovers %>%
  count(number_of_cats)
```

```
#> # A tibble: 7 x 2
#>   number_of_cats                                          n
#>   <chr>                                               <int>
#> 1 0                                                      33
#> 2 1                                                      14
#> 3 1.5 - honestly I think one of my cats is half human     1
#> 4 2                                                       4
#> 5 3                                                       6
#> 6 4                                                       1
#> 7 three                                                   1
```

We need to fix two entries.

---

## Fix data entry errors

```r
cat_lovers %>%
  mutate(number_of_cats = case_when(
    name == "Ginger Clark" ~ 2,
    name == "Doug Bass"    ~ 3,
    TRUE                   ~ as.numeric(number_of_cats)
    )) %>%
  summarise(mean_cats = mean(number_of_cats))
```

```
#> # A tibble: 1 x 1
#>   mean_cats
#>       <dbl>
#> 1     0.817
```

---

## Now that we know what we're doing...

```r
cat_lovers <- cat_lovers %>%
  mutate(
    number_of_cats = case_when(
      name == "Ginger Clark" ~ "2",
      name == "Doug Bass"    ~ "3",
      TRUE                   ~ number_of_cats
      ),
    number_of_cats = as.numeric(number_of_cats)
    )
```

---

## Moral of the story

- If your data does not behave how you expect it to, type coercion upon 
  reading in the data might be the reason.

- Go in and investigate your data, apply the fix, then **save your data**.

---

## Subsetting vectors

```r
x <- c(8, 4, 7,-1, 0, 100)
```

```r
x[1]
```

```
#> [1] 8
```

```r
x[c(1, 4)]
```

```
#> [1]  8 -1
```

```r
x[c(TRUE, FALSE)]
```

```
#> [1] 8 7 0
```

<br/>

**Note:** When using tidyverse code you'll rarely need to refer to elements 
using square brackets, but it's good to be aware of this syntax, especially 
since you might encounter it when searching for help online.

---

# Data "set"

---

## Data "sets" in R

- "set" is in quotation marks because it is not a formal data class

- A tidy data "set" can be one of the following types:
    + `tibble`
    + `data.frame`
    
---

## Data frames & tibbles

- A `data.frame` is the most commonly used data structure in R, they are just a 
  list of equal length vectors. Each vector is treated as a column and elements 
  of the vectors as rows.

- A tibble is a type of data frame that makes the data analysis easier.

- Most often a data frame will be constructed by reading in data from a file, 
  but we can also create them from scratch.
    - `readr` package (e.g. `read_csv` function) loads data as a `tibble` by 
      default
    - `tibble`s are part of the tidyverse, so they work well with other 
      packages we are using
    - they make minimal assumptions about your data, so are less likely to 
      cause hard to track bugs in your code

---

## Creating data frames

```r
df <- tibble(x = 1:3, y = c("a", "b", "c"))
class(df)
```

```
#> [1] "tbl_df"     "tbl"        "data.frame"
```

```r
glimpse(df)
```

```
#> Observations: 3
#> Variables: 2
#> $ x <int> 1, 2, 3
#> $ y <chr> "a", "b", "c"
```

---

## Features of data frames

```r
attributes(df)
```

```
#> $names
#> [1] "x" "y"
#> 
#> $row.names
#> [1] 1 2 3
#> 
#> $class
#> [1] "tbl_df"     "tbl"        "data.frame"
```

```r
class(df$x)
```

```
#> [1] "integer"
```

```r
class(df$y)
```

```
#> [1] "character"
```

---

## Working with tibbles in pipelines

**How many respondents have a below average number of cats?**

```r
mean_cats <- cat_lovers %>%
  summarise(mean_cats = mean(number_of_cats))

cat_lovers %>%
  filter(number_of_cats < mean_cats) %>%
  nrow()
```

```
#> [1] 60
```

---

## A result of a pipeline is almost always a tibble

```r
mean_cats
```

```
#> # A tibble: 1 x 1
#>   mean_cats
#>       <dbl>
#> 1     0.817
```

```r
class(mean_cats)
```

```
#> [1] "tbl_df"     "tbl"        "data.frame"
```

---

## `pull()` can be great

But use it sparingly!

```r
mean_cats <- cat_lovers %>%
  summarise(mean_cats = mean(number_of_cats)) %>%
  pull()

cat_lovers %>%
  filter(number_of_cats < mean_cats) %>%
  nrow()
```

```
#> [1] 33
```

```r
mean_cats
```

```
#> [1] 0.8166667
```

```r
class(mean_cats)
```

```
#> [1] "numeric"
```

---

# Factors

---

## Factors

**Factor** objects are what R uses to store data for categorical variables 
(a fixed number of discrete values).

```r
(x = factor(c("BS", "MS", "PhD", "MS")))
```

```
#> [1] BS  MS  PhD MS 
#> Levels: BS MS PhD
```

```r
glimpse(x)
```

```
#>  Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2
```

```r
typeof(x)
```

```
#> [1] "integer"
```

---

## Read data in as character strings

```r
glimpse(cat_lovers)
```

```
#> Observations: 60
#> Variables: 3
#> $ name           <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass", "T...
#> $ number_of_cats <dbl> 0, 0, 1, 3, 3, 2, 1, 1, 0, 0, 0, 0, 1, 3, 3, 2, 1, 1...
#> $ handedness     <chr> "left", "left", "left", "left", "left", "left", "lef...
```

---

## But coerce when plotting

```r
ggplot(cat_lovers, mapping = aes(x = handedness)) +
  geom_bar()
```

---

## Use `forcats` to manipulate factors

```r
cat_lovers %>%
* mutate(handedness = fct_relevel(handedness, "right", "left", "ambidextrous")) %>%
  ggplot(mapping = aes(x = handedness)) +
  geom_bar()
```

<img src="06-code-style_files/figure-html/unnamed-chunk-42-1.png" style="display: block; margin: auto;" />
]

---

## Package `forcats`

- R uses factors to handle categorical variables, variables that have a fixed 
  and known set of possible values. Historically, factors were much easier to 
  work with than character vectors, so many base R functions automatically 
  convert character vectors to factors.

- However, factors are still useful when you have true categorical data and 
  when you want to override the ordering of character vectors to improve 
  display. The goal of the `forcats` package is to provide a suite of useful 
  tools that solve common problems with factors.

Source: [forcats.tidyverse.org](http://forcats.tidyverse.org/)

---

## Recap

- Always best to think of data as part of a tibble
    + This works nicely with the `tidyverse` as well
    + Rows are observations, columns are variables
    
- Be careful about data types / classes
    + Sometimes `R` makes silly assumptions about your data class 
        + Using `tibble`s help, but it might not solve all issues
        + Think about your data in context, e.g. 0/1 variable is most likely a `factor`
    + If a plot/output is not behaving the way you expect, first
    investigate the data class
    + If you are absolutely sure of a data class, over-write it in your
    tibble so that you don't need to keep having to keep track of it
        + `mutate` the variable with the correct class
        
---
        
## References

1. http://style.tidyverse.org/

2. http://forcats.tidyverse.org/