September 26, 2017

Getting started

Recap

  • Any questions from last time?

  • Any questions from the homework?

Midterm

  • Assigned today around 12:30pm

  • Rules discussed in class on Tuesday – any questions?

Data wrangling (remaining slides)

NC DOT Fatal Crashes in North Carolina

From https://opendurham.nc.gov

bike <- read_csv2("https://stat.duke.edu/~mc301/data/nc_bike_crash.csv", 
                  na = c("NA", "", "."))

Data manipulations

bike <- bike %>%
  mutate(
    # Fix BikeAge_Gr
    BikeAge_Gr = str_replace(BikeAge_Gr, "10-Jun", "6-10"),
    BikeAge_Gr = str_replace(BikeAge_Gr, "15-Nov", "11-15"),
    # Crrate new alcohol variable
    alcohol = case_when(
      Bike_Alc_D == "No"      & Drvr_Alc_D == "No"      ~ "No",
      Bike_Alc_D == "Yes"     | Drvr_Alc_D == "Yes"     ~ "Yes",
      Bike_Alc_D == "Missing" & Drvr_Alc_D == "No"      ~ "Missing",
      Bike_Alc_D == "No"      & Drvr_Alc_D == "Missing" ~ "Missing"
      )
    ) %>%
    # Rename Speed_Limit
  rename(Speed_Limit = Speed_Limi)

Select rows with sample_n or sample_frac

  • sample_n: randomly sample 5 observations
bike_n5 <- bike %>%
  sample_n(5, replace = FALSE)
dim(bike_n5)
  • sample_frac: randomly sample 20% of observations
bike_perc20 <-bike %>%
  sample_frac(0.2, replace = FALSE)
dim(bike_perc20)

distinct to filter for unique rows

bike %>% 
  select(County, City) %>% 
  distinct() %>% 
  arrange(County, City)

Wrangling diamonds

Q1

Price is a continuous numeric variable (measured in cents). I would expect the distribution of price to be right-skewed, as the price can never be below 0, and as the price increases the number of diamonds at that price should decrease. This expectation is confirmed, given the histogram below:

ggplot(diamonds, aes(x = price)) +
  geom_histogram(col = "white", binwidth = 500) +
  labs(x = "Price", y = "Count", title = "Price is highly right-skewed")

Q2

There are 1,610 fair diamonds, 4,906 good diamonds, 12,082 very good diamonds, 13,791 premium diamonds, and 21,551 ideal diamonds, as given by the summary table below.

Two alternative approaches, either works, ok not to arrange in order:

diamonds %>%
  group_by(cut) %>%
  summarise(n_diamonds = n()) %>%
  arrange(n_diamonds)
## # A tibble: 5 x 2
##         cut n_diamonds
##      <fctr>      <int>
## 1      Fair       1610
## 2      Good       4906
## 3 Very Good      12082
## 4   Premium      13791
## 5     Ideal      21551

Q2 alternative

diamonds %>%
  count(cut) %>%
  arrange(n)
## # A tibble: 5 x 2
##         cut     n
##      <fctr> <int>
## 1      Fair  1610
## 2      Good  4906
## 3 Very Good 12082
## 4   Premium 13791
## 5     Ideal 21551

Q3

The proportion of each clarity of diamonds is given in the table below:

diamonds %>%
  group_by(clarity) %>%
  summarise(freq = n()) %>%
  mutate(prop = round(freq / sum(freq), 3)) %>%
  arrange(desc(prop))
## # A tibble: 8 x 3
##   clarity  freq  prop
##    <fctr> <int> <dbl>
## 1     SI1 13065 0.242
## 2     VS2 12258 0.227
## 3     SI2  9194 0.170
## 4     VS1  8171 0.151
## 5    VVS2  5066 0.094
## 6    VVS1  3655 0.068
## 7      IF  1790 0.033
## 8      I1   741 0.014

Q4

A scatterplot depicting the relationship between depth and price is shown below. There is no real relationship between the two. The variability of prices increases as depth increases.

diamonds %>%
  filter(cut == "Fair") %>%
  ggplot(mapping = aes(x = depth, y = price)) +
    geom_point() +
    labs(x = "Depth", y = "Price", 
         title = "No real relationship between cut depth and price")

Q5

Summary statistics for each cut of diamond are shown below:

diamonds %>%
  group_by(cut) %>%
  summarise(min_price = min(price), 
            max_price = max(price), 
            mean_price = mean(price), 
            median_price = median(price)) %>%
  arrange(desc(median_price))
## # A tibble: 5 x 5
##         cut min_price max_price mean_price median_price
##      <fctr>     <dbl>     <dbl>      <dbl>        <dbl>
## 1      Fair       337     18574   4358.758       3282.0
## 2   Premium       326     18823   4584.258       3185.0
## 3      Good       327     18788   3928.864       3050.5
## 4 Very Good       336     18818   3981.760       2648.0
## 5     Ideal       326     18806   3457.542       1810.0

Data types

Data structures and dimensionality

Dimensions Homogeneous Heterogeneous
1d Vector (atomic vector) List (generic vector)
2d Matrix Tibble or Data Frame
nd Array

Vectors

Vector types

R has six basic atomic vector types, but for now we'll only focus on the first four:

  • logical

  • double

  • integer

  • character

  • complex

  • raw

Vector types - examples

logical - boolean values TRUE and FALSE

typeof(TRUE)
## [1] "logical"

character - character strings

typeof("hello")
## [1] "character"
typeof('world')
## [1] "character"

Vector types - examples

double - floating point numerical values (default numerical type)

typeof(1.335)
## [1] "double"
typeof(7)
## [1] "double"

integer - integer numerical values (indicated with an L)

typeof(7L)
## [1] "integer"
typeof(1:3)
## [1] "integer"

Concatenation

Vectors can be constructed using the c() function.

c(1, 2, 3)
## [1] 1 2 3
c("Hello", "World!")
## [1] "Hello"  "World!"
c(1, c(2, c(3)))
## [1] 1 2 3

Coercion

R is a dynamically typed language – it will happily convert between the various types without complaint.

c(1, "Hello")
## [1] "1"     "Hello"
c(FALSE, 3L)
## [1] 0 3
c(1.2, 3L)
## [1] 1.2 3.0

Missing Values

R uses NA to represent missing values in its data structures.

typeof(NA)
## [1] "logical"

Other Special Values

NaN - Not a number

Inf - Positive infinity

-Inf - Negative infinity


pi / 0
## [1] Inf
0 / 0
## [1] NaN
1/0 + 1/0
## [1] Inf
1/0 - 1/0
## [1] NaN
NaN / NA
## [1] NaN
NaN * NA
## [1] NaN

Quick activity: 5 minutes, in teams

What is the type of the following vectors? Explain why they have that type.

  • c(1, NA+1L, "C")
  • c(1L / 0, NA)
  • c(1:3, 5)
  • c(3L, NaN+1L)
  • c(NA, TRUE)

Lists

Lists

Lists are generic vectors: 1d and can contain any combination of R objects.

mylist = list("A", 1:4, c(TRUE, FALSE), (1:4)/2)
mylist
## [[1]]
## [1] "A"
## 
## [[2]]
## [1] 1 2 3 4
## 
## [[3]]
## [1]  TRUE FALSE
## 
## [[4]]
## [1] 0.5 1.0 1.5 2.0
str(mylist)
## List of 4
##  $ : chr "A"
##  $ : int [1:4] 1 2 3 4
##  $ : logi [1:2] TRUE FALSE
##  $ : num [1:4] 0.5 1 1.5 2

Named lists

Because of their more complex structure we often want to name the elements of a list (we can also do this with vectors). This can make reading and accessing the list more straight forward.

myotherlist <- list(A = "hello", B = 1:4, "knock knock" = "who's there?")
str(myotherlist)
## List of 3
##  $ A          : chr "hello"
##  $ B          : int [1:4] 1 2 3 4
##  $ knock knock: chr "who's there?"
names(myotherlist)
## [1] "A"           "B"           "knock knock"
myotherlist$B
## [1] 1 2 3 4

Vectors vs. lists - [ vs. [[

x <- c(8,4,7)
x[1]
## [1] 8
x[[1]]
## [1] 8
y <- list(8,4,7)
y[2]
## [[1]]
## [1] 4
y[[2]]
## [1] 4

Note: When using tidyverse code you'll rarely need to refer to elements using square brackets, but it's good to be aware of this syntax, especially since you might encounter it when searching for help online.

Data "sets"

Data "sets" in R

  • "set" is in quotation marks because it is not a formal data type

  • A tidy data "set" can be one of the following types:
    • tibble
    • data.frame
  • We'll often work with tibbles:
    • readr package (e.g. read_csv function) loads data as a tibble by default
    • tibbles are part of the tidyverse, so they work well with other packages we are using
    • they make minimal assumptions about your data, so are less likely to cause hard to track bugs in your code

Data frames

A data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.

Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.

df <- data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3

Data frames (cont.)

attributes(df)
## $names
## [1] "x" "y"
## 
## $row.names
## [1] 1 2 3
## 
## $class
## [1] "data.frame"
class(df$x)
## [1] "integer"
class(df$y)
## [1] "factor"

tibbles

A tibble is a type of data frame that … makes your life (i.e. data analysis) easier.

See ?tibble for details.

df1 <- data.frame(x = 1:3, 
                  y = c("a", "b", "c"))
as_tibble(df1)
## # A tibble: 3 x 2
##       x      y
##   <int> <fctr>
## 1     1      a
## 2     2      b
## 3     3      c

as_tibble() in a pipe

You can convert data.frames to tibbles with the as_tibble function:

df2 <- data.frame(m = 3:1, 
                  n = c(TRUE, TRUE, FALSE)) %>%
  as_tibble()
df2
## # A tibble: 3 x 2
##       m     n
##   <int> <lgl>
## 1     3  TRUE
## 2     2  TRUE
## 3     1 FALSE

Summary of data structures

Recap

  • Always best to think of data as part of a tibble
    • This plays nicely with dplyr and ggplot2 as well
    • Rows are observations, columns are variables
  • Be careful about data types / classes
    • Sometimes R makes silly assumptions about your data class
      • Using tibbles instead of data.frames helps, but it might not solve all issues
      • Think about your data in context, e.g. 0/1 variable is most likely a factor
    • If a plot/output is not behaving the way you expect, first investigate the data class
    • If you are absolutely sure of a data class, over-write it in your tibble so that you don't need to keep having to keep track of it
      • mutate the variable with the correct class