September 26, 2017
Any questions from last time?
Any questions from the homework?
Assigned today around 12:30pm
Rules discussed in class on Tuesday – any questions?
From https://opendurham.nc.gov
bike <- read_csv2("https://stat.duke.edu/~mc301/data/nc_bike_crash.csv", na = c("NA", "", "."))
bike <- bike %>% mutate( # Fix BikeAge_Gr BikeAge_Gr = str_replace(BikeAge_Gr, "10-Jun", "6-10"), BikeAge_Gr = str_replace(BikeAge_Gr, "15-Nov", "11-15"), # Crrate new alcohol variable alcohol = case_when( Bike_Alc_D == "No" & Drvr_Alc_D == "No" ~ "No", Bike_Alc_D == "Yes" | Drvr_Alc_D == "Yes" ~ "Yes", Bike_Alc_D == "Missing" & Drvr_Alc_D == "No" ~ "Missing", Bike_Alc_D == "No" & Drvr_Alc_D == "Missing" ~ "Missing" ) ) %>% # Rename Speed_Limit rename(Speed_Limit = Speed_Limi)
sample_n
or sample_frac
sample_n
: randomly sample 5 observationsbike_n5 <- bike %>% sample_n(5, replace = FALSE) dim(bike_n5)
sample_frac
: randomly sample 20% of observationsbike_perc20 <-bike %>% sample_frac(0.2, replace = FALSE) dim(bike_perc20)
distinct
to filter for unique rowsbike %>% select(County, City) %>% distinct() %>% arrange(County, City)
Price is a continuous numeric variable (measured in cents). I would expect the distribution of price to be right-skewed, as the price can never be below 0, and as the price increases the number of diamonds at that price should decrease. This expectation is confirmed, given the histogram below:
ggplot(diamonds, aes(x = price)) + geom_histogram(col = "white", binwidth = 500) + labs(x = "Price", y = "Count", title = "Price is highly right-skewed")
There are 1,610 fair diamonds, 4,906 good diamonds, 12,082 very good diamonds, 13,791 premium diamonds, and 21,551 ideal diamonds, as given by the summary table below.
Two alternative approaches, either works, ok not to arrange in order:
diamonds %>% group_by(cut) %>% summarise(n_diamonds = n()) %>% arrange(n_diamonds)
## # A tibble: 5 x 2 ## cut n_diamonds ## <fctr> <int> ## 1 Fair 1610 ## 2 Good 4906 ## 3 Very Good 12082 ## 4 Premium 13791 ## 5 Ideal 21551
diamonds %>% count(cut) %>% arrange(n)
## # A tibble: 5 x 2 ## cut n ## <fctr> <int> ## 1 Fair 1610 ## 2 Good 4906 ## 3 Very Good 12082 ## 4 Premium 13791 ## 5 Ideal 21551
The proportion of each clarity of diamonds is given in the table below:
diamonds %>% group_by(clarity) %>% summarise(freq = n()) %>% mutate(prop = round(freq / sum(freq), 3)) %>% arrange(desc(prop))
## # A tibble: 8 x 3 ## clarity freq prop ## <fctr> <int> <dbl> ## 1 SI1 13065 0.242 ## 2 VS2 12258 0.227 ## 3 SI2 9194 0.170 ## 4 VS1 8171 0.151 ## 5 VVS2 5066 0.094 ## 6 VVS1 3655 0.068 ## 7 IF 1790 0.033 ## 8 I1 741 0.014
A scatterplot depicting the relationship between depth and price is shown below. There is no real relationship between the two. The variability of prices increases as depth increases.
diamonds %>% filter(cut == "Fair") %>% ggplot(mapping = aes(x = depth, y = price)) + geom_point() + labs(x = "Depth", y = "Price", title = "No real relationship between cut depth and price")
Summary statistics for each cut of diamond are shown below:
diamonds %>% group_by(cut) %>% summarise(min_price = min(price), max_price = max(price), mean_price = mean(price), median_price = median(price)) %>% arrange(desc(median_price))
## # A tibble: 5 x 5 ## cut min_price max_price mean_price median_price ## <fctr> <dbl> <dbl> <dbl> <dbl> ## 1 Fair 337 18574 4358.758 3282.0 ## 2 Premium 326 18823 4584.258 3185.0 ## 3 Good 327 18788 3928.864 3050.5 ## 4 Very Good 336 18818 3981.760 2648.0 ## 5 Ideal 326 18806 3457.542 1810.0
Dimensions | Homogeneous | Heterogeneous |
---|---|---|
1d | Vector (atomic vector) | List (generic vector) |
2d | Matrix | Tibble or Data Frame |
nd | Array | — |
R has six basic atomic vector types, but for now we'll only focus on the first four:
logical
double
integer
character
complex
raw
logical - boolean values TRUE
and FALSE
typeof(TRUE)
## [1] "logical"
character - character strings
typeof("hello")
## [1] "character"
typeof('world')
## [1] "character"
double - floating point numerical values (default numerical type)
typeof(1.335)
## [1] "double"
typeof(7)
## [1] "double"
integer - integer numerical values (indicated with an L
)
typeof(7L)
## [1] "integer"
typeof(1:3)
## [1] "integer"
Vectors can be constructed using the c()
function.
c(1, 2, 3)
## [1] 1 2 3
c("Hello", "World!")
## [1] "Hello" "World!"
c(1, c(2, c(3)))
## [1] 1 2 3
R is a dynamically typed language – it will happily convert between the various types without complaint.
c(1, "Hello")
## [1] "1" "Hello"
c(FALSE, 3L)
## [1] 0 3
c(1.2, 3L)
## [1] 1.2 3.0
R uses NA
to represent missing values in its data structures.
typeof(NA)
## [1] "logical"
NaN
- Not a number
Inf
- Positive infinity
-Inf
- Negative infinity
pi / 0
## [1] Inf
0 / 0
## [1] NaN
1/0 + 1/0
## [1] Inf
1/0 - 1/0
## [1] NaN
NaN / NA
## [1] NaN
NaN * NA
## [1] NaN
What is the type of the following vectors? Explain why they have that type.
c(1, NA+1L, "C")
c(1L / 0, NA)
c(1:3, 5)
c(3L, NaN+1L)
c(NA, TRUE)
Lists are generic vectors: 1d and can contain any combination of R objects.
mylist = list("A", 1:4, c(TRUE, FALSE), (1:4)/2) mylist
## [[1]] ## [1] "A" ## ## [[2]] ## [1] 1 2 3 4 ## ## [[3]] ## [1] TRUE FALSE ## ## [[4]] ## [1] 0.5 1.0 1.5 2.0
str(mylist)
## List of 4 ## $ : chr "A" ## $ : int [1:4] 1 2 3 4 ## $ : logi [1:2] TRUE FALSE ## $ : num [1:4] 0.5 1 1.5 2
Because of their more complex structure we often want to name the elements of a list (we can also do this with vectors). This can make reading and accessing the list more straight forward.
myotherlist <- list(A = "hello", B = 1:4, "knock knock" = "who's there?") str(myotherlist)
## List of 3 ## $ A : chr "hello" ## $ B : int [1:4] 1 2 3 4 ## $ knock knock: chr "who's there?"
names(myotherlist)
## [1] "A" "B" "knock knock"
myotherlist$B
## [1] 1 2 3 4
x <- c(8,4,7) x[1]
## [1] 8
x[[1]]
## [1] 8
y <- list(8,4,7) y[2]
## [[1]] ## [1] 4
y[[2]]
## [1] 4
Note: When using tidyverse code you'll rarely need to refer to elements using square brackets, but it's good to be aware of this syntax, especially since you might encounter it when searching for help online.
"set" is in quotation marks because it is not a formal data type
tibble
data.frame
tibble
s:
readr
package (e.g. read_csv
function) loads data as a tibble
by defaulttibble
s are part of the tidyverse, so they work well with other packages we are usingA data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.
Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.
df <- data.frame(x = 1:3, y = c("a", "b", "c")) str(df)
## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3
attributes(df)
## $names ## [1] "x" "y" ## ## $row.names ## [1] 1 2 3 ## ## $class ## [1] "data.frame"
class(df$x)
## [1] "integer"
class(df$y)
## [1] "factor"
A tibble is a type of data frame that … makes your life (i.e. data analysis) easier.
See ?tibble
for details.
df1 <- data.frame(x = 1:3, y = c("a", "b", "c")) as_tibble(df1)
## # A tibble: 3 x 2 ## x y ## <int> <fctr> ## 1 1 a ## 2 2 b ## 3 3 c
as_tibble()
in a pipeYou can convert data.frame
s to tibble
s with the as_tibble
function:
df2 <- data.frame(m = 3:1, n = c(TRUE, TRUE, FALSE)) %>% as_tibble() df2
## # A tibble: 3 x 2 ## m n ## <int> <lgl> ## 1 3 TRUE ## 2 2 TRUE ## 3 1 FALSE
dplyr
and ggplot2
as wellR
makes silly assumptions about your data class
tibble
s instead of data.frame
s helps, but it might not solve all issuesfactor
mutate
the variable with the correct class