September 15, 2015
Introduction of teams
Brief disussion of data types
Introduction of the Paris Paintings dataset
Application Exercise: explore Paris Paintings
Due Thursday: Finish App Ex + add info to your GitHub profile
Dimensions | Homogeneous | Heterogeneous |
---|---|---|
1d | Vector (atomic vector) | List (generic vector) |
2d | Matrix | Data Frame |
nd | Array | — |
R has six basic atomic vector types, but for now we'll only focus on the first four:
logical
double
integer
character
complex
raw
logical - boolean values TRUE
and FALSE
typeof(TRUE)
## [1] "logical"
character - character strings
typeof("hello")
## [1] "character"
typeof('world')
## [1] "character"
double - floating point numerical values (default numerical type)
typeof(1.335)
## [1] "double"
typeof(7)
## [1] "double"
integer - integer numerical values (indicated with an L
)
typeof(7L)
## [1] "integer"
typeof(1:3)
## [1] "integer"
Vectors can be constructed using the c()
function.
c(1, 2, 3)
## [1] 1 2 3
c("Hello", "World!")
## [1] "Hello" "World!"
c(1, c(2, c(3)))
## [1] 1 2 3
R is a dynamically typed language – it will happily convert between the various types without complaint.
c(1, "Hello")
## [1] "1" "Hello"
c(FALSE, 3L)
## [1] 0 3
c(1.2, 3L)
## [1] 1.2 3.0
R uses NA
to represent missing values in its data structures.
typeof(NA)
## [1] "logical"
NaN
- Not a number
Inf
- Positive infinity
-Inf
- Negative infinity
pi / 0
## [1] Inf
0 / 0
## [1] NaN
1/0 + 1/0
## [1] Inf
1/0 - 1/0
## [1] NaN
NaN / NA
## [1] NaN
NaN * NA
## [1] NaN
What is the type of the following vectors? Explain why they have that type.
c(1, NA+1L, "C")
c(1L / 0, NA)
c(1:3, 5)
c(3L, NaN+1L)
c(NA, TRUE)
Lists are generic vectors: 1d and can contain any combination of R objects.
mylist = list("A", 1:4, c(TRUE, FALSE), (1:4)/2) mylist
## [[1]] ## [1] "A" ## ## [[2]] ## [1] 1 2 3 4 ## ## [[3]] ## [1] TRUE FALSE ## ## [[4]] ## [1] 0.5 1.0 1.5 2.0
str(mylist)
## List of 4 ## $ : chr "A" ## $ : int [1:4] 1 2 3 4 ## $ : logi [1:2] TRUE FALSE ## $ : num [1:4] 0.5 1 1.5 2
Because of their more complex structure we often want to name the elements of a list (we can also do this with vectors). This can make reading and accessing the list more straight forward.
myotherlist <- list(A = "hello", B = 1:4, "knock knock" = "who's there?") str(myotherlist)
## List of 3 ## $ A : chr "hello" ## $ B : int [1:4] 1 2 3 4 ## $ knock knock: chr "who's there?"
names(myotherlist)
## [1] "A" "B" "knock knock"
myotherlist$B
## [1] 1 2 3 4
x <- c(8,4,7) x[1]
## [1] 8
x[[1]]
## [1] 8
y <- list(8,4,7) y[2]
## [[1]] ## [1] 4
y[[2]]
## [1] 4
A data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.
Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.
df <- data.frame(x = 1:3, y = c("a", "b", "c")) str(df)
## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3
attributes(df)
## $names ## [1] "x" "y" ## ## $row.names ## [1] 1 2 3 ## ## $class ## [1] "data.frame"
class(df$x)
## [1] "integer"
class(df$y)
## [1] "factor"
By default R will often attempt to convert character vectors into factors when they are included in a data frame. Sometimes this is useful, sometimes it isn't – either way it is important to know what type/class you are working with. This behavior can be changed using the stringsAsFactors
argument.
df <- data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE) str(df)
## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: chr "a" "b" "c"
We can add rows or columns to a data frame using rbind
and cbind
respectively.
df <- data.frame(x = 1:3, y = c("a","b","c")) cbind(df, z = TRUE)
## x y z ## 1 1 a TRUE ## 2 2 b TRUE ## 3 3 c TRUE
rbind(df, c(4,"b"))
## x y ## 1 1 a ## 2 2 b ## 3 3 c ## 4 4 b
df1 <- data.frame(x = 1:3, y = c("a", "b", "c")) df2 <- data.frame(m = 3:1, n = c(TRUE, TRUE, FALSE)) df3 <- cbind(df1, df2) df3
## x y m n ## 1 1 a 3 TRUE ## 2 2 b 2 TRUE ## 3 3 c 1 FALSE
str(df3)
## 'data.frame': 3 obs. of 4 variables: ## $ x: int 1 2 3 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3 ## $ m: int 3 2 1 ## $ n: logi TRUE TRUE FALSE
tbl_df()
in dplyr
library(dplyr)
df1 <- data.frame(x = 1:3, y = c("a", "b", "c")) tbl_df(df1)
## Source: local data frame [3 x 2] ## ## x y ## 1 1 a ## 2 2 b ## 3 3 c
tbl_df()
in dplyr
(cont.)df2 <- data.frame(m = 3:1, n = c(TRUE, TRUE, FALSE)) %>% tbl_df() df2
## Source: local data frame [3 x 2] ## ## m n ## 1 3 TRUE ## 2 2 TRUE ## 3 1 FALSE
dplyr
and ggplot2
as wellR
makes silly assumptions about your data class
stringsAsFactors = FALSE
helps, but that's not the whole storyfactor
mutate
the variable with the correct classAbove materials are derived in part from the following sources:
Click here for Sandra's slides
This dataset is made available for class use only. Do not post the raw dataset anywhere else. You can share your results/findings/plots etc. but not the dataset.
Codebook: https://stat.duke.edu/courses/Fall15/sta112.01/data/paris_paintings.html
Go to the Resources on Sakai and download paris_paintings.csv
Upload this file to RStudio Server
Load using the following (make sure data file is in the correct working directory):
pp <- read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>% tbl_df()
See course website