September 15, 2015

Today's agenda

Today's agenda

  • Introduction of teams

  • Brief disussion of data types

  • Introduction of the Paris Paintings dataset

  • Application Exercise: explore Paris Paintings

  • Due Thursday: Finish App Ex + add info to your GitHub profile

Data structures and dimensionality

Dimensions Homogeneous Heterogeneous
1d Vector (atomic vector) List (generic vector)
2d Matrix Data Frame
nd Array

Vectors

Vector types

R has six basic atomic vector types, but for now we'll only focus on the first four:

  • logical

  • double

  • integer

  • character

  • complex

  • raw

Vector types - examples

logical - boolean values TRUE and FALSE

typeof(TRUE)
## [1] "logical"

character - character strings

typeof("hello")
## [1] "character"
typeof('world')
## [1] "character"

Vector types - examples

double - floating point numerical values (default numerical type)

typeof(1.335)
## [1] "double"
typeof(7)
## [1] "double"

integer - integer numerical values (indicated with an L)

typeof(7L)
## [1] "integer"
typeof(1:3)
## [1] "integer"

Concatenation

Vectors can be constructed using the c() function.

c(1, 2, 3)
## [1] 1 2 3
c("Hello", "World!")
## [1] "Hello"  "World!"
c(1, c(2, c(3)))
## [1] 1 2 3

Coercion

R is a dynamically typed language – it will happily convert between the various types without complaint.

c(1, "Hello")
## [1] "1"     "Hello"
c(FALSE, 3L)
## [1] 0 3
c(1.2, 3L)
## [1] 1.2 3.0

Missing Values

R uses NA to represent missing values in its data structures.

typeof(NA)
## [1] "logical"

Other Special Values

NaN - Not a number

Inf - Positive infinity

-Inf - Negative infinity


pi / 0
## [1] Inf
0 / 0
## [1] NaN
1/0 + 1/0
## [1] Inf
1/0 - 1/0
## [1] NaN
NaN / NA
## [1] NaN
NaN * NA
## [1] NaN

Quick activity: 5 minutes, in teams

What is the type of the following vectors? Explain why they have that type.

  • c(1, NA+1L, "C")
  • c(1L / 0, NA)
  • c(1:3, 5)
  • c(3L, NaN+1L)
  • c(NA, TRUE)

Lists

Lists

Lists are generic vectors: 1d and can contain any combination of R objects.

mylist = list("A", 1:4, c(TRUE, FALSE), (1:4)/2)
mylist
## [[1]]
## [1] "A"
## 
## [[2]]
## [1] 1 2 3 4
## 
## [[3]]
## [1]  TRUE FALSE
## 
## [[4]]
## [1] 0.5 1.0 1.5 2.0
str(mylist)
## List of 4
##  $ : chr "A"
##  $ : int [1:4] 1 2 3 4
##  $ : logi [1:2] TRUE FALSE
##  $ : num [1:4] 0.5 1 1.5 2

Named lists

Because of their more complex structure we often want to name the elements of a list (we can also do this with vectors). This can make reading and accessing the list more straight forward.

myotherlist <- list(A = "hello", B = 1:4, "knock knock" = "who's there?")
str(myotherlist)
## List of 3
##  $ A          : chr "hello"
##  $ B          : int [1:4] 1 2 3 4
##  $ knock knock: chr "who's there?"
names(myotherlist)
## [1] "A"           "B"           "knock knock"
myotherlist$B
## [1] 1 2 3 4

Vectors vs. lists - [ vs. [[

x <- c(8,4,7)
x[1]
## [1] 8
x[[1]]
## [1] 8
y <- list(8,4,7)
y[2]
## [[1]]
## [1] 4
y[[2]]
## [1] 4

Data Frames

Data Frames

A data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.

Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.

df <- data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3

attributes(df)
## $names
## [1] "x" "y"
## 
## $row.names
## [1] 1 2 3
## 
## $class
## [1] "data.frame"
class(df$x)
## [1] "integer"
class(df$y)
## [1] "factor"

Strings (Characters) vs Factors

By default R will often attempt to convert character vectors into factors when they are included in a data frame. Sometimes this is useful, sometimes it isn't – either way it is important to know what type/class you are working with. This behavior can be changed using the stringsAsFactors argument.

df <- data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE)
str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: chr  "a" "b" "c"

Growing data frames

We can add rows or columns to a data frame using rbind and cbind respectively.

df <- data.frame(x = 1:3, y = c("a","b","c"))
cbind(df, z = TRUE)
##   x y    z
## 1 1 a TRUE
## 2 2 b TRUE
## 3 3 c TRUE
rbind(df, c(4,"b"))
##   x y
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 b

Combining data frames

df1 <- data.frame(x = 1:3, y = c("a", "b", "c"))
df2 <- data.frame(m = 3:1, n = c(TRUE, TRUE, FALSE))
df3 <- cbind(df1, df2)
df3
##   x y m     n
## 1 1 a 3  TRUE
## 2 2 b 2  TRUE
## 3 3 c 1 FALSE
str(df3)
## 'data.frame':    3 obs. of  4 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ m: int  3 2 1
##  $ n: logi  TRUE TRUE FALSE

tbl_df() in dplyr

library(dplyr)
df1 <- data.frame(x = 1:3, 
                  y = c("a", "b", "c"))
tbl_df(df1)
## Source: local data frame [3 x 2]
## 
##   x y
## 1 1 a
## 2 2 b
## 3 3 c

tbl_df() in dplyr (cont.)

df2 <- data.frame(m = 3:1, 
                  n = c(TRUE, TRUE, FALSE)) %>%
  tbl_df()
df2
## Source: local data frame [3 x 2]
## 
##   m     n
## 1 3  TRUE
## 2 2  TRUE
## 3 1 FALSE

Summary of data structures

Recap

  • Always best to think of data as part of a data.frame
    • This plays nicely with dplyr and ggplot2 as well
    • Rows are observations, columns are variables
  • Be careful about data types / classes
    • Sometimes R makes silly assumptions about your data class
      • stringsAsFactors = FALSE helps, but that's not the whole story
      • 0/1 variable is most likely a factor
    • If a plot/output is not behaving the way you expect, first investigate the data class
    • If you are absolutely sure of a data class, over-write it in your data frame so that you don't need to keep having to keep track of it
      • mutate the variable with the correct class

Acknowledgments

Paris Paintings

Paintings dataset

Click here for Sandra's slides

Data sharing

This dataset is made available for class use only. Do not post the raw dataset anywhere else. You can share your results/findings/plots etc. but not the dataset.

Accessing the data

Application exercise

App Ex 2

See course website