September 15, 2015

## Today's agenda

• Introduction of teams

• Brief disussion of data types

• Introduction of the Paris Paintings dataset

• Application Exercise: explore Paris Paintings

• Due Thursday: Finish App Ex + add info to your GitHub profile

## Data structures and dimensionality

Dimensions Homogeneous Heterogeneous
1d Vector (atomic vector) List (generic vector)
2d Matrix Data Frame
nd Array

## Vector types

R has six basic atomic vector types, but for now we'll only focus on the first four:

• logical

• double

• integer

• character

• complex

• raw

## Vector types - examples

logical - boolean values TRUE and FALSE

typeof(TRUE)
## [1] "logical"

character - character strings

typeof("hello")
## [1] "character"
typeof('world')
## [1] "character"

## Vector types - examples

double - floating point numerical values (default numerical type)

typeof(1.335)
## [1] "double"
typeof(7)
## [1] "double"

integer - integer numerical values (indicated with an L)

typeof(7L)
## [1] "integer"
typeof(1:3)
## [1] "integer"

## Concatenation

Vectors can be constructed using the c() function.

c(1, 2, 3)
## [1] 1 2 3
c("Hello", "World!")
## [1] "Hello"  "World!"
c(1, c(2, c(3)))
## [1] 1 2 3

## Coercion

R is a dynamically typed language – it will happily convert between the various types without complaint.

c(1, "Hello")
## [1] "1"     "Hello"
c(FALSE, 3L)
## [1] 0 3
c(1.2, 3L)
## [1] 1.2 3.0

## Missing Values

R uses NA to represent missing values in its data structures.

typeof(NA)
## [1] "logical"

## Other Special Values

NaN - Not a number

Inf - Positive infinity

-Inf - Negative infinity

pi / 0
## [1] Inf
0 / 0
## [1] NaN
1/0 + 1/0
## [1] Inf
1/0 - 1/0
## [1] NaN
NaN / NA
## [1] NaN
NaN * NA
## [1] NaN

## Quick activity: 5 minutes, in teams

What is the type of the following vectors? Explain why they have that type.

• c(1, NA+1L, "C")
• c(1L / 0, NA)
• c(1:3, 5)
• c(3L, NaN+1L)
• c(NA, TRUE)

## Lists

Lists are generic vectors: 1d and can contain any combination of R objects.

mylist = list("A", 1:4, c(TRUE, FALSE), (1:4)/2)
mylist
## [[1]]
## [1] "A"
##
## [[2]]
## [1] 1 2 3 4
##
## [[3]]
## [1]  TRUE FALSE
##
## [[4]]
## [1] 0.5 1.0 1.5 2.0
str(mylist)
## List of 4
##  $: chr "A" ##$ : int [1:4] 1 2 3 4
##  $: logi [1:2] TRUE FALSE ##$ : num [1:4] 0.5 1 1.5 2

## Named lists

Because of their more complex structure we often want to name the elements of a list (we can also do this with vectors). This can make reading and accessing the list more straight forward.

myotherlist <- list(A = "hello", B = 1:4, "knock knock" = "who's there?")
str(myotherlist)
## List of 3
##  $A : chr "hello" ##$ B          : int [1:4] 1 2 3 4
##  $knock knock: chr "who's there?" names(myotherlist) ## [1] "A" "B" "knock knock" myotherlist$B
## [1] 1 2 3 4

## Vectors vs. lists - [ vs. [[

x <- c(8,4,7)
x[1]
## [1] 8
x[[1]]
## [1] 8
y <- list(8,4,7)
y[2]
## [[1]]
## [1] 4
y[[2]]
## [1] 4

## Data Frames

A data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.

Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.

df <- data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $x: int 1 2 3 ##$ y: Factor w/ 3 levels "a","b","c": 1 2 3
attributes(df)
## $names ## [1] "x" "y" ## ##$row.names
## [1] 1 2 3
##
## $class ## [1] "data.frame" class(df$x)
## [1] "integer"
##  $y: chr "a" "b" "c" ## Growing data frames We can add rows or columns to a data frame using rbind and cbind respectively. df <- data.frame(x = 1:3, y = c("a","b","c")) cbind(df, z = TRUE) ## x y z ## 1 1 a TRUE ## 2 2 b TRUE ## 3 3 c TRUE rbind(df, c(4,"b")) ## x y ## 1 1 a ## 2 2 b ## 3 3 c ## 4 4 b ## Combining data frames df1 <- data.frame(x = 1:3, y = c("a", "b", "c")) df2 <- data.frame(m = 3:1, n = c(TRUE, TRUE, FALSE)) df3 <- cbind(df1, df2) df3 ## x y m n ## 1 1 a 3 TRUE ## 2 2 b 2 TRUE ## 3 3 c 1 FALSE str(df3) ## 'data.frame': 3 obs. of 4 variables: ##$ x: int  1 2 3
##  $y: Factor w/ 3 levels "a","b","c": 1 2 3 ##$ m: int  3 2 1
##  \$ n: logi  TRUE TRUE FALSE

## tbl_df() in dplyr

library(dplyr)
df1 <- data.frame(x = 1:3,
y = c("a", "b", "c"))
tbl_df(df1)
## Source: local data frame [3 x 2]
##
##   x y
## 1 1 a
## 2 2 b
## 3 3 c

## tbl_df() in dplyr (cont.)

df2 <- data.frame(m = 3:1,
n = c(TRUE, TRUE, FALSE)) %>%
tbl_df()
df2
## Source: local data frame [3 x 2]
##
##   m     n
## 1 3  TRUE
## 2 2  TRUE
## 3 1 FALSE

## Recap

• Always best to think of data as part of a data.frame
• This plays nicely with dplyr and ggplot2 as well
• Rows are observations, columns are variables
• Be careful about data types / classes
• Sometimes R makes silly assumptions about your data class
• stringsAsFactors = FALSE helps, but that's not the whole story
• 0/1 variable is most likely a factor
• If a plot/output is not behaving the way you expect, first investigate the data class
• If you are absolutely sure of a data class, over-write it in your data frame so that you don't need to keep having to keep track of it
• mutate the variable with the correct class

## Data sharing

This dataset is made available for class use only. Do not post the raw dataset anywhere else. You can share your results/findings/plots etc. but not the dataset.

## Accessing the data

• Go to the Resources on Sakai and download paris_paintings.csv

• Upload this file to RStudio Server

• Load using the following (make sure data file is in the correct working directory):

pp <- read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>%
tbl_df()

## App Ex 2

See course website