Sta112FS 6. Data types

September 15, 2015

Today's agenda

Introduction of teams
Brief disussion of data types
Introduction of the Paris Paintings dataset
Application Exercise: explore Paris Paintings
Due Thursday: Finish App Ex + add info to your GitHub profile

Data structures and dimensionality

Dimensions	Homogeneous	Heterogeneous
1d	Vector (atomic vector)	List (generic vector)
2d	Matrix	Data Frame
nd	Array	—

Vectors

Vector types

R has six basic atomic vector types, but for now we'll only focus on the first four:

logical
double
integer
character
complex
raw

Vector types - examples

logical - boolean values TRUE and FALSE

typeof(TRUE)

## [1] "logical"

character - character strings

typeof("hello")

## [1] "character"

typeof('world')

## [1] "character"

Vector types - examples

double - floating point numerical values (default numerical type)

typeof(1.335)

## [1] "double"

typeof(7)

## [1] "double"

integer - integer numerical values (indicated with an L)

typeof(7L)

## [1] "integer"

typeof(1:3)

## [1] "integer"

Concatenation

Vectors can be constructed using the c() function.

c(1, 2, 3)

## [1] 1 2 3

c("Hello", "World!")

## [1] "Hello"  "World!"

c(1, c(2, c(3)))

## [1] 1 2 3

Coercion

R is a dynamically typed language – it will happily convert between the various types without complaint.

c(1, "Hello")

## [1] "1"     "Hello"

c(FALSE, 3L)

## [1] 0 3

c(1.2, 3L)

## [1] 1.2 3.0

Missing Values

R uses NA to represent missing values in its data structures.

typeof(NA)

## [1] "logical"

Other Special Values

NaN - Not a number

Inf - Positive infinity

-Inf - Negative infinity

pi / 0

## [1] Inf

0 / 0

## [1] NaN

1/0 + 1/0

## [1] Inf

1/0 - 1/0

## [1] NaN

NaN / NA

## [1] NaN

NaN * NA

## [1] NaN

Quick activity: 5 minutes, in teams

What is the type of the following vectors? Explain why they have that type.

c(1, NA+1L, "C")
c(1L / 0, NA)
c(1:3, 5)
c(3L, NaN+1L)
c(NA, TRUE)

Lists

Lists are generic vectors: 1d and can contain any combination of R objects.

mylist = list("A", 1:4, c(TRUE, FALSE), (1:4)/2)
mylist

## [[1]]
## [1] "A"
## 
## [[2]]
## [1] 1 2 3 4
## 
## [[3]]
## [1]  TRUE FALSE
## 
## [[4]]
## [1] 0.5 1.0 1.5 2.0

str(mylist)

## List of 4
##  $ : chr "A"
##  $ : int [1:4] 1 2 3 4
##  $ : logi [1:2] TRUE FALSE
##  $ : num [1:4] 0.5 1 1.5 2

Named lists

Because of their more complex structure we often want to name the elements of a list (we can also do this with vectors). This can make reading and accessing the list more straight forward.

myotherlist <- list(A = "hello", B = 1:4, "knock knock" = "who's there?")
str(myotherlist)

## List of 3
##  $ A          : chr "hello"
##  $ B          : int [1:4] 1 2 3 4
##  $ knock knock: chr "who's there?"

names(myotherlist)

## [1] "A"           "B"           "knock knock"

myotherlist$B

## [1] 1 2 3 4

Vectors vs. lists - [ vs. [[

x <- c(8,4,7)
x[1]

## [1] 8

x[[1]]

## [1] 8

y <- list(8,4,7)
y[2]

## [[1]]
## [1] 4

y[[2]]

## [1] 4

Data Frames

A data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.

Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.

df <- data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)

## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3

attributes(df)

## $names
## [1] "x" "y"
## 
## $row.names
## [1] 1 2 3
## 
## $class
## [1] "data.frame"

class(df$x)

## [1] "integer"

class(df$y)

## [1] "factor"

Strings (Characters) vs Factors

By default R will often attempt to convert character vectors into factors when they are included in a data frame. Sometimes this is useful, sometimes it isn't – either way it is important to know what type/class you are working with. This behavior can be changed using the stringsAsFactors argument.

df <- data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE)
str(df)

## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: chr  "a" "b" "c"

Growing data frames

We can add rows or columns to a data frame using rbind and cbind respectively.

df <- data.frame(x = 1:3, y = c("a","b","c"))
cbind(df, z = TRUE)

##   x y    z
## 1 1 a TRUE
## 2 2 b TRUE
## 3 3 c TRUE

rbind(df, c(4,"b"))

##   x y
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 b

Combining data frames

df1 <- data.frame(x = 1:3, y = c("a", "b", "c"))
df2 <- data.frame(m = 3:1, n = c(TRUE, TRUE, FALSE))
df3 <- cbind(df1, df2)
df3

##   x y m     n
## 1 1 a 3  TRUE
## 2 2 b 2  TRUE
## 3 3 c 1 FALSE

str(df3)

## 'data.frame':    3 obs. of  4 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ m: int  3 2 1
##  $ n: logi  TRUE TRUE FALSE

`tbl_df()` in `dplyr`

library(dplyr)

df1 <- data.frame(x = 1:3, 
                  y = c("a", "b", "c"))
tbl_df(df1)

## Source: local data frame [3 x 2]
## 
##   x y
## 1 1 a
## 2 2 b
## 3 3 c

`tbl_df()` in `dplyr` (cont.)

df2 <- data.frame(m = 3:1, 
                  n = c(TRUE, TRUE, FALSE)) %>%
  tbl_df()
df2

## Source: local data frame [3 x 2]
## 
##   m     n
## 1 3  TRUE
## 2 2  TRUE
## 3 1 FALSE

Summary of data structures

Recap

Always best to think of data as part of a data.frame
- This plays nicely with dplyr and ggplot2 as well
- Rows are observations, columns are variables
Be careful about data types / classes
- Sometimes R makes silly assumptions about your data class
  - stringsAsFactors = FALSE helps, but that's not the whole story
  - 0/1 variable is most likely a factor
- If a plot/output is not behaving the way you expect, first investigate the data class
- If you are absolutely sure of a data class, over-write it in your data frame so that you don't need to keep having to keep track of it
  - mutate the variable with the correct class

Acknowledgments

Above materials are derived in part from the following sources:

Hadley Wickham - Advanced R
R Language Definition

Paris Paintings

Paintings dataset

Click here for Sandra's slides

Data sharing

Accessing the data

Codebook: https://stat.duke.edu/courses/Fall15/sta112.01/data/paris_paintings.html
Go to the Resources on Sakai and download paris_paintings.csv
Upload this file to RStudio Server
Load using the following (make sure data file is in the correct working directory):

pp <- read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>%
  tbl_df()

Application exercise

App Ex 2

See course website

Today's agenda

Today's agenda

Data structures and dimensionality

Vectors

Vector types

Vector types - examples

Vector types - examples

Concatenation

Coercion

Missing Values

Other Special Values

Quick activity: 5 minutes, in teams

Lists

Lists

Named lists

Vectors vs. lists - [ vs. [[

Data Frames

Data Frames

Strings (Characters) vs Factors

Growing data frames

Combining data frames

tbl_df() in dplyr

tbl_df() in dplyr (cont.)

Summary of data structures

Recap

Acknowledgments

Paris Paintings

Paintings dataset

Data sharing

Accessing the data

Application exercise

App Ex 2

`tbl_df()` in `dplyr`

`tbl_df()` in `dplyr` (cont.)