Sta112FS Lecture 3 - Statistical inference, EDA, and the data sciene process

September 2, 2014

Skill set of a data scientist

data preparation & munging
modeling
coding
visualization
communication

Statistical inference

Making sense

The processes in our lives are actually data generating processes.

Give examples of some data generating proccesses from your day.

We would like ways to describe, understand, and make sense of these processes, because as scientists we want to understand the world better, but also because understanding these prcesses is part of the solution to problems we're trying to solve.

Data as traces

Data represents traces of the real-world processes
Which processes we gather are decided by the data collection (sampling) method

Cycle of …

Data represents traces of the real-world processes
Once we have the data, we want to simplify it into something more comprehensible, to somewhing that capures it all in a much more concise way
These are called statistical estimators

The process of going from the world to the data, and then from the data back to the world is the field of statistical inference

Statistical inference

Discipline that concerns itself with the development of procedires, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic (random) process.

Populations and samples

Population is the complete set of observations of whatever we are studying, e.g. people, tweets, photographs, etc. (population size = N)
Sample is a subset of the population, ideally random and representative (sample size = n)

Sampling is natural

When you taste a spoonful of soup and decide the spoonful you tasted isn't salty enough, that's exploratory analysis
If you generalize and conclude that your entire soup needs salt, that's an inference
For your inference to be valid, the spoonful you tasted (the sample) needs to be representative of the entire pot (the population)

Sampling biases

Suppose I want to calculate the average number of kids couples have. Will data from this class (asking each of you how many kids your parents have) yield a biased or unbiased estimate of the value I am interested in? If biased, will it be an over-estimate or an under-estimate?

Populations and samples of big data

Even if we have access to all of FB's or Google's or Twitter's data corpus, any inferences we make from that data should not be extended to draw conclusions about humans beyond those sets of users for any particular day.

Consider the Twitter data generated by Hurricane Sandy, more than 20 million tweets between October 27 and November 1. A fascinating study combining Sandy-related Twitter and Foursquare data produced some expected findings (grocery shopping peaks the night before the storm) and some surprising ones (nightlife picked up the day after — presumably when cabin fever strikes). But these data don’t represent the whole picture. The greatest number of tweets about Sandy came from Manhattan. This makes sense given the city’s high level of smartphone ownership and Twitter use, but it creates the illusion that Manhattan was the hub of the disaster. Very few messages originated from more severely affected locations, such as Breezy Point, Coney Island and Rockaway. As extended power blackouts drained batteries and limited cellular access, even fewer tweets came from the worst hit areas. In fact, there was much more going on outside the privileged, urban experience of Sandy that Twitter data failed to convey, especially in aggregate. We can think of this as a "signal problem": Data are assumed to accurately reflect the social world, but there are significant gaps, with little or no signal coming from particular communities.

From: http://blogs.hbr.org/2013/04/the-hidden-biases-in-big-data/

Ultimate goal

The ultimate goal is to say (infer) something about the population using data from the sample.

Ideally, this is a statistical statement.

But we start data analysis by describing / visualizing what's in the sample, and use these findings to make conjectures about the population, before using formal modeling and inference techniques to make statistical statements about the population.

Toolkit

ggplot2

R/RStuido + ggplot2

# install the package
install.packages("ggplot2")

# load the package
library(ggplot2)

Fuel economy

data: mpg - fuel economy data from 1999 and 2008 for 38 popular models of cars

# see the help file
?mpg

# view data
View(mpg)

How many rows and columns does this dataset have? What does each row represent? What does each column represent?

Make a prediction. What relationship do you expect to see between engine size (displ) and mileage (hwy)?

Scatterplots

qplot()

qplot([x-variable], [y-variable], data = [dataset])

qplot(displ, hwy, data = mpg)

plot of chunk unnamed-chunk-5

Displacement vs. highway mpg

How would you describe this relationship? What other variables would help us understand data points that don't follow the overall trend?

qplot(displ, hwy, data = mpg)

plot of chunk unnamed-chunk-6

Additional variables

Can display additional variables with

aesthetics (like shape, colour, size), or
faceting (small multiples displaying different subsets)

Aesthetics

Aesthetics options

Visual characteristics of plotting characters that can be mapped to data are

color
size
shape
alpha (transparency)

qplot([x], [y], data = [dataset],
ashthetic feature = [variable to map it to])

Displacement vs. highway mpg + class

qplot(displ, hwy, data = mpg, color = class)

plot of chunk unnamed-chunk-7

Dive further…

Work in your teams to add color, size, and shape aesthetics to your graph. Experiment. Do different things happen for discrete and continuous variables? What happens when you use more than one aesthetic?

Aesthetics summary

aesthetics	discrete	continuous
color	rainbow of colors	gradient
size	discrete steps	linear mapping between radius and value
shape	different shape for each	shouldn't (and doesn't) work

Faceting

Faceting options

Smaller plots that display different subsets of the data
Useful for exploring conditional relationships and large data

qplot([x], [y], data = [dataset]) +
facet_grid(. ~ variable to facet by)

Displacement vs. highway mpg by cylinders

qplot(displ, hwy, data = mpg) +
  facet_grid(. ~ cyl)

plot of chunk unnamed-chunk-8

Dive further…

Run the following pieces of code, and describe what each plot displays.

qplot(displ, hwy, data = mpg) +
  facet_grid(drv ~ .)

qplot(displ, hwy, data = mpg) +
  facet_grid(drv ~ cyl)

qplot(displ, hwy, data = mpg) +
  facet_wrap(~ class)

Facet summary

facet_grid(): 2d grid, rows ~ cols, . for no split
facet_wrap(): 1d ribbon wrapped into 2d

Geoms

Displacement vs. highway mpg, take 2

How are these plots similar? How are they different?

plot of chunk unnamed-chunk-12

Geometric object

qplot([x], [y], data = [dataset], geom = "smooth")

`point` as geom

Note: point is the default geom when you have x and y, so we can omit it when plotting a scatterplot.

plot of chunk unnamed-chunk-13

`smooth` as geom

Reconstruct the following plot.

plot of chunk unnamed-chunk-14

Skill set of a data scientist

Statistical inference

Making sense

Data as traces

Cycle of …

Statistical inference

Populations and samples

Sampling is natural

Sampling biases

Populations and samples of big data

Ultimate goal

Toolkit

ggplot2

Fuel economy

Scatterplots

qplot()

Displacement vs. highway mpg

Additional variables

Aesthetics

Aesthetics options

Displacement vs. highway mpg + class

Dive further…

Aesthetics summary

Faceting

Faceting options

Displacement vs. highway mpg by cylinders

Dive further…

Facet summary

Geoms

Displacement vs. highway mpg, take 2

Geometric object

point as geom

smooth as geom

`point` as geom

`smooth` as geom