September 2, 2014

Skill set of a data scientist

  • data preparation & munging
  • modeling
  • coding
  • visualization
  • communication

Statistical inference

Making sense

The processes in our lives are actually data generating processes.

Give examples of some data generating proccesses from your day.

We would like ways to describe, understand, and make sense of these processes, because as scientists we want to understand the world better, but also because understanding these prcesses is part of the solution to problems we're trying to solve.

Data as traces

  • Data represents traces of the real-world processes

  • Which processes we gather are decided by the data collection (sampling) method

Cycle of …

  • Data represents traces of the real-world processes

  • Once we have the data, we want to simplify it into something more comprehensible, to somewhing that capures it all in a much more concise way

  • These are called statistical estimators

  • The process of going from the world to the data, and then from the data back to the world is the field of statistical inference

Statistical inference

Discipline that concerns itself with the development of procedires, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic (random) process.

Populations and samples

  • Population is the complete set of observations of whatever we are studying, e.g. people, tweets, photographs, etc. (population size = N)

  • Sample is a subset of the population, ideally random and representative (sample size = n)

Sampling is natural

  • When you taste a spoonful of soup and decide the spoonful you tasted isn't salty enough, that's exploratory analysis

  • If you generalize and conclude that your entire soup needs salt, that's an inference

  • For your inference to be valid, the spoonful you tasted (the sample) needs to be representative of the entire pot (the population)

Sampling biases

Suppose I want to calculate the average number of kids couples have. Will data from this class (asking each of you how many kids your parents have) yield a biased or unbiased estimate of the value I am interested in? If biased, will it be an over-estimate or an under-estimate?

Populations and samples of big data

Even if we have access to all of FB's or Google's or Twitter's data corpus, any inferences we make from that data should not be extended to draw conclusions about humans beyond those sets of users for any particular day.

Consider the Twitter data generated by Hurricane Sandy, more than 20 million tweets between October 27 and November 1. A fascinating study combining Sandy-related Twitter and Foursquare data produced some expected findings (grocery shopping peaks the night before the storm) and some surprising ones (nightlife picked up the day after — presumably when cabin fever strikes). But these data don’t represent the whole picture. The greatest number of tweets about Sandy came from Manhattan. This makes sense given the city’s high level of smartphone ownership and Twitter use, but it creates the illusion that Manhattan was the hub of the disaster. Very few messages originated from more severely affected locations, such as Breezy Point, Coney Island and Rockaway. As extended power blackouts drained batteries and limited cellular access, even fewer tweets came from the worst hit areas. In fact, there was much more going on outside the privileged, urban experience of Sandy that Twitter data failed to convey, especially in aggregate. We can think of this as a "signal problem": Data are assumed to accurately reflect the social world, but there are significant gaps, with little or no signal coming from particular communities.

From: http://blogs.hbr.org/2013/04/the-hidden-biases-in-big-data/

Ultimate goal

The ultimate goal is to say (infer) something about the population using data from the sample.

Ideally, this is a statistical statement.

But we start data analysis by describing / visualizing what's in the sample, and use these findings to make conjectures about the population, before using formal modeling and inference techniques to make statistical statements about the population.

Toolkit

ggplot2

R/RStuido + ggplot2

# install the package
install.packages("ggplot2")
# load the package
library(ggplot2)

Fuel economy

data: mpg - fuel economy data from 1999 and 2008 for 38 popular models of cars

# see the help file
?mpg
# view data
View(mpg)
How many rows and columns does this dataset have? What does each row represent? What does each column represent?
Make a prediction. What relationship do you expect to see between engine size (displ) and mileage (hwy)?

Scatterplots

qplot()

qplot([x-variable], [y-variable], data = [dataset])
qplot(displ, hwy, data = mpg)

plot of chunk unnamed-chunk-5

Displacement vs. highway mpg

How would you describe this relationship? What other variables would help us understand data points that don't follow the overall trend?
qplot(displ, hwy, data = mpg)

plot of chunk unnamed-chunk-6

Additional variables

Can display additional variables with

  • aesthetics (like shape, colour, size), or

  • faceting (small multiples displaying different subsets)

Aesthetics

Aesthetics options

Visual characteristics of plotting characters that can be mapped to data are

  • color

  • size

  • shape

  • alpha (transparency)

qplot([x], [y], data = [dataset],
   ashthetic feature = [variable to map it to])

Displacement vs. highway mpg + class

qplot(displ, hwy, data = mpg, color = class)

plot of chunk unnamed-chunk-7

Dive further…

Work in your teams to add color, size, and shape aesthetics to your graph. Experiment. Do different things happen for discrete and continuous variables? What happens when you use more than one aesthetic?

Aesthetics summary

aesthetics discrete continuous
color rainbow of colors gradient
size discrete steps linear mapping between radius and value
shape different shape for each shouldn't (and doesn't) work

Faceting

Faceting options

  • Smaller plots that display different subsets of the data

  • Useful for exploring conditional relationships and large data

qplot([x], [y], data = [dataset]) +
   facet_grid(. ~ variable to facet by)

Displacement vs. highway mpg by cylinders

qplot(displ, hwy, data = mpg) +
  facet_grid(. ~ cyl)

plot of chunk unnamed-chunk-8

Dive further…

Run the following pieces of code, and describe what each plot displays.
qplot(displ, hwy, data = mpg) +
  facet_grid(drv ~ .)
qplot(displ, hwy, data = mpg) +
  facet_grid(drv ~ cyl)
qplot(displ, hwy, data = mpg) +
  facet_wrap(~ class)

Facet summary

  • facet_grid(): 2d grid, rows ~ cols, . for no split

  • facet_wrap(): 1d ribbon wrapped into 2d

Geoms

Displacement vs. highway mpg, take 2

How are these plots similar? How are they different?

plot of chunk unnamed-chunk-12

Geometric object

qplot([x], [y], data = [dataset], geom = "smooth")

point as geom

Note: point is the default geom when you have x and y, so we can omit it when plotting a scatterplot.

plot of chunk unnamed-chunk-13

smooth as geom

Reconstruct the following plot.

plot of chunk unnamed-chunk-14