September 2, 2014

Skill set of a data scientist

  • data preparation & munging
  • modeling
  • coding
  • visualization
  • communication

Statistical inference

Making sense

The processes in our lives are actually data generating processes.

Give examples of some data generating proccesses from your day.

We would like ways to describe, understand, and make sense of these processes, because as scientists we want to understand the world better, but also because understanding these prcesses is part of the solution to problems we're trying to solve.

Data as traces

  • Data represents traces of the real-world processes

  • Which processes we gather are decided by the data collection (sampling) method

Cycle of …

  • Data represents traces of the real-world processes

  • Once we have the data, we want to simplify it into something more comprehensible, to somewhing that capures it all in a much more concise way

  • These are called statistical estimators

  • The process of going from the world to the data, and then from the data back to the world is the field of statistical inference

Statistical inference

Discipline that concerns itself with the development of procedires, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic (random) process.

Populations and samples

  • Population is the complete set of observations of whatever we are studying, e.g. people, tweets, photographs, etc. (population size = N)

  • Sample is a subset of the population, ideally random and representative (sample size = n)

Sampling is natural

  • When you taste a spoonful of soup and decide the spoonful you tasted isn't salty enough, that's exploratory analysis

  • If you generalize and conclude that your entire soup needs salt, that's an inference

  • For your inference to be valid, the spoonful you tasted (the sample) needs to be representative of the entire pot (the population)

Sampling biases

Suppose I want to calculate the average number of kids couples have. Will data from this class (asking each of you how many kids your parents have) yield a biased or unbiased estimate of the value I am interested in? If biased, will it be an over-estimate or an under-estimate?

Populations and samples of big data

Even if we have access to all of FB's or Google's or Twitter's data corpus, any inferences we make from that data should not be extended to draw conclusions about humans beyond those sets of users for any particular day.

Consider the Twitter data generated by Hurricane Sandy, more than 20 million tweets between October 27 and November 1. A fascinating study combining Sandy-related Twitter and Foursquare data produced some expected findings (grocery shopping peaks the night before the storm) and some surprising ones (nightlife picked up the day after — presumably when cabin fever strikes). But these data don’t represent the whole picture. The greatest number of tweets about Sandy came from Manhattan. This makes sense given the city’s high level of smartphone ownership and Twitter use, but it creates the illusion that Manhattan was the hub of the disaster. Very few messages originated from more severely affected locations, such as Breezy Point, Coney Island and Rockaway. As extended power blackouts drained batteries and limited cellular access, even fewer tweets came from the worst hit areas. In fact, there was much more going on outside the privileged, urban experience of Sandy that Twitter data failed to convey, especially in aggregate. We can think of this as a "signal problem": Data are assumed to accurately reflect the social world, but there are significant gaps, with little or no signal coming from particular communities.


Ultimate goal

The ultimate goal is to say (infer) something about the population using data from the sample.

Ideally, this is a statistical statement.

But we start data analysis by describing / visualizing what's in the sample, and use these findings to make conjectures about the population, before using formal modeling and inference techniques to make statistical statements about the population.



R/RStuido + ggplot2

# install the package
# load the package

Fuel economy

data: mpg - fuel economy data from 1999 and 2008 for 38 popular models of cars

# see the help file
# view data
How many rows and columns does this dataset have? What does each row represent? What does each column represent?
Make a prediction. What relationship do you expect to see between engine size (displ) and mileage (hwy)?



qplot([x-variable], [y-variable], data = [dataset])
qplot(displ, hwy, data = mpg)