- data preparation & munging
- modeling
- coding
- visualization
- communication

September 2, 2014

- data preparation & munging
- modeling
- coding
- visualization
- communication

The processes in our lives are actually data generating processes.

Give examples of some data generating proccesses from your day.

We would like ways to describe, understand, and make sense of these processes, because as scientists we want to understand the world better, but also because understanding these prcesses is part of the solution to problems we're trying to solve.

Data represents traces of the real-world processes

Which processes we gather are decided by the data collection (sampling) method

Data represents traces of the real-world processes

Once we have the data, we want to simplify it into something more comprehensible, to somewhing that capures it all in a much more concise way

These are called

**statistical estimators**

- The process of going from the world to the data, and then from the data back to the world is the field of
**statistical inference**

Discipline that concerns itself with the development of procedires, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic (random) process.

**Population**is the complete set of**observations**of whatever we are studying, e.g. people, tweets, photographs, etc. (population size = N)**Sample**is a subset of the population, ideally random and representative (sample size = n)

When you taste a spoonful of soup and decide the spoonful you tasted isn't salty enough, that's

**exploratory analysis**If you generalize and conclude that your entire soup needs salt, that's an

**inference**For your inference to be valid, the spoonful you tasted (the sample) needs to be

**representative**of the entire pot (the population)

Suppose I want to calculate the average number of kids couples have. Will data from this class (asking each of you how many kids your parents have) yield a biased or unbiased estimate of the value I am interested in? If biased, will it be an over-estimate or an under-estimate?

Even if we have access to all of FB's or Google's or Twitter's data corpus, any inferences we make from that data should not be extended to draw conclusions about humans beyond those sets of users for any particular day.

Consider the Twitter data generated by Hurricane Sandy, more than 20 million tweets between October 27 and November 1. A fascinating study combining Sandy-related Twitter and Foursquare data produced some expected findings (grocery shopping peaks the night before the storm) and some surprising ones (nightlife picked up the day after — presumably when cabin fever strikes). But these data don’t represent the whole picture. The greatest number of tweets about Sandy came from Manhattan. This makes sense given the city’s high level of smartphone ownership and Twitter use, but it creates the illusion that Manhattan was the hub of the disaster. Very few messages originated from more severely affected locations, such as Breezy Point, Coney Island and Rockaway. As extended power blackouts drained batteries and limited cellular access, even fewer tweets came from the worst hit areas. In fact, there was much more going on outside the privileged, urban experience of Sandy that Twitter data failed to convey, especially in aggregate. We can think of this as a "signal problem": Data are assumed to accurately reflect the social world, but there are significant gaps, with little or no signal coming from particular communities.

From: http://blogs.hbr.org/2013/04/the-hidden-biases-in-big-data/The ultimate goal is to say (infer) something about the population using data from the sample.

Ideally, this is a statistical statement.

But we start data analysis by describing / visualizing what's in the sample, and use these findings to make conjectures about the population, before using formal modeling and inference techniques to make statistical statements about the population.

R/RStuido + ggplot2

# install the package install.packages("ggplot2")

# load the package library(ggplot2)

**data:** `mpg`

- fuel economy data from 1999 and 2008 for 38 popular models of cars

# see the help file ?mpg

# view data View(mpg)

How many rows and columns does this dataset have? What does each row represent? What does each column represent?

Make a prediction. What relationship do you expect to see between engine size (displ) and mileage (hwy)?

qplot([x-variable], [y-variable], data = [dataset])

qplot(displ, hwy, data = mpg)