- data preparation & munging
- modeling
- coding
- visualization
- communication
September 2, 2014
The processes in our lives are actually data generating processes.
We would like ways to describe, understand, and make sense of these processes, because as scientists we want to understand the world better, but also because understanding these prcesses is part of the solution to problems we're trying to solve.
Data represents traces of the real-world processes
Which processes we gather are decided by the data collection (sampling) method
Data represents traces of the real-world processes
Once we have the data, we want to simplify it into something more comprehensible, to somewhing that capures it all in a much more concise way
These are called statistical estimators
Discipline that concerns itself with the development of procedires, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic (random) process.
Population is the complete set of observations of whatever we are studying, e.g. people, tweets, photographs, etc. (population size = N)
Sample is a subset of the population, ideally random and representative (sample size = n)
When you taste a spoonful of soup and decide the spoonful you tasted isn't salty enough, that's exploratory analysis
If you generalize and conclude that your entire soup needs salt, that's an inference
For your inference to be valid, the spoonful you tasted (the sample) needs to be representative of the entire pot (the population)
Even if we have access to all of FB's or Google's or Twitter's data corpus, any inferences we make from that data should not be extended to draw conclusions about humans beyond those sets of users for any particular day.
Consider the Twitter data generated by Hurricane Sandy, more than 20 million tweets between October 27 and November 1. A fascinating study combining Sandy-related Twitter and Foursquare data produced some expected findings (grocery shopping peaks the night before the storm) and some surprising ones (nightlife picked up the day after — presumably when cabin fever strikes). But these data don’t represent the whole picture. The greatest number of tweets about Sandy came from Manhattan. This makes sense given the city’s high level of smartphone ownership and Twitter use, but it creates the illusion that Manhattan was the hub of the disaster. Very few messages originated from more severely affected locations, such as Breezy Point, Coney Island and Rockaway. As extended power blackouts drained batteries and limited cellular access, even fewer tweets came from the worst hit areas. In fact, there was much more going on outside the privileged, urban experience of Sandy that Twitter data failed to convey, especially in aggregate. We can think of this as a "signal problem": Data are assumed to accurately reflect the social world, but there are significant gaps, with little or no signal coming from particular communities.
From: http://blogs.hbr.org/2013/04/the-hidden-biases-in-big-data/The ultimate goal is to say (infer) something about the population using data from the sample.
Ideally, this is a statistical statement.
But we start data analysis by describing / visualizing what's in the sample, and use these findings to make conjectures about the population, before using formal modeling and inference techniques to make statistical statements about the population.
R/RStuido + ggplot2
# install the package install.packages("ggplot2")
# load the package library(ggplot2)
data: mpg
- fuel economy data from 1999 and 2008 for 38 popular models of cars
# see the help file ?mpg
# view data View(mpg)
qplot(displ, hwy, data = mpg)
qplot(displ, hwy, data = mpg)
Can display additional variables with
aesthetics (like shape, colour, size), or
faceting (small multiples displaying different subsets)
Visual characteristics of plotting characters that can be mapped to data are
color
size
shape
alpha
(transparency)
qplot(displ, hwy, data = mpg, color = class)
aesthetics | discrete | continuous |
---|---|---|
color | rainbow of colors | gradient |
size | discrete steps | linear mapping between radius and value |
shape | different shape for each | shouldn't (and doesn't) work |
Smaller plots that display different subsets of the data
Useful for exploring conditional relationships and large data
qplot(displ, hwy, data = mpg) + facet_grid(. ~ cyl)
qplot(displ, hwy, data = mpg) + facet_grid(drv ~ .)
qplot(displ, hwy, data = mpg) + facet_grid(drv ~ cyl)
qplot(displ, hwy, data = mpg) + facet_wrap(~ class)
facet_grid()
: 2d grid, rows ~ cols, . for no split
facet_wrap()
: 1d ribbon wrapped into 2d
point
as geomNote: point
is the default geom when you have x
and y
, so we can omit it when plotting a scatterplot.
smooth
as geom