If you are curious about how raw data from the ACS were cleaned and prepared, see the code that the FiveThirtyEight authors used (be warned: it’s a bit outside of the scope of this course!).
Often times, the first step in using statistics to turn information into knowledge is to summarize and describe the raw information - the data. In this lab we explore data on college majors and earnings, specifically the data behind the FiveThirtyEight story The Economic Guide To Picking A College Major.
These data originally come from the American Community Survey (ACS) 2010-2012 Public Use Microdata Series. There are many considerations that go into picking a major; two of them are earnings potential and employment prospects. They are important, but they don’t tell the whole story. Keep this in mind as you analyze the data.
lab02-[github_name]
.lab02-shawnsanto
. Or, simply go to the repository you just created (there is a link provided when you created the repository).Clone or download
button in that repository, and copy the git URL (it should end in .git)Files
pane in RStudio.Remember that we first have to configure RStudio Cloud to talk to GitHub. An easy way to do this is to use the use_git_config()
function from the usethis
package. Type the following lines of code in the Console
in RStudio filling in your name and email address.
Currently your project is called Untitled Project
. Update the name of your project to be “Lab 02 - Data wrangling and visualization”.
In this lab we will work with the tidyverse
and fivethirtyeight
packages. Load the packages into the Console
(they have already been installed for you).
Before we introduce the data, let’s do a few things, which hopefully should be review.
Open the R Markdown (Rmd) file in your project, change the author name to your name, and knit the document.
The data frame we will be working with today is called college_recent_grads
and it’s in the fivethirtyeight
package. To find out more about the dataset, type the following in your Console
: ?college_recent_grads
. A question mark before the name of an object will bring up its help file. This command must be run in the Console
. college_recent_grads
is a tidy data frame, with each row representing an observation and each column representing a variable. Take a quick peek at your data frame and view its dimensions with the glimpse()
function.
The description of the variables, i.e. the codebook, is given below.
Header | Description |
---|---|
rank |
Rank by median earnings |
major_code |
Major code, FO1DP in ACS PUMS |
major |
Major description |
major_category |
Category of major from Carnevale et al |
total |
Total number of people with major |
sample_size |
Sample size (unweighted) of full-time, year-round ONLY (used for earnings) |
men |
Male graduates |
women |
Female graduates |
sharewomen |
Women as share of total |
employed |
Number employed (ESR == 1 or 2) |
employed_full_time |
Employed 35 hours or more |
employed_part_time |
Employed less than 35 hours |
employed_full_time_yearround |
Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35) |
unemployed |
Number unemployed (ESR == 3) |
unemployment_rate |
Unemployed / (Unemployed + Employed) |
median |
Median earnings of full-time, year-round workers |
p25th |
25th percentile of earnings |
p75th |
75th percentile of earnings |
college_jobs |
Number with job requiring a college degree |
non_college_jobs |
Number with job not requiring a college degree |
low_wage_jobs |
Number in low-wage service jobs |
The college_recent_grads
data frame is a trove of information.
Let’s think about some questions we might want to answer with these data. - Which major has the lowest unemployment rate? - Which major has the highest percentage of women? - How do the distributions of median income compare across major categories? - Do women tend to choose majors with lower or higher earnings?
In the next section we will answer these questions.
In order to answer this question all we need to do is sort the data. We can use the arrange()
function to do this, and sort it by the unemployment_rate
variable. By default, arrange()
sorts in ascending order, which is what we want here – we’re interested in the major with the lowest unemployment rate.
This gives us what we wanted, but not in an ideal form. First, the name of the major barely fits on the page. Second, some of the variables are not that useful (e.g. major_code
, major_category
) and some we might want front and center are not easily viewed (e.g. unemployment_rate
). We can use the select()
function to choose which variables to display and in which order:
Note how easily we expanded our code with adding another step to our pipeline, with the pipe operator: %>%
.
This is looking better, but do we really need all those decimal places in the unemployment variable? Not really!
There are two ways we can address this problem. One would be to round the unemployment_rate
variable in the dataset or we can change the number of digits displayed, without touching the input data. Below are instructions for how you would do both of these:
mutate()
function. In this case, we’re overwriting the existing unemployment_rate
variable, by round
ing it to 4
decimal places.college_recent_grads %>%
arrange(unemployment_rate) %>%
select(rank, major, unemployment_rate) %>%
mutate(unemployment_rate = round(unemployment_rate, digits = 4))
Note that the digits
in options
is scientific digits, and in round
they are decimal places. If you’re thinking “Wouldn’t it be nice if they were consistent?”…so do we. You don’t need to do both of these, that would be redundant. The next exercise asks you to choose one.
To answer such a question we need to arrange the data in descending order. For example, if earlier we were interested in the major with the highest unemployment rate, we would use the following:
The desc()
function specifies that we want unemployment_rate
in descending order.
college_recent_grads %>%
arrange(desc(unemployment_rate)) %>%
select(rank, major, unemployment_rate)
slice(1:3)
at the end of the pipeline.A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value below which 20% of the observations may be found. (Source: Wikipedia
There are three types of incomes reported in this data frame: p25th
, median
, and p75th
. These correspond to the 25th, 50th, and 75th percentiles of the income distribution of sampled individuals for a given major.
In order to answer our question about the distributions of median income across major categories, we need to group the data by major_category
. Then, we need a way to summarize the distributions of median income within these groups. This decision will depend on the shapes of these distributions. Let’s create a visualization to see the shapes of these distributions.
Function ggplot()
will allow us to create our plots. The first argument is the data frame, and the next argument gives the mapping of the variables of the data to the aes
thetic elements of the plot.
Let’s start simple and take a look at the distribution of all median incomes, without considering major categories.
Along with the plot, we get a message:
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is telling us that we might want to reconsider the binwidth we chose for our histogram – or more accurately, the binwidth we didn’t specify. It’s good practice to always think in the context of the data and try out a few binwidths before settling on your final choice. You might ask yourself: “What would be a meaningful difference in median incomes?” $1 is obviously too little, $10000 might be too high.
geom_histogram()
function. So to specify a binwidth of $1000, you would use geom_histogram(binwidth = 1000)
. We can also calculate summary statistics for this distribution using the summarise()
function:college_recent_grads %>%
summarise(min = min(median),
max = max(median),
mean = mean(median),
med = median(median),
sd = sd(median),
q1 = quantile(median, probs = 0.25),
q3 = quantile(median, probs = 0.75))
Next, we facet the plot by major category.
ggplot(data = college_recent_grads, mapping = aes(x = median)) +
geom_histogram() +
facet_wrap( ~ major_category, ncol = 4)
Plot the distribution of median
income using a histogram, faceted by major_category
. Use the binwidth
you chose in the earlier exercise. Now that we’ve seen the shapes of the distributions of median incomes for each major category, we should have a better idea for which summary statistic to use to quantify the typical median income.
Which major category has the highest typical (you’ll need to decide what this means) median income? Use the partial code below, filling it in with the appropriate statistic and function. Also note that we are looking for the highest statistic, so make sure to arrange your resulting data frame in the correct order.
count()
, which first groups the data and then counts the number of observations in each category (see below). Arrange the results so that the major with the lowest observations is in row 1 of the resulting data frame.One section of the FiveThirtyEight story is titled “All STEM fields aren’t the same”. Let’s see if this is true. First, let’s create a new vector called stem_categories
that lists the major categories that are considered STEM fields. Ask your TA what function c()
does if you are not sure. Also, check the help with ?c
.
stem_categories <- c("Biology & Life Science",
"Computers & Mathematics",
"Engineering",
"Physical Sciences")
We can use this to create a new variable in our data frame indicating whether a major is STEM or not.
college_recent_grads <- college_recent_grads %>%
mutate(major_type = ifelse(major_category %in% stem_categories, "stem", "not stem"))
Let’s unpack this: with mutate()
we create a new variable called major_type
, which is defined as "stem"
if the major_category
is in the vector called stem_categories
we created earlier, and as "not stem"
otherwise. %in%
is a logical operator. Other logical operators that are commonly used are
Operator | Operation |
---|---|
x < y |
less than |
x > y |
greater than |
x <= y |
less than or equal to |
x >= y |
greater than or equal to |
x != y |
not equal to |
x == y |
equal to |
x %in% y |
group membership |
x | y |
or |
x & y |
and |
!x |
not |
Here is a small example showing how the operator %in%
works:
x <- c("red", "white", "blue")
y <- c("green", "red", "red", "orange", "white", "black", "purple", "blue")
y %in% x
#> [1] FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
We can use logical operators to filter()
our data. Here will filter based on the conditions that the major type is STEM and the median salary is less than $36,000.