September 19, 2017
Now that you've had a chance to work on a substantial assignment as a team, it's time to look back, evaluate what worked and what didn't. Pick a 20min period within these that you can attend as a team, let me know at the end of class.
fishing <- read_csv("data/Fishing_industry_by_country.csv") fishing <- fishing %>% # Remove Hong Kong and Other as those don't show up on the map filter(!(Country %in% c("Hong Kong", "Other"))) %>% # Rename some of the countries so they match how they are spelled on the world map mutate( Country = case_when( Country == "Congo, Democratic Republic of the" ~ "Democratic Republic of the Congo", Country == "People's Republic of China" ~ "China", Country == "Taiwan (Republic of China)" ~ "Taiwan", Country == "Russian Federation" ~ "Russia", TRUE ~ Country ), perc_capture = (Capture / (Capture + Aquaculture)) * 100 # For percentages )
country_map <- map_data("world") ggplot(fishing, aes(map_id = Country)) + geom_map(data = country_map, map = country_map, aes(map_id = region), fill = "white", color = "grey", lwd = 0.1) + geom_map(aes(fill = perc_capture), map = country_map, color = "black", lwd = 0.1) + expand_limits(x = country_map$long, y = country_map$lat)+ labs(x = "Latitude", y = "Longitude", title = "Percentage of Capture by Country")+ labs(fill = "Percentage\nof Capture") + # Line break in title scale_fill_gradient(low = "pink", high = "maroon")
ggplot(mapping = aes(x = year)) + geom_point(aes(y = FTTF)) + geom_line(aes(y = FTTF)) + geom_point(aes(y = FTTTF)) + geom_line(aes(y = FTTTF)) + geom_point(aes(y = FTNTTF)) + geom_line(aes(y = FTNTTF)) + geom_point(aes(y = PTF), color = "red") + geom_line(aes(y = PTF), color = "red") + geom_point(aes(y = GSE)) + geom_line(aes(y = GSE)) + labs(x = "Year", y = "Percent of Total Instructional Staff", title = "Change in Percentage of Part-Time Faculty over Time, 1975-2011") + xlim(c(1970, 2020)) + annotate("text", x = 2012, y = c(41, 20, 17, 14, 8), label = c("Part-time\nfaculty", "Grad students", "Tenured faculty", "Non-tenure-\ntrack faculty", "Tenure-track\nfaculty"), color = c("red", rep("black", 4)), hjust = 0)
Design a study comparing average energy levels of people who do and do not exercise – both as an observational study and as an experiment.
Girls who ate breakfast of any type had a lower average body mass index, a common obesity gauge, than those who said they didn't. The index was even lower for girls who said they ate cereal for breakfast, according to findings of the study conducted by the Maryland Medical Research Institute with funding from the National Institutes of Health (NIH) and cereal-maker General Mills.
[…]
The results were gleaned from a larger NIH survey of 2,379 girls in California, Ohio, and Maryland who were tracked between the ages of 9 and 19.
[…]
As part of the survey, the girls were asked once a year what they had eaten during the previous three days.
[…]
What is the explanatory and what is the response variable?
Randall Munroe CC BY-NC 2.5 http://xkcd.com/552/
Bivariate relationship: Fitness -> Heart health
Multivariate relationship: Calories + Age + Fitness -> Heart health
Not considering an important variable when studying a relationship can result in what we call a Simpson's paradox, which illustrates the effect the omission of an explanatory variable can have on the measure of association between another explanatory variable and a response variable.
In other words, the inclusion of a third variable in the analysis can change the apparent relationship between the other two variables.
Study carried out by the graduate Division of the University of California, Berkeley in the early 70’s to evaluate whether there was a sex bias in graduate admissions
The data come from six departments. For confidentiality we'll call them A-F.
We have information on whether the applicant was male or female and whether they were admitted or rejected.
First, we will evaluate whether the percentage of males admitted is indeed higher than females, overall. Next, we will calculate the same percentage for each department.
ucb_admit <- read_csv("data/ucb_admit.csv") ucb_admit
## # A tibble: 4,526 x 3 ## Admit Gender Dept ## <chr> <chr> <chr> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## # ... with 4,516 more rows
glimpse(ucb_admit)
## Observations: 4,526 ## Variables: 3 ## $ Admit <chr> "Admitted", "Admitted", "Admitted", "Admitted", "Admitt... ## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male",... ## $ Dept <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", ...
\(P(A | B)\): Probability of event A given event B
What can you say about the overall gender distribution? Hint: Calculate the following probabilities: \(P(Admit | Male)\) and \(P(Admit | Female)\).
## # A tibble: 2 x 3 ## Gender Admitted Rejected ## * <chr> <int> <int> ## 1 Female 557 1278 ## 2 Male 1198 1493
What type of visualization would be appropriate for representing these data?
## # A tibble: 2 x 4 ## Gender Admitted Rejected Perc_Admit ## <chr> <int> <int> <dbl> ## 1 Female 557 1278 0.30 ## 2 Male 1198 1493 0.45
ggplot(ucb_admit, mapping = aes(x = Gender, fill = Admit)) + geom_bar(position = "fill")
What can you say about the by department gender distribution?
## # A tibble: 12 x 4 ## Dept Gender Admitted Rejected ## * <chr> <chr> <int> <int> ## 1 A Female 89 19 ## 2 A Male 512 313 ## 3 B Female 17 8 ## 4 B Male 353 207 ## 5 C Female 202 391 ## 6 C Male 120 205 ## 7 D Female 131 244 ## 8 D Male 138 279 ## 9 E Female 94 299 ## 10 E Male 53 138 ## 11 F Female 24 317 ## 12 F Male 22 351
What type of visualization would be appropriate for representing these data?
## # A tibble: 12 x 5 ## Dept Gender Admitted Rejected Perc_Admit ## <chr> <chr> <int> <int> <dbl> ## 1 A Female 89 19 0.82 ## 2 A Male 512 313 0.62 ## 3 B Female 17 8 0.68 ## 4 B Male 353 207 0.63 ## 5 C Female 202 391 0.34 ## 6 C Male 120 205 0.37 ## 7 D Female 131 244 0.35 ## 8 D Male 138 279 0.33 ## 9 E Female 94 299 0.24 ## 10 E Male 53 138 0.28 ## 11 F Female 24 317 0.07 ## 12 F Male 22 351 0.06
ggplot(ucb_admit, mapping = aes(x = Gender, fill = Admit)) + geom_bar(position = "fill") + facet_grid(. ~ Dept) + labs(x = "Gender", y = "", fill = "Admission status")