This lab will not be graded - there is nothing to turn in. Please do sit with your lab group, which may be found on Sakai. Next week will be the first team lab; we will introduce the team workflow and some basic linear regression commands.
Plastic pollution is a major and growing problem, negatively affecting oceans and wildlife health. Our World in Data has a lot of great data at various levels including globally, per country, and over time. For this lab we focus on data from 2010.
Additionally, National Geographic recently ran a data visualization communication contest on plastic waste as seen here.
Learning goals for this lab are
The link to this assignment is at https://classroom.github.com/a/MZhDf2xM.
Ask your TA (or better yet, your lab group!) if you are unsure of how to clone your own private repository for this lab.
You will write your answers in the document
230123-lab.Rmd. Before starting the exercises, be sure to
update the author name in the YAML at the top of the .Rmd file. Knit the
document and make sure the resulting PDF file has your name.
We’ll use the tidyverse package for this analysis. You can run run the following code to load this package.
library(tidyverse)
The dataset for this assignment can be found as a csv file in the
data folder of your repository. You can read it in using
the following line of code. We will name it
plastic_waste.
plastic_waste <- read_csv("data/plastic-waste.csv")
The variable descriptions are as follows:
code: 3 Letter country codeentity: Country namecontinent: Continent nameyear: Yeargdp_per_cap: GDP per capita constant 2011 international
$, rateplastic_waste_per_cap: Amount of plastic waste per
capita in kg/daymismanaged_plastic_waste_per_cap: Amount of mismanaged
plastic waste per capita in kg/daymismanaged_plastic_waste: Tonnes of mismanaged plastic
wastecoastal_pop: Number of individuals living on/near
coasttotal_pop: Total population according to GapminderLet’s start by taking a look at the distribution of plastic waste per capita in 2010.
ggplot(data = plastic_waste, aes(x = plastic_waste_per_cap)) +
geom_histogram(binwidth = 0.2)
One country stands out as an unusual observation at the top of the
distribution. One way of identifying this country is to filter the data
for countries where plastic waste per capita is greater than 3.5
kg/person. If you’re unfamiliar with the function or the
filter() function, check out a brief reference here.
Note that the pipe operator %>% is specific to the
dplyr package; base R also has a pipe operator given by
|>. Either of them will work just fine.
plastic_waste |>
filter(plastic_waste_per_cap > 3.5)
## # A tibble: 1 x 10
## code entity continent year gdp_per_cap plastic_waste_p~ mismanaged_plas~
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 TTO Trinidad and Tobago North Am~ 2010 31261. 3.6 0.19
## # ... with 3 more variables: mismanaged_plastic_waste <dbl>, coastal_pop <dbl>,
## # total_pop <dbl>
Did you expect this result? You might consider doing some research on Trinidad and Tobago to see why plastic waste per capita is so high there, or whether this is a data error.
Another way of visualizing numerical data is using density plots. Follow along using the code below:
ggplot(data = plastic_waste, aes(x = plastic_waste_per_cap)) +
geom_density()
And compare distributions across continents by coloring density curves by continent.
ggplot(data = plastic_waste,
mapping = aes(x = plastic_waste_per_cap,
color = continent)) +
geom_density()
The resulting plot may be a little difficult to read, so let’s also fill the curves in with colors as well.
ggplot(data = plastic_waste,
mapping = aes(x = plastic_waste_per_cap,
color = continent,
fill = continent)) +
geom_density()
The overlapping colors make it difficult to tell what’s happening
with the distributions in continents plotted first, and hence covered by
continents plotted over them. We can change the transparency level of
the fill color to help with this. The alpha argument takes
values between 0 and 1: 0 is completely transparent and 1 is completely
opaque. There is no way to tell what value will work best, so it’s best
to try a few.
ggplot(data = plastic_waste,
mapping = aes(x = plastic_waste_per_cap,
color = continent,
fill = continent)) +
geom_density(alpha = 0.7)
This still doesn’t look great…
color and fill
of the curves by mapping aesthetics of the plot but we defined the
alpha level as a characteristic of the plotting
geom.Now is a good time to commit and push your changes to GitHub with a short, informative commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
There is yet another way to visualize this relationship is using side-by-side box plots.
ggplot(data = plastic_waste,
mapping = aes(x = continent,
y = plastic_waste_per_cap)) +
geom_boxplot()
Now is another good time to commit and push your changes to GitHub with a short, informative commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
plastic_waste <- plastic_waste %>%
filter(plastic_waste_per_cap < 3)
Hint: The colors are from the viridis color palette.
Take a look at the functions starting with
scale_color_viridis_* in the ggplot2
reference page.
Commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
This lab was adapted from Data Science in a Box.