We will look at the relationship between budget and revenue for movies made in the United States in 1986 to 2016. The data is from the Internet Movie Database (IMDB).
library(readr)
library(tidyverse)
library(DT)
The movies
data set includes basic information about each movie including budget, genre, movie studio, director, etc. A full list of the variables may be found here.
movies <- read_csv("https://raw.githubusercontent.com/danielgrijalva/movie-stats/master/movies.csv")
movies <- movies %>%
filter(country=="USA",
!(genre %in% c("Musical","War","Western"))) #remove genres with < 10 movies
movies
## # A tibble: 4,868 x 15
## budget company country director genre gross name rating released
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 8.00e6 Columb… USA Rob Rei… Adve… 5.23e7 Stan… R 1986-08…
## 2 6.00e6 Paramo… USA John Hu… Come… 7.01e7 Ferr… PG-13 1986-06…
## 3 1.50e7 Paramo… USA Tony Sc… Acti… 1.80e8 Top … PG 1986-05…
## 4 1.85e7 Twenti… USA James C… Acti… 8.52e7 Alie… R 1986-07…
## 5 9.00e6 Walt D… USA Randal … Adve… 1.86e7 Flig… PG 1986-08…
## 6 6.00e6 De Lau… USA David L… Drama 8.55e6 Blue… R 1986-10…
## 7 9.00e6 Paramo… USA Howard … Come… 4.05e7 Pret… PG-13 1986-02…
## 8 1.50e7 SLM Pr… USA David C… Drama 4.05e7 The … R 1986-08…
## 9 6.00e6 Twenti… USA David S… Come… 8.20e6 Lucas PG-13 1986-03…
## 10 2.50e7 Twenti… USA John Ca… Acti… 1.11e7 Big … PG-13 1986-07…
## # ... with 4,858 more rows, and 6 more variables: runtime <int>,
## # score <dbl>, star <chr>, votes <int>, writer <chr>, year <int>
We begin by looking at how the gross revenue (gross
) has changed over time. Since we want to visualize the results, we will choose a few genres of interest for the analysis.
genre_list <- c("Horror", "Drama", "Action", "Animation")
movies %>%
filter(genre %in% genre_list) %>%
group_by(genre,year) %>%
summarise(avg_gross = mean(gross)) %>%
ggplot(mapping = aes(x = year, y = avg_gross, color=genre)) +
geom_line() +
ylab("Average Gross Revenue (in US Dollars)") +
ggtitle("Gross Revenue Over Time")
Next, let’s see the relationship between a movie’s budget and its gross revenue. Because there is a large range of values for budget and revenue, we will plot the log-transformed version of each variable to more easily visualize the relationship. We will talk more about variable transformations later in the semester.
movies %>%
filter(genre %in% genre_list, budget > 0) %>%
ggplot(mapping = aes(x=log(budget), y = log(gross), color=genre)) +
geom_point() +
geom_smooth(method="lm",se=FALSE) +
xlab("Log-transformed Budget")+
ylab("Log-transformed Gross Revenue") +
facet_wrap(~ genre)
Put your name in the author field at the top of the file (in the yaml
– we will discuss what this is at a later date). Knit again.
Change the genre names in parts 1 and 2 to genres that interest you. The spelling and capitalization must match what’s in the data, so you can use the Appendix to see the correct spelling and capitalization. Knit again.
You have made your first data visualization!
Below is a list of genres in the data set:
movies %>%
arrange(genre) %>%
select(genre) %>%
distinct() %>%
datatable()