We will look at the relationship between budget and revenue for movies made in the United States in 1986 to 2016. The data is from the Internet Movie Database (IMDB).

library(readr)
library(tidyverse)
library(DT)

Data

The movies data set includes basic information about each movie including budget, genre, movie studio, director, etc. A full list of the variables may be found here.

movies <- read_csv("https://raw.githubusercontent.com/danielgrijalva/movie-stats/master/movies.csv")
movies <- movies %>%
  filter(country=="USA", 
         !(genre %in% c("Musical","War","Western"))) #remove genres with < 10 movies
movies
## # A tibble: 4,868 x 15
##    budget company country director genre  gross name  rating released
##     <dbl> <chr>   <chr>   <chr>    <chr>  <dbl> <chr> <chr>  <chr>   
##  1 8.00e6 Columb… USA     Rob Rei… Adve… 5.23e7 Stan… R      1986-08…
##  2 6.00e6 Paramo… USA     John Hu… Come… 7.01e7 Ferr… PG-13  1986-06…
##  3 1.50e7 Paramo… USA     Tony Sc… Acti… 1.80e8 Top … PG     1986-05…
##  4 1.85e7 Twenti… USA     James C… Acti… 8.52e7 Alie… R      1986-07…
##  5 9.00e6 Walt D… USA     Randal … Adve… 1.86e7 Flig… PG     1986-08…
##  6 6.00e6 De Lau… USA     David L… Drama 8.55e6 Blue… R      1986-10…
##  7 9.00e6 Paramo… USA     Howard … Come… 4.05e7 Pret… PG-13  1986-02…
##  8 1.50e7 SLM Pr… USA     David C… Drama 4.05e7 The … R      1986-08…
##  9 6.00e6 Twenti… USA     David S… Come… 8.20e6 Lucas PG-13  1986-03…
## 10 2.50e7 Twenti… USA     John Ca… Acti… 1.11e7 Big … PG-13  1986-07…
## # ... with 4,858 more rows, and 6 more variables: runtime <int>,
## #   score <dbl>, star <chr>, votes <int>, writer <chr>, year <int>

Analysis

Part 1

We begin by looking at how the gross revenue (gross) has changed over time. Since we want to visualize the results, we will choose a few genres of interest for the analysis.

genre_list <- c("Horror", "Drama", "Action", "Animation")
movies %>%
  filter(genre %in% genre_list) %>% 
  group_by(genre,year) %>%
  summarise(avg_gross = mean(gross)) %>%
  ggplot(mapping = aes(x = year, y = avg_gross, color=genre)) +
    geom_line() +
    ylab("Average Gross Revenue (in US Dollars)") +
    ggtitle("Gross Revenue Over Time")

Part 2

Next, let’s see the relationship between a movie’s budget and its gross revenue. Because there is a large range of values for budget and revenue, we will plot the log-transformed version of each variable to more easily visualize the relationship. We will talk more about variable transformations later in the semester.

movies %>%
  filter(genre %in% genre_list, budget > 0) %>% 
  ggplot(mapping = aes(x=log(budget), y = log(gross), color=genre)) +
  geom_point() +
  geom_smooth(method="lm",se=FALSE) + 
  xlab("Log-transformed Budget")+
  ylab("Log-transformed Gross Revenue") +
  facet_wrap(~ genre)

Next Steps

  1. Put your name in the author field at the top of the file (in the yaml – we will discuss what this is at a later date). Knit again.

  2. Change the genre names in parts 1 and 2 to genres that interest you. The spelling and capitalization must match what’s in the data, so you can use the Appendix to see the correct spelling and capitalization. Knit again.

You have made your first data visualization!

Discussion Questions

  1. Consider the plot in Part 1.
    • Describe how movie revenue has changed over time.
    • Suppose we use revenue as a measure of popularity. How has the popularity of each genre changed over time? In other words, are the genres that were most popular in 1986 still the most popular today?
  2. Consider the plot in Part 2.
    • Which genre(s) tend to have the highest budgets?
    • In general, what is the relationship between a movie’s budget and its total revenue? Are there any genres that show a different relationship between budget and revenue?

References

  1. https://github.com/danielgrijalva/movie-stats
  2. Internet Movie Database

Appendix

Below is a list of genres in the data set:

movies %>% 
  arrange(genre) %>% 
  select(genre) %>%
  distinct() %>%
  datatable()