Sta 112

You and your teammates work for Paramount Pictures.

Your bosses have just paid a large amount of money to acquire a data set of 651 randomly sampled movies produced and released before 2016. These data include a number of variables on everything from audience and critic scores from IMDB and Rotten Tomatoes to runtime and whether or not the cast and or director have won an Oscar.

After spending all this money your bosses are interested in learning what attributes make a movie popular as well as any other interesting insights into what makes a movie successful (either critically or in terms of box office gross). They want you to justify their expenditure by putting together a flashy tool that they can show off at the next board meeting.

They don’t care what exactly you do with the data: exploratory data analysis (EDA), visualization, inference, modeling, and or prediction are all valid approaches they just want something useful and or interesting to come out of these data.

Data

The data are provided in your project repo and can be loaded using:

load("movies.Rdata")

The codebook for these data is as follow:

title: Title of movie
title_type: Type of movie (Documentary, Feature Film, TV Movie)
genre: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, 1. Horror, Mystery & Suspense, Other)
runtime: Runtime of movie (in minutes)
mpaa_rating: MPAA rating of the movie (G, PG, PG-13, R, Unrated)
studio: Studio that produced the movie
thtr_rel_year: Year the movie is released in theaters
thtr_rel_month: Month the movie is released in theaters
thtr_rel_day: Day of the month the movie is released in theaters
dvd_rel_year: Year the movie is released on DVD
dvd_rel_month: Month the movie is released on DVD
dvd_rel_day: Day of the month the movie is released on DVD
imdb_rating: Rating on IMDB
imdb_num_votes: Number of votes on IMDB
critics_rating: Categorical variable for critics rating on Rotten Tomatoes 1. (Certified Fresh, Fresh, Rotten)
critics_score: Critics score on Rotten Tomatoes
audience_rating: Categorical variable for audience rating on Rotten Tomatoes 1. (Spilled, Upright)
audience_score: Audience score on Rotten Tomatoes (response variable)
best_pic_nom: Whether or not the movie was nominated for a best picture 1. Oscar (no, yes)
best_pic_win: Whether or not the movie won a best picture Oscar (no, yes)
best_actor_win: Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie
best_actress win: Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie
best_dir_win: Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie
top200_box: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)
director: Director of the movie
actor1: First main actor/actress in the abridged cast of the movie
actor2: Second main actor/actress in the abridged cast of the movie
actor3: Third main actor/actress in the abridged cast of the movie
actor4: Fourth main actor/actress in the abridged cast of the movie
actor5: Fifth main actor/actress in the abridged cast of the movie
imdb_url: Link to IMDB page for the movie
rt_url: Link to Rotten Tomatoes page for the movie

Analysis and Interactivity

The project purposefully open ended, you should create some kind of compelling interactive tool that provides insight into this data. There is no limit on what tools or packages you may use but we strongly recommend using Shiny as a easy way of introducing interactivity into whatever you produce. Your ultimate goal is to provide insight into the data that someone who works for a motion picture studio would be able to use. The data provided plenty of opportunities but you should not feel constrained by them, if there is additional information / data you think is relevant you are welcome to scape / collect it and add it on. Conversely if you are interested in exploring only a single genre you are more then welcome to subset the data in any way you would like.

Visualization and EDA are a very good place to start as they will give you a deeper insight into the data and its internal relationships but modeling, inference, and or prediction are also useful tools to base your conclusions around. Whatever results you produce, they must be statistically sound and fully justified.

Your final product should contain the following pieces:

An interactive analysis / insight / report. This can be a standalone R script / Shiny app / etc. All code should be clearly written and documented.
A brief write-up documenting your process and describing your conclusions. This should include a summary of what you have learned about the data along with relevant statistical arguments supporting your conclusions. It is also a good idea to include at least a brief critique your own methods and provide suggestions for improving your analysis.
Presentation materials - you will be presenting your results to the class, this can be a combination of slides and or live demo of your results.

Presentation format & length

You will give a ten minute presentation / demo of your work and answer questions from myself and your classmates for up to 5 mins. Each team member should speak during this presentation and participate in answering questions. The time limit is firm, you will be asked to stop at the end of 10 minutes. This is not a lot of time, therefore you should decide carefully what you will highlight during your presentation and practice to make sure you can fit everything you want to say within the time limit. You are welcome to use slides or live demo your shiny app (or a combination of both) for your presentation, any slides used should also be included in your GitHub repository.

Presentations will occur during the scheduled final period for the class, Thursday, December 15 2:00 PM - 5:00 PM. Any group member who does not attend the presentation will receive a 0 from the entire project.

Grading

Your writeup (and accompanying code) and presentation will be graded out of 100 points.

Grading of the project will take into account:

(~10%) Correctness: Are the procedures and explanations correct?
(~40%) Content/Critical thought: Did your think carefully about the problem?
(~10%) Tidyness: Is your code organized well, documented, and easy to understand?
(~20%) Writeup: Did you clearly document your approach and explain your findings?
(~20%) Presentation: What was the quality of the presentation?

Submission

All work submission will be via the GitHub project repository - this should include your write up and all code as well as any additional slides or other materials used during the presentation.

Teamwork and grading