You and your teammates work for Paramount Pictures.
Your bosses have just paid a large amount of money to acquire a data set of 651 randomly sampled movies produced and released before 2016. These data include a number of variables on everything from audience and critic scores from IMDB and Rotten Tomatoes to runtime and whether or not the cast and or director have won an Oscar.
After spending all this money your bosses are interested in learning what attributes make a movie popular as well as any other interesting insights into what makes a movie successful (either critically or in terms of box office gross). They want you to justify their expenditure by putting together a flashy tool that they can show off at the next board meeting.
They don’t care what exactly you do with the data: exploratory data analysis (EDA), visualization, inference, modeling, and or prediction are all valid approaches they just want something useful and or interesting to come out of these data.
The data are provided in your project repo and can be loaded using:
load("movies.Rdata")
The codebook for these data is as follow:
title
: Title of movietitle_type
: Type of movie (Documentary, Feature Film, TV Movie)genre
: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, 1. Horror, Mystery & Suspense, Other)runtime
: Runtime of movie (in minutes)mpaa_rating
: MPAA rating of the movie (G, PG, PG-13, R, Unrated)studio
: Studio that produced the moviethtr_rel_year
: Year the movie is released in theatersthtr_rel_month
: Month the movie is released in theatersthtr_rel_day
: Day of the month the movie is released in theatersdvd_rel_year
: Year the movie is released on DVDdvd_rel_month
: Month the movie is released on DVDdvd_rel_day
: Day of the month the movie is released on DVDimdb_rating
: Rating on IMDBimdb_num_votes
: Number of votes on IMDBcritics_rating
: Categorical variable for critics rating on Rotten Tomatoes 1. (Certified Fresh, Fresh, Rotten)critics_score
: Critics score on Rotten Tomatoesaudience_rating
: Categorical variable for audience rating on Rotten Tomatoes 1. (Spilled, Upright)audience_score
: Audience score on Rotten Tomatoes (response variable)best_pic_nom
: Whether or not the movie was nominated for a best picture 1. Oscar (no, yes)best_pic_win
: Whether or not the movie won a best picture Oscar (no, yes)best_actor_win
: Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given moviebest_actress win
: Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given moviebest_dir_win
: Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movietop200_box
: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)director
: Director of the movieactor1
: First main actor/actress in the abridged cast of the movieactor2
: Second main actor/actress in the abridged cast of the movieactor3
: Third main actor/actress in the abridged cast of the movieactor4
: Fourth main actor/actress in the abridged cast of the movieactor5
: Fifth main actor/actress in the abridged cast of the movieimdb_url
: Link to IMDB page for the moviert_url
: Link to Rotten Tomatoes page for the movieThe project purposefully open ended, you should create some kind of compelling interactive tool that provides insight into this data. There is no limit on what tools or packages you may use but we strongly recommend using Shiny as a easy way of introducing interactivity into whatever you produce. Your ultimate goal is to provide insight into the data that someone who works for a motion picture studio would be able to use. The data provided plenty of opportunities but you should not feel constrained by them, if there is additional information / data you think is relevant you are welcome to scape / collect it and add it on. Conversely if you are interested in exploring only a single genre you are more then welcome to subset the data in any way you would like.
Visualization and EDA are a very good place to start as they will give you a deeper insight into the data and its internal relationships but modeling, inference, and or prediction are also useful tools to base your conclusions around. Whatever results you produce, they must be statistically sound and fully justified.
Your final product should contain the following pieces:
An interactive analysis / insight / report. This can be a standalone R script / Shiny app / etc. All code should be clearly written and documented.
A brief write-up documenting your process and describing your conclusions. This should include a summary of what you have learned about the data along with relevant statistical arguments supporting your conclusions. It is also a good idea to include at least a brief critique your own methods and provide suggestions for improving your analysis.
Presentation materials - you will be presenting your results to the class, this can be a combination of slides and or live demo of your results.
You will give a ten minute presentation / demo of your work and answer questions from myself and your classmates for up to 5 mins. Each team member should speak during this presentation and participate in answering questions. The time limit is firm, you will be asked to stop at the end of 10 minutes. This is not a lot of time, therefore you should decide carefully what you will highlight during your presentation and practice to make sure you can fit everything you want to say within the time limit. You are welcome to use slides or live demo your shiny app (or a combination of both) for your presentation, any slides used should also be included in your GitHub repository.
Presentations will occur during the scheduled final period for the class, Thursday, December 15 2:00 PM - 5:00 PM. Any group member who does not attend the presentation will receive a 0 from the entire project.
Your writeup (and accompanying code) and presentation will be graded out of 100 points.
Grading of the project will take into account:
All work submission will be via the GitHub project repository - this should include your write up and all code as well as any additional slides or other materials used during the presentation.
Team scores for both the proposal and the poster will be adjusted based on team peer evaluation data to determine each student’s individual grade. You will be asked to fill out a survey where you rate the contribution of each team member. Filling out the survey is a prerequisite for receiving a project score.