Project: Showcase your skillz!

Part 1: Eagles

These data were collected by Knight and Skagen during a field study on the foraging behavior of wintering Bald Eagles in Washington State, USA. Data reflects 160 attempts by one (pirating) Bald Eagle to steal a chum salmon from another (feeding) Bald Eagle.

Data: `eagles`

You can load the dataset (called eagles) using the following command:

load(url("http://www.stat.duke.edu/~cr173/Sta102_Su16/Project/eagles.Rdata"))

Codebook

Below is a description of the variables:

Success: Number of successful attempts.
Total: Total number of attempts.
Pirate_Size: Size of pirating eagle.
Pirate_Age: Age of pirating eagle.
Victim_Size: Size of victim eagle.

Data wrangling

Note the format of this data is somewhat different from what we are used to, each row is a tabulation of many individual attempts for identical combination of pirate and victim eagle characteristics. For example. the first row tells us that there were 24 attempts where a Large Adult pirate eagle attempted to steal a salmon for another large eagle (17 of which were successful).

We would like to transform this data such that each row instead represents a single attempt, in which case we would expect to have three of the original columns: Pirate_Size, Pirate_Age, and Victim_Size and a new column Attempt which contains the values Success or Failure. The Total and Success columns are no longer necessary as their information will now be encoded in the rows. We accomplish this by duplicating rows using subsetting (first for successes and then for failures) and then add the Attempt column before joining everything together into eagles_tidy.

success <- eagles[rep(1:8, eagles$Success), ] %>% 
  select(Pirate_Size:Victim_Size) %>% 
  mutate(Attempt = "Success")
failure <- eagles[rep(1:8, eagles$Total - eagles$Success), ] %>% 
  select(Pirate_Size:Victim_Size) %>% 
  mutate(Attempt = "Failure")
eagles_tidy <- rbind(success,failure)

Research question

Using these data, what can we say about this pirating behaviour? Specifically address, using statistical inference methods, how the size of the pirating and pirated eagles is associated with the outcome. Make sure to clearly identify your hypotheses, conduct necessary exploratiry data analysis and statistical inference, and then explain in detail what the statisical conclusions mean in the biological context.

Part 2: Movies

You work for Paramount Pictures.

Your boss has just acquired data about how much audiences and critics like movies as well as numerous other variables about the movies.

She is interested in learning what attributes make a movie popular. She is also interested in learning something new about movies. She wants your team to figure it all out.

Data: `movies`

You can load the dataset (called movies) using the following command:

load(url("http://www.stat.duke.edu/~cr173/Sta102_Su16/Project/movies.Rdata"))

The data set is comprised of 651 randomly sampled movies produced and released before 2016.

You might also choose to omit certain observations or restructure some of the variables to make them suitable for answering your research questions.

When you are fitting a model you should also be careful about collinearity, as some of these variables may be dependent on each other.

Codebook

title: Title of movie
title_type: Type of movie (Documentary, Feature Film, TV Movie)
genre: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, 1. Horror, Mystery & Suspense, Other)
runtime: Runtime of movie (in minutes)
mpaa_rating: MPAA rating of the movie (G, PG, PG-13, R, Unrated)
studio: Studio that produced the movie
thtr_rel_year: Year the movie is released in theaters
thtr_rel_month: Month the movie is released in theaters
thtr_rel_day: Day of the month the movie is released in theaters
dvd_rel_year: Year the movie is released on DVD
dvd_rel_month: Month the movie is released on DVD
dvd_rel_day: Day of the month the movie is released on DVD
imdb_rating: Rating on IMDB
imdb_num_votes: Number of votes on IMDB
critics_rating: Categorical variable for critics rating on Rotten Tomatoes 1. (Certified Fresh, Fresh, Rotten)
critics_score: Critics score on Rotten Tomatoes
audience_rating: Categorical variable for audience rating on Rotten Tomatoes 1. (Spilled, Upright)
audience_score: Audience score on Rotten Tomatoes
best_pic_nom: Whether or not the movie was nominated for a best picture 1. Oscar (no, yes)
best_pic_win: Whether or not the movie won a best picture Oscar (no, yes)
best_actor_win: Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie
best_actress win: Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie
best_dir_win: Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie
top200_box: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)
director: Director of the movie
actor1: First main actor/actress in the abridged cast of the movie
actor2: Second main actor/actress in the abridged cast of the movie
actor3: Third main actor/actress in the abridged cast of the movie
actor4: Fourth main actor/actress in the abridged cast of the movie
actor5: Fifth main actor/actress in the abridged cast of the movie
imdb_url: Link to IMDB page for the movie
rt_url: Link to Rotten Tomatoes page for the movie

Research question

Using these data, what can we say about the relationship between audience scores and at least four of the other variables in this dataset.

Some of these variables are only there for informational purposes and do not make any sense to include in a statistical analysis. It is up to you to decide which variables are meaningful and which should be omitted. For example information in the the actor1 through actor5 variables was used to determine whether the movie casts an actor or actress who won a best actor or actress Oscar.

Make sure to clearly identify your research question based on variable you choose to include in your analysis, conduct necessary exploratiry data analysis and statistical inference and/or modeling, and then explain in detail what the statisical conclusions mean in the context of these data.

Note: Formal model selection is not required as part of this project.

Logistical details

Due date and submission

The project is due at 8pm on Friday, June 17.

Submit online on Sakai under Assignments. These will be time stamped, and late penalty will be applied based on the time stamp. Only one submission per team required.

R Markdown file (.Rmd)
HTML output (.html)

We will download your R Markdown file and run your code to confirm reproducibility of your work. Grading will be based on the document we compile, so make sure that your R Markdown file contains everything necessary to compile your entire work.

Format & length

Your project should be written using the R Markdown template, so that all R code, output, and plots will be automatically included in your write up.

Download the template for the proposal:

download.file("http://www.stat.duke.edu/~cr173/Sta102_Su16/Project/sta102_project.Rmd", 
              destfile = "sta101_project.Rmd")

Your write up for each part should not exceed 4 pages including figures. View a print preview to determine length.

You can hide your code in the output by adding the option echo = FALSE in your R chunks. This will help save space. Note that your code will still be evaluated for grading purposes (we’ll be able to see it in your Rmd file).

Grading

Each part is worth 50 points.

A general breakdown of grading is as follows:

90%-100% - Outstanding effort. Student understands how to apply all statistical concepts, can put the results into a cogent argument, can identify weaknesses in the argument, and can clearly communicate the results to others.
80%-89% - Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.
70%-79% - Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a cogent argument, and communication of results is sometimes unclear.
60%-69% - Struggling effort. Student is making some effort, but has misunderstanding of many concepts and is unable to put together a cogent argument. Communication of results is unclear.
Below 60% - Student is not making a sufficient effort.

The following will result in deductions:

Late: -20 points for one day late (until 8pm on Saturday, June 18). Work turned in any later will not be accepted.
Each page over limit: -5 points per page (view print preview to confirm length)
Reproducibility issues, requiring to make changes to the R Markdown file to knit the document will also be penalized

Honor code

You may not discuss this project in any way with anyone outside your team, besides the professor and TAs. Failure to abide by this policy will result in a 0 for all teams involved.

Tips

This project is an opportunity to apply what you have learned about descriptive statistics, graphical methods, correlation and regression, and hypothesis testing and confidence intervals.

The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather to show that you are proficient at using R at a basic level and that you are proficient at interpreting and presenting the results.

You might consider critiquing your own method, such as issues pertaining to the reliability of the data and the appropriateness of the statistical analysis you used within the context of this specific data set.

Project: Showcase your skillz!

Part 1: Eagles

Data: eagles

Codebook

Data wrangling

Research question

Part 2: Movies

Data: movies

Codebook

Research question

Logistical details

Due date and submission

Format & length

Grading

Honor code

Tips

Data: `eagles`

Data: `movies`