These data were collected by Knight and Skagen during a field study on the foraging behavior of wintering Bald Eagles in Washington State, USA. Data reflects 160 attempts by one (pirating) Bald Eagle to steal a chum salmon from another (feeding) Bald Eagle.
eagles
You can load the dataset (called eagles
) using the following command:
load(url("http://www.stat.duke.edu/~cr173/Sta102_Su16/Project/eagles.Rdata"))
Below is a description of the variables:
Success
: Number of successful attempts.Total
: Total number of attempts.Pirate_Size
: Size of pirating eagle.Pirate_Age
: Age of pirating eagle.Victim_Size
: Size of victim eagle.Note the format of this data is somewhat different from what we are used to, each row is a tabulation of many individual attempts for identical combination of pirate and victim eagle characteristics. For example. the first row tells us that there were 24 attempts where a Large Adult pirate eagle attempted to steal a salmon for another large eagle (17 of which were successful).
We would like to transform this data such that each row instead represents a single attempt, in which case we would expect to have three of the original columns: Pirate_Size
, Pirate_Age
, and Victim_Size
and a new column Attempt
which contains the values Success
or Failure
. The Total
and Success
columns are no longer necessary as their information will now be encoded in the rows. We accomplish this by duplicating rows using subsetting (first for successes and then for failures) and then add the Attempt
column before joining everything together into eagles_tidy
.
success <- eagles[rep(1:8, eagles$Success), ] %>%
select(Pirate_Size:Victim_Size) %>%
mutate(Attempt = "Success")
failure <- eagles[rep(1:8, eagles$Total - eagles$Success), ] %>%
select(Pirate_Size:Victim_Size) %>%
mutate(Attempt = "Failure")
eagles_tidy <- rbind(success,failure)
Using these data, what can we say about this pirating behaviour? Specifically address, using statistical inference methods, how the size of the pirating and pirated eagles is associated with the outcome. Make sure to clearly identify your hypotheses, conduct necessary exploratiry data analysis and statistical inference, and then explain in detail what the statisical conclusions mean in the biological context.
You work for Paramount Pictures.
Your boss has just acquired data about how much audiences and critics like movies as well as numerous other variables about the movies.
She is interested in learning what attributes make a movie popular. She is also interested in learning something new about movies. She wants your team to figure it all out.
movies
You can load the dataset (called movies
) using the following command:
load(url("http://www.stat.duke.edu/~cr173/Sta102_Su16/Project/movies.Rdata"))
The data set is comprised of 651 randomly sampled movies produced and released before 2016.
You might also choose to omit certain observations or restructure some of the variables to make them suitable for answering your research questions.
When you are fitting a model you should also be careful about collinearity, as some of these variables may be dependent on each other.
title
: Title of movietitle_type
: Type of movie (Documentary, Feature Film, TV Movie)genre
: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, 1. Horror, Mystery & Suspense, Other)runtime
: Runtime of movie (in minutes)mpaa_rating
: MPAA rating of the movie (G, PG, PG-13, R, Unrated)studio
: Studio that produced the moviethtr_rel_year
: Year the movie is released in theatersthtr_rel_month
: Month the movie is released in theatersthtr_rel_day
: Day of the month the movie is released in theatersdvd_rel_year
: Year the movie is released on DVDdvd_rel_month
: Month the movie is released on DVDdvd_rel_day
: Day of the month the movie is released on DVDimdb_rating
: Rating on IMDBimdb_num_votes
: Number of votes on IMDBcritics_rating
: Categorical variable for critics rating on Rotten Tomatoes 1. (Certified Fresh, Fresh, Rotten)critics_score
: Critics score on Rotten Tomatoesaudience_rating
: Categorical variable for audience rating on Rotten Tomatoes 1. (Spilled, Upright)audience_score
: Audience score on Rotten Tomatoesbest_pic_nom
: Whether or not the movie was nominated for a best picture 1. Oscar (no, yes)best_pic_win
: Whether or not the movie won a best picture Oscar (no, yes)best_actor_win
: Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given moviebest_actress win
: Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given moviebest_dir_win
: Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movietop200_box
: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)director
: Director of the movieactor1
: First main actor/actress in the abridged cast of the movieactor2
: Second main actor/actress in the abridged cast of the movieactor3
: Third main actor/actress in the abridged cast of the movieactor4
: Fourth main actor/actress in the abridged cast of the movieactor5
: Fifth main actor/actress in the abridged cast of the movieimdb_url
: Link to IMDB page for the moviert_url
: Link to Rotten Tomatoes page for the movieUsing these data, what can we say about the relationship between audience scores and at least four of the other variables in this dataset.
Some of these variables are only there for informational purposes and do not make any sense to include in a statistical analysis. It is up to you to decide which variables are meaningful and which should be omitted. For example information in the the actor1
through actor5
variables was used to determine whether the movie casts an actor or actress who won a best actor or actress Oscar.
Make sure to clearly identify your research question based on variable you choose to include in your analysis, conduct necessary exploratiry data analysis and statistical inference and/or modeling, and then explain in detail what the statisical conclusions mean in the context of these data.
Note: Formal model selection is not required as part of this project.
The project is due at 8pm on Friday, June 17.
Submit online on Sakai under Assignments. These will be time stamped, and late penalty will be applied based on the time stamp. Only one submission per team required.
We will download your R Markdown file and run your code to confirm reproducibility of your work. Grading will be based on the document we compile, so make sure that your R Markdown file contains everything necessary to compile your entire work.
Your project should be written using the R Markdown template, so that all R code, output, and plots will be automatically included in your write up.
Download the template for the proposal:
download.file("http://www.stat.duke.edu/~cr173/Sta102_Su16/Project/sta102_project.Rmd",
destfile = "sta101_project.Rmd")
Your write up for each part should not exceed 4 pages including figures. View a print preview to determine length.
You can hide your code in the output by adding the option echo = FALSE
in your R chunks. This will help save space. Note that your code will still be evaluated for grading purposes (we’ll be able to see it in your Rmd file).
Each part is worth 50 points.
A general breakdown of grading is as follows:
90%-100% - Outstanding effort. Student understands how to apply all statistical concepts, can put the results into a cogent argument, can identify weaknesses in the argument, and can clearly communicate the results to others.
80%-89% - Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.
70%-79% - Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a cogent argument, and communication of results is sometimes unclear.
60%-69% - Struggling effort. Student is making some effort, but has misunderstanding of many concepts and is unable to put together a cogent argument. Communication of results is unclear.
Below 60% - Student is not making a sufficient effort.
The following will result in deductions:
You may not discuss this project in any way with anyone outside your team, besides the professor and TAs. Failure to abide by this policy will result in a 0 for all teams involved.
This project is an opportunity to apply what you have learned about descriptive statistics, graphical methods, correlation and regression, and hypothesis testing and confidence intervals.
The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather to show that you are proficient at using R at a basic level and that you are proficient at interpreting and presenting the results.
You might consider critiquing your own method, such as issues pertaining to the reliability of the data and the appropriateness of the statistical analysis you used within the context of this specific data set.