Over the last year Reddit user Stuck_In_the_Matrix has been actively scraping Reddit making the data publicly available for researchers and other interested parties. The data is collected at the comment level, each entry contains a user’s comment along with relevant metadata (subreddit, date time, score, etc). This data is stored as text multiple JSON files, where each line is a separate JSON object. Due to the high volume of traffic on Reddit each of the monthly files is approximately 30 gigabytes. You will be responsible performing the following analysis tasks using mapreduce and/or Spark.
Data from January 2015 - March 2015 have been made available to you via the HDFS on gort. You can find the data in /data/reddit/reddit.json
(the full HDFS address is hdfs://localhost:8020/data/reddit/reddit.json
). Smaller versions of the fata are available as tiny.json
and small.json
that have 10,000 and 1,000,000 comments respectively.
Each comment belongs to a particular subreddit, we would like to know for the given time period what were the 25 most popular subreddits? We would also like to have this broken down by month (January to March) - create a Billboard-like table for each month that shows the top 25 subreddits for that month that also includes the change in rank since the previous month. Comment on any subreddits that show a strong positive or negative trend.
Create plots that show the frequency of Reddit comments over the entire time period (aggregate to an hourly level). Also create plots that show the frequency of comments over the days of the week (data should again be at an hour level). Comment on any patterns you notice, particularly days with unusually large or small numbers of comments.
Valentine’s day is on February 14 every year, our data contains this for 2015 - pick one other day and perform a word frequency analysis of these three days and see if what Redditors say on Valentine’s day appears to be different from your control days. This does not need to be a fully quantitative analysis but do make sure to clean up the data (e.g. strip things like punctuation and capitalization, remove stop words, etc.)
hw6.Rmd
- write up detailing the specifics of your implementation (e.g. query approach)
mapreduce/spark jobs - I recommend having one separate file per task that contains the related mapreduce / spark implementation. These files should perform the query and save the result to a local Rdata file (these should not be commited, but should be used by hw6.Rmd
so the jobs don’t need to be rerun).
This homework is due by 11 pm Sunday, May 1st. You are to complete the assignment as a group and to keep everything (code, write ups, etc.) on your team’s github repository (commit early and often). All team members are expected to contribute equally to the completion of this assignment and group assessments will be given at its completion - anyone judged to not have sufficient contributed to the final product will have their grade penalized. While different teams members may have different coding backgrounds and abilities, it is the responsibility of every team member to understand how and why all code in the assignment works.