Homework 3 - NYC 311

Background

New York City is at the forefront of the open data movement among local, state and federal governments. They have made publicly available a huge amount of data (NYC Open Data) on everything from street trees, to restaurant inspections, to parking violations. For this homework assignment we will be looking at a database of calls made to New York City’s 311 service which provides government information and non-emergency services.

This will be our first foray in to bigish data as the CSV file containing these data is roughly 6 gigabytes in size and contains 10 million observations over 48 variables. This is not so big that we can’t run our analyses on a moderately powerful laptop, but it is large enough that we need to start paying attention to what we are doing and the performance of our code.

The data contains all 311 requests from all five boroughs of New York City between January 1st, 2010 and October 6th, 2015. This data (nyc_311.csv) and additional supplementary data is available on saxon in my Sta523 data folder under nyc: /home/vis/cr173/Sta523/data/nyc/. We will introduce and discuss the additional supplementary datasets in class.

Task 1 - Geocoding

The 311 data contains a large number of variables that we do not care about for the time being. For your first task you will need to geocode as much of the data as possible using the given variables. Note that this data has had minimal cleaning done, there are a large number of errors, omissions, and related issues. Also, note that there is a very large number of requests made, 10 million over the course of the last 5 years. Even under the most optimistic of circumstances you will not be able to, nor should you, use any of the standard web based geocoding services.

In order to be successful at this task you do not need to geocode every address, or even most addresses. The goal is to geocode as many as possible with as much accuracy as possible to enable you to be successful with the 2nd task. This is a messy, large, and complex data set and at the best of times geocoding is a very difficult problem - go for the low hanging fruit first and then work on the edge cases / exceptions later as needed.

Your write up for this task should include a description of any and all cleaning / subsetting / etc. that was done to the data, as well as a description of your geocoding approach(es) and a discussion of how successful each was.

Task 2 - Recreating NYC’s Boroughs

The primary goal of this assignment is to attempt to reconstruct the boundaries of the 5 city boroughs. The data set contains the column, Borough, that lists the borough in which the request ostensibly originated from. Your goal is to take this data along with the geocoded locations from Task 1 and generate a set of spatial polygons (in the GeoJSON format) that represents the boundary of these borough.

As mentioned before, the data is complex and messy so keep in mind that there is no guarantee that the reported borough is correct, or the street address, or even your geocoding. As such, the goal is not perfection, anything that even remotely resembles the borough map will be considered a success. No single approach for this estimation is likely to work well, and an iterative approach to cleaning and tweaking will definitely be necessary. I would suggest initially focusing on a borough to develop your methods before generalizing to the entirety of New York City.

To make things more interesting I will be offering a extra credit for the team that is best able to recreate the borough map as judged by the smallest total area of discrepancy between your predicted polygons and the true map. In order to win the extra credit you must abide by the rules as detailed below. I will maintain a leader board so that you will be able to judge how well you are doing relative to the other teams in the class.

For this task you are expected to produce a GeoJSON file called boroughs.json, for details on formatting see the hw_example repo. Your write up should include a discussion of your approaches to generating the boundaries and at least a simple visualization of your boundaries on top of the Manhattan borough boundaries.

Task 3 - Visualization

The final task you will need to complete for this assignment is a novel visualization of the data - this will be left wholly open ended. The only constraint is that you are limited to only a single plot and you should make it as interesting and informative as possible (it does not need to involve geospatial data or the borough map). You should also include a brief writeup that describes how the visualization is constructed and why you chose this specific aspect of the data to visualize.

Rules

There will is a hard limit of 2 hours of run time for this assignment, I should be able to run make and have the entire analysis and all output finished within 2 hours on saxon.
If you wish to use any additional data source(s), you must first check it with me and I will approve it or not. Additional data may only be used to improve the quality of your geocoding, you are not allowed to use anything beyond the geocoded 311 data to directly estimate the borough boundaries.
You may not use any existing borough data regardless of source. I am aware that the borough boundaries are only a google search away - avoid the temptation. If I suspect that you have used this data at any point in your analysis your whole team will be disqualified from the extra credit and I also reserve the right to penalize your assignment grade for particularly egregious cases. This applies even to instances where you solely use the data to score yourself.
Scores for your predictions will be provided by wercker - final scores will be calculated by running your code from scratch.
Do not create unnecessary copies of the data, you should not create a local copy of nyc_311.csv in your home directory on saxon. If you absolutely must, you can maintain a copy on your own laptop. If you are saving intermediary files, make sure to remove as much unnecessary data as possible (columns and or rows) and save the file in a binary format (e.g. .Rdata). Be aware of the size of your files and your disk usage and be very careful to not commit any large files to git as this can prevent you from being able to push to github.

Submission and Grading

This homework is due by 2 pm Saturday, October 31st. You are to complete the assignment as a group and to keep everything (code, write ups, etc.) on your team’s github repository (commit early and often). All team members are expected to contribute equally to the completion of this assignment and group assessments will be given at its completion - anyone judged to not have sufficient contributed to the final product will have their grade penalized. While different teams members may have different coding backgrounds and abilities, it is the responsibility of every team member to understand how and why all code in the assignment works.

The final product for this assignment include a single document named hw3.Rmd that contains a write up for your approach to both the geocoding of the violation addresses and the reconstruction of the borough boundaries. Your hw3 directory should also include commented R script(s) that implement both of these tasks as well as an appropriate Makefile which can be used to rerun the analyses, generate the boroughs.json file, and compile hw3.Rmd.