
I recently started playing with Apache Pig.
Pig is a high-level platform for creating MapReduce programs used with Hadoop wikipedia
MapReduce is a programming model for processing large data sets wikipedia
I'm planning to run everything on my Macbook Air with 128GB SSD, so "large data set" will be relative.
I decided to search NYC Open Data for an interesting dataset and found this, the NYC Restaurant Inspection Results. (only 90MB)
As a quick recap for non US residents: Restaurants get points for violations. If a restaurant has an inspection score between 0 and 13 points it receives an A, those with 14 to 27 points a B and those with 28 or more a C. More info here. As a customer you like A, might tolerate B and want to stay away from C.
We have the data, but what information do we want to get out of it?
I figured it would be fun (and controversial!) to find out which cuisine is prone to be associated with bad ratings. For example, we might find out that cuban restaurants tend to get a lot of A ratings, whereas italian restaurants get more B's.
Here's the Pig Latin (with verbose comments):
After running this script we end up with this file.
And here are the results, rendered with a short JS script (A - blue, B - green, C - red):
Now - while this exercise allowed me to produce pretty graphics the question I really wanted answered was:
Which cuisines are 'safe' and which should I stay away from?
I extended the Pig Latin script to produce the necessary result.
(You can look at the raw results here)
Based on these results we can create a a top-fifteen list of cuisines per grade:
Conclusion: The good ol' street cart seems safe (Hotdogs/Pretzels, Nuts/Confectionary), so does Armenian and Ethiopian. On the other end, try to stay away from chinese fusion (Chinese/Cuban, Chinese/Japanese) and Polynesian.
Disclaimer: Obviously, please take all of this with a grain of salt. The purpose of this exercise was to have some fun with Pig (yeah I said it), not to produce a 'proper' analysis of NYC restaurant ratings.
I, for one, enjoyed working with Pig quite a bit. The language is in it's early stages but mature enough to be fun to work with. it seems obvious to have an abstraction like this that keeps you from writing many many lines of boilerplate MapReduce code. That said, as with all abstractions, it's crucial to understand what happens under the hood. Especially when working with large data sets, execution time is super important (a simple group all or dump statement can cause a lot of trouble). For a good introduction to Hadoop and Pig I recommend this and this video.
My next step will be to use a bigger dataset and run the whole thing on a real Hadoop cluster instead of my local machine.