mrjob workshop


Introduction

mapreduce

mapreduce is all about filter and sorting in a distributed fashion (obviously this is not a great definition).

mapreduce diagram

With large datasets, the benefit of mapreduce can be seen in performing the operations in a distributed computing environment.

Some example uses of mapreduce:

http://www.datasalt.com/2012/12/mapreduce-real-use-cases/

http://stevekrenzel.com/finding-friends-with-mapreduce

mrjob

mrjob facilitates writing mapreduce jobs in python and running the jobs in many different ways.

Why mrjob?

https://pythonhosted.org/mrjob/guides/why-mrjob.html#overview