mrjob workshop

AWS EMR is a saas version of mapreduce and apache hadoop. It handles setting up hadoop and running all software needed for mapreduce jobs.

mrjob on EMR performs these steps for you:

setting up an EMR cluster
copying data to an S3 bucket (optionally)
running mapreduce jobs
collecting stats
streaming the results to you or into S3
cleanup or terminiation of EMR clusters

For this section, we are going to setup EMR configuration and run some jobs on EMR.

Setup

First, create a file called ~/.mrjob.conf (in your home directory):

runners:
  emr:
    aws_access_key_id: <key>
    aws_secret_access_key: <secret>

You will need to replace <key> and <secret> with workshop provided ones.

When you run with the -r emr runner, by default, an EMR stack will be created for you. This process takes quite a while, so instead of that, lets use one of the most powerful features of mrjob, reusing emr job flows.

Next lets start one of our old jobs.

Replace <CHANGE ME> with the job flow id provided:

$ python wordcount.py -r emr --emr-job-flow-id=<CHANGE ME> words.txt

If the job starts successfully, navigate to http://localhost:40098/jobdetails.jsp to view the status and statistics.

Exercises

Find some larger text and try to run wordcount.py with it on EMR.

mrjob workshop

Running on EMR

Setup

Exercises