AWS EMR is a saas version of mapreduce and apache hadoop. It handles setting up hadoop and running all software needed for mapreduce jobs.
mrjob on EMR performs these steps for you:
For this section, we are going to setup EMR configuration and run some jobs on EMR.
First, create a file called ~/.mrjob.conf (in your home directory):
runners:
emr:
aws_access_key_id: <key>
aws_secret_access_key: <secret>You will need to replace <key> and <secret> with workshop provided ones.
When you run with the -r emr runner, by default, an EMR stack will be created
for you. This process takes quite a while, so instead of that, lets use one of the most powerful
features of mrjob, reusing emr job flows.
Next lets start one of our old jobs.
Replace <CHANGE ME> with the job flow id provided:
$ python wordcount.py -r emr --emr-job-flow-id=<CHANGE ME> words.txtIf the job starts successfully, navigate to http://localhost:40098/jobdetails.jsp to view the status and statistics.