AWS EMR is a saas version of mapreduce and apache hadoop. It handles setting up hadoop and running all software needed for mapreduce jobs.
mrjob on EMR performs these steps for you:
For this section, we are going to setup EMR configuration and run some jobs on EMR.
First, create a file called ~/.mrjob.conf
(in your home directory):
runners:
emr:
aws_access_key_id: <key>
aws_secret_access_key: <secret>
You will need to replace <key>
and <secret>
with workshop provided ones.
When you run with the -r emr
runner, by default, an EMR stack will be created
for you. This process takes quite a while, so instead of that, lets use one of the most powerful
features of mrjob, reusing emr job flows.
Next lets start one of our old jobs.
Replace <CHANGE ME>
with the job flow id provided:
$ python wordcount.py -r emr --emr-job-flow-id=<CHANGE ME> words.txt
If the job starts successfully, navigate to http://localhost:40098/jobdetails.jsp to view the status and statistics.