mrjob workshop


Multistep Jobs

Multistep jobs are required when you need to do more mapping and reducing...

For this section, we are going to be counting the animal that was seen the most during a scientific observation that has the ngram at in the animals name (I suck at examples...).

We will see a few new concepts:

Setup

First, create a file called multistep.py in your current folder with this as the contents:

from mrjob.job import MRJob


class MultistepJob(MRJob):

    def mapper(self, _, value):
        count, animal = value.split()

        yield animal, int(count)

    def reducer(self, animal, count):
        yield None, (sum(count), animal)

    def reducer_max(self, _, values):
        yield max(values)

    def steps(self):
        return [
            self.mr(mapper_pre_filter='grep at',
                    mapper=self.mapper,
                    reducer=self.reducer
                    ),
            self.mr(reducer=self.reducer_max)
        ]

if __name__ == '__main__':
    MultistepJob.run()

And second, create a file called animals.txt in your current folder with this as the contents (these will be the observational counts of animals that were seen):

1 dog
17 goat
1 cat
66 dog
3 cat
8 llama
14 rat
18 bat
2 bat

Finally, lets run the job:

$ python multistep.py -r local animals.txt

Check the output, you should get something like:

...
20      "bat"
...

Exercises