Multistep jobs are required when you need to do more mapping and reducing...
For this section, we are going to be counting the animal that was seen the most
during a scientific observation that has the ngram at
in the animals name
(I suck at examples...).
We will see a few new concepts:
command
stepspre_filters
First, create a file called multistep.py
in your current folder with
this as the contents:
from mrjob.job import MRJob
class MultistepJob(MRJob):
def mapper(self, _, value):
count, animal = value.split()
yield animal, int(count)
def reducer(self, animal, count):
yield None, (sum(count), animal)
def reducer_max(self, _, values):
yield max(values)
def steps(self):
return [
self.mr(mapper_pre_filter='grep at',
mapper=self.mapper,
reducer=self.reducer
),
self.mr(reducer=self.reducer_max)
]
if __name__ == '__main__':
MultistepJob.run()
And second, create a file called animals.txt
in your current folder with
this as the contents (these will be the observational counts of animals that
were seen):
1 dog
17 goat
1 cat
66 dog
3 cat
8 llama
14 rat
18 bat
2 bat
Finally, lets run the job:
$ python multistep.py -r local animals.txt
Check the output, you should get something like:
...
20 "bat"
...
at
in them?