从 IPython 笔记本运行 MRJob

时间：2023-09-12

本文介绍了从 IPython 笔记本运行 MRJob的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从 IPython 笔记本运行 mrjob 示例

I'm trying to run mrjob example from IPython notebook

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

def mapper(self, _, line):
    yield "chars", len(line)
    yield "words", len(line.split())
    yield "lines", 1

def reducer(self, key, values):
    yield key, sum(values)

然后用代码运行它

mr_job = MRWordFrequencyCount(args=["testfile.txt"])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        print key, value

并得到错误:

TypeError: <module '__main__' (built-in)> is a built-in class

有没有办法从 IPython notebook 运行 mrjob?

Is there way to run mrjob from IPython notebook?

推荐答案

我还没有找到完美的方法"，但你可以做的一件事是创建一个笔记本单元格，使用 %%file 魔术，将单元格内容写入文件:

I haven't found the "perfect way" yet, but one thing you can do is create one notebook cell, using the %%file magic, writing the cell contents to a file:

%%file wordcount.py
from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)

然后让 mrjob 在稍后的单元格中运行该文件:

And then have mrjob run that file in a later cell:

import wordcount
reload(wordcount)

mr_job = wordcount.MRWordFrequencyCount(args=['example.txt'])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        print key, value

请注意，我调用了我的文件 wordcount.py 并且我从 wordcount 模块导入了类 MRWordFrequencyCount -- 文件名和模块必须匹配.Python 还会缓存导入的模块，当您更改 wordcount.py 文件时，iPython 不会重新加载模块，而是使用旧的缓存模块.这就是我将 reload() 调用放在那里的原因.

Notice that I called my file wordcount.py and that I import the class MRWordFrequencyCount from the wordcount module -- the filename and module has to match. Also Python caches imported modules and when you change the wordcount.py-file iPython will not reload the module but rather used the old, cached one. That's why I put the reload() call in there.

参考:https://groups.google.com/d/味精/mrjob/CfdAgcEaC-I/8XfJPXCjTvQJ

更新(更短)
对于较短的第二个笔记本单元，您可以通过从笔记本中调用 shell 来运行 mrjob

Update (shorter)
For a shorter second notebook cell you can run the mrjob by invoking the shell from within the notebook

! python mrjob.py shakespeare.txt

参考:http://jupyter.cs.brynmawr.edu/hub/dblank/公共/Jupyter%20Magics.ipynb

Reference: http://jupyter.cs.brynmawr.edu/hub/dblank/public/Jupyter%20Magics.ipynb

这篇关于从 IPython 笔记本运行 MRJob的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持html5模板网！

上一篇：在 PySpark 中进行排序减少的最有效方法是什么? 下一篇：Hadoop 流作业在 Python 中失败(不成功)

从 IPython 笔记本运行 MRJob

问题描述

推荐答案

相关文章

最新文章