我正在使用 Python 编写的映射器和缩减器在 Hadoop(在 Amazon 的 EMR 上)运行流式作业.我想知道如果我在 Java 中(或使用 Pig)实现相同的映射器和减速器,我会体验到的速度提升.
I'm running a streaming job in Hadoop (on Amazon's EMR) with the mapper and reducer written in Python. I want to know about the speed gains I would experience if I implement the same mapper and reducer in Java (or use Pig).
特别是,我正在寻找人们从流式迁移到自定义 jar 部署和/或 Pig 的经验,以及包含这些选项的基准比较的文档.我发现了这个问题,但答案对我来说不够具体.我不是在寻找 Java 和 Python 之间的比较,而是在 Hadoop 中的自定义 jar 部署和基于 Python 的流之间的比较.
In particular, I'm looking for people's experiences on migrating from streaming to custom jar deployments and/or Pig and also documents containing benchmark comparisons of these options. I found this question, but the answers are not specific enough for me. I'm not looking for comparisons between Java and Python, but comparisons between custom jar deployment in Hadoop and Python-based streaming.
我的工作是从 Google Books NGgram 数据集中读取 NGram 计数并计算聚合度量.计算节点上的 CPU 利用率似乎接近 100%.(我想听听您对 CPU 密集型或 IO 密集型作业的区别的看法.
My job is reading NGram counts from the Google Books NGgram dataset and computing aggregate measures. It seems like CPU utilization on the compute nodes are close to 100%. (I would like to hear your opinions about the differences of having CPU-bound or an IO-bound job, as well).
谢谢!
澳大利亚
为什么要考虑部署自定义 jar?
Why consider deploying custom jars ?
什么时候用猪?
什么时候不使用 pig ?
When to NOT use pig ?
关于 IO 和 CPU 绑定作业的注意事项:
A Note on IO and CPU bound jobs :
这篇关于Hadoop 中的流式处理或自定义 Jar的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持html5模板网!
如何检测 32 位 int 上的整数溢出?How can I detect integer overflow on 32 bits int?(如何检测 32 位 int 上的整数溢出?)
return 语句之前的局部变量,这有关系吗?Local variables before return statements, does it matter?(return 语句之前的局部变量,这有关系吗?)
如何将整数转换为整数?How to convert Integer to int?(如何将整数转换为整数?)
如何在给定范围内创建一个随机打乱数字的 intHow do I create an int array with randomly shuffled numbers in a given range(如何在给定范围内创建一个随机打乱数字的 int 数组)
java的行为不一致==Inconsistent behavior on java#39;s ==(java的行为不一致==)
为什么 Java 能够将 0xff000000 存储为 int?Why is Java able to store 0xff000000 as an int?(为什么 Java 能够将 0xff000000 存储为 int?)