在与 Google 的一位朋友交谈后,我想实现某种 Job/Worker 模型来更新我的数据集.
After talking with a friend of mine from Google, I'd like to implement some kind of Job/Worker model for updating my dataset.
此数据集反映了 3rd 方服务的数据,因此,要进行更新,我需要对其 API 进行多次远程调用.我认为将花费大量时间等待来自此第 3 方服务的响应.我想加快速度,并更好地利用我的计算时间,通过并行化这些请求并同时保持其中许多请求处于打开状态,因为它们等待各自的响应.
This dataset mirrors a 3rd party service's data, so, to do the update, I need to make several remote calls to their API. I think a lot of time will be spent waiting for responses from this 3rd party service. I'd like to speed things up, and make better use of my compute hours, by parallelizing these requests and keeping many of them open at once, as they wait for their individual responses.
在我解释我的具体数据集并解决问题之前,我想澄清一下我正在寻找什么答案:
Before I explain my specific dataset and get into the problem, I'd like to clarify what answers I'm looking for:
好的,现在进入细节:
数据集由拥有最喜欢项目和关注其他用户的用户组成.目的是能够更新每个用户的队列——用户在加载页面时将看到的项目列表,基于她关注的用户最喜欢的项目.但是,在我处理数据和更新用户队列之前,我需要确保我拥有最新的数据,这是 API 调用的来源.
The dataset consists of users who have favorite items and who follow other users. The aim is to be able to update each user's queue -- the list of items the user will see when they load the page, based on the favorite items of the users she follows. But, before I can crunch the data and update a user's queue, I need to make sure I have the most up-to-date data, which is where the API calls come in.
我可以拨打两个电话:
在我为正在更新的用户调用获取关注的用户之后,我需要为每个被关注的用户更新最喜欢的项目.只有当所有被关注用户的所有收藏都返回时,我才能开始处理该原始用户的队列.这个流程看起来像:
After I call get followed users for the user being updated, I need to update the favorite items for each user being followed. Only when all of the favorites are returned for all the users being followed can I start processing the queue for that original user. This flow looks like:
此流程中的工作包括:
所以,我的问题是:
感谢阅读,期待与大家讨论.
Thanks for reading, I'm looking forward to some discussion with you all.
编辑,回应 JimR:
感谢您的中肯答复.自从我写了原始问题以来,在我的阅读中,我已经远离使用 MapReduce.我还没有确定我想如何构建它,但是当我真的只是想并行化 HTTP 请求时,我开始觉得 MapReduce 更适合分配/并行化计算负载.
Thanks for a solid reply. In my reading since I wrote the original question, I've leaned away from using MapReduce. I haven't decided for sure yet how I want to build this, but I'm beginning to feel MapReduce is better for distributing / parallelizing computing load when I'm really just looking to parallelize HTTP requests.
我的减少"任务,即获取所有获取的数据并将其处理成结果的部分,并不是计算密集型的.我很确定它最终会成为一个大型 SQL 查询,每个用户执行一两秒钟.
What would have been my "reduce" task, the part that takes all the fetched data and crunches it into results, isn't that computationally intensive. I'm pretty sure it's going to wind up being one big SQL query that executes for a second or two per user.
所以,我倾向于:
看来我们要使用 Node.js和 Seq 流控制库.从我的流程地图/流程图转移到代码的存根非常容易,现在只需填写代码以连接到正确的 API.
Seems that we're going with Node.js and the Seq flow control library. It was very easy to move from my map/flowchart of the process to a stubb of the code, and now it's just a matter of filling out the code to hook into the right APIs.
感谢您的回答,他们在寻找我正在寻找的解决方案方面提供了很多帮助.
Thanks for the answers, they were a lot of help finding the solution I was looking for.
这篇关于我应该为这项任务学习/使用 MapReduce 还是其他类型的并行化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持html5模板网!
python:不同包下同名的两个模块和类python: Two modules and classes with the same name under different packages(python:不同包下同名的两个模块和类)
配置 Python 以使用站点包的其他位置Configuring Python to use additional locations for site-packages(配置 Python 以使用站点包的其他位置)
如何在不重复导入顶级名称的情况下构造python包How to structure python packages without repeating top level name for import(如何在不重复导入顶级名称的情况下构造python包)
在 OpenShift 上安装 python 包Install python packages on OpenShift(在 OpenShift 上安装 python 包)
如何刷新 sys.path?How to refresh sys.path?(如何刷新 sys.path?)
分发带有已编译动态共享库的 Python 包Distribute a Python package with a compiled dynamic shared library(分发带有已编译动态共享库的 Python 包)