是否可以从 Scrapy spider 运行另一个蜘蛛?

时间：2023-05-26

本文介绍了是否可以从 Scrapy spider 运行另一个蜘蛛?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

限时送ChatGPT账号..

现在我有 2 只蜘蛛，我想做的是

For now I have 2 spiders, what I would like to do is

Spider 1 转到 url1 并且如果出现 url2 ，用 url2<调用蜘蛛 2/代码>.也使用管道保存url1的内容.
蜘蛛2去url2做点什么.



Spider 1 goes to url1 and if url2 appears, call spider 2 with url2. Also saves the content of url1 by using pipeline.
Spider 2 goes to url2 and do something.

由于两种蜘蛛的复杂性，我想将它们分开.
Due to the complexities of both spiders I would like to have them separated.
我使用 scrapy crawl 的尝试:
def parse(self, response):
    p = multiprocessing.Process(
        target=self.testfunc())
    p.join()
    p.start()

def testfunc(self):
    settings = get_project_settings()
    crawler = CrawlerRunner(settings)
    crawler.crawl(<spidername>, <arguments>)

它会加载设置但不会抓取:
It does load the settings but doesn't crawl:
2015-08-24 14:13:32 [scrapy] INFO: Enabled extensions: CloseSpider, LogStats, CoreStats, SpiderState
2015-08-24 14:13:32 [scrapy] INFO: Enabled downloader middlewares: DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, HttpAuthMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-24 14:13:32 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-24 14:13:32 [scrapy] INFO: Spider opened
2015-08-24 14:13:32 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

文档中有一个关于从脚本启动的示例，但我想做的是在使用 scrapy crawl 命令时启动另一个蜘蛛.
The documentations has a example about launching from script, but what I'm trying to do is launch another spider while using scrapy crawl command.
完整代码
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from multiprocessing import Process
import scrapy
import os


def info(title):
    print(title)
    print('module name:', __name__)
    if hasattr(os, 'getppid'):  # only available on Unix
        print('parent process:', os.getppid())
    print('process id:', os.getpid())


class TestSpider1(scrapy.Spider):
    name = "test1"
    start_urls = ['http://www.google.com']

    def parse(self, response):
        info('parse')
        a = MyClass()
        a.start_work()


class MyClass(object):

    def start_work(self):
        info('start_work')
        p = Process(target=self.do_work)
        p.start()
        p.join()

    def do_work(self):

        info('do_work')
        settings = get_project_settings()
        runner = CrawlerRunner(settings)
        runner.crawl(TestSpider2)
        d = runner.join()
        d.addBoth(lambda _: reactor.stop())
        reactor.run()
        return

class TestSpider2(scrapy.Spider):

    name = "test2"
    start_urls = ['http://www.google.com']

    def parse(self, response):
        info('testspider2')
        return

我希望是这样的:
scrapy 抓取测试1(例如，当 response.status_code 为 200 时:)
在test1中，调用scrapy crawl test2

推荐答案
我不会深入给出，因为这个问题真的很老，但我会继续从官方 Scrappy 文档中删除这个片段......你非常接近！哈哈
I won't go in depth  given since this question is really old but I'll go ahead drop this snippet from the official Scrappy docs....   You are very close! lol 
import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

https://doc.scrapy.org/en/latest/topics/实践.html
然后使用回调，你可以在你的蜘蛛之间传递项目做你所说的逻辑函数
And then using callbacks you can pass items between your spiders do do w.e logic functions your talking about

                        这篇关于是否可以从 Scrapy spider 运行另一个蜘蛛?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持html5模板网！



上一篇：Process.run() 和 Process.start() 之间的区别 
下一篇：Python 3.4 多处理递归 Pool.map() 

 
相关文章
Python 多处理模块的 .join() 方法到底在做什么?What exactly is Python multiprocessing Module#39;s .join() Method Doing?(Python 多处理模块的 .join() 方法到底在做什么?)
在 Python 中将多个参数传递给 pool.map() 函数Passing multiple parameters to pool.map() function in Python(在 Python 中将多个参数传递给 pool.map() 函数)
multiprocessing.pool.MaybeEncodingError: 'TypeError("multiprocessing.pool.MaybeEncodingError: #39;TypeError(quot;cannot serialize #39;_io.BufferedReader#39; objectquot;,)#39;(multiprocessing.pool.MaybeEnc
Python 多进程池.当其中一个工作进程确定不再需要Python Multiprocess Pool. How to exit the script when one of the worker process determines no more work needs to be done?(Python 多进程池.当其中一
如何将队列引用传递给 pool.map_async() 管理的函数How do you pass a Queue reference to a function managed by pool.map_async()?(如何将队列引用传递给 pool.map_async() 管理的函数?)
与多处理错误的另一个混淆，“模块"对象没yet another confusion with multiprocessing error, #39;module#39; object has no attribute #39;f#39;(与多处理错误的另一个混淆，“模块对象



最新文章

如何在多个线程中运行`selenium-chromedriver`
如何腌制 ssl.SSLContext 对象
故意在python中创建一个孤儿进程
在多处理期间保持统一计数?
Python 2.7:如何弥补缺少的 pool.starmap?
为什么 tkinter 不能很好地处理多处理?
如何在python中将文件描述符从父级传递给子级?
多处理 Queue.get() 挂起
超出最大递归深度.多处理和 bs4
如何修复/调试 scikit learn 中引发的这个多进程终