如果可能的话,我正在寻找用 java 或任何其他语言编写的 Apache Lucene 网络爬虫.爬虫必须使用lucene并创建有效的lucene索引和文档文件,所以这就是nutch被淘汰的原因例如...
I am looking for Apache Lucene web crawler written in java if possible or in any other language. The crawler must use lucene and create a valid lucene index and document files, so this is the reason why nutch is eliminated for example...
有谁知道这样的网络爬虫存在吗?如果答案是肯定的,我可以在哪里找到它.天呐……
Does anybody know does such a web crawler exist and can If answer is yes where I can find it. Tnx...
你要问的是两个组件:
首先要说一句勇气:去过那里,做到了.我将从制作您自己的角度来分别处理这两个组件,因为我不相信您可以使用 Lucene 来完成您所要求的事情,而无需真正了解底层发生的事情.
First a word of couragement: Been there, done that. I'll tackle both of the components individually from the point of view of making your own since I don't believe that you could use Lucene to do something you've requested without really understanding what's going on underneath.
因此,您有一个要爬网"以收集特定资源的网站/目录.假设它是列出目录内容的任何普通网络服务器,那么制作网络爬虫很容易:只需将其指向目录的根并定义收集实际文件的规则,例如以 .txt 结尾".非常简单的东西,真的.
So you have a web site/directory you want to "crawl" through to collect specific resources. Assuming that it's any common web server which lists directory contents, making a web crawler is easy: Just point it to the root of the directory and define rules for collecting the actual files, such as "ends with .txt". Very simple stuff, really.
实际的实现可能是这样的:使用 HttpClient获取实际的网页/目录列表,以您认为最有效的方式解析它们,例如使用 XPath 从获取的文档中选择所有链接,或者使用 Java 的 模式 和 Matcher 类随时可用.如果您决定走 XPath 路线,请考虑使用 JDOM 进行 DOM 处理和 Jaxen 用于实际的 XPath.
The actual implementation could be something like so: Use HttpClient to get the actual web pages/directory listings, parse them in the way you find most efficient such as using XPath to select all the links from the fetched document or just parsing it with regex using Java's Pattern and Matcher classes readily available. If you decide to go the XPath route, consider using JDOM for DOM handling and Jaxen for the actual XPath.
获得所需的实际资源(例如一堆文本文件)后,您需要确定数据的类型,以便能够知道要索引的内容以及可以安全忽略的内容.为简单起见,我假设这些是没有字段或任何内容的纯文本文件,不会深入研究,但如果您有多个字段要存储,我建议您让爬虫生成 1..n 具有 访问器和修改器 的专用 bean(奖励积分: 使 bean immutable,不允许访问者改变 bean 的内部状态,为 bean 创建一个 复制构造函数)在其他组件中使用.
Once you get the actual resources you want such as bunch of text files, you need to identify the type of data to be able to know what to index and what you can safely ignore. For simplicity's sake I'm assuming these are plaintext files with no fields or anything and won't go deeper into that but if you have multiple fields to store, I suggest you make your crawler to produce 1..n of specialized beans with accessors and mutators (bonus points: Make the bean immutable, don't allow accessors to mutate the internal state of the bean, create a copy constructor for the bean) to be used in the other component.
就 API 调用而言,您应该有类似 HttpCrawler#getDocuments(String url)
的东西,它返回一个 List
以与实际的索引器结合使用.
In terms of API calls, you should have something like HttpCrawler#getDocuments(String url)
which returns a List<YourBean>
to use in conjuction with the actual indexer.
除了 显而易见的东西 与 Lucene 相比,例如设置目录并了解它的线程模型(任何时候只允许一次写入操作,即使在更新索引时也可以存在多次读取),您当然希望将 bean 提供给索引.我已经链接到的五分钟教程基本上就是这样做的,查看示例 addDoc(..)
方法并将字符串替换为 YourBean
.
Beyond the obvious stuff with Lucene such as setting up a directory and understanding its threading model (only one write operation is allowed at any time, multiple reads can exist even when the index is being updated), you of course want to feed your beans to the index. The five minute tutorial I already linked to basically does exactly that, look into the example addDoc(..)
method and just replace the String with YourBean
.
请注意,Lucene IndexWriter 确实有一些清理方法可以方便地以受控方式执行,例如调用 IndexWriter#commit()
只有在一堆文档被添加到索引后才适用性能,然后调用 IndexWriter#optimize()
以确保索引不会随着时间的推移变得非常臃肿也是一个好主意.永远记得关闭索引以避免不必要的 LockObtainFailedException
s 被抛出,与 Java 中的所有 IO 一样,这样的操作当然应该在 finally
块中完成.
Note that Lucene IndexWriter does have some cleanup methods which are handy to execute in a controlled manner, for example calling IndexWriter#commit()
only after a bunch of documents have been added to index is good for performance and then calling IndexWriter#optimize()
to make sure the index isn't getting hugely bloated over time is a good idea too. Always remember to close the index too to avoid unnecessary LockObtainFailedException
s to be thrown, as with all IO in Java such operation should of course be done in the finally
block.
[0 to 5]
实际上被转换为 +0 +1 +2 +3 +4 +5
这意味着范围查询很快就消失了,因为查询子部分的数量达到了最大值.[0 to 5]
actually gets transformed into +0 +1 +2 +3 +4 +5
which means the range query dies out very quickly because there's a maximum number of query sub parts.有了这些信息,我相信您可以在不到一天的时间内制作自己的特殊 Lucene 索引器,如果您想对其进行严格测试,则需要三天.
With this information I do believe you could make your own special Lucene indexer in less than a day, three if you want to test it rigorously.
这篇关于Lucene爬虫(需要建立lucene索引)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持html5模板网!