我有两个关于 apache lucene 提供的命中荧光笔的问题:
I have two questions regarding hit highlighter provided with apache lucene:
参见 这个函数你能解释一下令牌流参数的使用吗?
see this function could you explain the use of token stream parameter.
我有几个包含许多字段的大型 lucene 文档,每个字段中都有一些字符串.现在我找到了与特定查询最相关的文档.现在找到了这个文档,因为查询中的几个单词可能与文档中的单词匹配.我想找出查询中的哪些单词导致了这种情况.所以为此我打算使用 Lucene Hit Highlighter.示例:如果查询是skin doctor delhi"并且标题为dermatologist"的文档包含skin"和doctor"这两个词,那么在点击突出显示后,我应该能够从查询中分离出skin"和doctor".几个星期以来,我一直在尝试为此编写代码.无法得到我想要的.你能帮帮我吗?
I have several large lucene document containing many fields and each field has some strings in it. Now I have found the most relevant document for a particular query. Now this document was found because several words in the query might have matched with the words in the document. I want to find out what words in the query caused this. So for this I plan to use Lucene Hit Highlighter. Example: if the query is "skin doctor delhi" and the document titled "dermatologist" contains the words "skin" and "doctor" then after hit highlighting i should be able to separate out "skin" and "doctor" from the query. I have been trying to write the code for this for several weeks now. Not able to get what i want. Could you help me please?
提前致谢.
更新:
当前方法:我创建了一个包含文档中所有单词的查询.
Current Approach: I create a query containing all the words in the document.
Field[] field = doc.getFields("description");
String desc = "";
for (int j = 0; j < field.length; ++j) {
desc += field[j].stringValue() + " ";
}
Query q = qp.parse(desc);
QueryScorer scorer = new QueryScorer(q, reader, "description");
Highlighter highlighter = new Highlighter(scorer);
String fragment = highlighter.getBestFragment(analyzer, "description", text);
它适用于小文档,但不适用于大文档.得到如下stacktrace.
It works for small documents but does not work for large documents. The following stacktrace is obtained.
org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:152)
at org.apache.lucene.queryParser.QueryParser.getBooleanQuery(QueryParser.java:891)
at org.apache.lucene.queryParser.QueryParser.getBooleanQuery(QueryParser.java:866)
at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1213)
at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1167)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:182)
很明显,这种方法对于大文档是不合理的.应该怎么做才能纠正这个问题?
It is obvious that the approach is unreasonable for large documents. What should be done to correct this?
顺便说一句,我正在使用 FuzzyQuery 匹配.
BTW I am using FuzzyQuery matching.
添加了一些关于 explain() 的细节.
added some details about explain().
一些一般介绍:Lucene Highlighter 旨在从命中文档中查找文本片段,并突出显示与查询匹配的标记.
Some general introduction: The Lucene Highlighter is meant to find text snippets from a hit document, and to highlight tokens matching the query.
解释 expl = searcher.explain(query, docId);
String asText = expl.toString();
String asHtml = expl.toHtml();
docId 是搜索结果中的原始文档 ID.
docId is the raw document id from the search results.
仅当您确实需要片段和/或亮点时,才应使用荧光笔.如果您仍想使用荧光笔,请遵循 Nicholas Hrychan 的建议.不过要当心,因为他描述了 Lucene 2.4.1 API - 如果您使用更高级的版本,您应该在他说SpanScorer"的地方使用QueryScorer".
Only if you do need the snippets and/or highlights, you should use the Highlighter. If you still want to use the highlighter, follow Nicholas Hrychan's advice. Beware, though, as he describes the Lucene 2.4.1 API - If you use a more advanced version, you should use "QueryScorer" where he says "SpanScorer" .
这篇关于在 lucene 中使用命中荧光笔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持html5模板网!