lucene 中的高光性能非常慢

时间：2023-09-29

本文介绍了lucene 中的高光性能非常慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Lucene (4.6) 荧光笔在搜索常用词时性能非常慢.搜索速度很快(100 毫秒)，但突出显示可能需要一个多小时(！).

Lucene (4.6) highlighter has very slow performance, when a frequent term is searched. Search is fast (100ms), but highlight may take more than an hour(!).

详细信息: 使用了很棒的文本语料库(1.5GB 纯文本).性能不取决于文本是否被分成更多的小块.(也用 500MB 和 5MB 块进行了测试.)存储位置和偏移量.如果搜索一个非常频繁的术语或模式，TopDocs 检索速度很快(100 毫秒)，但每个searcher.doc(id)"调用都很昂贵(5-50 秒)，getBestFragments() 非常昂贵(超过 1 小时).甚至它们也为此目的被存储和索引.(硬件:core i7、8GM mem)

Details: great text corpus was used (1.5GB plain text). Performance doesn't depend if text is splitted into more small pieces or not. (Tested with 500MB and 5MB pieces as well.) Positions and offsets are stored. If a very frequent term or pattern is searched, TopDocs are retrieved fast (100ms), but each "searcher.doc(id)" calls are expensive (5-50s), and getBestFragments() are extremely expensive (more than 1 hour). Even they are stored and indexed for this purpose. (hardware: core i7, 8GM mem)

更大的背景:它将服务于语言分析研究.使用了一种特殊的词干提取:它也存储词性信息.例如，如果 "adj adj adj adj noun" 被搜索，它会给出它在文本中出现的所有内容.

Greater background: it would serve a language analysis research. A special stemming is used: it stores the part of speech info, too. For example if "adj adj adj adj noun" is searched, it gives all its occurrences in the text with context.

我可以调整它的性能，还是应该选择其他工具?

使用代码:

            //indexing
            FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
            offsetsType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);

            offsetsType.setStored(true);
            offsetsType.setIndexed(true);
            offsetsType.setStoreTermVectors(true);
            offsetsType.setStoreTermVectorOffsets(true);
            offsetsType.setStoreTermVectorPositions(true);
            offsetsType.setStoreTermVectorPayloads(true);


            doc.add(new Field("content", fileContent, offsetsType));


            //quering
            TopDocs results = searcher.search(query, limitStart+limit);

            int endPos = Math.min(results.scoreDocs.length, limitStart+limit);
            int startPos = Math.min(results.scoreDocs.length, limitStart);

            for (int i = startPos; i < endPos; i++) {
                int id = results.scoreDocs[i].doc;

                // bottleneck #1 (5-50s):
                Document doc = searcher.doc(id);

                FastVectorHighlighter h = new FastVectorHighlighter();

                // bottleneck #2 (more than 1 hour):   
                String[] hs = h.getBestFragments(h.getFieldQuery(query), m, id, "content", contextSize, 10000);

相关(未回答)问题:https://stackoverflow.com/questions/19416804/very-slow-solr-performance-when-highlighting

lucene 中的高光性能非常慢

问题描述

推荐答案

相关文章

最新文章