在没有索引的情况下使用 Lucene Analyzer - 我的方法

时间：2023-09-29

本文介绍了在没有索引的情况下使用 Lucene Analyzer - 我的方法合理吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的目标是利用 Lucene 的许多标记器和过滤器来转换输入文本，但不创建任何索引.

My objective is to leverage some of Lucene's many tokenizers and filters to transform input text, but without the creation of any indexes.

例如，给定这个(人为的)输入字符串...

For example, given this (contrived) input string...

" 某人的 - [texté] 在这里，foo ."

...还有像这样的 Lucene 分析器...

...and a Lucene analyzer like this...

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("lowercase")
        .addTokenFilter("icuFolding")
        .build();

我想得到以下输出:

某人的文本在这里 foo

下面的 Java 方法可以满足我的需求.

The below Java method does what I want.

但有没有更好(即更典型和/或更简洁)的方式让我这样做?

我特别想的是我使用 TokenStream 和 CharTermAttribute 的方式，因为我以前从未像这样使用过它们.感觉很笨重.

I am specifically thinking about the way I have used TokenStream and CharTermAttribute, since I have never used them like this before. Feels clunky.

代码如下:

Lucene 8.3.0 导入:

Lucene 8.3.0 imports:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.custom.CustomAnalyzer;

我的方法:

private String transform(String input) throws IOException {

    Analyzer analyzer = CustomAnalyzer.builder()
            .withTokenizer("icu")
            .addTokenFilter("lowercase")
            .addTokenFilter("icuFolding")
            .build();

    TokenStream ts = analyzer.tokenStream("myField", new StringReader(input));
    CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);

    StringBuilder sb = new StringBuilder();
    try {
        ts.reset();
        while (ts.incrementToken()) {
            sb.append(charTermAtt.toString()).append(" ");
        }
        ts.end();
    } finally {
        ts.close();
    }
    return sb.toString().trim();
}

在没有索引的情况下使用 Lucene Analyzer - 我的方法

问题描述

推荐答案

相关文章

最新文章