1. <i id='a1Juo'><tr id='a1Juo'><dt id='a1Juo'><q id='a1Juo'><span id='a1Juo'><b id='a1Juo'><form id='a1Juo'><ins id='a1Juo'></ins><ul id='a1Juo'></ul><sub id='a1Juo'></sub></form><legend id='a1Juo'></legend><bdo id='a1Juo'><pre id='a1Juo'><center id='a1Juo'></center></pre></bdo></b><th id='a1Juo'></th></span></q></dt></tr></i><div id='a1Juo'><tfoot id='a1Juo'></tfoot><dl id='a1Juo'><fieldset id='a1Juo'></fieldset></dl></div>

      <small id='a1Juo'></small><noframes id='a1Juo'>

    2. <legend id='a1Juo'><style id='a1Juo'><dir id='a1Juo'><q id='a1Juo'></q></dir></style></legend><tfoot id='a1Juo'></tfoot>
        <bdo id='a1Juo'></bdo><ul id='a1Juo'></ul>
      1. 使用 Lucene 和 Java 标记、删除停用词

        时间:2023-09-29

          <small id='2D7Nn'></small><noframes id='2D7Nn'>

              <bdo id='2D7Nn'></bdo><ul id='2D7Nn'></ul>
            • <i id='2D7Nn'><tr id='2D7Nn'><dt id='2D7Nn'><q id='2D7Nn'><span id='2D7Nn'><b id='2D7Nn'><form id='2D7Nn'><ins id='2D7Nn'></ins><ul id='2D7Nn'></ul><sub id='2D7Nn'></sub></form><legend id='2D7Nn'></legend><bdo id='2D7Nn'><pre id='2D7Nn'><center id='2D7Nn'></center></pre></bdo></b><th id='2D7Nn'></th></span></q></dt></tr></i><div id='2D7Nn'><tfoot id='2D7Nn'></tfoot><dl id='2D7Nn'><fieldset id='2D7Nn'></fieldset></dl></div>
                <tfoot id='2D7Nn'></tfoot><legend id='2D7Nn'><style id='2D7Nn'><dir id='2D7Nn'><q id='2D7Nn'></q></dir></style></legend>
                    <tbody id='2D7Nn'></tbody>
                • 本文介绍了使用 Lucene 和 Java 标记、删除停用词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

                  问题描述

                  我正在尝试使用 Lucene 从 txt 文件中标记和删除停用词.我有这个:

                  I am trying to tokenize and remove stop words from a txt file with Lucene. I have this:

                  public String removeStopWords(String string) throws IOException {
                  
                  Set<String> stopWords = new HashSet<String>();
                      stopWords.add("a");
                      stopWords.add("an");
                      stopWords.add("I");
                      stopWords.add("the");
                  
                      TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string));
                      tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords);
                  
                      StringBuilder sb = new StringBuilder();
                  
                      CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
                      while (tokenStream.incrementToken()) {
                          if (sb.length() > 0) {
                              sb.append(" ");
                          }
                          sb.append(token.toString());
                      System.out.println(sb);    
                      }
                      return sb.toString();
                  }}
                  

                  我的主要看起来像这样:

                  My main looks like this:

                      String file = "..../datatest.txt";
                  
                      TestFileReader fr = new TestFileReader();
                      fr.imports(file);
                      System.out.println(fr.content);
                  
                      String text = fr.content;
                  
                      Stopwords stopwords = new Stopwords();
                      stopwords.removeStopWords(text);
                      System.out.println(stopwords.removeStopWords(text));
                  

                  这给了我一个错误,但我不知道为什么.

                  This is giving me an error but I can't figure out why.

                  推荐答案

                  我遇到了同样的问题.要使用 Lucene 删除停用词,您可以使用方法 EnglishAnalyzer.getDefaultStopSet(); 使用它们的默认停止集.否则,您可以创建自己的自定义停用词列表.

                  I had The same problem. To remove stop-words using Lucene you could either use their Default Stop Set using the method EnglishAnalyzer.getDefaultStopSet();. Otherwise, you could create your own custom stop-words list.

                  下面的代码显示了 removeStopWords() 的正确版本:

                  The code below shows the correct version of your removeStopWords():

                  public static String removeStopWords(String textFile) throws Exception {
                      CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet();
                      TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_48, new StringReader(textFile.trim()));
                  
                      tokenStream = new StopFilter(Version.LUCENE_48, tokenStream, stopWords);
                      StringBuilder sb = new StringBuilder();
                      CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
                      tokenStream.reset();
                      while (tokenStream.incrementToken()) {
                          String term = charTermAttribute.toString();
                          sb.append(term + " ");
                      }
                      return sb.toString();
                  }
                  

                  要使用自定义停用词列表,请使用以下内容:

                  To use a custom list of stop words use the following:

                  //CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); //this is Lucene set 
                  final List<String> stop_Words = Arrays.asList("fox", "the");
                  final CharArraySet stopSet = new CharArraySet(Version.LUCENE_48, stop_Words, true);
                  

                  这篇关于使用 Lucene 和 Java 标记、删除停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持html5模板网!

                  上一篇:什么是匹配两个包含少于 10 个拉丁文单词的字符 下一篇:使用 solr 构建标签云

                  相关文章

                  最新文章

                    <i id='CXJ2I'><tr id='CXJ2I'><dt id='CXJ2I'><q id='CXJ2I'><span id='CXJ2I'><b id='CXJ2I'><form id='CXJ2I'><ins id='CXJ2I'></ins><ul id='CXJ2I'></ul><sub id='CXJ2I'></sub></form><legend id='CXJ2I'></legend><bdo id='CXJ2I'><pre id='CXJ2I'><center id='CXJ2I'></center></pre></bdo></b><th id='CXJ2I'></th></span></q></dt></tr></i><div id='CXJ2I'><tfoot id='CXJ2I'></tfoot><dl id='CXJ2I'><fieldset id='CXJ2I'></fieldset></dl></div>
                  1. <small id='CXJ2I'></small><noframes id='CXJ2I'>

                      <bdo id='CXJ2I'></bdo><ul id='CXJ2I'></ul>
                  2. <legend id='CXJ2I'><style id='CXJ2I'><dir id='CXJ2I'><q id='CXJ2I'></q></dir></style></legend>

                      <tfoot id='CXJ2I'></tfoot>