尊敬的 stackoverflow 社区:
给定一些文本,我希望得到文本中最常用的 TOP 50 词,并从中创建一个标签云,从而以图形方式显示文本的要点.
文本实际上是一组 100 条左右的评论,每个 ITEM(一张图片)大约有 120 条,我还想保持云更新 - 通过保持评论索引,并在每次出现新的 Web 请求时使用云生成代码运行.
我决定使用 Solr 来索引文本,现在想知道如何从 Solr 中获取 TOP 50 单词 TermsVectorComponant.以下是术语向量组件返回的结果示例,在您通过说 tv.tf="true"
打开术语频率后:
<lst name="doc-5"><str name="uniqueKey">MA147LL/A</str><lst name="包括"><lst name="cbl"><tf>5</tf></lst><lst name="earbud"><tf>3</tf></lst><lst name="headphon"><tf>10</tf></lst><lst name="usb"><tf>11</tf></lst></lst></lst><lst name="doc-9"><str name="uniqueKey">3007WFP</str><lst name="包括"><lst name="cbl"><tf>5</tf></lst><lst name="usb"><tf>4</tf></lst></lst></lst>
如您所见,我有 2 个问题:
有没有更好的方法?(或)我可以告诉 solr termvector 组件以某种方式对其进行排序并为我只提取 100 个吗?(或)我可以使用其他一些框架吗?我需要在新评论出现时对其进行索引,因此标签云始终是最新的 - 至于云生成器,它需要一个加权词词典,并将其变成一个漂亮的图像.
这个答案没有帮助.
编辑 - 尝试 jpountz &佩奇厨师的回答
这是我为此查询得到的结果:
select?q=Id:d4439543-afd4-42fb-978a-b72eab0c07f9&facet=true&facet.field=Post_Content&facet.minCount=1&facet.limit=50<int name="also">1</int><int name="ani">1</int><int name="anoth">1</int><int name="atleast">1</int><int name="base">1</int><int name="bcd">1</int><int name="因为">1</int><int name="更好">1</int><int name="更大">1</int><int name="bio">1</int><int name="boot">1</int><int name="bootable">1</int><int name="bootload">1</int><int name="bootscreen">1</int>
我得到了 50 个这样的元素,@jpountz 感谢帮助限制结果,但是为什么所有 50 个单独的 <int>
元素都保持值 1?我的想法是:数字 1 代表与我的查询匹配的文档数(由于我通过 Id:Guid 查询,因此只能是一个),它们不代表 Post_Content
中单词的频率/p>
为了证明这一点,我从查询中删除了 Id:GUID,结果是:
<int name="content">33</int><int name="can">17</int><int name="on">16</int><int name="so">16</int><int name="some">16</int><int name="all">15</int><int name="i">15</int><int name="do">14</int><int name="have">14</int><int name="我的">14</int>
我的问题是如何获取文档中的词频,而不是许多词的文档频率.例如,我知道 bootable 是我在 Post_content 中使用了 6 次的一个词,所以我想要对一组文档进行排序,例如 (6,"bootable"), (5, "disc").
我想出了一个 STOPGAP 解决方案:(为了举例,我将每个 solr 文档称为帖子")
Solr 中有一个术语组件,其目的似乎是公开任何给定字段的所有索引术语.它主要用于实现自动完成等功能,以及在术语级别运行的其他功能.默认情况下按频率排序 - 字段中出现频率较高的术语首先出现.
我所做的是创建了一个名为 content_
的动态字段,并根据类别在其自己的字段中索引每个帖子集.这意味着将有数百个动态字段实例,每个实例都包含一个 post-set,我可以使用该字段上的 terms 组件来获取该 post-set 的 TOP TERMS.
作为图片:
content_postSetOne :包含一组帖子的索引版本content_postSetTwo :包含另一组帖子的索引版本content_postSetThree :包含第三组帖子的索引版本
这个解决方案有点适合我,如果需要,您也可以轻松地为每个帖子创建一个字段.我也有兴趣了解使用这样的动态字段的含义:这会是一个问题吗?
这与 Paige 和 jPountz 的答案有何不同:
Dear stackoverflow community :
Given some text, I wish to get the TOP 50 most frequent words in the text, and create a tag cloud out of it, and thus show the gist of what the text is about in a graphical way.
The text is actually a set of 100 or so comments PER each ITEM(a picture) there are about 120 items, and I also want to keep the cloud updated - by keeping the comments indexed, and using the cloud generation code to run each time a new web request turns up.
I settled on using Solr to index the text, and now wondering how to get the TOP 50 words, out of Solr TermsVectorComponant. Here is an example of the results returned by the terms vector componant, after you turn on term frequency by saying tv.tf="true"
:
<lst name="doc-5">
<str name="uniqueKey">MA147LL/A</str>
<lst name="includes">
<lst name="cabl"><tf>5</tf></lst>
<lst name="earbud"><tf>3</tf></lst>
<lst name="headphon"><tf>10</tf></lst>
<lst name="usb"><tf>11</tf></lst>
</lst>
</lst>
<lst name="doc-9">
<str name="uniqueKey">3007WFP</str>
<lst name="includes">
<lst name="cabl"><tf>5</tf></lst>
<lst name="usb"><tf>4</tf></lst>
</lst>
</lst>
As you can see I have 2 problems :
Is there a better way? (or) Can I tell solr termvector component to somehow sort it and pick up only 100 for me? (or) Is there some other framework which I can use? I need to keep new comments indexed as they come, so the tag cloud is always uptodate - As to the cloud generator it takes a dictionary of weighted words, and makes it into a nice image.
This answer does not help.
EDIT - trying out jpountz & paige cook's answer
Here is a result which I got for this query :
select?q=Id:d4439543-afd4-42fb-978a-b72eab0c07f9&facet=true
&facet.field=Post_Content&facet.minCount=1&facet.limit=50
<int name="also">1</int>
<int name="ani">1</int>
<int name="anoth">1</int>
<int name="atleast">1</int>
<int name="base">1</int>
<int name="bcd">1</int>
<int name="becaus">1</int>
<int name="better">1</int>
<int name="bigger">1</int>
<int name="bio">1</int>
<int name="boot">1</int>
<int name="bootabl">1</int>
<int name="bootload">1</int>
<int name="bootscreen">1</int>
I got 50 such elements, @jpountz thanks for helping limit the results, BUT why does ALL FIFTY of the individual <int>
elements hold the value 1? My thoughts are : The number 1 represents the count of the documents matching my query (which can only be one since I queried by Id:Guid) and they do not represent the frequency of the words in Post_Content
To prove this I removed the Id:GUID from query and result was:
<int name="content">33</int>
<int name="can">17</int>
<int name="on">16</int>
<int name="so">16</int>
<int name="some">16</int>
<int name="all">15</int>
<int name="i">15</int>
<int name="do">14</int>
<int name="have">14</int>
<int name="my">14</int>
My problem is how to get the term frequency in the document, and not the document frequency of many terms. For example I know for a fact that bootable was a word I used 6 times in Post_content, So i want sorted Pairs like (6,"bootable"), (5, "disc") for a set of documents.
I have come up with a STOPGAP solution : (Im calling a each solr document a "post" for examples sake)
There is a terms component in Solr, whose purpose seems to be to expose all the indexed terms of any given field. It is mainly used to implement features like auto-complete, and other features that operate at a term level. And it is by default sorted by frequency - the more frequently occurring terms in the field come up first.
What I have done is created a dynamic field called content_
and indexed each post-set in its own field based on category. This means that there will be hundreds of instances of the dynamic field each containing one post-set, and I can use the terms component on that field to get TOP TERMS for that post-set.
As a picture :
content_postSetOne : contains indexed version of a set of posts
content_postSetTwo : contains indexed version of another set of posts
content_postSetThree : contains indexed version of a third set of posts
This solution is sort of working for me, and you can easily create a field per Post also if needed. Im also interested in knowing the implications of using dynamic fields like this : Will this be a problem?
How this is different from the Paige and jPountz answer is :
这篇关于使用 solr 构建标签云的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持html5模板网!