我正在检查 4 个相同的数据框列中是否有类似的结果(模糊匹配),并且我有以下代码作为示例.当我将它应用到真正的 40.000 行 x 4 列数据集时,它会一直在 eternum 中运行.问题是代码太慢了.例如,如果我将数据集限制为 10 个用户,计算需要 8 分钟,而计算需要 20、19 分钟.有什么我想念的吗?我不知道为什么要花那么长时间.我希望在 2 小时或更短的时间内获得所有结果.任何提示或帮助将不胜感激.
I'm checking if there are similar results (fuzzy match) in 4 same dataframe columns, and I have the following code, as an example. When I apply it to the real 40.000 rows x 4 columns dataset, keeps running in eternum. The issue is that the code is too slow. For example, if I limite the dataset to 10 users, it takes 8 minutes to compute, while for 20, 19 minutes. Is there anything I am missing? I do not know why this take that long. I expect to have all results, maximum in 2 hours or less. Any hint or help would be greatly appreciated.
from fuzzywuzzy import process
dataframecolumn = ["apple","tb"]
compare = ["adfad","apple","asple","tab"]
Ratios = [process.extract(x,compare) for x in dataframecolumn]
result = list()
for ratio in Ratios:
for match in ratio:
if match[1] != 100:
result.append(match)
break
print (result)
输出:[('asple', 80), ('tab', 80)]
Output: [('asple', 80), ('tab', 80)]
通过编写矢量化操作和避免循环来显着提高速度
Major speed improvements come by writing vectorized operations and avoiding loops
from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np
dataframecolumn = pd.DataFrame(["apple","tb"])
dataframecolumn.columns = ['Match']
compare = pd.DataFrame(["adfad","apple","asple","tab"])
compare.columns = ['compare']
dataframecolumn['Key'] = 1
compare['Key'] = 1
combined_dataframe = dataframecolumn.merge(compare,on="Key",how="left")
combined_dataframe = combined_dataframe[~(combined_dataframe.Match==combined_dataframe.compare)]
def partial_match(x,y):
return(fuzz.ratio(x,y))
partial_match_vector = np.vectorize(partial_match)
combined_dataframe['score']=partial_match_vector(combined_dataframe['Match'],combined_dataframe['compare'])
combined_dataframe = combined_dataframe[combined_dataframe.score>=80]
+--------+-----+--------+------+
| Match | Key | compare | score
+--------+-----+--------+------+
| apple | 1 | asple | 80
| tb | 1 | tab | 80
+--------+-----+--------+------+
这篇关于列表性能中的Python模糊匹配字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持html5模板网!
如何在python中的感兴趣区域周围绘制一个矩形How to draw a rectangle around a region of interest in python(如何在python中的感兴趣区域周围绘制一个矩形)
如何使用 OpenCV 检测和跟踪人员?How can I detect and track people using OpenCV?(如何使用 OpenCV 检测和跟踪人员?)
如何在图像的多个矩形边界框中应用阈值?How to apply threshold within multiple rectangular bounding boxes in an image?(如何在图像的多个矩形边界框中应用阈值?)
如何下载 Coco Dataset 的特定部分?How can I download a specific part of Coco Dataset?(如何下载 Coco Dataset 的特定部分?)
根据文本方向检测图像方向角度Detect image orientation angle based on text direction(根据文本方向检测图像方向角度)
使用 Opencv 检测图像中矩形的中心和角度Detect centre and angle of rectangles in an image using Opencv(使用 Opencv 检测图像中矩形的中心和角度)