我正在尝试删除 pandas DataFrame 中的一些观察结果,其中相似性几乎为 100%,但不完全一致.见下图:
I am attempting to remove some observations in a pandas DataFrame where the similarities are ALMOST 100% but not quite. See frame below:
注意John"、Mary"和Wesley"是如何出现的具有几乎相同的观察结果,但有一列不同.真实数据集有 15 列和 215,000 多个观测值.在我可以直观验证的所有情况下,相似之处同样是:在 15 列中,其他观察每次最多匹配 14 列.为了项目的目的,我决定删除重复的观察结果(并将它们存储到另一个 DataFrame 中,以防我的老板要求查看它们).
Notice how "John", "Mary", and "Wesley" have nearly identical observations, but have one column being different. The real data set has 15 columns, and 215,000+ observations. In all of the cases I could visually verify, the similarities were likewise: out of 15 columns, the other observation would match up to 14 columns, every time. For the purpose of the project I have decided to remove the repeated observations (and store them into another DataFrame just in case my boss asks to see them).
我显然已经想到了 remove_duplicates(keep='something'),但这行不通,因为观察结果并不完全相似.有没有人遇到过这样的问题?有什么补救办法吗?
I have evidently thought of remove_duplicates(keep='something'), but that would not work since the observations are not ENTIRELY similar. Has anyone ever encounter such an issue? Any idea on a remedy?
关于列子集的简单循环怎么样:
What about a simple loop over subset of columns :
import pandas as pd
df = pd.DataFrame(
[
['John', 45, 85000, 'DC'],
['Netcha', 25, 48000, 'NYC'],
['Mary', 45, 85000, 'DC'],
['Wesley', 36, 72500, 'LA'],
['Porter', 22, 98750, 'Seattle'],
['John', 45, 105500, 'DC'],
['Mary', 28, 85000, 'DC'],
['Wesley', 36, 72500, 'Boston'],
],
columns=['Name', 'Age', 'Salary', 'City'])
cols = df.columns.tolist()
cols.remove('Name')
for col in cols:
observed_cols = df.drop(col, axis=1).columns.tolist()
df.drop_duplicates(observed_cols, keep='first', inplace=True)
print(df)
返回:
Name Age Salary City
0 John 45 85000 DC
1 Netcha 25 48000 NYC
2 Mary 45 85000 DC
3 Wesley 36 72500 LA
4 Porter 22 98750 Seattle
这篇关于删除 *NEARLY* 重复的观察 - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持html5模板网!
如何在python中的感兴趣区域周围绘制一个矩形How to draw a rectangle around a region of interest in python(如何在python中的感兴趣区域周围绘制一个矩形)
如何使用 OpenCV 检测和跟踪人员?How can I detect and track people using OpenCV?(如何使用 OpenCV 检测和跟踪人员?)
如何在图像的多个矩形边界框中应用阈值?How to apply threshold within multiple rectangular bounding boxes in an image?(如何在图像的多个矩形边界框中应用阈值?)
如何下载 Coco Dataset 的特定部分?How can I download a specific part of Coco Dataset?(如何下载 Coco Dataset 的特定部分?)
根据文本方向检测图像方向角度Detect image orientation angle based on text direction(根据文本方向检测图像方向角度)
使用 Opencv 检测图像中矩形的中心和角度Detect centre and angle of rectangles in an image using Opencv(使用 Opencv 检测图像中矩形的中心和角度)