我对 python 和一般编程很陌生,但我正在尝试对包含大约 700 万行 python 的制表符分隔的 .txt 文件运行滑动窗口"计算.我所说的滑动窗口的意思是,它将运行一个计算,比如 50,000 行,报告数字,然后向上移动说 10,000 行,并在另外 50,000 行上执行相同的计算.我的计算和滑动窗口"工作正常,如果我在一小部分数据上测试它,它运行良好.但是,如果我尝试在我的整个数据集上运行该程序,它会非常慢(我现在已经运行了大约 40 个小时).数学很简单,所以我认为它不应该花这么长时间.
I am quite new to python and programming in general, but I am trying to run a "sliding window" calculation over a tab delimited .txt file that contains about 7 million lines with python. What I mean by sliding window is that it will run a calculation over say 50,000 lines, report the number and then move up say 10,000 lines and perform the same calculation over another 50,000 lines. I have the calculation and the "sliding window" working correctly and it runs well if I test it on a a small subset of my data. However, if i try to run the program over my entire data set it is incredibly slow (i've had it running now for about 40 hours). The math is quite simple so I don't think it should be taking this long.
我现在阅读 .txt 文件的方式是使用 csv.DictReader 模块.我的代码如下:
The way I am reading my .txt file right now is with the csv.DictReader module. My code is as follows:
file1='/Users/Shared/SmallSetbee.txt'
newfile=open(file1, 'rb')
reader=csv.DictReader((line.replace(' ','') for line in newfile), delimiter=" ")
我相信这是一次从所有 700 万行中制作一本字典,我认为这可能是它对于较大文件的速度如此之慢的原因.
I believe that this is making a dictionary out of all 7 million lines at once, which I'm thinking could be the reason it slows down so much for the larger file.
由于我只对一次对块"或窗口"数据运行计算感兴趣,有没有更有效的方法来一次只读取指定的行,执行计算,然后重复指定行的新指定块"或窗口"?
Since I am only interested in running my calculation over "chunks" or "windows" of data at a time, is there a more efficient way to read in only specified lines at a time, perform the calculation and then repeat with a new specified "chunk" or "window" of specified lines?
collections.deque 是一个有序的项目集合,可以采用最大大小.当您将一个项目添加到一端时,一个项目会从另一端落下.这意味着要遍历 csv 上的窗口",您只需要继续向 deque 添加行,它就会处理丢弃完整的行.
A collections.deque is an ordered collection of items which can take a maximum size. When you add an item to one end, one falls of the other end. This means that to iterate over a "window" on your csv, you just need to keep adding rows to the deque and it will handle throwing away complete ones already.
dq = collections.deque(maxlen=50000)
with open(...) as csv_file:
reader = csv.DictReader((line.replace(" ", "") for line in csv_file), delimiter=" ")
# initial fill
for _ in range(50000):
dq.append(reader.next())
# repeated compute
try:
while 1:
compute(dq)
for _ in range(10000):
dq.append(reader.next())
except StopIteration:
compute(dq)
这篇关于在 python 中有效地处理一个大的 .txt 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持html5模板网!
如何在python中的感兴趣区域周围绘制一个矩形How to draw a rectangle around a region of interest in python(如何在python中的感兴趣区域周围绘制一个矩形)
如何使用 OpenCV 检测和跟踪人员?How can I detect and track people using OpenCV?(如何使用 OpenCV 检测和跟踪人员?)
如何在图像的多个矩形边界框中应用阈值?How to apply threshold within multiple rectangular bounding boxes in an image?(如何在图像的多个矩形边界框中应用阈值?)
如何下载 Coco Dataset 的特定部分?How can I download a specific part of Coco Dataset?(如何下载 Coco Dataset 的特定部分?)
根据文本方向检测图像方向角度Detect image orientation angle based on text direction(根据文本方向检测图像方向角度)
使用 Opencv 检测图像中矩形的中心和角度Detect centre and angle of rectangles in an image using Opencv(使用 Opencv 检测图像中矩形的中心和角度)