我想将硬盘上的一个大的 fortran 记录 (12G) 映射到一个 numpy 数组.(映射而不是加载以节省内存.)
I want to map a big fortran record (12G) on hard disk to a numpy array. (Mapping instead of loading for saving memory.)
存储在 fortran 记录中的数据不连续,因为它被记录标记分割.记录结构为标记,数据,标记,数据,...,数据,标记".数据区域和标记的长度是已知的.
The data stored in fortran record is not continuous as it is divided by record markers. The record structure is as "marker, data, marker, data,..., data, marker". The length of data regions and markers are known.
标记之间的数据长度不是4字节的倍数,否则我可以将每个数据区域映射到一个数组中.
The length of data between markers is not multiple of 4 bytes, otherwise I can map each data region to an array.
在memmap中设置offset可以跳过第一个标记,是否可以跳过其他标记并将数据映射到数组?
The first marker can be skipped by setting offset in memmap, is it possible to skip other markers and map the data to an array?
对可能的模棱两可表达道歉,并感谢任何解决方案或建议.
Apology for possible ambiguous expression and thanks for any solution or suggestion.
5 月 15 日编辑
这些是 fortran 未格式化的文件.record 中存储的数据是一个 (1024^3)*3 float32 数组 (12Gb).
These are fortran unformatted files. The data stored in record is a (1024^3)*3 float32 array (12Gb).
大于2G的变长记录的记录布局如下所示:
The record layout of variable-length records that are greater than 2 gigabytes is shown below:
(详情见这里 -> [记录类型] -> [可变长度记录]部分.)
(For details see here -> the section [Record Types] -> [Variable-Length Records].)
在我的例子中,除了最后一个之外,每个子记录的长度为 2147483639 字节,并由 8 个字节分隔(如上图所示,前一个子记录的结束标记和下一个子记录的开始标记,共 8 个字节).
In my case, except the last one, each subrecord has a length of 2147483639 bytes and separated by 8 bytes (as you see in the figure above, a end marker of the previous subrecord and a begin marker of the following one, 8 bytes in total ) .
我们可以看到第一个子记录以某个浮点数的前 3 个字节结束,第二个子记录以 2147483639 mod 4 =3 的其余 1 个字节开始.
We can see the first subrecord ends with the first 3 bytes of certain float number and the second subrecord begins with the rest 1 byte as 2147483639 mod 4 =3.
我发布了另一个答案,因为对于 给出的示例这里 numpy.memmap
起作用了:
I posted another answer because for the example given here numpy.memmap
worked:
offset = 0
data1 = np.memmap('tmp', dtype='i', mode='r+', order='F',
offset=0, shape=(size1))
offset += size1*byte_size
data2 = np.memmap('tmp', dtype='i', mode='r+', order='F',
offset=offset, shape=(size2))
offset += size1*byte_size
data3 = np.memmap('tmp', dtype='i', mode='r+', order='F',
offset=offset, shape=(size3))
对于 int32
byte_size=32/8
,对于 int16
byte_size=16/8
等等...
for int32
byte_size=32/8
, for int16
byte_size=16/8
and so forth...
如果大小不变,您可以将数据加载到二维数组中,例如:
If the sizes are constant, you can load the data in a 2D array like:
shape = (total_length/size,size)
data = np.memmap('tmp', dtype='i', mode='r+', order='F', shape=shape)
您可以根据需要更改 memmap
对象.甚至可以使数组共享相同的元素.在这种情况下,对其中一个所做的更改会在另一个中自动更新.
You can change the memmap
object as long as you want. It is even possible to make arrays sharing the same elements. In that case the changes made in one are automatically updated in the other.
其他参考资料:
在 python 和 numpy 中处理大数据,内存不足,如何将部分结果保存在磁盘上?
numpy.memmap
文档在这里.
numpy.memmap
documentation here.
这篇关于是否可以使用 python 将磁盘上的不连续数据映射到数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持html5模板网!