根据不规则的时间间隔合并 pandas 数据帧

时间：2023-03-24

本文介绍了根据不规则的时间间隔合并 pandas 数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想知道如何加快两个数据帧的合并.其中一个数据帧具有时间戳数据点(value col).

I'm wondering how I can speed up a merge of two dataframes. One of the dataframes has time stamped data points (value col).

import pandas as pd
import numpy as np

data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)),
                     'value':np.random.uniform(-1,1,size=50)})

另一个有时间间隔信息(start_time、end_time，以及关联的interval_id).

The other has time interval information (start_time, end_time, and associated interval_id).

intervals = pd.DataFrame({'interval_id':np.arange(9),
                          'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),    
                          'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)})

我想比下面的 for 循环更有效地合并这两个数据帧:

I'd like to merge these two dataframes more efficiently than the for loop below:

data['interval_id'] = np.nan
for index, ser in intervals.iterrows():
    in_interval = (data['time'] >= ser['start_time']) & 
                  (data['time'] <= ser['end_time'])
    data['interval_id'][in_interval] = ser['interval_id']

result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True)

我一直在想我可以使用 pandas 时间序列功能，比如日期范围或 TimeGrouper，但我还没有想出比上述更 Python 的东西(pandas-y?).

I keep imagining I'll be able to use pandas time series functionality, like a date range or TimeGrouper, but I have yet to figure out anything more pythonic (pandas-y?) than the above.

示例结果:

     time      value     interval_id  start_time   end_time
0    0.575976  0.022727          NaN         NaN        NaN
1    4.607545  0.222568            0    3.618715   8.294847
2    5.179350  0.438052            0    3.618715   8.294847
3   11.069956  0.641269            1   10.301728  19.870283
4   12.387854  0.344192            1   10.301728  19.870283
5   18.889691  0.582946            1   10.301728  19.870283
6   20.850469 -0.027436          NaN         NaN        NaN
7   23.199618  0.731316            2   21.488868  28.968338
8   26.631284  0.570647            2   21.488868  28.968338
9   26.996397  0.597035            2   21.488868  28.968338
10  28.601867 -0.131712            2   21.488868  28.968338
11  28.660986  0.710856            2   21.488868  28.968338
12  28.875395 -0.355208            2   21.488868  28.968338
13  28.959320 -0.430759            2   21.488868  28.968338
14  29.702800 -0.554742          NaN         NaN        NaN

非常感谢精通时间序列的人的任何建议.

Any suggestions from time series-savvy people out there would be greatly appreciated.

在杰夫的回答之后更新:

Update, after Jeff's answer:

主要问题是 interval_id 与任何常规时间间隔无关(例如，间隔并不总是大约 10 秒).一个间隔可能是 10 秒，下一个可能是 2 秒，下一个可能是 100 秒，所以我不能使用 Jeff 提出的任何常规舍入方案.不幸的是，我上面的最小示例并没有说明这一点.

The main problem is that interval_id has no relation to any regular time interval (e.g., intervals are not always approximately 10 seconds). One interval could be 10 seconds, the next could be 2 seconds, and the next could be 100 seconds, so I can't use any regular rounding scheme as Jeff proposed. Unfortunately, my minimal example above does not make that clear.

推荐答案

你可以使用 np.searchsorted 查找表示 data['time'] 中的每个值在 intervals['start_time'] 之间适合的位置的索引.然后，您可以再次调用 np.searchsorted 来查找表示 data['time'] 中每个值在 intervals['end_time']<之间的位置的索引/代码>.请注意，使用 np.searchsorted 依赖于 interval['start_time'] 和 interval['end_time'] 处于排序顺序.


You could use np.searchsorted to find the indices representing where each value in data['time'] would fit between intervals['start_time']. Then you could call np.searchsorted again to find the indices representing where each value in data['time'] would fit between intervals['end_time']. Note that using np.searchsorted relies on interval['start_time'] and interval['end_time'] being in sorted order.
对于数组中的每个对应位置，这两个索引相等，data['time'] 适合 interval['start_time'] 和 >间隔['end_time'].请注意，这依赖于不相交的间隔.
For each corresponding location in the arrays, where these two indices are equal, data['time'] fits in between interval['start_time'] and interval['end_time']. Note that this relies on the intervals being disjoint.
以这种方式使用 searchsorted 比使用 for-loop 快大约 5 倍:
Using searchsorted in this way is about 5 times faster than using the for-loop:
import pandas as pd
import numpy as np

np.random.seed(1)
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)),
                     'value':np.random.uniform(-1,1,size=50)})

intervals = pd.DataFrame(
    {'interval_id':np.arange(9),
     'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),    
     'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)})

def using_loop():
    data['interval_id'] = np.nan
    for index, ser in intervals.iterrows():
        in_interval = (data['time'] >= ser['start_time']) & 
                      (data['time'] <= ser['end_time'])
        data['interval_id'][in_interval] = ser['interval_id']

    result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True)
    return result

def using_searchsorted():
    start_idx = np.searchsorted(intervals['start_time'].values, data['time'].values)-1
    end_idx = np.searchsorted(intervals['end_time'].values, data['time'].values)
    mask = (start_idx == end_idx)
    result = data.copy()
    result['interval_id'] = result['start_time'] = result['end_time'] = np.nan
    result['interval_id'][mask] = start_idx
    result.ix[mask, 'start_time'] = intervals['start_time'][start_idx[mask]].values
    result.ix[mask, 'end_time'] = intervals['end_time'][end_idx[mask]].values
    return result

<小时>
In [254]: %timeit using_loop()
100 loops, best of 3: 7.74 ms per loop

In [255]: %timeit using_searchsorted()
1000 loops, best of 3: 1.56 ms per loop

In [256]: 7.74/1.56
Out[256]: 4.961538461538462


                        这篇关于根据不规则的时间间隔合并 pandas 数据帧的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持html5模板网！



上一篇：为什么 NUMPY correlate 和 corrcoef 返回不同的值以及 
下一篇：pandas df.loc[z,x]=y 如何提高速度? 

 
相关文章
同时安装 Anacondas 2.7 和 3.5 可以吗?Is it ok having both Anacondas 2.7 and 3.5 installed in the same time?(同时安装 Anacondas 2.7 和 3.5 可以吗?)
如何使用 conda 创建单独的 python 环境，每个环境How to use conda to create separate python environments, each with a different $PYTHONPATH(如何使用 conda 创建单独的 python 环境，每个环境
根据跨越边界的数量，突出显示超过或低于阈值Highlight matplotlib points that go over or under a threshold in colors based on the amount the boundaries are crossed(根据跨越边界的数量，突出
Pandas 错误 - 遇到无效值Pandas error - invalid value encountered(Pandas 错误 - 遇到无效值)
Anaconda 站点包Anaconda site-packages(Anaconda 站点包)
如何在 Windows 10 上安装 snappy C 库以在 Anaconda 中与How to install snappy C libraries on Windows 10 for use with python-snappy in Anaconda?(如何在 Windows 10 上安装 snappy C 库以在 Anaconda 中与



最新文章

pandas:从时间戳中提取日期和时间
如何使用 pandas 按 10 分钟对时间序列进行分组
Pandas Plots:周末的单独颜色，x 轴上漂亮的打印时
将 Python 序列(时间序列/数组)拆分为重叠的子序列
在 Pandas 中查找与给定时间最近的 DataFrame 行
用于样本外预测的 ARMA.predict 不适用于浮点数?
Python用线性插值正则化不规则时间序列
如何重新采样具有应用于每列的不同函数的数据
pandas .plot() x 轴刻度频率——如何显示更多刻
如何在 Pyspark 中随时间序列数据使用滑动窗口转