我有一个数据框,其中为每个 ID 记录了 1 个或多个事件.对于每个事件,都会记录 id、度量 x 和日期.像这样的:
I have a data frame where 1 or more events are recorded for each id. For each event the id, a metric x and a date are recorded. Something like this:
import pandas as pd
import datetime as dt
import numpy as np
x = range(0, 6)
id = ['a', 'a', 'b', 'a', 'b', 'b']
dates = [dt.datetime(2012, 5, 2),dt.datetime(2012, 4, 2),dt.datetime(2012, 6, 2),
dt.datetime(2012, 7, 30),dt.datetime(2012, 4, 1),dt.datetime(2012, 5, 9)]
df =pd.DataFrame(np.column_stack((id,x,dates)), columns = ['id', 'x', 'dates'])
我希望能够设置一个回溯期(即 70 天),并为数据集中的每一行计算该 id 和所需回溯范围内的任何先前事件的 x 的累积总和(不包括 x对于正在执行计算的行).最终应该看起来像:
I'd like to be able to set a lookback period (i.e. 70 days) and calculate, for each row in the dataset, a cumulative sum of x for any preceding event for that id and within the desired lookback (excluding x for the row the calculation is being performed for). Should end up looking like:
id x dates want
0 a 0 2012-05-02 00:00:00 1
1 a 1 2012-04-02 00:00:00 0
2 b 2 2012-06-02 00:00:00 9
3 a 3 2012-07-30 00:00:00 0
4 b 4 2012-04-01 00:00:00 0
5 b 5 2012-05-09 00:00:00 4
嗯,一种方法如下: (1) 使用 'id' 作为分组变量执行 groupby/apply.(2) 在应用内,resample 组到每日时间序列.(3) 然后只需使用 rolling_sum (并移位,因此您不包括当前行的x"值)来计算 70 天回溯期的总和.(4) 将组归约到只有原来的观察:
Well, one approach is the following: (1) do a groupby/apply with 'id' as grouping variable. (2) Within the apply, resample the group to a daily time series. (3) Then just using rolling_sum (and shift so you don't include the current rows 'x' value) to compute the sum of your 70 day lookback periods. (4) Reduce the group back to only the original observations:
In [12]: df = df.sort(['id','dates'])
In [13]: df
Out[13]:
id x dates
1 a 1 2012-04-02
0 a 0 2012-05-02
3 a 3 2012-07-30
4 b 4 2012-04-01
5 b 5 2012-05-09
2 b 2 2012-06-02
您将需要按 ['id','dates'] 对数据进行排序.现在我们可以做 groupby/apply:
You are going to need your data sorted by ['id','dates']. Now we can do the groupby/apply:
In [15]: def past70(g):
g = g.set_index('dates').resample('D','last')
g['want'] = pd.rolling_sum(g['x'],70,0).shift(1)
return g[g.x.notnull()]
In [16]: df = df.groupby('id').apply(past70).drop('id',axis=1)
In [17]: df
Out[17]:
x want
id dates
a 2012-04-02 1 NaN
2012-05-02 0 1
2012-07-30 3 0
b 2012-04-01 4 NaN
2012-05-09 5 4
2012-06-02 2 9
如果您不想要 NaN,那么就这样做:
If you don't want the NaNs then just do:
In [28]: df.fillna(0)
Out[28]:
x want
id dates
a 2012-04-02 1 0
2012-05-02 0 1
2012-07-30 3 0
b 2012-04-01 4 0
2012-05-09 5 4
2012-06-02 2 9
如果您想让回溯窗口成为参数,请执行以下操作:
If you want to make the lookback window a parameter do something like the following:
def past_window(g,win=70):
g = g.set_index('dates').resample('D','last')
g['want'] = pd.rolling_sum(g['x'],win,0).shift(1)
return g[g.x.notnull()]
df = df.groupby('id').apply(past_window,win=10)
print df.fillna(0)
这篇关于Pandas 按组的时间累积总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持html5模板网!
如何在 Python 中将货币字符串转换为浮点数?How do I convert a currency string to a floating point number in Python?(如何在 Python 中将货币字符串转换为浮点数?)
在 Pandas 中解析多索引 Excel 文件Parsing a Multi-Index Excel File in Pandas(在 Pandas 中解析多索引 Excel 文件)
pandas 时间序列 between_datetime 函数?pandas timeseries between_datetime function?( pandas 时间序列 between_datetime 函数?)
pandas 重新采样到每月的特定工作日pandas resample to specific weekday in month( pandas 重新采样到每月的特定工作日)
在 Pandas 中合并/组合两个具有不同频率时间序列Merging/combining two dataframes with different frequency time series indexes in Pandas?(在 Pandas 中合并/组合两个具有不同频率时间序列索
Python - 如何标准化时间序列数据Python - how to normalize time-series data(Python - 如何标准化时间序列数据)