创建具有多个输入的 TimeseriesGenerator

时间：2023-03-24

本文介绍了创建具有多个输入的 TimeseriesGenerator的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试根据约 4000 只股票的每日基本面和价格数据训练 LSTM 模型，由于内存限制，在转换为模型的序列后，我无法将所有内容都保存在内存中.

这导致我改用生成器，例如

相反，我想要的是类似于此:

稍微类似的问题:合并或附加多个 Keras TimeseriesGenerator 对象合二为一

我探索了像这样组合生成器的选项 SO 建议:我如何组合两个 keras 生成器函数，但是在大约 4000 个生成器的情况下这不是主意.

我希望我的问题有意义.

解决方案

所以我最终做的是手动进行所有预处理并为每个包含预处理序列的股票保存一个 .npy 文件，然后使用手动创建的生成器我像这样批量制作:

类 seq_generator():def __init__(self, list_of_filepaths):self.usedDict = dict()对于 list_of_filepaths 中的路径:self.usedDict[路径] = []定义生成(自我):而真:路径 = np.random.choice(list(self.usedDict.keys()))stock_array = np.load(路径)random_sequence = np.random.randint(stock_array.shape[0])如果 random_sequence 不在 self.usedDict[path] 中:self.usedDict[path].append(random_sequence)产量 stock_array[random_sequence, :, :]train_generator = seq_generator(list_of_filepaths)train_dataset = tf.data.Dataset.from_generator(seq_generator.generate(),output_types=(tf.float32, tf.float32),output_shapes=(n_timesteps, n_features))train_dataset = train_dataset.batch(batch_size)

其中 list_of_filepaths 只是预处理 .npy 数据的路径列表.

<小时>

这将:

加载随机股票的预处理 .npy 数据
随机选择一个序列
检查序列的索引是否已经在usedDict
如果不是:
- 将该序列的索引附加到 usedDict 以跟踪不向模型提供两次相同的数据
- 产生序列

这意味着生成器将在每次调用"时从随机股票中提供一个唯一序列，使我能够使用 .from_generator() 和 .batch() 方法.

I'm trying to train an LSTM model on daily fundamental and price data from ~4000 stocks, due to memory limits I cannot hold everything in memory after converting to sequences for the model.



This leads me to using a generator instead like the TimeseriesGenerator from Keras / Tensorflow. Problem is that if I try using the generator on all of my data stacked it would create sequences of mixed stocks, see the example below with a sequence of 5, here Sequence 3 would include the last 4 observations of "stock 1" and the first observation of "stock 2"



Instead what I would want is similar to this:



Slightly similar question: Merge or append multiple Keras TimeseriesGenerator objects into one

I explored the option of combining the generators like this SO suggests: How do I combine two keras generator functions, however this is not idea in the case of ~4000 generators.

I hope my question makes sense.
 解决方案 
So what I've ended up doing is to do all the preprocessing manually and save an .npy file for each stock containing the preprocessed sequences, then using a manually created generator I make batches like this:
class seq_generator():

  def __init__(self, list_of_filepaths):
    self.usedDict = dict()
    for path in list_of_filepaths:
      self.usedDict[path] = []

  def generate(self):
    while True: 
      path = np.random.choice(list(self.usedDict.keys()))
      stock_array = np.load(path) 
      random_sequence = np.random.randint(stock_array.shape[0])
      if random_sequence not in self.usedDict[path]:
        self.usedDict[path].append(random_sequence)
        yield stock_array[random_sequence, :, :]

train_generator = seq_generator(list_of_filepaths)

train_dataset = tf.data.Dataset.from_generator(seq_generator.generate(),
                                               output_types=(tf.float32, tf.float32), 
                                               output_shapes=(n_timesteps, n_features)) 

train_dataset = train_dataset.batch(batch_size)
Where list_of_filepaths is simply a list of paths to preprocessed .npy data.



This will:


Load a random stock's preprocessed .npy data
Pick a sequence at random
Check if the index of the sequence has already been used in usedDict
If not: 


Append the index of that sequence to usedDict to keep track as to not feed the same data twice to the model
Yield the sequence 



This means that the generator will feed a single unique sequence from a random stock at each "call", enabling me to use the .from_generator() and .batch() methods from Tensorflows Dataset type.

                        这篇关于创建具有多个输入的 TimeseriesGenerator的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持html5模板网！



上一篇：Pandas DataFrame 按时间戳分组 
下一篇：如何在 matplotlib 图中突出显示周末? 

 
相关文章
如何在 conda 环境中安装 Selenium?How to install Selenium in a conda environment?(如何在 conda 环境中安装 Selenium?)
使用 Anaconda installe 在 Windows 上获取 CUDA 和 CUDNNget the CUDA and CUDNN version on windows with Anaconda installe(使用 Anaconda installe 在 Windows 上获取 CUDA 和 CUDNN 版本)
如何下载适用于 python 3.6 的 AnacondaHow can I download Anaconda for python 3.6(如何下载适用于 python 3.6 的 Anaconda)
使用两个不同的 Python 发行版Using two different Python Distributions(使用两个不同的 Python 发行版)
除了 OSX 上现有的 pyenv 安装之外，如何安装 AnaHow can I install Anaconda aside an existing pyenv installation on OSX?(除了 OSX 上现有的 pyenv 安装之外，如何安装 Anaconda?)
在 Cygwin 中为 Anaconda 永久设置 Python 路径Permanently set Python path for Anaconda within Cygwin(在 Cygwin 中为 Anaconda 永久设置 Python 路径)



最新文章

如何在python中将最佳拟合线应用于时间序列
Python:滑动窗口均值，忽略缺失数据
平均每五分钟数据作为 pandas 数据框中的一个数据
用 pandas 总结几个月
TypeError:不能将序列乘以“float"类型的非整数
在 pandas 中的 groupby 之后绘制多个时间序列
从 Pandas 数据框列中删除“秒"和“分钟"
选择由 DatetimeIndex 索引的 Pandas DataFrame 的子集和
在 Pandas 数据框列中填充缺失的日期值
如何在 Pandas DataFrame 上计算滚动累积乘积