├── LICENSE ├── README.MD ├── __init__.py ├── datasets ├── file_01.csv └── file_02.csv ├── img ├── lstm_dnn_graph.png └── time_series_LSTM_001.png ├── time_series_lstm.py └── utils ├── __init__.py ├── __pycache__ ├── __init__.cpython-35.pyc ├── cleaner.cpython-35.pyc ├── data_loader.cpython-35.pyc └── data_splitter.cpython-35.pyc ├── cleaner.py ├── data_loader.py └── data_splitter.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2016 Matt Shaffer 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.MD: -------------------------------------------------------------------------------- 1 | ## Time Series Prediction with LSTM Neural Network 2 | 3 | This is part of a project working with time series to build a predictive model around bitcoin prices. Stocks and cryptocurrency data is noisy and likely unpredictable with small amounts of data, so this is more an exercise in experimentation than a serious attempt to gain any advantage in the market. 4 | 5 | Ticker data is collected on 15 second intervals, and predicted 5 minutes into the future. In addition to ticker data, a second data file contains features collected using the [Bitcoin Developer Websocket API](https://blockchain.info), which provides streaming data on new transactions and blocks. Features are organized into epochs and normalized to zero mean and unit variance. 6 | 7 | Several variations of network architecture can be used to model the data with small modifications to the layers. In its current form, the model uses a Long Short Term Memory (LSTM) combined with a stacked neural network and dropout. LSTMs can learn long term dependencies, usually in the context of sequences of speech or text, and likewise seemed like a logical choice with which to build a time series predictor. 8 | 9 | The graph for this configuration is as follows: 10 | 11 | ![lstm graph] (img/lstm_dnn_graph.png) 12 | 13 | This uses the development branch of Tensorflow, currently version [0.11.0rc0](https://www.tensorflow.org/) 14 | 15 | More data is needed to adequately assess results, and more error metrics will be pushed soon. For now, results look something like this after 2000 epochs, omitting the optional DNN: 16 | 17 | ``` 18 | Training Error: 0.85 RMSE 19 | Test Error: 0.82 RMSE 20 | 21 | Loss: 0.665656 22 | done in 4.848s. 23 | Overall time: 794.409s 24 | ``` 25 | 26 | ![results graph] (img/time_series_LSTM_001.png) 27 | 28 | Directories and scripts: 29 | 30 | time_series_lstm.py - main program 31 | utils/cleaner.py - cleans the two datasets and fills in missing data with redundant data source 32 | utils/data_loader.py - loads and formats raw data into dataframes 33 | utils/data_splitter.py - spilts into training, test, and optional validation sets 34 | 35 | To run: 36 | 37 | install dependencies 38 | - Tensorflow 39 | - Numpy 40 | - Pandas 41 | - Matplotlib 42 | 43 | Then run: 44 | 45 | ``` 46 | python3 time_series_lstm.py 47 | ``` 48 | 49 | Then send donations to: 50 | ``` 51 | 3DrXNQydchoC7aaNHa4F4BSNCL2YyZJAjn 52 | ``` 53 | 54 | License (MIT)¶ 55 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 56 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 57 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/planetceres/bitcoin-nn/469d351d59c585c561993cca00006eff3ce9ebb7/__init__.py -------------------------------------------------------------------------------- /img/lstm_dnn_graph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/planetceres/bitcoin-nn/469d351d59c585c561993cca00006eff3ce9ebb7/img/lstm_dnn_graph.png -------------------------------------------------------------------------------- /img/time_series_LSTM_001.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/planetceres/bitcoin-nn/469d351d59c585c561993cca00006eff3ce9ebb7/img/time_series_LSTM_001.png -------------------------------------------------------------------------------- /time_series_lstm.py: -------------------------------------------------------------------------------- 1 | from math import sqrt 2 | import numpy as np 3 | import matplotlib.pyplot as plt 4 | import tensorflow as tf 5 | import tensorflow.contrib.layers as layers 6 | import tensorflow.contrib.learn as skflow 7 | from sklearn.metrics import mean_squared_error, mean_absolute_error 8 | import time 9 | from utils.data_loader import read_data, loader_data 10 | from utils.data_splitter import split_data_sets 11 | from utils.cleaner import est_nan, replace_nans_noise 12 | 13 | plotting = False 14 | log_dir = 'tmp/timeseries/' # directory for TensorBoard logging 15 | 16 | # Data Files 17 | file_01 = './datasets/file_01.csv' 18 | file_02 = './datasets/file_02.csv' 19 | 20 | # Hyperparameters 21 | prediction = 12 # How many steps into the future to predict (12 = 5 min here) 22 | steps_forward = prediction 23 | steps_backward = 0 # value <= 0 24 | inputs_default = 0 25 | hidden = 128 26 | batch_size = 256 27 | n_steps = seq_len = 1 # number of elements in sequence to classify 28 | epochs = 2000 29 | test_sets = 6 30 | test_size = 0.2 31 | 32 | dnn_hidden = [12, 24, 48, 24, 12] # Hidden Layers in fully connected layer (if enabled) 33 | 34 | # Test and validation set size if using validation set 35 | test_size_val = 0.2 36 | validation_size = 0.2 37 | 38 | # LSTM Hyperparameters 39 | learning_rate = 0.05 40 | forget_bias = 0.8 41 | keep_prob = 0.8 #0.9998 42 | 43 | # Inputs and target value 44 | X_columns = ['btce-buy','btce-volume','btce-low','btce-sell','btce-high', 45 | 'threshold_setting','prev_tx_count_in', 'prev_mass_out', 46 | 'prev_value_in','prev_mass_in', 'prev_value_out', 'prev_tx_count_out'] 47 | y_column = 'btce-price' 48 | y_column_ref = 'cd-price' # This fills in any missing target data with data from a redundant data set 49 | 50 | # Optional feature shifting 51 | input_range = {} 52 | 53 | 54 | t0 = time.time() 55 | 56 | def lstm_model(X, y): 57 | 58 | X = tf.reshape(X, [-1, n_steps, n_input]) # shape: [batch_size, n_steps, n_input] 59 | X = tf.transpose(X, [1, 0, 2]) # shape: [n_steps, batch_size, n_input] 60 | X = tf.reshape(X, [-1, n_input]) # shape: [n_steps*batch_size, n_input] 61 | 62 | # Split data for sequences 63 | X = tf.split(0, n_steps, X) # n_steps * (batch_size, n_input) 64 | 65 | init = tf.random_normal_initializer(batch_size, stddev=0.05) 66 | lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(hidden, forget_bias=forget_bias) 67 | 68 | # Dropout 69 | lstm_cell = tf.nn.rnn_cell.DropoutWrapper(lstm_cell, output_keep_prob=keep_prob) 70 | 71 | output, _ = tf.nn.rnn(lstm_cell, X, dtype=tf.float32) 72 | 73 | # Fully connected layer with dropout 74 | output = tf.nn.dropout( 75 | layers.stack(output[0], 76 | layers.fully_connected, 77 | dnn_hidden), 78 | keep_prob 79 | ) 80 | 81 | regression = skflow.models.linear_regression(output, y) # Use output[0] if omitting fully connected layer 82 | 83 | return regression 84 | 85 | def data_from_csv(file_01, file_02): 86 | data = read_data(file_01, file_02) 87 | 88 | print('Cleaning NaN values from price graph...') 89 | data = est_nan(data, y_column, y_column_ref) 90 | data = replace_nans_noise(data, X_columns) 91 | print("done in %0.3fs." % (time.time() - t0)) 92 | return data 93 | 94 | t1 = time.time() 95 | data = data_from_csv(file_01, file_02) 96 | 97 | print('Generating training, test and validation sets...') 98 | X, y = loader_data(source=data, y_column=y_column, X_columns=X_columns, 99 | inputs_per_column=input_range, inputs_default=inputs_default, 100 | steps_forward=prediction) 101 | 102 | X_train, y_train, X_test, y_test = split_data_sets(X, y, test_size, test_sets) 103 | 104 | print("done in %0.3fs." % (time.time() - t1)) 105 | 106 | 107 | #______________________________________________________________________# 108 | # MODELS # 109 | #______________________________________________________________________# 110 | 111 | 112 | t1 = time.time() 113 | print('Building model..') 114 | 115 | # number of steps 116 | steps = (X_train.shape[0] / batch_size) * epochs 117 | n_input = X_train.shape[1] 118 | print('Number of features: {0}'.format(n_input)) 119 | 120 | X_train, y_train = X_train.astype(np.float32).copy(), y_train.astype(np.float32).copy() 121 | X_test, y_test = X_test.astype(np.float32).copy(), y_test.astype(np.float32).copy() 122 | X, y = X.astype(np.float32).copy(), y.astype(np.float32).copy() 123 | 124 | model = skflow.TensorFlowEstimator( 125 | model_fn=lstm_model, 126 | n_classes=0, # n_classes = 0 for regression 127 | verbose=0, 128 | batch_size=batch_size, 129 | steps=steps, 130 | optimizer=tf.train.ProximalAdagradOptimizer( 131 | learning_rate=learning_rate, 132 | l1_regularization_strength=0.001), 133 | config=skflow.RunConfig(save_checkpoints_secs=1) 134 | ) 135 | 136 | config = tf.ConfigProto() 137 | config.gpu_options.allow_growth=True 138 | with tf.Session(config=config) as sess: 139 | 140 | # TensorBoard Summary Writer 141 | summary_writer = tf.train.SummaryWriter(log_dir, sess.graph) 142 | 143 | model.fit(X_train, y_train, logdir=log_dir) 144 | 145 | print("done in %0.3fs." % (time.time() - t1)) 146 | 147 | #______________________________________________________________________# 148 | # RESULTS # 149 | #______________________________________________________________________# 150 | 151 | t1 = time.time() 152 | print('Predicting outputs...') 153 | 154 | y_hat = model.predict(X) 155 | y_train_predicted = model.predict(X_train) 156 | train_rmse = sqrt(mean_squared_error(y_train, y_train_predicted)) 157 | y_test_predicted = model.predict(X_test) 158 | test_rmse = sqrt(mean_squared_error(y_test, y_test_predicted)) 159 | 160 | train_err_abs = mean_absolute_error(y_train, y_train_predicted) 161 | test_err_abs = mean_absolute_error(y_test, y_test_predicted) 162 | 163 | print('Training Error: %.2f RMSE' % (train_rmse)) 164 | print('Test Error: %.2f RMSE' % (test_rmse)) 165 | 166 | print('Train Error Abs: %.2f Absolute Error' % (train_err_abs)) 167 | print('Test Error Abs: %.2f Absolute Error' % (test_err_abs)) 168 | 169 | loss_ = model.evaluate(X_test, y_test) 170 | print('Loss: {0:f}'.format(loss_[0]['loss'])) 171 | 172 | print("done in %0.3fs." % (time.time() - t1)) 173 | print("Overall time: %0.3fs" % (time.time() - t0)) 174 | 175 | #______________________________________________________________________# 176 | # PLOTTING # 177 | #______________________________________________________________________# 178 | 179 | if plotting == True: 180 | plt.clf() 181 | X_plot = np.ravel(data['epoch'])[:-prediction] 182 | A_plot = np.ravel(y) 183 | B_plot = np.ravel(y_hat) 184 | 185 | plot_results(X_plot, A_plot, B_plot) 186 | 187 | # Plot results 188 | def plot_results(X_plot, A_plot, B_plot): 189 | plt.plot(X_plot, A_plot, 'blue', alpha=0.5) 190 | plt.plot(X_plot, B_plot, 'red', alpha=0.5) 191 | 192 | plt.legend(loc='lower left') 193 | plt.show() 194 | 195 | 196 | 197 | -------------------------------------------------------------------------------- /utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/planetceres/bitcoin-nn/469d351d59c585c561993cca00006eff3ce9ebb7/utils/__init__.py -------------------------------------------------------------------------------- /utils/__pycache__/__init__.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/planetceres/bitcoin-nn/469d351d59c585c561993cca00006eff3ce9ebb7/utils/__pycache__/__init__.cpython-35.pyc -------------------------------------------------------------------------------- /utils/__pycache__/cleaner.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/planetceres/bitcoin-nn/469d351d59c585c561993cca00006eff3ce9ebb7/utils/__pycache__/cleaner.cpython-35.pyc -------------------------------------------------------------------------------- /utils/__pycache__/data_loader.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/planetceres/bitcoin-nn/469d351d59c585c561993cca00006eff3ce9ebb7/utils/__pycache__/data_loader.cpython-35.pyc -------------------------------------------------------------------------------- /utils/__pycache__/data_splitter.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/planetceres/bitcoin-nn/469d351d59c585c561993cca00006eff3ce9ebb7/utils/__pycache__/data_splitter.cpython-35.pyc -------------------------------------------------------------------------------- /utils/cleaner.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import math 3 | import time 4 | import numpy as np 5 | 6 | '''Find and replace NaN values''' 7 | def est_nan(data, target_feature, reference_feature): 8 | 9 | plotting = False # Show plots for data estimation where missing values were found 10 | 11 | # Max number of values to use for ratio 12 | tail_n = 100 13 | 14 | # make sure there are values for first and last rows 15 | if (pd.isnull(data[target_feature].iloc[-1])): 16 | print('NaN values at end of data with length: ' + str(len(data))) 17 | trim_at = data[target_feature].iloc[:(len(data) - 1)].last_valid_index() 18 | row_drop_num = len(data) - trim_at 19 | print('Dropping %d rows' % row_drop_num) 20 | data = data.drop(data.index[trim_at: -1]) 21 | print('New length of dataset: ' + str(len(data))) 22 | 23 | if (pd.isnull(data[target_feature].iloc[0])): 24 | print('NaN values at beginning of data with length: ' + str(len(data))) 25 | trim_at = data[target_feature].iloc[0:].first_valid_index() 26 | row_drop_num = trim_at 27 | print('Dropping %d rows' % row_drop_num) 28 | data = data.drop(data.index[0: trim_at]) 29 | print('New length of dataset: ' + str(len(data))) 30 | 31 | # find indexes of NaNs in A and B columns and create arrays 32 | nanindex = data.index[data[target_feature].apply(np.isnan)] 33 | valIndex = data.index[data[target_feature].apply(np.isfinite)] 34 | valAIndex = data.index[data[reference_feature].apply(np.isfinite)] 35 | dualIndex = data.index[data[target_feature].apply(np.isfinite) & data[reference_feature].apply(np.isfinite)] 36 | 37 | df_index = data.index.values.tolist() 38 | nindex = [df_index.index(i) for i in nanindex] 39 | # valArray = [df_index.index(i) for i in valIndex] 40 | 41 | # bcRatio set as 1, unless using Coindesk values to fill in NaNs 42 | try: 43 | # sum the last 100 values (~2 hours) of ticker data to get the conversion rate 44 | bcRatio = ( 45 | sum(data[target_feature].ix[dualIndex].tail(tail_n)) / sum(data[reference_feature].ix[dualIndex].tail(tail_n))) 46 | except: 47 | bcRatio = 1 48 | 49 | # Find nearest value function 50 | def find_nearest(array, value): 51 | idx = np.searchsorted(array, value, side="left") 52 | if idx > 0 and (idx == len(array) or math.fabs(value - array[idx - 1]) < math.fabs(value - array[idx])): 53 | return array[idx - 1] 54 | else: 55 | return array[idx] 56 | 57 | nanStart = 0 58 | nanEnd = 0 59 | prevNanIndex = -1 60 | for n in range(len(nindex)): 61 | 62 | # Indices of NaN array 63 | n_i_1t = (nindex[n] - 1) 64 | n_i_t = nindex[n] 65 | n_i_t1 = (nindex[n] + 1) 66 | 67 | # Values of NaN Array 68 | n_v_1t = data.ix[n_i_1t][reference_feature] 69 | 70 | # If the last value in the data array is NaN 71 | # and the next value is not NaN 72 | if (prevNanIndex == n_i_1t) & (n_i_t1 not in nindex): 73 | 74 | # The NaN Series ends with the next non NaN index 75 | nanEnd = n_i_t1 76 | placeholder = float(data.loc[nanStart, target_feature]) 77 | 78 | # The number of NaN values in the series 79 | nanDiff = nanEnd - (nanStart + 1) 80 | 81 | # The averaged difference in values between start of NaN series and end of NaN Series 82 | diff = (data.ix[nanEnd][target_feature] - data.ix[nanStart][target_feature]) / (nanDiff + 1) 83 | 84 | # For each NaN in series, replace with scaled value 85 | for i in range(nanDiff): 86 | 87 | # Local index of NaN series 88 | r = i + 1 89 | # Global index of the dataframe 90 | row_index = nanStart + r 91 | 92 | # Find the nearest value to serve as reference 93 | nearestA = find_nearest(valAIndex, (row_index)) 94 | nearestB = find_nearest(valIndex, (row_index)) 95 | nnA = abs(nearestA - row_index) 96 | nnB = abs(nearestB - row_index) 97 | 98 | if (nnB <= nnA): 99 | 100 | # Increment by the averaged difference 101 | increment = r * diff 102 | estimated = (placeholder + increment) 103 | data.loc[row_index, target_feature] = estimated 104 | 105 | else: 106 | # If A is closer use the conversion rate to port over values 107 | placeholderA = data.loc[nearestA, reference_feature] 108 | estimated = placeholderA * float(bcRatio) 109 | data.loc[row_index, target_feature] = estimated 110 | 111 | # Reset Series Variables 112 | nanStart = 0 113 | nanEnd = 0 114 | prevNanIndex = -1 115 | 116 | # If the last value was NaN and so is the next 117 | elif (prevNanIndex == n_i_1t) & (n_i_t1 in nindex): 118 | pass 119 | 120 | # If the last value is not NaN, but the next is, mark the start index 121 | elif (n_i_1t not in nindex) & (n_i_t1 in nindex): 122 | nanStart = n_i_1t 123 | 124 | # If only one NaN is found isolated, use the preceding and folling values to fill it in 125 | elif (n_i_t1 not in nindex) & (n_i_t1 not in nindex): 126 | nanDiff = n_i_t1 - (n_i_1t + 1) 127 | placeholder = float(data.loc[n_i_1t, target_feature]) 128 | diff = (data.ix[n_i_t1][target_feature] - data.ix[n_i_1t][target_feature]) / float(nanDiff + 1) 129 | row_index = n_i_t 130 | estimated = (data.ix[n_i_1t][target_feature] + diff) * bcRatio 131 | data.loc[row_index, target_feature] = estimated 132 | 133 | # Reset Series Variables 134 | nanStart = 0 135 | nanEnd = 0 136 | prevNanIndex = -1 137 | 138 | else: 139 | print("Error matching NaN series") 140 | nanStart = n_i_1t 141 | 142 | # Set the index of the last NaN to the current index 143 | prevNanIndex = nindex[n] 144 | 145 | if plotting == True: 146 | # print(data) 147 | plot_results(data.index, data[target_feature], data[reference_feature]) 148 | 149 | return data 150 | 151 | 152 | def replace_nans_noise(data, feature_columns): 153 | for col in range(len(feature_columns)): 154 | standard_deviation = data[feature_columns[col]].std(axis=0, skipna=True) 155 | mean_data = data[feature_columns[col]].mean(axis=0, skipna=True) 156 | data[feature_columns[col]] = [np.random.normal(mean_data, standard_deviation, 1)[0] 157 | if pd.isnull(data[feature_columns[col]].iloc[row]) 158 | else data[feature_columns[col]].iloc[row] 159 | for row in range(len(data))] 160 | return data 161 | 162 | # Plot results 163 | def plot_results(X_plot, A_plot, B_plot): 164 | plt.plot(X_plot, A_plot, 'blue', alpha=0.5) 165 | plt.plot(X_plot, B_plot, 'red', alpha=0.5) 166 | 167 | plt.legend(loc='lower left') 168 | plt.show() -------------------------------------------------------------------------------- /utils/data_loader.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from sklearn.preprocessing import StandardScaler 4 | 5 | def read_data(file_01, file_02): 6 | data_01= pd.read_csv( 7 | file_01, 8 | parse_dates={'timeline': ['btce-time_stamp']}, 9 | infer_datetime_format=True) 10 | data_02 = pd.read_csv( 11 | file_02, 12 | parse_dates={'timeline': ['epoch_time_stamp']}, 13 | infer_datetime_format=True) 14 | 15 | data_02 = data_02.drop_duplicates('epoch') 16 | data_01['timeline'] = data_01['timeline'].astype(float) 17 | data_02['timeline'] = data_02['timeline'].astype(float) 18 | 19 | data_ = data_02.set_index('timeline').reindex(data_01.set_index('timeline').index, method='nearest').reset_index() 20 | data = pd.merge(data_01, data_, on='timeline', suffixes=('_', '')) 21 | return data 22 | 23 | def loader_data(source, y_column, X_columns, inputs_per_column=None, inputs_default=3, steps_forward=1): 24 | # Shift the target by the number of steps specified in prediction variable 25 | y = source[y_column].shift(-steps_forward) 26 | 27 | # Normalize data to mean and unit variance 28 | scaler = StandardScaler() 29 | new_X = pd.DataFrame(scaler.fit_transform(source[X_columns]), columns=X_columns) 30 | 31 | X = pd.DataFrame() 32 | 33 | for column in X_columns: 34 | inputs = inputs_per_column.get(column, None) 35 | if inputs: 36 | inputs_list = range(inputs[0], inputs[1]+1) 37 | else: 38 | inputs_list = range(-inputs_default, 1) 39 | 40 | for i in inputs_list: 41 | col_name = "%s_%s" % (column, i) 42 | X[col_name] = new_X[column].shift(-i) # Note: shift direction is inverted 43 | 44 | X = pd.concat([X, y], axis=1) 45 | X.dropna(inplace=True, axis=0) 46 | y = X[y_column].reshape(X.shape[0], 1) 47 | X.drop([y_column], axis=1, inplace=True) 48 | 49 | return X.values, y 50 | -------------------------------------------------------------------------------- /utils/data_splitter.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | 4 | # Split the dataset into training and test sets 5 | def split_data_sets(X, y, test_size, sets=1): 6 | set_length = X.shape[0]/sets 7 | offset = 0 8 | X_train_lst = [] 9 | y_train_lst = [] 10 | X_test_lst = [] 11 | y_test_lst = [] 12 | for i in range(sets+1): 13 | offset = i*set_length 14 | train_length = int(set_length*(1-test_size)) 15 | train_start = offset 16 | train_end = offset + train_length 17 | test_start = train_end 18 | test_end = (i+1)*set_length 19 | X_train_lst.append(X[train_start:train_end]) 20 | y_train_lst.append(y[train_start:train_end]) 21 | X_test_lst.append(X[test_start:test_end]) 22 | y_test_lst.append(y[test_start:test_end]) 23 | X_train = np.concatenate(X_train_lst) 24 | y_train = np.concatenate(y_train_lst) 25 | X_test = np.concatenate(X_test_lst) 26 | y_test = np.concatenate(y_test_lst) 27 | return X_train, y_train, X_test, y_test 28 | 29 | # Split into training, test and validation sets 30 | # To do: combine with function above 31 | def split_data_sets_with_validation(X, y, test_size, validation_size, sets=1): 32 | set_length = X.shape[0]/sets 33 | offset = 0 34 | X_train_lst = [] 35 | y_train_lst = [] 36 | X_test_lst = [] 37 | y_test_lst = [] 38 | X_validation_lst = [] 39 | y_validation_lst = [] 40 | for i in range(sets+1): 41 | offset = i*set_length 42 | train_length = int(set_length*(1-(test_size + validation_size))) 43 | train_start = offset 44 | train_end = offset + train_length 45 | test_length = int(set_length*test_size) 46 | test_start = train_end 47 | test_end = train_end + test_length 48 | validation_start = test_end 49 | validation_end = (i+1)*set_length 50 | X_train_lst.append(X[train_start:train_end]) 51 | y_train_lst.append(y[train_start:train_end]) 52 | X_test_lst.append(X[test_start:test_end]) 53 | y_test_lst.append(y[test_start:test_end]) 54 | X_validation_lst.append(X[validation_start:validation_end]) 55 | y_validation_lst.append(y[validation_start:validation_end]) 56 | X_train = np.concatenate(X_train_lst) 57 | y_train = np.concatenate(y_train_lst) 58 | X_test = np.concatenate(X_test_lst) 59 | y_test = np.concatenate(y_test_lst) 60 | X_validation = np.concatenate(X_validation_lst) 61 | y_validation = np.concatenate(y_validation_lst) 62 | return X_train, y_train, X_test, y_test, X_validation, y_validation 63 | --------------------------------------------------------------------------------