├── LICENSE
├── README.MD
├── __init__.py
├── datasets
    ├── file_01.csv
    └── file_02.csv
├── img
    ├── lstm_dnn_graph.png
    └── time_series_LSTM_001.png
├── time_series_lstm.py
└── utils
    ├── __init__.py
    ├── __pycache__
        ├── __init__.cpython-35.pyc
        ├── cleaner.cpython-35.pyc
        ├── data_loader.cpython-35.pyc
        └── data_splitter.cpython-35.pyc
    ├── cleaner.py
    ├── data_loader.py
    └── data_splitter.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2016 Matt Shaffer
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.MD:
--------------------------------------------------------------------------------
 1 | ## Time Series Prediction with LSTM Neural Network
 2 | 
 3 | This is part of a project working with time series to build a predictive model around bitcoin prices. Stocks and cryptocurrency data is noisy and likely unpredictable with small amounts of data, so this is more an exercise in experimentation than a serious attempt to gain any advantage in the market.
 4 |   
 5 | Ticker data is collected on 15 second intervals, and predicted 5 minutes into the future. In addition to ticker data, a second data file contains features collected using the [Bitcoin Developer Websocket API](https://blockchain.info), which provides streaming data on new transactions and blocks. Features are organized into epochs and normalized to zero mean and unit variance.
 6 | 
 7 | Several variations of network architecture can be used to model the data with small modifications to the layers. In its current form, the model uses a Long Short Term Memory (LSTM) combined with a stacked neural network and dropout. LSTMs can learn long term dependencies, usually in the context of sequences of speech or text, and likewise seemed like a logical choice with which to build a time series predictor.   
 8 | 
 9 | The graph for this configuration is as follows:
10 | 
11 | ![lstm graph] (img/lstm_dnn_graph.png)
12 | 
13 | This uses the development branch of Tensorflow, currently version [0.11.0rc0](https://www.tensorflow.org/)
14 | 
15 | More data is needed to adequately assess results, and more error metrics will be pushed soon. For now, results look something like this after 2000 epochs, omitting the optional DNN:
16 | 
17 | ```
18 | Training Error: 0.85 RMSE
19 | Test Error: 0.82 RMSE
20 | 
21 | Loss: 0.665656
22 | done in 4.848s.
23 | Overall time: 794.409s
24 | ```
25 | 
26 | ![results graph] (img/time_series_LSTM_001.png)
27 | 
28 | Directories and scripts:
29 | 
30 | time_series_lstm.py - main program
31 | utils/cleaner.py - cleans the two datasets and fills in missing data with redundant data source
32 | utils/data_loader.py - loads and formats raw data into dataframes
33 | utils/data_splitter.py - spilts into training, test, and optional validation sets
34 | 
35 | To run:
36 | 
37 | install dependencies
38 | - Tensorflow
39 | - Numpy
40 | - Pandas
41 | - Matplotlib
42 |     
43 | Then run:
44 | 
45 | ```
46 | python3 time_series_lstm.py
47 | ```
48 | 
49 | Then send donations to:
50 | ```
51 | 3DrXNQydchoC7aaNHa4F4BSNCL2YyZJAjn
52 | ``` 
53 | 
54 | License (MIT)¶
55 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
56 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
57 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/planetceres/bitcoin-nn/469d351d59c585c561993cca00006eff3ce9ebb7/__init__.py


--------------------------------------------------------------------------------
/img/lstm_dnn_graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/planetceres/bitcoin-nn/469d351d59c585c561993cca00006eff3ce9ebb7/img/lstm_dnn_graph.png


--------------------------------------------------------------------------------
/img/time_series_LSTM_001.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/planetceres/bitcoin-nn/469d351d59c585c561993cca00006eff3ce9ebb7/img/time_series_LSTM_001.png


--------------------------------------------------------------------------------
/time_series_lstm.py:
--------------------------------------------------------------------------------
  1 | from math import sqrt
  2 | import numpy as np
  3 | import matplotlib.pyplot as plt
  4 | import tensorflow as tf
  5 | import tensorflow.contrib.layers as layers
  6 | import tensorflow.contrib.learn as skflow
  7 | from sklearn.metrics import mean_squared_error, mean_absolute_error
  8 | import time
  9 | from utils.data_loader import read_data, loader_data
 10 | from utils.data_splitter import split_data_sets
 11 | from utils.cleaner import est_nan, replace_nans_noise
 12 | 
 13 | plotting = False
 14 | log_dir = 'tmp/timeseries/' # directory for TensorBoard logging
 15 | 
 16 | # Data Files
 17 | file_01 = './datasets/file_01.csv'
 18 | file_02 = './datasets/file_02.csv'
 19 | 
 20 | # Hyperparameters
 21 | prediction = 12 # How many steps into the future to predict (12 = 5 min here)
 22 | steps_forward = prediction
 23 | steps_backward = 0  # value <= 0
 24 | inputs_default = 0
 25 | hidden = 128
 26 | batch_size = 256
 27 | n_steps = seq_len = 1 # number of elements in sequence to classify
 28 | epochs = 2000
 29 | test_sets = 6
 30 | test_size = 0.2
 31 | 
 32 | dnn_hidden = [12, 24, 48, 24, 12] # Hidden Layers in fully connected layer (if enabled)
 33 | 
 34 | # Test and validation set size if using validation set
 35 | test_size_val = 0.2
 36 | validation_size = 0.2
 37 | 
 38 | # LSTM Hyperparameters
 39 | learning_rate = 0.05
 40 | forget_bias = 0.8
 41 | keep_prob = 0.8 #0.9998
 42 | 
 43 | # Inputs and target value
 44 | X_columns = ['btce-buy','btce-volume','btce-low','btce-sell','btce-high',
 45 |             'threshold_setting','prev_tx_count_in', 'prev_mass_out',
 46 |             'prev_value_in','prev_mass_in', 'prev_value_out', 'prev_tx_count_out']
 47 | y_column = 'btce-price'
 48 | y_column_ref = 'cd-price' # This fills in any missing target data with data from a redundant data set
 49 | 
 50 | # Optional feature shifting
 51 | input_range = {}
 52 | 
 53 | 
 54 | t0 = time.time()
 55 | 
 56 | def lstm_model(X, y):
 57 |     
 58 |     X = tf.reshape(X, [-1, n_steps, n_input])  # shape: [batch_size, n_steps, n_input]
 59 |     X = tf.transpose(X, [1, 0, 2]) # shape: [n_steps, batch_size, n_input]
 60 |     X = tf.reshape(X, [-1, n_input])  # shape: [n_steps*batch_size, n_input]
 61 | 
 62 |     # Split data for sequences
 63 |     X = tf.split(0, n_steps, X)  # n_steps * (batch_size, n_input)
 64 | 
 65 |     init = tf.random_normal_initializer(batch_size, stddev=0.05)
 66 |     lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(hidden, forget_bias=forget_bias)
 67 |     
 68 |     # Dropout
 69 |     lstm_cell = tf.nn.rnn_cell.DropoutWrapper(lstm_cell, output_keep_prob=keep_prob)
 70 | 
 71 |     output, _ = tf.nn.rnn(lstm_cell, X, dtype=tf.float32)
 72 | 
 73 |     # Fully connected layer with dropout
 74 |     output = tf.nn.dropout(
 75 |         layers.stack(output[0],
 76 |         layers.fully_connected,
 77 |         dnn_hidden),
 78 |         keep_prob
 79 |     )
 80 | 
 81 |     regression = skflow.models.linear_regression(output, y) # Use output[0] if omitting fully connected layer
 82 | 
 83 |     return regression
 84 | 
 85 | def data_from_csv(file_01, file_02):
 86 |     data = read_data(file_01, file_02)
 87 | 
 88 |     print('Cleaning NaN values from price graph...')
 89 |     data = est_nan(data, y_column, y_column_ref)
 90 |     data = replace_nans_noise(data, X_columns)
 91 |     print("done in %0.3fs." % (time.time() - t0))
 92 |     return data
 93 | 
 94 | t1 = time.time()
 95 | data = data_from_csv(file_01, file_02)
 96 | 
 97 | print('Generating training, test and validation sets...')
 98 | X, y = loader_data(source=data, y_column=y_column, X_columns=X_columns,
 99 |                            inputs_per_column=input_range, inputs_default=inputs_default,
100 |                            steps_forward=prediction)
101 | 
102 | X_train, y_train, X_test, y_test = split_data_sets(X, y, test_size, test_sets)
103 | 
104 | print("done in %0.3fs." % (time.time() - t1))
105 | 
106 | 
107 | #______________________________________________________________________#
108 | #                              MODELS                                  #
109 | #______________________________________________________________________#
110 | 
111 | 
112 | t1 = time.time()
113 | print('Building model..')
114 | 
115 | # number of steps
116 | steps = (X_train.shape[0] / batch_size) * epochs
117 | n_input = X_train.shape[1]
118 | print('Number of features: {0}'.format(n_input))
119 |       
120 | X_train, y_train = X_train.astype(np.float32).copy(), y_train.astype(np.float32).copy()
121 | X_test, y_test = X_test.astype(np.float32).copy(), y_test.astype(np.float32).copy()
122 | X, y = X.astype(np.float32).copy(), y.astype(np.float32).copy()
123 | 
124 | model = skflow.TensorFlowEstimator(
125 |     model_fn=lstm_model, 
126 |     n_classes=0, # n_classes = 0 for regression 
127 |     verbose=0,
128 |     batch_size=batch_size, 
129 |     steps=steps, 
130 |     optimizer=tf.train.ProximalAdagradOptimizer(
131 |         learning_rate=learning_rate,
132 |         l1_regularization_strength=0.001),
133 |     config=skflow.RunConfig(save_checkpoints_secs=1)
134 |     )
135 | 
136 | config = tf.ConfigProto()
137 | config.gpu_options.allow_growth=True
138 | with tf.Session(config=config) as sess:
139 | 
140 |     # TensorBoard Summary Writer
141 |     summary_writer = tf.train.SummaryWriter(log_dir, sess.graph)
142 | 
143 |     model.fit(X_train, y_train, logdir=log_dir)
144 | 
145 | print("done in %0.3fs." % (time.time() - t1))
146 | 
147 | #______________________________________________________________________#
148 | #                             RESULTS                                  #
149 | #______________________________________________________________________#
150 | 
151 | t1 = time.time()
152 | print('Predicting outputs...')
153 | 
154 | y_hat = model.predict(X)
155 | y_train_predicted = model.predict(X_train)
156 | train_rmse = sqrt(mean_squared_error(y_train, y_train_predicted))
157 | y_test_predicted = model.predict(X_test)
158 | test_rmse = sqrt(mean_squared_error(y_test, y_test_predicted))
159 | 
160 | train_err_abs = mean_absolute_error(y_train, y_train_predicted)
161 | test_err_abs = mean_absolute_error(y_test, y_test_predicted)
162 | 
163 | print('Training Error: %.2f RMSE' % (train_rmse))
164 | print('Test Error: %.2f RMSE' % (test_rmse))
165 | 
166 | print('Train Error Abs: %.2f Absolute Error' % (train_err_abs))
167 | print('Test Error Abs: %.2f Absolute Error' % (test_err_abs))
168 | 
169 | loss_ = model.evaluate(X_test, y_test)
170 | print('Loss: {0:f}'.format(loss_[0]['loss']))
171 | 
172 | print("done in %0.3fs." % (time.time() - t1))
173 | print("Overall time: %0.3fs" % (time.time() - t0))
174 | 
175 | #______________________________________________________________________#
176 | #                             PLOTTING                                 #
177 | #______________________________________________________________________#
178 | 
179 | if plotting == True:
180 |     plt.clf()
181 |     X_plot = np.ravel(data['epoch'])[:-prediction]
182 |     A_plot = np.ravel(y)
183 |     B_plot = np.ravel(y_hat)
184 | 
185 |     plot_results(X_plot, A_plot, B_plot)
186 | 
187 | # Plot results
188 | def plot_results(X_plot, A_plot, B_plot):
189 |     plt.plot(X_plot, A_plot, 'blue', alpha=0.5)
190 |     plt.plot(X_plot, B_plot, 'red', alpha=0.5)
191 | 
192 |     plt.legend(loc='lower left')
193 |     plt.show()
194 | 
195 | 
196 | 
197 | 


--------------------------------------------------------------------------------
/utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/planetceres/bitcoin-nn/469d351d59c585c561993cca00006eff3ce9ebb7/utils/__init__.py


--------------------------------------------------------------------------------
/utils/__pycache__/__init__.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/planetceres/bitcoin-nn/469d351d59c585c561993cca00006eff3ce9ebb7/utils/__pycache__/__init__.cpython-35.pyc


--------------------------------------------------------------------------------
/utils/__pycache__/cleaner.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/planetceres/bitcoin-nn/469d351d59c585c561993cca00006eff3ce9ebb7/utils/__pycache__/cleaner.cpython-35.pyc


--------------------------------------------------------------------------------
/utils/__pycache__/data_loader.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/planetceres/bitcoin-nn/469d351d59c585c561993cca00006eff3ce9ebb7/utils/__pycache__/data_loader.cpython-35.pyc


--------------------------------------------------------------------------------
/utils/__pycache__/data_splitter.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/planetceres/bitcoin-nn/469d351d59c585c561993cca00006eff3ce9ebb7/utils/__pycache__/data_splitter.cpython-35.pyc


--------------------------------------------------------------------------------
/utils/cleaner.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import math
  3 | import time
  4 | import numpy as np
  5 | 
  6 | '''Find and replace NaN values'''
  7 | def est_nan(data, target_feature, reference_feature):
  8 | 
  9 |     plotting = False  # Show plots for data estimation where missing values were found
 10 | 
 11 |     # Max number of values to use for ratio
 12 |     tail_n = 100
 13 | 
 14 |     # make sure there are values for first and last rows
 15 |     if (pd.isnull(data[target_feature].iloc[-1])):
 16 |         print('NaN values at end of data with length: ' + str(len(data)))
 17 |         trim_at = data[target_feature].iloc[:(len(data) - 1)].last_valid_index()
 18 |         row_drop_num = len(data) - trim_at
 19 |         print('Dropping %d rows' % row_drop_num)
 20 |         data = data.drop(data.index[trim_at: -1])
 21 |         print('New length of dataset: ' + str(len(data)))
 22 | 
 23 |     if (pd.isnull(data[target_feature].iloc[0])):
 24 |         print('NaN values at beginning of data with length: ' + str(len(data)))
 25 |         trim_at = data[target_feature].iloc[0:].first_valid_index()
 26 |         row_drop_num = trim_at
 27 |         print('Dropping %d rows' % row_drop_num)
 28 |         data = data.drop(data.index[0: trim_at])
 29 |         print('New length of dataset: ' + str(len(data)))
 30 | 
 31 |     # find indexes of NaNs in A and B columns and create arrays
 32 |     nanindex = data.index[data[target_feature].apply(np.isnan)]
 33 |     valIndex = data.index[data[target_feature].apply(np.isfinite)]
 34 |     valAIndex = data.index[data[reference_feature].apply(np.isfinite)]
 35 |     dualIndex = data.index[data[target_feature].apply(np.isfinite) & data[reference_feature].apply(np.isfinite)]
 36 | 
 37 |     df_index = data.index.values.tolist()
 38 |     nindex = [df_index.index(i) for i in nanindex]
 39 |     # valArray = [df_index.index(i) for i in valIndex]
 40 | 
 41 |     # bcRatio set as 1, unless using Coindesk values to fill in NaNs
 42 |     try:
 43 |         # sum the last 100 values (~2 hours) of ticker data to get the conversion rate
 44 |         bcRatio = (
 45 |         sum(data[target_feature].ix[dualIndex].tail(tail_n)) / sum(data[reference_feature].ix[dualIndex].tail(tail_n)))
 46 |     except:
 47 |         bcRatio = 1
 48 | 
 49 |     # Find nearest value function
 50 |     def find_nearest(array, value):
 51 |         idx = np.searchsorted(array, value, side="left")
 52 |         if idx > 0 and (idx == len(array) or math.fabs(value - array[idx - 1]) < math.fabs(value - array[idx])):
 53 |             return array[idx - 1]
 54 |         else:
 55 |             return array[idx]
 56 | 
 57 |     nanStart = 0
 58 |     nanEnd = 0
 59 |     prevNanIndex = -1
 60 |     for n in range(len(nindex)):
 61 | 
 62 |         # Indices of NaN array
 63 |         n_i_1t = (nindex[n] - 1)
 64 |         n_i_t = nindex[n]
 65 |         n_i_t1 = (nindex[n] + 1)
 66 | 
 67 |         # Values of NaN Array
 68 |         n_v_1t = data.ix[n_i_1t][reference_feature]
 69 | 
 70 |         # If the last value in the data array is NaN
 71 |         # and the next value is not NaN
 72 |         if (prevNanIndex == n_i_1t) & (n_i_t1 not in nindex):
 73 | 
 74 |             # The NaN Series ends with the next non NaN index
 75 |             nanEnd = n_i_t1
 76 |             placeholder = float(data.loc[nanStart, target_feature])
 77 | 
 78 |             # The number of NaN values in the series
 79 |             nanDiff = nanEnd - (nanStart + 1)
 80 | 
 81 |             # The averaged difference in values between start of NaN series and end of NaN Series
 82 |             diff = (data.ix[nanEnd][target_feature] - data.ix[nanStart][target_feature]) / (nanDiff + 1)
 83 | 
 84 |             # For each NaN in series, replace with scaled value
 85 |             for i in range(nanDiff):
 86 | 
 87 |                 # Local index of NaN series
 88 |                 r = i + 1
 89 |                 # Global index of the dataframe
 90 |                 row_index = nanStart + r
 91 | 
 92 |                 # Find the nearest value to serve as reference
 93 |                 nearestA = find_nearest(valAIndex, (row_index))
 94 |                 nearestB = find_nearest(valIndex, (row_index))
 95 |                 nnA = abs(nearestA - row_index)
 96 |                 nnB = abs(nearestB - row_index)
 97 | 
 98 |                 if (nnB <= nnA):
 99 | 
100 |                     # Increment by the averaged difference
101 |                     increment = r * diff
102 |                     estimated = (placeholder + increment)
103 |                     data.loc[row_index, target_feature] = estimated
104 | 
105 |                 else:
106 |                     # If A is closer use the conversion rate to port over values
107 |                     placeholderA = data.loc[nearestA, reference_feature]
108 |                     estimated = placeholderA * float(bcRatio)
109 |                     data.loc[row_index, target_feature] = estimated
110 | 
111 |             # Reset Series Variables
112 |             nanStart = 0
113 |             nanEnd = 0
114 |             prevNanIndex = -1
115 | 
116 |         # If the last value was NaN and so is the next
117 |         elif (prevNanIndex == n_i_1t) & (n_i_t1 in nindex):
118 |             pass
119 | 
120 |         # If the last value is not NaN, but the next is, mark the start index
121 |         elif (n_i_1t not in nindex) & (n_i_t1 in nindex):
122 |             nanStart = n_i_1t
123 | 
124 |         # If only one NaN is found isolated, use the preceding and folling values to fill it in
125 |         elif (n_i_t1 not in nindex) & (n_i_t1 not in nindex):
126 |             nanDiff = n_i_t1 - (n_i_1t + 1)
127 |             placeholder = float(data.loc[n_i_1t, target_feature])
128 |             diff = (data.ix[n_i_t1][target_feature] - data.ix[n_i_1t][target_feature]) / float(nanDiff + 1)
129 |             row_index = n_i_t
130 |             estimated = (data.ix[n_i_1t][target_feature] + diff) * bcRatio
131 |             data.loc[row_index, target_feature] = estimated
132 | 
133 |             # Reset Series Variables
134 |             nanStart = 0
135 |             nanEnd = 0
136 |             prevNanIndex = -1
137 | 
138 |         else:
139 |             print("Error matching NaN series")
140 |             nanStart = n_i_1t
141 | 
142 |         # Set the index of the last NaN to the current index
143 |         prevNanIndex = nindex[n]
144 | 
145 |     if plotting == True:
146 |         # print(data)
147 |         plot_results(data.index, data[target_feature], data[reference_feature])
148 | 
149 |     return data
150 | 
151 | 
152 | def replace_nans_noise(data, feature_columns):
153 |     for col in range(len(feature_columns)):
154 |         standard_deviation = data[feature_columns[col]].std(axis=0, skipna=True)
155 |         mean_data = data[feature_columns[col]].mean(axis=0, skipna=True)
156 |         data[feature_columns[col]] = [np.random.normal(mean_data, standard_deviation, 1)[0]
157 |                                       if pd.isnull(data[feature_columns[col]].iloc[row])
158 |                                       else data[feature_columns[col]].iloc[row]
159 |                                       for row in range(len(data))]
160 |     return data
161 | 
162 | # Plot results
163 | def plot_results(X_plot, A_plot, B_plot):
164 |     plt.plot(X_plot, A_plot, 'blue', alpha=0.5)
165 |     plt.plot(X_plot, B_plot, 'red', alpha=0.5)
166 | 
167 |     plt.legend(loc='lower left')
168 |     plt.show()


--------------------------------------------------------------------------------
/utils/data_loader.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import numpy as np
 3 | from sklearn.preprocessing import StandardScaler
 4 | 
 5 | def read_data(file_01, file_02):
 6 |     data_01= pd.read_csv(
 7 |         file_01,
 8 |         parse_dates={'timeline': ['btce-time_stamp']},
 9 |         infer_datetime_format=True)
10 |     data_02 = pd.read_csv(
11 |         file_02,
12 |         parse_dates={'timeline': ['epoch_time_stamp']},
13 |         infer_datetime_format=True)
14 | 
15 |     data_02 = data_02.drop_duplicates('epoch')
16 |     data_01['timeline'] = data_01['timeline'].astype(float)
17 |     data_02['timeline'] = data_02['timeline'].astype(float)
18 | 
19 |     data_ = data_02.set_index('timeline').reindex(data_01.set_index('timeline').index, method='nearest').reset_index()
20 |     data = pd.merge(data_01, data_, on='timeline', suffixes=('_', ''))
21 |     return data
22 | 
23 | def loader_data(source, y_column, X_columns, inputs_per_column=None, inputs_default=3, steps_forward=1):
24 |     # Shift the target by the number of steps specified in prediction variable
25 |     y = source[y_column].shift(-steps_forward)
26 | 
27 |     # Normalize data to mean and unit variance
28 |     scaler = StandardScaler()
29 |     new_X = pd.DataFrame(scaler.fit_transform(source[X_columns]), columns=X_columns)
30 | 
31 |     X = pd.DataFrame()
32 | 
33 |     for column in X_columns:
34 |         inputs = inputs_per_column.get(column, None)
35 |         if inputs:
36 |             inputs_list = range(inputs[0], inputs[1]+1)
37 |         else:
38 |             inputs_list = range(-inputs_default, 1)
39 | 
40 |         for i in inputs_list:
41 |             col_name = "%s_%s" % (column, i)
42 |             X[col_name] = new_X[column].shift(-i)  # Note: shift direction is inverted
43 | 
44 |     X = pd.concat([X, y], axis=1)
45 |     X.dropna(inplace=True, axis=0)
46 |     y = X[y_column].reshape(X.shape[0], 1)
47 |     X.drop([y_column], axis=1, inplace=True)
48 | 
49 |     return X.values, y
50 | 


--------------------------------------------------------------------------------
/utils/data_splitter.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import numpy as np
 3 | 
 4 | # Split the dataset into training and test sets
 5 | def split_data_sets(X, y, test_size, sets=1):
 6 |     set_length = X.shape[0]/sets
 7 |     offset = 0
 8 |     X_train_lst = []
 9 |     y_train_lst = []
10 |     X_test_lst = []
11 |     y_test_lst = []
12 |     for i in range(sets+1):
13 |         offset = i*set_length
14 |         train_length = int(set_length*(1-test_size))
15 |         train_start = offset
16 |         train_end = offset + train_length
17 |         test_start = train_end
18 |         test_end = (i+1)*set_length
19 |         X_train_lst.append(X[train_start:train_end])
20 |         y_train_lst.append(y[train_start:train_end])
21 |         X_test_lst.append(X[test_start:test_end])
22 |         y_test_lst.append(y[test_start:test_end])
23 |     X_train = np.concatenate(X_train_lst)
24 |     y_train = np.concatenate(y_train_lst)
25 |     X_test = np.concatenate(X_test_lst)
26 |     y_test = np.concatenate(y_test_lst)
27 |     return X_train, y_train, X_test, y_test
28 | 
29 | # Split into training, test and validation sets
30 | # To do: combine with function above
31 | def split_data_sets_with_validation(X, y, test_size, validation_size, sets=1):
32 |     set_length = X.shape[0]/sets
33 |     offset = 0
34 |     X_train_lst = []
35 |     y_train_lst = []
36 |     X_test_lst = []
37 |     y_test_lst = []
38 |     X_validation_lst = []
39 |     y_validation_lst = []
40 |     for i in range(sets+1):
41 |         offset = i*set_length
42 |         train_length = int(set_length*(1-(test_size + validation_size)))
43 |         train_start = offset
44 |         train_end = offset + train_length
45 |         test_length = int(set_length*test_size)
46 |         test_start = train_end
47 |         test_end = train_end + test_length
48 |         validation_start = test_end
49 |         validation_end = (i+1)*set_length
50 |         X_train_lst.append(X[train_start:train_end])
51 |         y_train_lst.append(y[train_start:train_end])
52 |         X_test_lst.append(X[test_start:test_end])
53 |         y_test_lst.append(y[test_start:test_end])
54 |         X_validation_lst.append(X[validation_start:validation_end])
55 |         y_validation_lst.append(y[validation_start:validation_end])
56 |     X_train = np.concatenate(X_train_lst)
57 |     y_train = np.concatenate(y_train_lst)
58 |     X_test = np.concatenate(X_test_lst)
59 |     y_test = np.concatenate(y_test_lst)
60 |     X_validation = np.concatenate(X_validation_lst)
61 |     y_validation = np.concatenate(y_validation_lst)
62 |     return X_train, y_train, X_test, y_test, X_validation, y_validation
63 | 


--------------------------------------------------------------------------------