├── ExtendedAbstract.pdf ├── README.md ├── Thesis.pdf ├── classes ├── class_DataProcessor.py ├── class_ForecastingTrader.py ├── class_SeriesAnalyser.py └── class_Trader.py ├── code_organization.pdf ├── data └── link_to_data.txt ├── drafts ├── PairsTrading_Examples.ipynb ├── config │ ├── config.json │ ├── config_commodities_2000_2018.json │ ├── config_commodities_2008_2018.json │ ├── config_commodities_2009_2015.json │ ├── config_commodities_2009_2017.json │ ├── config_commodities_2010_2016.json │ ├── config_commodities_2010_2018.json │ ├── config_commodities_2010_2019.json │ ├── config_commodities_2011_2015.json │ ├── config_commodities_2011_2017.json │ ├── config_commodities_2011_2019.json │ ├── config_commodities_2012_2016.json │ ├── config_commodities_2012_2018.json │ ├── config_commodities_2013_2017.json │ ├── config_commodities_2013_2019.json │ ├── config_commodities_2014_2018.json │ ├── config_commodities_2015_2019.json │ └── config_commodities_pr.json ├── draft.py ├── main.py └── mlp_trainer.py ├── notebooks ├── PairsTrading-Benchmark-FixedBeta_2009_2019.ipynb ├── PairsTrading-Benchmark-FixedBeta_2015_2019.ipynb ├── PairsTrading-Benchmark-FixedBeta_NoClustering_2012_2016.ipynb ├── PairsTrading-Benchmark-FixedBeta_NoClustering_2013_2017.ipynb ├── PairsTrading-Benchmark-FixedBeta_NoClustering_2014_2018.ipynb ├── PairsTrading-Benchmark-FixedBeta_OPTICS_2012_2016.ipynb ├── PairsTrading-Benchmark-FixedBeta_OPTICS_2013_2017.ipynb ├── PairsTrading-Benchmark-FixedBeta_OPTICS_2014_2018.ipynb ├── PairsTrading-Benchmark-FixedBeta_Sector_2012_2016.ipynb ├── PairsTrading-Benchmark-FixedBeta_Sector_2013_2017.ipynb ├── PairsTrading-Benchmark-FixedBeta_Sector_2014_2018.ipynb ├── PairsTrading-Clustering.ipynb ├── PairsTrading-DataPreprocessing.ipynb └── PairsTrading-Forecasting_2009_2019.ipynb └── training ├── PairsTrading_DeepLearning.ipynb ├── encoder_decoder_trainer.py └── rnn_trainer.py /ExtendedAbstract.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/simaomsarmento/PairsTrading/0781877c75673ceca3c61704eee9c9dca9d37b6b/ExtendedAbstract.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # PairsTrading 2 | 3 | ## Thesis to obtain the Master of Science Degree in Electrical and Computer Engineering 4 | 5 | **September 2019** 6 | 7 | **Simão Moraes Sarmento, simao.moraes.sarmento@tecnico.ulisboa.pt** 8 | 9 | ## Thesis Abstract: 10 | Pairs Trading is one of the most valuable market-neutral strategies used by hedge funds. It is particularly interesting as it overcomes the arduous process of valuing securities by focusing on relative pricing. By buying a relatively undervalued security and selling a relatively overvalued one, a profit can be made upon the pair's price convergence. However, with the growing availability of data, it became increasingly harder to find rewarding pairs. In this work, we address two problems: (i) how to find profitable pairs while constraining the search space and (ii) how to avoid long decline periods due to prolonged divergent pairs. To manage these difficulties, the application of promising Machine Learning techniques is investigated in detail. We propose the integration of an Unsupervised Learning algorithm, OPTICS, to handle problem (i). The results obtained demonstrate the suggested technique can outperform the common pairs' search methods, achieving an average portfolio Sharpe ratio of 3.79, in comparison to 3.58 and 2.59 obtained by standard approaches. For problem (ii), we introduce a forecasting-based trading model, capable of reducing the periods of portfolio decline by 75\%. Yet, this comes at the expense of decreasing overall profitability. The proposed strategy is tested using an ARMA model, an LSTM and an LSTM Encoder-Decoder. This work's results are simulated during varying periods between January 2009 and December 2018, using 5-minutes price data from a group of 208 commodity-linked ETFs, and accounting for transaction costs. 11 | 12 | ## Repository content: 13 | 14 | This repository contains all the code developed to produce the results presented in *Thesis.pdf*. 15 | 16 | A detailed explanation concerning the code organization can be found in *code_organization.pdf*. 17 | 18 | 19 | 20 | 21 | 22 | ## Notes: 23 | 24 | - The files have been organized in folders to make this repo tidier. Nevertheless, the code presented in the notebooks and 25 | in the training files presumes the class files are in the same directory. 26 | - To rerun the notebooks or the training files the path to the classes must be adapted. 27 | 28 | Data available in: https://www.dropbox.com/sh/0w3vu1eylrfnkch/AABttIlDf64MmVf5CP1Qy-XOa?dl=0 29 | 30 | 31 | 32 | ## Training the Deep Learning models on Google Colab 33 | 34 | 1. Copy all the required files (data folder + classes + training files) to directory in Google Drive 35 | 2. Run the notebook in the 'training' folder using google colab. 36 | 37 | -------------------------------------------------------------------------------- /Thesis.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/simaomsarmento/PairsTrading/0781877c75673ceca3c61704eee9c9dca9d37b6b/Thesis.pdf -------------------------------------------------------------------------------- /classes/class_DataProcessor.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from openpyxl import load_workbook 4 | 5 | # just set the seed for the random number generator 6 | np.random.seed(107) 7 | 8 | class DataProcessor: 9 | """ 10 | This class contains a set of pairs trading strategies along 11 | with some auxiliary functions 12 | 13 | """ 14 | 15 | def read_ticker_excel(self, path=None): 16 | """ 17 | Assumes the relevant tickers are saved in an excel file. 18 | 19 | :param path: path to excel 20 | :return: df with tickers data, list with tickers 21 | """ 22 | 23 | df = pd.read_excel(path) 24 | 25 | # remove duplicated 26 | unique_df = df[~df.duplicated(subset=['Ticker'], keep='first')].sort_values(['Ticker']) 27 | tickers = unique_df.Ticker.unique() 28 | 29 | return df, unique_df, tickers 30 | 31 | def dict_to_df(self, dataset, threshold=None): 32 | """ 33 | Transforms a dictionary into a Dataframe 34 | 35 | :param dataset: dictionary containing tickers as keys and corresponding price series 36 | :param threshold: threshold for number of Nan Values 37 | :return: df with tickers as columns 38 | :return: df_clean with tickers as columns, and columns with null values dropped 39 | """ 40 | 41 | first_count = True 42 | for k in dataset.keys(): 43 | if dataset[k] is not None: 44 | if first_count: 45 | df = dataset[k] 46 | first_count = False 47 | else: 48 | df = pd.concat([df, dataset[k]], axis=1) 49 | 50 | if threshold is not None: 51 | df_clean = self.remove_tickers_with_nan(df, threshold) 52 | else: 53 | df_clean = df 54 | 55 | return df, df_clean 56 | 57 | def remove_tickers_with_nan(self, df, threshold): 58 | """ 59 | Removes columns with more than threshold null values 60 | """ 61 | null_values = df.isnull().sum() 62 | null_values = null_values[null_values > 0] 63 | 64 | to_remove = list(null_values[null_values > threshold].index) 65 | df = df.drop(columns=to_remove) 66 | 67 | return df 68 | 69 | def get_return_series(self, df_prices): 70 | """ 71 | This function calculates the return series of a given price series 72 | 73 | :param prices: time series with prices 74 | :return: return series 75 | """ 76 | df_returns = df_prices.pct_change() 77 | df_returns = df_returns.iloc[1:] 78 | 79 | return df_returns 80 | 81 | def split_data(self, df_prices, training_dates, testing_dates, remove_nan=True): 82 | """ 83 | This function splits a dataframe into training and validation sets 84 | :param df_prices: dataframe containing prices for all dates 85 | :param training_dates: tuple (training initial date, training final date) 86 | :param testing_dates: tuple (testing initial date, testing final date) 87 | :param remove_nan: flag to detect if nan values are to be removed 88 | 89 | :return: df with training prices 90 | :return: df with testing prices 91 | """ 92 | if remove_nan: 93 | dataset_mask = ((df_prices.index >= training_dates[0]) &\ 94 | (df_prices.index <= testing_dates[1])) 95 | df_prices_dataset = df_prices[dataset_mask] 96 | print('Total of {} tickers'.format(df_prices_dataset.shape[1])) 97 | df_prices_dataset_without_nan = self.remove_tickers_with_nan(df_prices_dataset, 0) 98 | print('Total of {} tickers after removing tickers with Nan values'.format( 99 | df_prices_dataset_without_nan.shape[1])) 100 | df_prices = df_prices_dataset_without_nan.copy() 101 | 102 | train_mask = (df_prices.index <= training_dates[1]) 103 | test_mask = (df_prices.index >= testing_dates[0]) 104 | df_prices_train = df_prices[train_mask] 105 | df_prices_test = df_prices[test_mask] 106 | 107 | return df_prices_train, df_prices_test 108 | 109 | def append_df_to_excel(self, filename, df, sheet_name='Sheet1', startrow=None, 110 | truncate_sheet=False, 111 | **to_excel_kwargs): 112 | """ 113 | Source: https://stackoverflow.com/questions/20219254/how-to-write-to-an-existing-excel-file-without-overwriting 114 | -data-using-pandas/47740262#47740262 115 | 116 | Append a DataFrame [df] to existing Excel file [filename] 117 | into [sheet_name] Sheet. 118 | If [filename] doesn't exist, then this function will create it. 119 | 120 | Parameters: 121 | filename : File path or existing ExcelWriter 122 | (Example: '/path/to/file.xlsx') 123 | df : dataframe to save to workbook 124 | sheet_name : Name of sheet which will contain DataFrame. 125 | (default: 'Sheet1') 126 | startrow : upper left cell row to dump data frame. 127 | Per default (startrow=None) calculate the last row 128 | in the existing DF and write to the next row... 129 | truncate_sheet : truncate (remove and recreate) [sheet_name] 130 | before writing DataFrame to Excel file 131 | to_excel_kwargs : arguments which will be passed to `DataFrame.to_excel()` 132 | [can be dictionary] 133 | 134 | Returns: None 135 | """ 136 | 137 | # ignore [engine] parameter if it was passed 138 | if 'engine' in to_excel_kwargs: 139 | to_excel_kwargs.pop('engine') 140 | 141 | writer = pd.ExcelWriter(filename, engine='openpyxl') 142 | 143 | # Python 2.x: define [FileNotFoundError] exception if it doesn't exist 144 | #try: 145 | # FileNotFoundError 146 | #except NameError: 147 | # FileNotFoundError = IOError 148 | 149 | try: 150 | # try to open an existing workbook 151 | writer.book = load_workbook(filename) 152 | 153 | # get the last row in the existing Excel sheet 154 | # if it was not specified explicitly 155 | if startrow is None and sheet_name in writer.book.sheetnames: 156 | startrow = writer.book[sheet_name].max_row 157 | 158 | # truncate sheet 159 | if truncate_sheet and sheet_name in writer.book.sheetnames: 160 | # index of [sheet_name] sheet 161 | idx = writer.book.sheetnames.index(sheet_name) 162 | # remove [sheet_name] 163 | writer.book.remove(writer.book.worksheets[idx]) 164 | # create an empty sheet [sheet_name] using old index 165 | writer.book.create_sheet(sheet_name, idx) 166 | 167 | # copy existing sheets 168 | writer.sheets = {ws.title: ws for ws in writer.book.worksheets} 169 | except FileNotFoundError: 170 | # file does not exist yet, we will create it 171 | pass 172 | 173 | if startrow is None: 174 | startrow = 0 175 | 176 | # write out the new sheet 177 | df.to_excel(writer, sheet_name, startrow=startrow, **to_excel_kwargs) 178 | 179 | # save the workbook 180 | writer.save() 181 | -------------------------------------------------------------------------------- /classes/class_ForecastingTrader.py: -------------------------------------------------------------------------------- 1 | # set seeds 2 | import numpy as np 3 | np.random.seed(1) # NumPy 4 | import random 5 | random.seed(3) # Python 6 | import tensorflow as tf 7 | tf.set_random_seed(2) # Tensorflow 8 | session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, 9 | inter_op_parallelism_threads=1) 10 | from keras import backend as K 11 | sess = tf.Session(graph=tf.get_default_graph(), config=session_conf) 12 | K.set_session(sess) 13 | 14 | import pandas as pd 15 | 16 | import matplotlib.pyplot as plt 17 | 18 | from classes import class_Trader 19 | 20 | from sklearn.preprocessing import StandardScaler 21 | 22 | # Import keras 23 | from keras.models import Sequential 24 | from keras.layers import Dense, Dropout, TimeDistributed, CuDNNLSTM 25 | from keras.callbacks import EarlyStopping 26 | from keras.initializers import glorot_normal 27 | from keras.layers import RepeatVector 28 | from keras.utils import plot_model 29 | #from keras_sequential_ascii import keras2ascii 30 | # just set the seed for the random number generator 31 | 32 | import pickle 33 | 34 | 35 | 36 | class ForecastingTrader: 37 | """ 38 | 39 | 40 | """ 41 | def __init__(self): 42 | """ 43 | :initial elements 44 | """ 45 | 46 | def destandardize(self, predictions, spread_mean, spread_std): 47 | """ 48 | This function transforms the normalized predictions into the original space. 49 | """ 50 | return predictions * spread_std + spread_mean 51 | 52 | def forecast_spread_trading(self, X, Y, spread_test, spread_train, beta, predictions, lag, 53 | low_quantile=0.15, high_quantile=0.85, multistep=0): 54 | """ 55 | This function will set the trading positions based on the forecasted spread. 56 | For each day, the function compares the predicted spread for that day with the 57 | true value of the spread in tha day before, giving the predicted spread pct change. 58 | In case it is larger than the threshold, a position is entered. 59 | Note: because a position entered in day n it is only accounted for on the day after, 60 | we shift the entered positions. 61 | : predictions: predictions should not be standardized, but with regular mean and variance. 62 | """ 63 | # 1. Get predictions pct_change 64 | # we want to see the pct change of the prediction compared 65 | # to the true value but the previous instant time, because we 66 | # are interested in seeing the temporal % change 67 | if multistep == 0: 68 | predictions_1 = predictions 69 | predictions_2 = pd.Series(data=[0]*len(predictions), index=predictions.index) 70 | predictions_pct_change = (((predictions_1 - spread_test.shift(lag)) / 71 | abs(spread_test.shift(lag))) * 100).fillna(0) 72 | # true_change = spread_test.diff().fillna(0) 73 | else: 74 | predictions_1, predictions_2 = predictions['t'], predictions['t+1'] 75 | predictions_pct_change = (((predictions_2 - spread_test.shift(lag)) / 76 | abs(spread_test.shift(lag))) * 100).fillna(0) 77 | # need to add last row and first row correspondingly 78 | predictions_1 = predictions_1.append(pd.Series(data=predictions_2[-1], index=spread_test[-1:].index)) 79 | predictions_2 = pd.concat([pd.Series(data=predictions_1[0], index=predictions_1[:1].index), predictions_2]) 80 | 81 | # 2. Calculate trading thresholds 82 | spread_train_pct_change = ((spread_train - spread_train.shift(lag+multistep)) / 83 | abs(spread_train.shift(lag+multistep))) * 100 84 | positive_changes = spread_train_pct_change[spread_train_pct_change > 0] 85 | negative_changes = spread_train_pct_change[spread_train_pct_change < 0] 86 | long_threshold = positive_changes.quantile(q=high_quantile, interpolation='linear') 87 | #print('Long threshold: {:.2f}'.format(long_threshold)) 88 | short_threshold = negative_changes.quantile(q=low_quantile, interpolation='linear') 89 | #print('Short threshold: {:.2f}'.format(short_threshold)) 90 | 91 | # 3. Define trading timings 92 | # Note: If we want to enter a position at the beginning of day N, 93 | # because of the way pnl is calculated the position is entered 94 | # in the previous day. 95 | # Example: In day 23 the percentage change is 55% (wrt day 22). If we were to enter the 96 | # position in day 23, the following code would not consider the gains during day 23, even if we had 97 | # set the position in the morning. (it conly considers the gains for the next days) 98 | # Thus, as a workaoround, we enter the position at day 22 (at night), and it considers the gains for day 23 99 | numUnits = pd.Series(data=[0.] * len(spread_test), index=spread_test.index, name='numUnits') 100 | longsEntry = predictions_pct_change > long_threshold 101 | longsEntry = longsEntry.shift(-1).fillna(False) 102 | numUnits[longsEntry] = 1. 103 | shortsEntry = predictions_pct_change < short_threshold 104 | shortsEntry = shortsEntry.shift(-1).fillna(False) 105 | numUnits[shortsEntry] = -1 106 | 107 | # ffill if applicable 108 | if lag == 1: 109 | pct_change_from_previous = predictions_pct_change 110 | else: 111 | pct_change_from_previous = predictions_pct_change = (((predictions_1 - spread_test.shift(1)) / 112 | abs(spread_test.shift(1))) * 100).fillna(0) 113 | for i in range(1, len(numUnits) - 1): 114 | if numUnits[i] != 0: 115 | continue 116 | else: 117 | if numUnits[i - 1] == 0: 118 | continue 119 | elif numUnits[i - 1] == 1.: 120 | if pct_change_from_previous[i + 1] > 0: 121 | numUnits[i] = 1 122 | continue 123 | elif numUnits[i - 1] == -1.: 124 | if pct_change_from_previous[i + 1] < 0: 125 | numUnits[i] = -1. 126 | continue 127 | 128 | # 4. Calculate P&L and Returns 129 | trader = class_Trader.Trader() 130 | 131 | # concatenate for positions with not enough data to be predicted 132 | lookback = len(Y)-len(spread_test) 133 | numUnits_not_predicted = pd.Series(data=[0.] * lookback, index=Y.index[:lookback]) 134 | numUnits = pd.concat([numUnits_not_predicted, numUnits], axis=0) 135 | numUnits.name = 'numUnits' 136 | # add trade durations 137 | numUnits_df = pd.DataFrame(numUnits, index=Y.index) 138 | numUnits_df = numUnits_df.rename(columns={"positions": "numUnits"}) 139 | trading_durations = trader.add_trading_duration(numUnits_df) 140 | # calculate balance 141 | balance_summary = trader.calculate_balance(Y, X, beta, numUnits.shift(1).fillna(0), trading_durations) 142 | # calculate return per position 143 | position_ret, _, _ = trader.calculate_position_returns(Y, X, beta, numUnits) 144 | df = pd.DataFrame({'position_return':position_ret.values, 145 | 'trading_duration':trading_durations, 146 | 'position_during_day': numUnits.shift(1).fillna(0).values}, 147 | index = position_ret.index) 148 | position_ret_with_costs = trader.add_transaction_costs(df, beta) 149 | balance_summary['position_ret_with_costs']=position_ret_with_costs 150 | 151 | # summarize 152 | ret_with_costs, cum_ret_with_costs = balance_summary.returns, (balance_summary.account_balance-1) 153 | bins = [-np.inf, -0.00000001, 0.00000001, np.inf] 154 | names = ['-1', '0', '1'] 155 | summary = pd.DataFrame(data={'prediction(t)': predictions_1.values, 156 | 'prediction(t+1)': predictions_2.values, 157 | 'spread(t)': spread_test.values, 158 | 'predicted_change(%)': predictions_pct_change, 159 | 'position_during_day': numUnits.shift(1).fillna(0).values[lookback:], 160 | 'position_return': position_ret, 161 | 'position_ret_with_costs': position_ret_with_costs, 162 | 'trading_days': trading_durations[lookback:], 163 | 'ret_with_costs': ret_with_costs[lookback:], 164 | 'predicted_direction': pd.cut(predictions_pct_change, bins, labels=names), 165 | 'true_direction': pd.cut(spread_test.diff(), bins, labels=names) 166 | }, 167 | index=spread_test.index) 168 | 169 | return ret_with_costs, cum_ret_with_costs, summary, balance_summary 170 | 171 | def returns_forecasting_trading(self, Y, X, beta, predictions, test): 172 | """ 173 | This function implements Dunis methodology. 174 | It tracks big changes in returns, and opens a position when the change in the returns is significant. 175 | """ 176 | # track positions for which the expected p&l overweights the transaction costs 177 | numUnits = pd.Series(data=[0.] * len(predictions), index=predictions.index, name='numUnits') 178 | long_opportunities = predictions > 0.0056 179 | short_opportunities = predictions < -0.0056 180 | longsEntry = long_opportunities.shift(-1).fillna(False) 181 | numUnits[longsEntry] = 1. 182 | shortsEntry = short_opportunities.shift(-1).fillna(False) 183 | numUnits[shortsEntry] = -1. 184 | 185 | # Calculate P&L and Returns 186 | trader = class_Trader.Trader() 187 | # concatenate positions with not enough data to be predicted 188 | lookback = len(Y) - len(predictions) 189 | numUnits_not_predicted = pd.Series(data=[0.] * lookback, index=Y.index[:lookback]) 190 | numUnits = pd.concat([numUnits_not_predicted, numUnits], axis=0) 191 | numUnits.name = 'numUnits' 192 | # add trade durations 193 | numUnits_df = pd.DataFrame(numUnits, index=Y.index) 194 | numUnits_df = numUnits_df.rename(columns={"positions": "numUnits"}) 195 | trading_durations = trader.add_trading_duration(numUnits_df) 196 | # calculate balance 197 | balance_summary = trader.calculate_balance(Y, X, beta, numUnits.shift(1).fillna(0), trading_durations) 198 | 199 | # summarize 200 | ret_with_costs, cum_ret_with_costs = balance_summary.returns, (balance_summary.account_balance - 1) 201 | summary = pd.DataFrame(data={'predicted_pnl(t)': predictions.values, 202 | 'pnl(t)': test.values, 203 | 'position_during_day': numUnits.shift(1).fillna(0).values[lookback:], 204 | 'Y': Y[lookback:], 205 | 'X': X[lookback:], 206 | 'Y_pct_change': Y.pct_change()[lookback:], 207 | 'X_pct_change': X.pct_change()[lookback:], 208 | 'trading_days': trading_durations[lookback:], 209 | 'ret_with_costs': ret_with_costs[lookback:] 210 | }, 211 | index=test.index) 212 | 213 | return ret_with_costs, cum_ret_with_costs, summary, balance_summary 214 | 215 | def spread_trading(self, X, Y, spread_test, spread_train, beta, predictions, lag, low_quantile=0.10, 216 | high_quantile=0.90, multistep=0): 217 | """ 218 | This function will set the trading positions based on the forecasted spread. 219 | For each day, the function compares the predicted spread for that day with the 220 | true value of the spread in tha day before, giving the predicted spread pct change. 221 | In case it is larger than the threshold, a position is entered. 222 | Note: because a position entered in day n it is only accounted for on the day after, 223 | we shift the entered positions. 224 | : predictions: predictions should not be standardized, but with regular mean and variance. 225 | """ 226 | # 1. Get predictions change 227 | if multistep == 0: 228 | predictions_1 = predictions 229 | predictions_change = predictions.diff().fillna(0) 230 | true_change = spread_test.diff().fillna(0) 231 | else: 232 | predictions_1, predictions_2 = predictions 233 | predictions_change = (predictions_2 - predictions_1.shift(lag)).fillna(0) 234 | # need to add last row 235 | predictions_change = predictions_change.append(pd.Series(data=[0], index=spread_test[-1:].index)) 236 | predictions_1 = predictions_1.append(pd.Series(data=predictions_2[-1], index=spread_test[-1:].index)) 237 | 238 | # 2. Calculate trading threshold 239 | spread_train_change = (spread_train - spread_train.shift(lag+multistep)).fillna(0) 240 | positive_changes = spread_train_change[spread_train_change > 0] 241 | negative_changes = spread_train_change[spread_train_change < 0] 242 | long_threshold = positive_changes.quantile(q=high_quantile, interpolation='linear') 243 | print('Long threshold: {:.2f}'.format(long_threshold)) 244 | short_threshold = negative_changes.quantile(q=low_quantile, interpolation='linear') 245 | print('Short threshold: {:.2f}'.format(short_threshold)) 246 | 247 | # 3. Define trading timings 248 | numUnits = pd.Series(data=[0.] * len(spread_test), index=spread_test.index, name='numUnits') 249 | longsEntry = (predictions_change > long_threshold) & (true_change.shift() > 0) 250 | longsEntry = longsEntry.shift(-1).fillna(False) 251 | numUnits[longsEntry] = 1. 252 | shortsEntry = (predictions_change < short_threshold) & (true_change.shift() < 0) 253 | shortsEntry = shortsEntry.shift(-1).fillna(False) 254 | numUnits[shortsEntry] = -1. 255 | 256 | # ffill if applicable 257 | if lag == 1: 258 | change_from_previous = predictions_change 259 | else: 260 | change_from_previous = (predictions_1 - spread_test.shift(1)).fillna(0) 261 | for i in range(1, len(numUnits) - 1): 262 | if numUnits[i] != 0: 263 | continue 264 | else: 265 | if numUnits[i - 1] == 0: 266 | continue 267 | elif numUnits[i - 1] == 1.: 268 | if change_from_previous[i + 1] > 0: 269 | numUnits[i] = 1 270 | continue 271 | elif numUnits[i - 1] == -1.: 272 | if change_from_previous[i + 1] < 0: 273 | numUnits[i] = -1. 274 | continue 275 | 276 | # 4. Calculate P&L and Returns 277 | trader = class_Trader.Trader() 278 | 279 | # concatenate for positions with not enough data to be predicted 280 | lookback = len(Y)-len(spread_test) 281 | numUnits_not_predicted = pd.Series(data=[0.] * lookback, index=Y.index[:lookback]) 282 | numUnits = pd.concat([numUnits_not_predicted, numUnits], axis=0) 283 | numUnits.name = 'numUnits' 284 | # add trade durations 285 | numUnits_df = pd.DataFrame(numUnits, index=Y.index) 286 | numUnits_df = numUnits_df.rename(columns={"positions": "numUnits"}) 287 | trading_durations = trader.add_trading_duration(numUnits_df) 288 | # calculate balance 289 | balance_summary = trader.calculate_balance(Y, X, beta, numUnits.shift(1).fillna(0), trading_durations) 290 | 291 | # summarize 292 | ret_with_costs, cum_ret_with_costs = balance_summary.returns, (balance_summary.account_balance-1) 293 | summary = pd.DataFrame(data={'prediction(t)': predictions_1.values, 294 | 'spread(t)': spread_test.values, 295 | 'predicted_change': predictions_change, 296 | 'true_change': spread_test.diff().fillna(0).values, 297 | 'position_during_day': numUnits.shift(1).fillna(0).values[lookback:], 298 | 'Y': Y[lookback:], 299 | 'X': X[lookback:], 300 | 'trading_days': trading_durations[lookback:], 301 | 'ret_with_costs': ret_with_costs[lookback:] 302 | }, 303 | index=spread_test.index) 304 | print('Accuracy of time series forecasting: {:.2f}%'.format(self.calculate_direction_accuracy(spread_test, 305 | predictions_1))) 306 | 307 | return ret_with_costs, cum_ret_with_costs, summary, balance_summary 308 | 309 | def momentum_trading(self, X, Y, spread_test, spread_train, beta, predictions, lag, low_quantile=0.10, 310 | high_quantile=0.90, multistep=0): 311 | """ 312 | This function will set the trading positions based on the forecasted spread. 313 | For each day, the function compares the predicted spread for that day with the 314 | true value of the spread in tha day before, giving the predicted spread pct change. 315 | In case it is larger than the threshold, a position is entered. 316 | Note: because a position entered in day n it is only accounted for on the day after, 317 | we shift the entered positions. 318 | : predictions: predictions should not be standardized, but with regular mean and variance. 319 | """ 320 | # 1. Get predictions change 321 | if multistep == 0: 322 | predictions_1 = predictions 323 | predictions_change = spread_test - predictions_1 324 | else: 325 | predictions_1, predictions_2 = predictions 326 | predictions_change = (predictions_2 - predictions_1.shift(lag)).fillna(0) 327 | # need to add last row 328 | predictions_change = predictions_change.append(pd.Series(data=[0], index=spread_test[-1:].index)) 329 | predictions_1 = predictions_1.append(pd.Series(data=predictions_2[-1], index=spread_test[-1:].index)) 330 | 331 | # 2. Calculate trading threshold 332 | spread_train_change = (spread_train - spread_train.shift(lag+multistep)).fillna(0) 333 | positive_changes = spread_train_change[spread_train_change > 0] 334 | negative_changes = spread_train_change[spread_train_change < 0] 335 | long_threshold = positive_changes.quantile(q=high_quantile, interpolation='linear') 336 | print('Long threshold: {:.2f}'.format(long_threshold)) 337 | short_threshold = negative_changes.quantile(q=low_quantile, interpolation='linear') 338 | print('Short threshold: {:.2f}'.format(short_threshold)) 339 | 340 | # 3. Define trading timings 341 | numUnits = pd.Series(data=[0.] * len(spread_test), index=spread_test.index, name='numUnits') 342 | longsEntry = predictions_change > long_threshold 343 | numUnits[longsEntry] = 1. 344 | shortsEntry = predictions_change < short_threshold 345 | numUnits[shortsEntry] = -1. 346 | 347 | # ffill if applicable 348 | if lag == 1: 349 | change_from_previous = predictions_change 350 | else: 351 | change_from_previous = (predictions_1 - spread_test.shift(1)).fillna(0) 352 | for i in range(1, len(numUnits) - 1): 353 | if numUnits[i] != 0: 354 | continue 355 | else: 356 | if numUnits[i - 1] == 0: 357 | continue 358 | elif numUnits[i - 1] == 1.: 359 | if change_from_previous[i] > 0: 360 | numUnits[i] = 1 361 | continue 362 | elif numUnits[i - 1] == -1.: 363 | if change_from_previous[i] < 0: 364 | numUnits[i] = -1. 365 | continue 366 | 367 | # 4. Calculate P&L and Returns 368 | trader = class_Trader.Trader() 369 | 370 | # concatenate for positions with not enough data to be predicted 371 | lookback = len(Y)-len(spread_test) 372 | numUnits_not_predicted = pd.Series(data=[0.] * lookback, index=Y.index[:lookback]) 373 | numUnits = pd.concat([numUnits_not_predicted, numUnits], axis=0) 374 | numUnits.name = 'numUnits' 375 | # add trade durations 376 | numUnits_df = pd.DataFrame(numUnits, index=Y.index) 377 | numUnits_df = numUnits_df.rename(columns={"positions": "numUnits"}) 378 | trading_durations = trader.add_trading_duration(numUnits_df) 379 | # calculate balance 380 | balance_summary = trader.calculate_balance(Y, X, beta, numUnits.shift(1).fillna(0), trading_durations) 381 | 382 | # summarize 383 | ret_with_costs, cum_ret_with_costs = balance_summary.returns, (balance_summary.account_balance-1) 384 | summary = pd.DataFrame(data={'prediction(t)': predictions_1.values, 385 | 'spread(t)': spread_test.values, 386 | 'spread_predicted_change': predictions_change.values, 387 | 'position_during_day': numUnits.shift(1).fillna(0).values[lookback:], 388 | '{}'.format(Y.name): Y[lookback:], 389 | '{}'.format(X.name): X[lookback:], 390 | 'trading_days': trading_durations[lookback:], 391 | 'ret_with_costs': ret_with_costs[lookback:] 392 | }, 393 | index=spread_test.index) 394 | print('Accuracy of time series forecasting: {:.2f}%'.format(self.calculate_direction_accuracy(spread_test, 395 | predictions_1))) 396 | 397 | return ret_with_costs, cum_ret_with_costs, summary, balance_summary 398 | 399 | def calculate_direction_accuracy(self, true, predictions): 400 | 401 | bins = [-np.inf, -0.00000001, 0.00000001, np.inf] 402 | names = ['-1', '0', '1'] 403 | predictions_change = predictions.diff().fillna(0) 404 | 405 | predicted_direction = pd.cut(predictions_change, bins, labels=names) 406 | true_direction = pd.cut(true.diff().fillna(0), bins, labels=names) 407 | #accuracy = len(predicted_direction[predicted_direction == true_direction])/len(predicted_direction) * 100 408 | 409 | predicted_direction_subset = predicted_direction[true_direction != '0'] 410 | true_direction_subset = true_direction[true_direction != '0'] 411 | accuracy = len(predicted_direction_subset[predicted_direction_subset == true_direction_subset]) / \ 412 | len(predicted_direction_subset) * 100 413 | 414 | return accuracy 415 | 416 | def series_to_supervised(self, data, index=None, n_in=1, n_out=1, dropnan=True): 417 | """ 418 | Frame a time series as a supervised learning dataset. 419 | Arguments: 420 | data: Sequence of observations as a list or NumPy array. 421 | n_in: Number of lag observations as input (X). 422 | n_out: Number of observations as output (y). 423 | dropnan: Boolean whether or not to drop rows with NaN values. 424 | Returns: 425 | Pandas DataFrame of series framed for supervised learning. 426 | """ 427 | n_vars = 1 if type(data) is list else data.shape[1] 428 | if index is None: 429 | df = pd.DataFrame(data) 430 | else: 431 | df = pd.DataFrame(data, index=index) 432 | cols, names = list(), list() 433 | # input sequence (t-n, ... t-1) 434 | for i in range(n_in, 0, -1): 435 | cols.append(df.shift(i)) 436 | names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)] 437 | # forecast sequence (t, t+1, ... t+n) 438 | for i in range(0, n_out): 439 | cols.append(df.shift(-i)) 440 | if i == 0: 441 | names += [('var%d(t)' % (j+1)) for j in range(n_vars)] 442 | else: 443 | names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)] 444 | # put it all together 445 | agg = pd.concat(cols, axis=1) 446 | agg.columns = names 447 | # drop rows with NaN values 448 | if dropnan: 449 | agg.dropna(inplace=True) 450 | return agg 451 | 452 | def plot_loss(self, history): 453 | """ 454 | Function to plot loss function. 455 | Arguments: 456 | history: History object with data from training. 457 | title: Plot title. 458 | """ 459 | 460 | plt.plot(history.history['loss'], label = "training") 461 | plt.plot(history.history['val_loss'], label = "validation") 462 | 463 | def prepare_train_data(self, spread, model_config): 464 | """ 465 | 466 | :param spread: spread of the pair being considered 467 | :param model_config: dictionary with model parameters 468 | :return: 469 | tuple with training data 470 | tuple with validation data 471 | y_series in validation period (to compare with predictions later on) 472 | """ 473 | train_val_split = model_config['train_val_split'] 474 | 475 | scaler = StandardScaler() 476 | spread_norm = scaler.fit_transform(spread.values.reshape(spread.shape[0], 1)) 477 | spread_norm = pd.Series(data=spread_norm.flatten(), index=spread.index) 478 | forecasting_data = self.series_to_supervised(list(spread_norm), spread.index, model_config['n_in'], 479 | model_config['n_out'], dropnan=True) 480 | # define dataset 481 | if model_config['n_out'] == 1: 482 | X_series = forecasting_data.drop(columns='var1(t)') 483 | y_series = forecasting_data['var1(t)'] 484 | elif model_config['n_out'] == 2: 485 | X_series = forecasting_data.drop(columns=['var1(t)', 'var1(t+1)']) 486 | y_series = forecasting_data[['var1(t)', 'var1(t+1)']] 487 | 488 | # split 489 | X_series_train = X_series[:train_val_split] 490 | X_series_val = X_series[train_val_split:] 491 | y_series_train = y_series[:train_val_split] 492 | y_series_val = y_series[train_val_split:] 493 | 494 | X_train = X_series_train.values 495 | X_val = X_series_val.values 496 | y_train = y_series_train.values 497 | y_val = y_series_val.values 498 | 499 | return (X_train, y_train), (X_val, y_val), y_series_val, scaler 500 | 501 | def prepare_test_data(self, spread, model_config, scaler): 502 | """ 503 | """ 504 | # normalize spread 505 | spread_norm = scaler.transform(spread.values.reshape(spread.shape[0], 1)) 506 | spread_norm = pd.Series(data=spread_norm.flatten(), index=spread.index) 507 | forecasting_data = self.series_to_supervised(list(spread_norm), spread.index, model_config['n_in'], 508 | model_config['n_out'], dropnan=True) 509 | # define dataset 510 | if model_config['n_out'] == 1: 511 | X_series_test = forecasting_data.drop(columns='var1(t)') 512 | y_series_test = forecasting_data['var1(t)'] 513 | elif model_config['n_out'] == 2: 514 | X_series_test = forecasting_data.drop(columns=['var1(t)', 'var1(t+1)']) 515 | y_series_test = forecasting_data[['var1(t)', 'var1(t+1)']] 516 | 517 | X_test = X_series_test.values 518 | y_test = y_series_test.values 519 | 520 | return (X_test, y_test), y_series_test 521 | 522 | def destandardize(self, predictions, spread_mean, spread_std): 523 | """ 524 | This function transforms the normalized predictions into the original space. 525 | """ 526 | return predictions * spread_std + spread_mean 527 | 528 | def train_models(self, pairs, model_config, model_type='mlp'): 529 | """ 530 | This function trains the models for every pair identified. 531 | 532 | :param pairs: list with pairs and corresponding statistics 533 | :param model_config: dictionary with info for the model 534 | :return: all models 535 | """ 536 | 537 | models = [] 538 | for pair in pairs: 539 | 540 | # prepare train data 541 | spread = pair[2]['spread'] 542 | train_data, validation_data, y_series_val, scaler = self.prepare_train_data(spread, model_config) 543 | # prepare test data 544 | spread_test = pair[2]['Y_test']-pair[2]['coint_coef']*pair[2]['X_test'] 545 | test_data, y_series_test = self.prepare_test_data(spread_test, model_config, scaler) 546 | 547 | # train model and get predictions 548 | if model_type == 'mlp': 549 | model, history, score, predictions_train, predictions_val, predictions_test = \ 550 | self.apply_MLP(X=train_data[0], 551 | y=train_data[1], 552 | validation_data=validation_data, 553 | test_data=test_data, 554 | n_in=model_config['n_in'], 555 | hidden_nodes=model_config['hidden_nodes'], 556 | epochs=model_config['epochs'], 557 | optimizer=model_config['optimizer'], 558 | loss_fct=model_config['loss_fct'], 559 | batch_size=model_config['batch_size']) 560 | 561 | elif model_type == 'rnn': 562 | model, history, score, predictions_train, predictions_val, predictions_test = \ 563 | self.apply_RNN(X=train_data[0], 564 | y=train_data[1], 565 | validation_data=validation_data, 566 | test_data=test_data, 567 | hidden_nodes=model_config['hidden_nodes'], 568 | epochs=model_config['epochs'], 569 | optimizer=model_config['optimizer'], 570 | loss_fct=model_config['loss_fct'], 571 | batch_size=model_config['batch_size']) 572 | elif model_type == 'encoder_decoder': 573 | model, history, score, predictions_train, predictions_val, predictions_test = \ 574 | self.apply_encoder_decoder(X=train_data[0], 575 | y=train_data[1], 576 | validation_data=validation_data, 577 | test_data=test_data, 578 | n_in=model_config['n_in'], 579 | n_out=model_config['n_out'], 580 | hidden_nodes=model_config['hidden_nodes'], 581 | epochs=model_config['epochs'], 582 | optimizer=model_config['optimizer'], 583 | loss_fct=model_config['loss_fct'], 584 | batch_size=model_config['batch_size']) 585 | 586 | # validation 587 | predictions_val = pd.DataFrame({'t': predictions_val.reshape(predictions_val.shape[0], 588 | predictions_val.shape[1])[:, 0], 589 | 't+1': predictions_val.reshape(predictions_val.shape[0], 590 | predictions_val.shape[1])[:, 1]}, 591 | index=y_series_val.index) 592 | predictions_val['t'] = scaler.inverse_transform(np.array(predictions_val['t'])) 593 | predictions_val['t+1'] = scaler.inverse_transform(np.array(predictions_val['t+1'])) 594 | 595 | # test 596 | predictions_test = pd.DataFrame({'t': predictions_test.reshape(predictions_test.shape[0], 597 | predictions_test.shape[1])[:, 0], 598 | 't+1': predictions_test.reshape(predictions_test.shape[0], 599 | predictions_test.shape[1])[:, 1]}, 600 | index=y_series_test.index) 601 | predictions_test['t'] = scaler.inverse_transform(np.array(predictions_test['t'])) 602 | predictions_test['t+1'] = scaler.inverse_transform(np.array(predictions_test['t+1'])) 603 | 604 | # train 605 | predictions_train = predictions_val.copy() # not relevant, just to fill up 606 | 607 | # transform predictions to series 608 | if model_type != 'encoder_decoder': 609 | predictions_train = scaler.inverse_transform(predictions_train) 610 | predictions_val = scaler.inverse_transform(predictions_val) 611 | predictions_test = scaler.inverse_transform(predictions_test) 612 | predictions_train = pd.Series(data=predictions_train.flatten(), 613 | index=spread[model_config['n_in']:-len(y_series_val)].index) 614 | predictions_val = pd.Series(data=predictions_val.flatten(), index=y_series_val.index) 615 | predictions_test = pd.Series(data=predictions_test.flatten(), 616 | index=spread_test[-len(test_data[1]):].index) 617 | 618 | # save all info 619 | # check epochs 620 | if len(history.history['val_loss']) == 500: 621 | epoch_stop = 500 622 | else: 623 | epoch_stop = len(history.history['val_loss']) - 50 # patience=50 624 | 625 | model_info = {'leg1': pair[0], 626 | 'leg2': pair[1], 627 | 'standardization_dict': 'scaler', 628 | 'history': history.history, 629 | 'score': score, 630 | 'epoch_stop': epoch_stop, 631 | 'predictions_train': predictions_train.copy(), 632 | 'predictions_val': predictions_val.copy(), 633 | 'predictions_test': predictions_test.copy() 634 | } 635 | models.append(model_info) 636 | 637 | # append model configuration on last position 638 | models.append(model_config) 639 | 640 | return models 641 | 642 | # ################################### MLP ############################################ 643 | def apply_MLP(self, X, y, validation_data, test_data, n_in, hidden_nodes, epochs, optimizer, loss_fct, 644 | batch_size=128): 645 | 646 | # define validation set 647 | X_val = validation_data[0] 648 | y_val = validation_data[1] 649 | 650 | # define test set 651 | X_test = test_data[0] 652 | y_test = test_data[1] 653 | 654 | model = Sequential() 655 | glorot_init = glorot_normal(seed=None) 656 | for i in range(len(hidden_nodes)): 657 | model.add(Dense(hidden_nodes[i], activation='relu', input_dim=n_in, kernel_initializer=glorot_init)) 658 | #model.add(Dropout(0.1)) 659 | model.add(Dense(1)) 660 | model.compile(optimizer=optimizer, loss=loss_fct, metrics=['mae']) 661 | model.summary() 662 | if len(hidden_nodes)>1: 663 | plot_model(model, to_file='/content/drive/PairsTrading/mlp_models/model_{}-{}_{}.png'.format(str(n_in), 664 | str(hidden_nodes[0]), str(hidden_nodes[1]), show_shapes=True, show_layer_names=False)) 665 | else: 666 | plot_model(model, to_file='/content/drive/PairsTrading/mlp_models/model_{}-{}.png'.format(str(n_in), 667 | str(hidden_nodes[0])), show_shapes=True, show_layer_names=False) 668 | #print(keras2ascii(model)) 669 | 670 | # simple early stopping 671 | es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=50) 672 | 673 | history = model.fit(X, y, epochs=epochs, verbose=1, validation_data=validation_data, 674 | shuffle=False, batch_size=batch_size, callbacks=[es]) 675 | 676 | # scores 677 | if len(history.history['loss']) < 500: 678 | train_score = [min(history.history['loss']), min(history.history['mean_absolute_error'])] 679 | val_score = [min(history.history['val_loss']),min(history.history['val_mean_absolute_error'])] 680 | else: 681 | train_score = [history.history['loss'][-1], history.history['mean_absolute_error'][-1]] 682 | val_score = [history.history['val_loss'][-1], history.history['val_mean_absolute_error'][-1]] 683 | score = {'train': train_score, 'val': val_score} 684 | 685 | # predictions 686 | predictions_train = model.predict(X, verbose=1) 687 | predictions_validation = model.predict(X_val, verbose=1) 688 | predictions_test = model.predict(X_test, verbose=1) 689 | 690 | print('------------------------------------------------------------') 691 | print('The mse train loss is: ', train_score[0]) 692 | print('The mae train loss is: ', train_score[1]) 693 | print('The mse test loss is: ', val_score[0]) 694 | print('The mae test loss is: ', val_score[1]) 695 | print('------------------------------------------------------------') 696 | 697 | return model, history, score, predictions_train, predictions_validation, predictions_test 698 | 699 | # ################################### RNN ############################################ 700 | def apply_RNN(self, X, y, validation_data, test_data, hidden_nodes, epochs, optimizer, loss_fct, 701 | batch_size=256): 702 | """ 703 | Note: CuDNNLSTM provides a faster implementation on GPU than regular LSTM 704 | :param X: 705 | :param y: 706 | :param validation_data: 707 | :param test_data: 708 | :param hidden_nodes: 709 | :param epochs: 710 | :param optimizer: 711 | :param loss_fct: 712 | :param batch_size: 713 | :return: 714 | """ 715 | # reshape 716 | X = X.reshape((X.shape[0], X.shape[1], 1)) 717 | X_val = validation_data[0].reshape((validation_data[0].shape[0], validation_data[0].shape[1], 1)) 718 | y_val = validation_data[1] 719 | X_test = test_data[0].reshape((test_data[0].shape[0], test_data[0].shape[1], 1)) 720 | y_test = test_data[1] 721 | 722 | # define model 723 | model = Sequential() 724 | glorot_init = glorot_normal(seed=None) 725 | # add GRU layers 726 | if len(hidden_nodes) == 1: 727 | #model.add(LSTM(hidden_nodes[0], activation='relu', input_shape=(X.shape[1], 1), 728 | # kernel_initializer=glorot_init)) 729 | model.add(CuDNNLSTM(hidden_nodes[0], input_shape=(X.shape[1], 1), kernel_initializer=glorot_init)) 730 | else: 731 | for i in range(len(hidden_nodes)-1): 732 | if i == 0: 733 | #model.add(LSTM(hidden_nodes[0], activation='relu', input_shape=(X.shape[1], 1), 734 | # return_sequences=True, kernel_initializer=glorot_init)) 735 | model.add(CuDNNLSTM(hidden_nodes[0], input_shape=(X.shape[1], 1), 736 | return_sequences=True, kernel_initializer=glorot_init)) 737 | else: 738 | #model.add(LSTM(hidden_nodes[i], activation='relu', return_sequences=True, 739 | # kernel_initializer=glorot_init)) 740 | model.add(CuDNNLSTM(hidden_nodes[i], return_sequences=True, 741 | kernel_initializer=glorot_init)) 742 | # add dropout in between 743 | model.add(Dropout(0.1)) 744 | 745 | #model.add(LSTM(hidden_nodes[-1], activation='relu', kernel_initializer=glorot_init)) # last layer does not return sequences 746 | model.add(CuDNNLSTM(hidden_nodes[-1], kernel_initializer=glorot_init))# last layer does not return sequences 747 | # add regularization 748 | #model.add(Dropout(0.1)) 749 | # add dense layer for output 750 | model.add(Dense(1, kernel_initializer=glorot_init)) 751 | model.compile(optimizer=optimizer, loss=loss_fct, metrics=['mae']) 752 | model.summary() 753 | plot_model(model, to_file='/content/drive/PairsTrading/rnn_models/model.png', show_shapes=True, 754 | show_layer_names=False) 755 | 756 | # simple early stopping 757 | es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=50) 758 | 759 | # fit model 760 | history = model.fit(X, y, epochs=epochs, verbose=1, validation_data=(X_val, y_val), shuffle=False, 761 | batch_size=batch_size, callbacks=[es]) 762 | 763 | # scores 764 | if len(history.history['loss']) < 500: 765 | train_score = [min(history.history['loss']), min(history.history['mean_absolute_error'])] 766 | val_score = [min(history.history['val_loss']),min(history.history['val_mean_absolute_error'])] 767 | else: 768 | train_score = [history.history['loss'][-1], history.history['mean_absolute_error'][-1]] 769 | val_score = [history.history['val_loss'][-1], history.history['val_mean_absolute_error'][-1]] 770 | 771 | score = {'train': train_score, 'val': val_score} 772 | 773 | # removed test score calculation to save time 774 | #test_score = model.evaluate(X_test, y_test, verbose=1) 775 | # , 'test': test_score} 776 | 777 | predictions_train = model.predict(X, verbose=1) 778 | predictions_validation = model.predict(X_val, verbose=1) 779 | predictions_test = model.predict(X_test, verbose=1) 780 | 781 | print('------------------------------------------------------------') 782 | print('The mse train loss is: ', train_score[0]) 783 | print('The mae train loss is: ', train_score[1]) 784 | print('The mse test loss is: ', val_score[0]) 785 | print('The mae test loss is: ', val_score[1]) 786 | print('------------------------------------------------------------') 787 | 788 | return model, history, score, predictions_train, predictions_validation, predictions_test 789 | 790 | # ################################### ENCODER DECODER ############################################ 791 | def apply_encoder_decoder(self, X, y, validation_data, test_data, n_in, n_out, hidden_nodes, 792 | epochs, optimizer, loss_fct, batch_size=512): 793 | 794 | # reshape from [samples, timesteps] into [samples, timesteps, features] 795 | X = X.reshape((X.shape[0], X.shape[1], 1)) 796 | y = y.reshape((y.shape[0], y.shape[1], 1)) 797 | X_val = validation_data[0].reshape((validation_data[0].shape[0], validation_data[0].shape[1], 1)) 798 | y_val = validation_data[1].reshape((validation_data[1].shape[0], validation_data[1].shape[1], 1)) 799 | X_test = test_data[0].reshape((test_data[0].shape[0], test_data[0].shape[1], 1)) 800 | y_test = test_data[1].reshape((test_data[1].shape[0], test_data[1].shape[1], 1)) 801 | 802 | # define model 803 | glorot_init = glorot_normal(seed=None) 804 | model = Sequential() 805 | 806 | # CuDNNLSTM provides a faster implementation on GPU 807 | #model.add(LSTM(hidden_nodes[0], activation='relu', input_shape=(n_in, 1), kernel_initializer=glorot_init)) 808 | model.add(CuDNNLSTM(hidden_nodes[0], input_shape=(n_in, 1), kernel_initializer=glorot_init)) 809 | model.add(RepeatVector(n_out)) 810 | 811 | # CuDNNLSTM provides a faster implementation on GPU 812 | #model.add(LSTM(hidden_nodes[1], activation='relu', return_sequences=True, kernel_initializer=glorot_init)) 813 | model.add(CuDNNLSTM(hidden_nodes[1], return_sequences=True, kernel_initializer=glorot_init)) 814 | 815 | #model.add(Dropout(0.1)) 816 | model.add(TimeDistributed(Dense(1, kernel_initializer=glorot_init))) 817 | model.compile(optimizer=optimizer, loss=loss_fct, metrics=['mae']) 818 | model.summary() 819 | plot_model(model, to_file='/content/drive/PairsTrading/encoder_decoder/model.png', show_shapes=True, 820 | show_layer_names=False) 821 | 822 | # fit model 823 | # simple early stopping 824 | es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=50) 825 | 826 | # fit model 827 | history = model.fit(X, y, epochs=epochs, verbose=1, validation_data=(X_val, y_val), shuffle=False, 828 | batch_size=batch_size, callbacks=[es]) 829 | 830 | # scores 831 | if len(history.history['loss']) < 500: 832 | train_score = [min(history.history['loss']), min(history.history['mean_absolute_error'])] 833 | val_score = [min(history.history['val_loss']), min(history.history['val_mean_absolute_error'])] 834 | else: 835 | train_score = [history.history['loss'][-1], history.history['mean_absolute_error'][-1]] 836 | val_score = [history.history['val_loss'][-1], history.history['val_mean_absolute_error'][-1]] 837 | 838 | score = {'train': train_score, 'val': val_score} 839 | 840 | predictions_train = model.predict(X, verbose=1) 841 | predictions_validation = model.predict(X_val, verbose=1) 842 | predictions_test = model.predict(X_test, verbose=1) 843 | 844 | print('------------------------------------------------------------') 845 | print('The mse train loss is: ', train_score[0]) 846 | print('The mae train loss is: ', train_score[1]) 847 | print('The mse test loss is: ', val_score[0]) 848 | print('The mae test loss is: ', val_score[1]) 849 | print('------------------------------------------------------------') 850 | 851 | return model, history, score, predictions_train, predictions_validation, predictions_test 852 | 853 | def display_forecasting_score(self, models): 854 | 855 | # initialize storage variables 856 | best_score = 99999 857 | best_model = None 858 | 859 | all_models = models 860 | for configuration in all_models: 861 | config = configuration[-1] 862 | print('\nNEW CONFIGURATION:') 863 | print('Configuration: ', config) 864 | mae_train, mse_train = list(), list() 865 | mae_val, mse_val = list(), list() 866 | for pair_i in range(len(configuration) - 1): 867 | score_train = configuration[pair_i]['score']['train'] 868 | score_val = configuration[pair_i]['score']['val'] 869 | # print('MAE: {:.2f}%'.format(score[1])) 870 | mse_train.append(score_train[0]) 871 | mae_train.append(score_train[1]) 872 | mse_val.append(score_val[0]) 873 | mae_val.append(score_val[1]) 874 | print('\nPair loaded: {}_{}: Epochs: {} Val_MSE: {}'.format(configuration[pair_i]['leg1'], 875 | configuration[pair_i]['leg2'], 876 | configuration[pair_i]['epoch_stop'], 877 | score_val[0] 878 | )) 879 | if (np.mean(mse_val)) < best_score: 880 | best_score = np.mean(mse_val) 881 | best_model = config 882 | 883 | print('\nCONFIGURATION TRAIN MSE ERROR: {:.4f}E-4'.format(np.mean(mse_train) * 10000)) 884 | print('CONFIGURATION TRAIN MAE ERROR: {:.4f}'.format(np.mean(mae_train))) 885 | print('\nCONFIGURATION VAL MSE ERROR: {:.4f}E-4'.format(np.mean(mse_val) * 10000)) 886 | print('CONFIGURATION VAL MAE ERROR: {:.4f}'.format(np.mean(mae_val))) 887 | 888 | return (best_model, best_score) 889 | 890 | def run_specific_model(self, n_in, hidden_nodes, pairs, path='models/', train_val_split='2017-01-01', lag=1, 891 | multistep=0, low_quantile=0.10, high_quantile=0.90): 892 | 893 | nodes_name = str(hidden_nodes[0]) + '_' + str(hidden_nodes[1]) if len(hidden_nodes) > 1 else str(hidden_nodes[0]) 894 | file_name = 'models_n_in-' + str(n_in) + '_hidden_nodes-' + nodes_name + '.pkl' 895 | 896 | with open(path + file_name, 'rb') as f: 897 | model = pickle.load(f) 898 | 899 | model_cumret, model_sharpe_ratio = list(), list() 900 | balance_summaries, summaries = list(), list() 901 | for pair_i in range(len(model) - 1): 902 | #print('\nPair loaded: {}_{}:'.format(model[pair_i]['leg1'], model[pair_i]['leg2'])) 903 | #print('Check pairs: {}_{}.'.format(pairs[pair_i][0], pairs[pair_i][1])) 904 | predictions = model[pair_i]['predictions_val'] 905 | 906 | ret, cumret, summary, balance_summary = self.forecast_spread_trading( 907 | X=pairs[pair_i][2]['X_train'][train_val_split:], 908 | Y=pairs[pair_i][2]['Y_train'][train_val_split:], 909 | spread_test=pairs[pair_i][2]['spread'][train_val_split:], 910 | spread_train=pairs[pair_i][2]['spread'][:train_val_split], 911 | beta=pairs[pair_i][2]['coint_coef'], 912 | predictions=predictions, 913 | lag=lag, 914 | low_quantile=low_quantile, 915 | high_quantile=high_quantile, 916 | multistep=multistep) 917 | 918 | #print('Accumulated return: {:.2f}%'.format(cumret[-1] * 100)) 919 | 920 | trader = class_Trader.Trader() 921 | if np.std(ret) != 0: 922 | sharpe_ratio = trader.calculate_sharpe_ratio(1, 252, ret) 923 | else: 924 | sharpe_ratio = 0 925 | #print('Sharpe Ratio:', sharpe_ratio) 926 | 927 | model_cumret.append(cumret[-1] * 100) 928 | model_sharpe_ratio.append(sharpe_ratio) 929 | summaries.append(summary) 930 | balance_summaries.append(balance_summary) 931 | 932 | return model, model_cumret, model_sharpe_ratio, summaries, balance_summaries 933 | 934 | def test_specific_model(self, n_in, hidden_nodes, pairs, path, train_test_split='2018-01-01', lag=1, 935 | low_quantile=0.10, high_quantile=0.90, multistep=0, profitable_pairs_indices=None): 936 | 937 | nodes_name = str(hidden_nodes[0]) + '_' + str(hidden_nodes[1]) if len(hidden_nodes) > 1 else str( 938 | hidden_nodes[0]) 939 | file_name = 'models_n_in-' + str(n_in) + '_hidden_nodes-' + nodes_name + '.pkl' 940 | 941 | with open(path + file_name, 'rb') as f: 942 | model = pickle.load(f) 943 | 944 | model_cumret, model_sharpe_ratio = list(), list() 945 | summaries, balance_summaries = list(), list() 946 | for pair_i in range(len(model) - 1): 947 | if pair_i in profitable_pairs_indices: 948 | #print('\nPair loaded: {}_{}:'.format(model[pair_i]['leg1'], model[pair_i]['leg2'])) 949 | #print('Check pairs: {}_{}.'.format(pairs[pair_i][0], pairs[pair_i][1])) 950 | predictions = model[pair_i]['predictions_test'] 951 | spread_test = pairs[pair_i][2]['Y_test'] - pairs[pair_i][2]['coint_coef'] * pairs[pair_i][2]['X_test'] 952 | 953 | ret, cumret, summary, balance_summary = self.forecast_spread_trading( 954 | X=pairs[pair_i][2]['X_test'], 955 | Y=pairs[pair_i][2]['Y_test'], 956 | spread_test=spread_test[-len(predictions)-multistep:], 957 | spread_train=pairs[pair_i][2]['spread'][:train_test_split], 958 | beta=pairs[pair_i][2]['coint_coef'], 959 | predictions=predictions, 960 | lag=lag, 961 | low_quantile=low_quantile, 962 | high_quantile=high_quantile, 963 | multistep=multistep) 964 | 965 | #print('Accumulated return: {:.2f}%'.format(cumret[-1] * 100)) 966 | 967 | trader = class_Trader.Trader() 968 | if np.std(ret) != 0: 969 | sharpe_ratio = trader.calculate_sharpe_ratio(1, 252, ret) 970 | else: 971 | sharpe_ratio = 0 972 | #print('Sharpe Ratio:', sharpe_ratio) 973 | 974 | model_cumret.append(cumret[-1] * 100) 975 | model_sharpe_ratio.append(sharpe_ratio) 976 | summaries.append(summary) 977 | balance_summaries.append(balance_summary) 978 | 979 | return model, model_cumret, model_sharpe_ratio, summaries, balance_summaries 980 | 981 | 982 | -------------------------------------------------------------------------------- /classes/class_SeriesAnalyser.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import sys 4 | import collections, functools, operator 5 | 6 | import statsmodels.api as sm 7 | from statsmodels.tsa.stattools import coint, adfuller 8 | 9 | from sklearn.cluster import DBSCAN 10 | from sklearn.cluster import OPTICS, cluster_optics_dbscan 11 | from sklearn.decomposition import PCA 12 | from sklearn import preprocessing 13 | from sklearn.metrics import silhouette_score 14 | 15 | # just set the seed for the random number generator 16 | np.random.seed(107) 17 | 18 | 19 | class SeriesAnalyser: 20 | """ 21 | This class contains a set of functions to deal with time series analysis. 22 | """ 23 | 24 | def __init__(self): 25 | """ 26 | :initial elements 27 | """ 28 | 29 | def check_for_stationarity(self, X, subsample=0): 30 | """ 31 | H_0 in adfuller is unit root exists (non-stationary). 32 | We must observe significant p-value to convince ourselves that the series is stationary. 33 | 34 | :param X: time series 35 | :param subsample: boolean indicating whether to subsample series 36 | :return: adf results 37 | """ 38 | if subsample != 0: 39 | frequency = round(len(X)/subsample) 40 | subsampled_X = X[0::frequency] 41 | result = adfuller(subsampled_X) 42 | else: 43 | result = adfuller(X) 44 | # result contains: 45 | # 0: t-statistic 46 | # 1: p-value 47 | # others: please see https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html 48 | 49 | return {'t_statistic': result[0], 'p_value': result[1], 'critical_values': result[4]} 50 | 51 | def check_properties(self, train_series, test_series, p_value_threshold, min_half_life=78, max_half_life=20000, 52 | min_zero_crossings=0, hurst_threshold=0.5, subsample=0): 53 | """ 54 | Gets two time series as inputs and provides information concerning cointegration stasttics 55 | Y - b*X : Y is dependent, X is independent 56 | """ 57 | 58 | # for some reason is not giving right results 59 | # t_statistic, p_value, crit_value = coint(X,Y, method='aeg') 60 | 61 | # perform test manually in both directions 62 | X = train_series[0] 63 | Y = train_series[1] 64 | pairs = [(X, Y), (Y, X)] 65 | pair_stats = [0] * 2 66 | criteria_not_verified = 'cointegration' 67 | 68 | # first of all, must verify price series S1 and S2 are I(1) 69 | stats_Y = self.check_for_stationarity(np.asarray(Y), subsample=subsample) 70 | if stats_Y['p_value'] > 0.10: 71 | stats_X = self.check_for_stationarity(np.asarray(X), subsample=subsample) 72 | if stats_X['p_value'] > 0.10: 73 | # conditions to test cointegration verified 74 | 75 | for i, pair in enumerate(pairs): 76 | S1 = np.asarray(pair[0]) 77 | S2 = np.asarray(pair[1]) 78 | S1_c = sm.add_constant(S1) 79 | 80 | # Y = bX + c 81 | # ols: (Y, X) 82 | results = sm.OLS(S2, S1_c).fit() 83 | b = results.params[1] 84 | 85 | if b > 0: 86 | spread = pair[1] - b * pair[0] # as Pandas Series 87 | spread_array = np.asarray(spread) # as array for faster computations 88 | 89 | stats = self.check_for_stationarity(spread_array, subsample=subsample) 90 | if stats['p_value'] < p_value_threshold: # verifies required pvalue 91 | criteria_not_verified = 'hurst_exponent' 92 | 93 | hurst_exponent = self.hurst(spread_array) 94 | if hurst_exponent < hurst_threshold: 95 | criteria_not_verified = 'half_life' 96 | 97 | hl = self.calculate_half_life(spread_array) 98 | if (hl >= min_half_life) and (hl < max_half_life): 99 | criteria_not_verified = 'mean_cross' 100 | 101 | zero_cross = self.zero_crossings(spread_array) 102 | if zero_cross >= min_zero_crossings: 103 | criteria_not_verified = 'None' 104 | 105 | pair_stats[i] = {'t_statistic': stats['t_statistic'], 106 | 'critical_val': stats['critical_values'], 107 | 'p_value': stats['p_value'], 108 | 'coint_coef': b, 109 | 'zero_cross': zero_cross, 110 | 'half_life': int(round(hl)), 111 | 'hurst_exponent': hurst_exponent, 112 | 'spread': spread, 113 | 'Y_train': pair[1], 114 | 'X_train': pair[0] 115 | } 116 | 117 | if pair_stats[0] == 0 and pair_stats[1] == 0: 118 | result = None 119 | return result, criteria_not_verified 120 | 121 | elif pair_stats[0] == 0: 122 | result = 1 123 | elif pair_stats[1] == 0: 124 | result = 0 125 | else: # both combinations are possible 126 | # select lowest t-statistic as representative test 127 | if abs(pair_stats[0]['t_statistic']) > abs(pair_stats[1]['t_statistic']): 128 | result = 0 129 | else: 130 | result = 1 131 | 132 | if result == 0: 133 | result = pair_stats[0] 134 | result['X_test'] = test_series[0] 135 | result['Y_test'] = test_series[1] 136 | elif result == 1: 137 | result = pair_stats[1] 138 | result['X_test'] = test_series[1] 139 | result['Y_test'] = test_series[0] 140 | 141 | return result, criteria_not_verified 142 | 143 | def find_pairs(self, data_train, data_test, p_value_threshold, min_half_life=78, max_half_life=20000, 144 | min_zero_crossings=0, hurst_threshold=0.5, subsample=0): 145 | """ 146 | This function receives a df with the different securities as columns, and aims to find tradable 147 | pairs within this world. There is a df containing the training data and another one containing test data 148 | Tradable pairs are those that verify: 149 | - cointegration 150 | - minimium half life 151 | - minimium zero crossings 152 | 153 | :param data_train: df with training prices in columns 154 | :param data_test: df with testing prices in columns 155 | :param p_value_threshold: pvalue threshold for a pair to be cointegrated 156 | :param min_half_life: minimium half life value of the spread to consider the pair 157 | :param min_zero_crossings: minimium number of allowed zero crossings 158 | :param hurst_threshold: mimimium acceptable number for hurst threshold 159 | :return: pairs that passed test 160 | """ 161 | n = data_train.shape[1] 162 | keys = data_train.keys() 163 | pairs_fail_criteria = {'cointegration': 0, 'hurst_exponent': 0, 'half_life': 0, 'mean_cross': 0, 'None': 0} 164 | pairs = [] 165 | for i in range(n): 166 | for j in range(i + 1, n): 167 | S1_train = data_train[keys[i]]; S2_train = data_train[keys[j]] 168 | S1_test = data_test[keys[i]]; S2_test = data_test[keys[j]] 169 | result, criteria_not_verified = self.check_properties((S1_train, S2_train), (S1_test, S2_test), 170 | p_value_threshold, min_half_life, max_half_life, 171 | min_zero_crossings, hurst_threshold, subsample) 172 | pairs_fail_criteria[criteria_not_verified] += 1 173 | if result is not None: 174 | pairs.append((keys[i], keys[j], result)) 175 | 176 | 177 | return pairs, pairs_fail_criteria 178 | 179 | def pairs_overlap(self, pairs, p_value_threshold, min_zero_crossings, min_half_life, hurst_threshold): 180 | """ 181 | This function receives the pairs identified in the training set, and returns a list of the pairs 182 | which are still cointegrated in the test set. 183 | 184 | :param pairs: list of pairs in the train set for which to verify cointegration in the test set 185 | :param p_value_threshold: p_value to consider cointegration 186 | :param min_zero_crossings: zero crossings to consider cointegration 187 | :param min_half_life: minimum half-life to consider cointegration 188 | :param hurst_threshold: maximum threshold to consider cointegration 189 | 190 | :return: list with pairs overlapped 191 | :return: list with indices from the pairs overlapped 192 | """ 193 | pairs_overlapped = [] 194 | pairs_overlapped_index = [] 195 | 196 | for index, pair in enumerate(pairs): 197 | # get consituents 198 | X = pair[2]['X_test'] 199 | Y = pair[2]['Y_test'] 200 | # check if pairs is valid 201 | series_name = X.name 202 | X = sm.add_constant(X) 203 | results = sm.OLS(Y, X).fit() 204 | X = X[series_name] 205 | b = results.params[X.name] 206 | spread = Y - b * X 207 | stats = self.check_for_stationarity(pd.Series(spread, name='Spread')) 208 | 209 | if stats['p_value'] < p_value_threshold: # verifies required pvalue 210 | hl = self.calculate_half_life(spread) 211 | if hl >= min_half_life: # verifies required half life 212 | zero_cross = self.zero_crossings(spread) 213 | if zero_cross >= min_zero_crossings: # verifies required zero crossings 214 | hurst_exponent = self.hurst(spread) 215 | if hurst_exponent < hurst_threshold: # verifies hurst exponent 216 | pairs_overlapped.append(pair) 217 | pairs_overlapped_index.append(index) 218 | 219 | return pairs_overlapped, pairs_overlapped_index 220 | 221 | def zscore(self, series): 222 | """ 223 | Returns the nromalized time series assuming a normal distribution 224 | """ 225 | return (series-series.mean())/np.std(series) 226 | 227 | def calculate_half_life(self, z_array): 228 | """ 229 | This function calculates the half life parameter of a 230 | mean reversion series 231 | """ 232 | z_lag = np.roll(z_array, 1) 233 | z_lag[0] = 0 234 | z_ret = z_array - z_lag 235 | z_ret[0] = 0 236 | 237 | # adds intercept terms to X variable for regression 238 | z_lag2 = sm.add_constant(z_lag) 239 | 240 | model = sm.OLS(z_ret[1:], z_lag2[1:]) 241 | res = model.fit() 242 | 243 | halflife = -np.log(2) / res.params[1] 244 | 245 | return halflife 246 | 247 | def hurst(self, ts): 248 | """ 249 | Returns the Hurst Exponent of the time series vector ts. 250 | Series vector ts should be a price series. 251 | Source: https://www.quantstart.com/articles/Basics-of-Statistical-Mean-Reversion-Testing""" 252 | # Create the range of lag values 253 | lags = range(2, 100) 254 | 255 | # Calculate the array of the variances of the lagged differences 256 | # Here it calculates the variances, but why it uses 257 | # standard deviation and then make a root of it? 258 | tau = [np.sqrt(np.std(np.subtract(ts[lag:], ts[:-lag]))) for lag in lags] 259 | 260 | # Use a linear fit to estimate the Hurst Exponent 261 | poly = np.polyfit(np.log(lags), np.log(tau), 1) 262 | 263 | # Return the Hurst exponent from the polyfit output 264 | return poly[0] * 2.0 265 | 266 | def variance_ratio(self, ts, lag=2): 267 | """ 268 | Returns the variance ratio test result 269 | Source: https://gist.github.com/jcorrius/56b4983ca059e69f2d2df38a3a05e225#file-variance_ratio-py 270 | """ 271 | # make sure we are working with an array, convert if necessary 272 | ts = np.asarray(ts) 273 | 274 | # Apply the formula to calculate the test 275 | n = len(ts) 276 | mu = sum(ts[1:n] - ts[:n - 1]) / n 277 | m = (n - lag + 1) * (1 - lag / n) 278 | b = sum(np.square(ts[1:n] - ts[:n - 1] - mu)) / (n - 1) 279 | t = sum(np.square(ts[lag:n] - ts[:n - lag] - lag * mu)) / m 280 | return t / (lag * b) 281 | 282 | def zero_crossings(self, x): 283 | """ 284 | Function that counts the number of zero crossings of a given signal 285 | :param x: the signal to be analyzed 286 | """ 287 | x = x - x.mean() 288 | zero_crossings = sum(1 for i, _ in enumerate(x) if (i + 1 < len(x)) if ((x[i] * x[i + 1] < 0) or (x[i] == 0))) 289 | 290 | return zero_crossings 291 | 292 | def apply_PCA(self, n_components, df, svd_solver='auto', random_state=0): 293 | """ 294 | This function applies Principal Component Analysis to the df given as 295 | parameter 296 | 297 | :param n_components: number of principal components 298 | :param df: dataframe containing time series for analysis 299 | :param svd_solver: solver for PCA: see PCA documentation 300 | :return: reduced normalized and transposed df 301 | """ 302 | 303 | if not isinstance(n_components, str): 304 | if n_components > df.shape[1]: 305 | print("ERROR: number of components larger than samples...") 306 | exit() 307 | 308 | pca = PCA(n_components=n_components, svd_solver=svd_solver, random_state=random_state) 309 | pca.fit(df) 310 | explained_variance = pca.explained_variance_ 311 | 312 | # standardize 313 | X = preprocessing.StandardScaler().fit_transform(pca.components_.T) 314 | 315 | return X, explained_variance 316 | 317 | def apply_OPTICS(self, X, df_returns, min_samples, max_eps=2, xi=0.05, cluster_method='xi'): 318 | """ 319 | 320 | :param X: 321 | :param df_returns: 322 | :param min_samples: 323 | :param max_eps: 324 | :param xi: 325 | :param eps: 326 | :return: 327 | """ 328 | clf = OPTICS(min_samples=min_samples, max_eps=max_eps, xi=xi, metric='euclidean', cluster_method=cluster_method) 329 | print(clf) 330 | 331 | clf.fit(X) 332 | labels = clf.labels_ 333 | n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) 334 | print("Clusters discovered: %d" % n_clusters_) 335 | 336 | clustered_series_all = pd.Series(index=df_returns.columns, data=labels.flatten()) 337 | clustered_series = clustered_series_all[clustered_series_all != -1] 338 | 339 | counts = clustered_series.value_counts() 340 | print("Pairs to evaluate: %d" % (counts * (counts - 1) / 2).sum()) 341 | 342 | return clustered_series_all, clustered_series, counts, clf 343 | 344 | def apply_DBSCAN(self, eps, min_samples, X, df_returns): 345 | """ 346 | This function applies a DBSCAN clustering algo 347 | 348 | :param eps: min distance for a sample to be within the cluster 349 | :param min_samples: min_samples to consider a cluster 350 | :param X: data 351 | 352 | :return: clustered_series_all: series with all tickers and labels 353 | :return: clustered_series: series with tickers belonging to a cluster 354 | :return: counts: counts of each cluster 355 | :return: clf object 356 | """ 357 | clf = DBSCAN(eps=eps, min_samples=min_samples, metric='euclidean') 358 | #print(clf) 359 | 360 | clf.fit(X) 361 | labels = clf.labels_ 362 | n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) 363 | print("Clusters discovered: %d" % n_clusters_) 364 | 365 | clustered_series_all = pd.Series(index=df_returns.columns, data=labels.flatten()) 366 | clustered_series = clustered_series_all[clustered_series_all != -1] 367 | 368 | counts = clustered_series.value_counts() 369 | print("Pairs to evaluate: %d" % (counts * (counts - 1) / 2).sum()) 370 | 371 | return clustered_series_all, clustered_series, counts, clf 372 | 373 | def clustering_for_optimal_PCA(self, min_components, max_components, returns, clustering_params): 374 | """ 375 | This function experiments different values for the number of PCA components considered. 376 | It returns the values obtained for the number of components which provided the best silhouette 377 | coefficient. 378 | 379 | :param min_components: min number of components to test 380 | :param max_components: max number of components to test 381 | :param returns: series of returns 382 | :param clustering_params: parameters for clustering 383 | 384 | :return: X: PCA reduced dataset 385 | :return: clustered_series_all: cluster labels for all sample 386 | :return: clustered_series: cluster labels for samples belonging to a cluster 387 | :return: counts: counts for each cluster 388 | :return: clf: object returned by DBSCAN 389 | """ 390 | # initialize dictionary to save best performers 391 | best_n_comp = {'n_comp': -1, 392 | 'silhouette': -1, 393 | 'X': None, 394 | 'clustered_series_all': None, 395 | 'clustered_series': None, 396 | 'counts': None, 397 | 'clf': None 398 | } 399 | 400 | for n_comp in range(min_components, max_components): 401 | print('\nNumber of components: ', n_comp) 402 | # Apply PCA on data 403 | print('Returns shape: ', returns.shape) 404 | X, _ = self.apply_PCA(n_comp, returns) 405 | # Apply DBSCAN 406 | clustered_series_all, clustered_series, counts, clf = self.apply_DBSCAN( 407 | clustering_params['epsilon'], 408 | clustering_params['min_samples'], 409 | X, 410 | returns) 411 | # Silhouette score 412 | silhouette = silhouette_score(X, clf.labels_, 'euclidean') 413 | print('Silhouette score ', silhouette) 414 | 415 | # Standard deviation 416 | # std_deviation = counts.std() 417 | # print('Standard deviation: ',std_deviation)) 418 | 419 | if silhouette > best_n_comp['silhouette']: 420 | best_n_comp = {'n_comp': n_comp, 421 | 'silhouette': silhouette, 422 | 'X': X, 423 | 'clustered_series_all': clustered_series_all, 424 | 'clustered_series': clustered_series, 425 | 'counts': counts, 426 | 'clf': clf 427 | } 428 | 429 | print('\nThe best silhouette coefficient was: {} for {} principal components'.format(best_n_comp['silhouette'], 430 | best_n_comp['n_comp'])) 431 | 432 | return best_n_comp['X'], best_n_comp['clustered_series_all'], best_n_comp['clustered_series'], best_n_comp[ 433 | 'counts'], best_n_comp['clf'] 434 | 435 | def get_candidate_pairs(self, clustered_series, pricing_df_train, pricing_df_test, min_half_life=78, 436 | max_half_life=20000, min_zero_crosings=20, p_value_threshold=0.05, hurst_threshold=0.5, 437 | subsample=0): 438 | """ 439 | This function looks for tradable pairs over the clusters formed previously. 440 | 441 | :param clustered_series: series with cluster label info 442 | :param pricing_df_train: df with price series from train set 443 | :param pricing_df_test: df with price series from test set 444 | :param n_clusters: number of clusters 445 | :param min_half_life: min half life of a time series to be considered as candidate 446 | :param min_zero_crosings: min number of zero crossings (or mean crossings) 447 | :param p_value_threshold: p_value to check during cointegration test 448 | :param hurst_threshold: max hurst exponent value 449 | 450 | :return: list of pairs and its info 451 | :return: list of unique tickers identified in the candidate pairs universe 452 | """ 453 | 454 | total_pairs, total_pairs_fail_criteria = [], [] 455 | n_clusters = len(clustered_series.value_counts()) 456 | for clust in range(n_clusters): 457 | sys.stdout.write("\r"+'Cluster {}/{}'.format(clust+1, n_clusters)) 458 | sys.stdout.flush() 459 | symbols = list(clustered_series[clustered_series == clust].index) 460 | cluster_pricing_train = pricing_df_train[symbols] 461 | cluster_pricing_test = pricing_df_test[symbols] 462 | pairs, pairs_fail_criteria = self.find_pairs(cluster_pricing_train, 463 | cluster_pricing_test, 464 | p_value_threshold, 465 | min_half_life, 466 | max_half_life, 467 | min_zero_crosings, 468 | hurst_threshold, 469 | subsample) 470 | total_pairs.extend(pairs) 471 | total_pairs_fail_criteria.append(pairs_fail_criteria) 472 | 473 | print('Found {} pairs'.format(len(total_pairs))) 474 | unique_tickers = np.unique([(element[0], element[1]) for element in total_pairs]) 475 | print('The pairs contain {} unique tickers'.format(len(unique_tickers))) 476 | 477 | # discarded 478 | review = dict(functools.reduce(operator.add, map(collections.Counter, total_pairs_fail_criteria))) 479 | print('Pairs Selection failed stage: ', review) 480 | 481 | return total_pairs, unique_tickers 482 | -------------------------------------------------------------------------------- /classes/class_Trader.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import sys 4 | import matplotlib.pyplot as plt 5 | from datetime import timedelta 6 | 7 | # just set the seed for the random number generator 8 | np.random.seed(107) 9 | 10 | class Trader: 11 | """ 12 | This class contains a set of pairs trading strategies along 13 | with some auxiliary functions 14 | """ 15 | 16 | def __init__(self): 17 | """ 18 | :initial elements 19 | """ 20 | 21 | def threshold_strategy(self, y, x, beta, entry_level=1.0, exit_level=1.0, stabilizing_threshold=5): 22 | """ 23 | This function implements a threshold filter strategy with a fixed beta, corresponding to the cointegration 24 | ratio. 25 | :param y: price series of asset y 26 | :param x: price series of asset x 27 | :param entry_level: abs of long and short threshold 28 | :param exit_multiplier: abs of exit threshold 29 | :param stabilizing_threshold: number of initial periods when no positions should be set 30 | """ 31 | 32 | # calculate normalized spread 33 | spread = y - beta * x 34 | norm_spread = (spread - spread.mean()) / np.std(spread) 35 | norm_spread = np.asarray(norm_spread.values) 36 | 37 | # get indices for long and short positions 38 | longs_entry = norm_spread < -entry_level 39 | longs_exit = norm_spread > -exit_level 40 | shorts_entry = norm_spread > entry_level 41 | shorts_exit = norm_spread < exit_level 42 | 43 | num_units_long = pd.Series([np.nan for i in range(len(y))]) 44 | num_units_short = pd.Series([np.nan for i in range(len(y))]) 45 | 46 | # remove trades while the spread is stabilizing 47 | longs_entry[:stabilizing_threshold] = False 48 | longs_exit[:stabilizing_threshold] = False 49 | shorts_entry[:stabilizing_threshold] = False 50 | shorts_exit[:stabilizing_threshold] = False 51 | 52 | # set threshold crossings with corresponding position 53 | num_units_long[longs_entry] = 1. 54 | num_units_long[longs_exit] = 0 55 | num_units_short[shorts_entry] = -1. 56 | num_units_short[shorts_exit] = 0 57 | 58 | # shift to simulate entry delay in real life trading 59 | # please comment if no need to simulate delay 60 | num_units_long = num_units_long.shift(1) 61 | num_units_short = num_units_short.shift(1) 62 | 63 | # initialize market position with zero 64 | num_units_long[0] = 0. 65 | num_units_short[0] = 0. 66 | # finally, fill in between 67 | num_units_long = num_units_long.fillna(method='ffill') 68 | num_units_short = num_units_short.fillna(method='ffill') 69 | num_units = num_units_long + num_units_short 70 | num_units = pd.Series(data=num_units.values, index=y.index, name='numUnits') 71 | 72 | # add position durations 73 | trading_durations = self.add_trading_duration(pd.DataFrame(num_units, index=y.index)) 74 | 75 | # Method 1: calculate return per each position 76 | # This method receives the series with the positions and calculate the return at the end of each position, not 77 | # yet accounting for costs 78 | position_ret, _, ret_summary = self.calculate_position_returns(y, x, beta, num_units) 79 | # Method 2: calculate balance in total 80 | # This method constructs the portfolio during the entire trading session and calculates the returns every 5 min. 81 | # By compounding the returns during a position, we obtain the position return as given in method 1. 82 | # This method is necessary to obtain the daily returns from which to estimate the Sharpe Ratio 83 | balance_summary = self.calculate_balance(y, x, beta, num_units.shift(1).fillna(0), trading_durations) 84 | 85 | # add transaction costs and gather all info in a single dataframe 86 | series_to_include = [(balance_summary.pnl, 'pnl'), 87 | (balance_summary.pnl_y, 'pnl_y'), 88 | (balance_summary.pnl_x, 'pnl_x'), 89 | (balance_summary.account_balance, 'account_balance'), 90 | (balance_summary.returns, 'returns'), 91 | (position_ret, 'position_return'), 92 | (y, y.name), 93 | (x, x.name), 94 | (pd.Series(norm_spread, index=y.index), 'norm_spread'), 95 | (num_units, 'numUnits'), 96 | (trading_durations, 'trading_duration')] 97 | summary = self.trade_summary(series_to_include, beta) 98 | 99 | # calculate sharpe ratio for each pair separately 100 | ret_w_costs = summary.returns 101 | n_years = round(len(y) / (240 * 78)) 102 | n_days = 252 103 | if np.std(ret_w_costs) == 0: 104 | sharpe_no_costs, sharpe_w_costs = (0, 0) 105 | else: 106 | if np.std(position_ret) == 0: 107 | sharpe_no_costs=0 108 | else: 109 | sharpe_no_costs = self.calculate_sharpe_ratio(n_years, n_days, position_ret) 110 | sharpe_w_costs = self.calculate_sharpe_ratio(n_years, n_days, ret_w_costs) 111 | 112 | return summary, (sharpe_no_costs, sharpe_w_costs), balance_summary 113 | 114 | def apply_trading_strategy(self, pairs, strategy='fixed_beta', entry_multiplier=1, exit_multiplier=0, 115 | test_mode=False, train_val_split='2017-01-01'): 116 | """ 117 | This function implements the standard fixed beta trading strategy. 118 | :param pairs: list with pairs identified in the training set 119 | :param strategy: currently, only fixed_beta is implemented 120 | :param entry_multiplier: threshold that defines where to enter a position 121 | :param exit_multiplier: threshold that defines where to exit a position 122 | :param test_mode: flag to decide whether to apply strategy on the validation set or in the test set 123 | :param train_val_split: split of training and validation data 124 | """ 125 | sharpe_results = [] 126 | cum_returns = [] 127 | sharpe_results_with_costs = [] 128 | cum_returns_with_costs = [] 129 | performance = [] # aux variable to store pairs' record 130 | print(' entry delay turned on.') 131 | for i, pair in enumerate(pairs): 132 | sys.stdout.write("\r"+'Pair: {}/{}'.format(i + 1, len(pairs))) 133 | sys.stdout.flush() 134 | pair_info = pair[2] 135 | 136 | if test_mode: 137 | y = pair_info['Y_test'] 138 | x = pair_info['X_test'] 139 | else: 140 | y = pair_info['Y_train'][train_val_split:] 141 | x = pair_info['X_train'][train_val_split:] 142 | 143 | if strategy == 'fixed_beta': 144 | summary, sharpe, balance_summary = self.threshold_strategy(y=y, x=x, beta=pair_info['coint_coef'], 145 | entry_level=entry_multiplier, 146 | exit_level=exit_multiplier) 147 | # no costs 148 | cum_returns.append((np.cumprod(1 + summary.position_return) - 1).iloc[-1] * 100) 149 | sharpe_results.append(sharpe[0]) 150 | # with costs 151 | # cum_returns_with_costs.append((np.cumprod(1 + summary.position_ret_with_costs) - 1).iloc[-1] * 100) 152 | cum_returns_with_costs.append((summary.account_balance[-1] - 1) * 100) 153 | sharpe_results_with_costs.append(sharpe[1]) 154 | performance.append((pair, summary, balance_summary)) 155 | 156 | else: 157 | print('Only one strategy currently available: \n1.Fixed Beta') 158 | exit() 159 | 160 | return (sharpe_results, cum_returns), (sharpe_results_with_costs, cum_returns_with_costs), performance 161 | 162 | def trade_summary(self, series, beta=0): 163 | """ 164 | This function receives a set of series containing information from the trade and 165 | returns a DataFrame containing the summary data. 166 | :param series: a list of tuples containing the time series and the corresponding names 167 | :param beta: cointegration ratio. If moving beta, use beta=0. 168 | """ 169 | for attribute, attribute_name in series: 170 | try: 171 | attribute.name = attribute_name 172 | except: 173 | continue 174 | summary = pd.concat([item[0] for item in series], axis=1) 175 | 176 | # change numUnits so that it corresponds to the position for the row's date, 177 | # instead of corresponding to the position entered in the end of that day. 178 | summary['numUnits'] = summary['numUnits'].shift().fillna(0) 179 | summary = summary.rename(columns={"numUnits": "position_during_day"}) 180 | 181 | # add position costs 182 | summary['position_ret_with_costs'] = self.add_transaction_costs(summary, beta) 183 | 184 | return summary 185 | 186 | def add_trading_duration(self, df): 187 | """ 188 | The following function adds a column containing the trading duration in days. 189 | :param df: Dataframe containing column with positions to enter in next day 190 | """ 191 | 192 | df['trading_duration'] = [0] * len(df) 193 | previous_unit = 0. 194 | new_position_counter = 0 195 | day = df.index[0].day 196 | for index, row in df.iterrows(): 197 | if previous_unit == row['numUnits']: 198 | if previous_unit != 0.: 199 | # update counter 200 | if index.day != day: 201 | new_position_counter += 1 202 | day = index.day 203 | # verify if it is last trading day 204 | if index == df.index[-1]: 205 | df.loc[index, 'trading_duration'] = new_position_counter 206 | continue # no change in positions to verify 207 | else: 208 | if previous_unit == 0.: 209 | previous_unit = row['numUnits'] 210 | # begin counter 211 | new_position_counter = 1 212 | day = index.day 213 | continue # simply start the trade 214 | else: 215 | df.loc[index, 'trading_duration'] = new_position_counter 216 | previous_unit = row['numUnits'] 217 | # begin counter 218 | new_position_counter = 1 219 | day = index.day 220 | continue 221 | 222 | return df['trading_duration'] 223 | 224 | def add_transaction_costs(self, summary, beta=0, comission_costs=0.08, market_impact=0.2, short_rental=1): 225 | """ 226 | Function to add transaction costs per position. 227 | :param summary: dataframe containing summary of all transactions 228 | :param beta: cointegration factor, use 0 if moving beta 229 | :param comission_costs: commision costs, in percentage, per security, per trade 230 | :param market_impact: market impact costs, in percentage, per security, per trade 231 | :param short_rental: short rental costs, in annual percentage 232 | """ 233 | fixed_costs_per_trade = (comission_costs + market_impact) / 100 # remove percentage 234 | short_costs_per_day = (short_rental / 252) / 100 # remove percentage 235 | 236 | costs = summary.apply(lambda row: self.apply_costs(row, fixed_costs_per_trade, short_costs_per_day, beta), 237 | axis=1) 238 | 239 | ret_with_costs = summary['position_return'] - costs 240 | 241 | return ret_with_costs 242 | 243 | def apply_costs(self, row, fixed_costs_per_trade, short_costs_per_day, beta=0): 244 | 245 | if beta == 0: 246 | beta = row['beta_position'] 247 | 248 | if row['position_during_day'] == 1. and row['trading_duration'] != 0: 249 | if beta >= 1: 250 | return fixed_costs_per_trade * (1 / beta) + fixed_costs_per_trade + short_costs_per_day * \ 251 | row['trading_duration'] 252 | elif beta < 1: 253 | return fixed_costs_per_trade * beta + fixed_costs_per_trade + short_costs_per_day * \ 254 | row['trading_duration'] * beta 255 | 256 | elif row['position_during_day'] == -1. and row['trading_duration'] != 0: 257 | if beta >= 1: 258 | return fixed_costs_per_trade * (1 / beta) + fixed_costs_per_trade + short_costs_per_day * \ 259 | row['trading_duration'] * (1 / beta) 260 | elif beta < 1: 261 | return fixed_costs_per_trade * beta + fixed_costs_per_trade + short_costs_per_day * \ 262 | row['trading_duration'] 263 | else: 264 | return 0 265 | 266 | def calculate_balance(self, y, x, beta, positions, trading_durations): 267 | """ 268 | Function to calculate balance during a trading session. 269 | 270 | :param y: y series 271 | :param x: x series 272 | :param beta: pair's cointegration coefficient 273 | :param positions: position during the current day 274 | :param trading_durations: series with trading duration of each trade 275 | :return: balance dataframe containing summary info 276 | """ 277 | y_returns = y.pct_change().fillna(0) * positions 278 | x_returns = -x.pct_change().fillna(0) * positions 279 | 280 | leg_y = [np.nan] * len(y) # initial balance 281 | leg_x = [np.nan] * len(y) # initial balance 282 | pnl_y = [np.nan] * len(y) 283 | pnl_x = [np.nan] * len(y) 284 | account_balance = [np.nan] * len(y) 285 | 286 | # auxiliary series to indicate beginning and end of position 287 | new_positions_idx = positions.diff()[positions.diff() != 0].index.values 288 | end_positions_idx = trading_durations[trading_durations != 0].index.values 289 | position_trigger = pd.Series([0] * len(y), index=y.index, name='position_trigger') 290 | # 2: new position 291 | # 1: new position which only lasts one day 292 | # -1: end of position that did not start on that day 293 | position_trigger[new_positions_idx] = 2. 294 | position_trigger[end_positions_idx] = position_trigger[end_positions_idx] - 1. 295 | position_trigger = position_trigger * positions.abs() 296 | position_trigger.name = 'position_trigger' 297 | 298 | for i in range(len(y)): 299 | if i == 0: 300 | pnl_y[0] = 0 301 | pnl_x[0] = 0 302 | account_balance[0] = 1 303 | if beta > 1: 304 | leg_y[0] = 1 / beta 305 | leg_x[0] = 1 306 | else: 307 | leg_y[0] = 1 308 | leg_x[0] = beta 309 | elif positions[i] == 0: 310 | pnl_y[i] = 0 311 | pnl_x[i] = 0 312 | leg_y[i] = leg_y[i - 1] 313 | leg_x[i] = leg_x[i - 1] 314 | account_balance[i] = account_balance[i-1] 315 | else: 316 | # add costs 317 | if position_trigger[i] == 1: 318 | # every new position invest initial 1$ + acc in X + acc in Y 319 | position_investment = account_balance[i-1] 320 | # if new position, that most legs contain now the overall invested 321 | if beta > 1: 322 | pnl_y[i] = y_returns[i] * position_investment * (1 / beta) 323 | pnl_x[i] = x_returns[i] * position_investment 324 | else: 325 | pnl_y[i] = y_returns[i] * position_investment 326 | pnl_x[i] = x_returns[i] * position_investment * beta 327 | 328 | # update legs 329 | if beta > 1: 330 | if positions[i] == 1: 331 | leg_y[i] = position_investment * (1 / beta) + pnl_y[i] 332 | leg_x[i] = position_investment - pnl_x[i] 333 | else: 334 | leg_y[i] = position_investment * (1 / beta) - pnl_y[i] 335 | leg_x[i] = position_investment + pnl_x[i] 336 | else: 337 | if positions[i] == 1: 338 | leg_y[i] = position_investment + pnl_y[i] 339 | leg_x[i] = position_investment * beta - pnl_x[i] 340 | else: 341 | leg_y[i] = position_investment - pnl_y[i] 342 | leg_x[i] = position_investment * beta + pnl_x[i] 343 | 344 | # commission costs + market impact costs + short rental costs 345 | if beta >= 1: 346 | pnl_y[i] = pnl_y[i] - 0.0028*(1/beta)*position_investment # add commission + bid ask spread 347 | pnl_x[i] = pnl_x[i] - 0.0028*position_investment # add commission + bid ask spread 348 | if positions[i] == 1: 349 | pnl_x[i] = pnl_x[i] - 1 * (0.01 / 252)*position_investment 350 | elif positions[i] == -1: 351 | pnl_y[i] = pnl_y[i] - 1 * (0.01 / 252)*(1/beta)*position_investment 352 | elif beta < 1: 353 | pnl_y[i] = pnl_y[i] - 0.0028 * position_investment # add commission + bid ask spread 354 | pnl_x[i] = pnl_x[i] - 0.0028 * beta * position_investment # add commission + bid ask spread 355 | if positions[i] == 1: 356 | pnl_x[i] = pnl_x[i] - 1 * (0.01 / 252)*beta*position_investment 357 | elif positions[i] == -1: 358 | pnl_y[i] = pnl_y[i] - 1 * (0.01 / 252)*position_investment 359 | # update balance 360 | account_balance[i] = account_balance[i-1] + pnl_x[i] + pnl_y[i] 361 | 362 | elif position_trigger[i] == 2: 363 | # every new position invest initial 1$ + acc in X + acc in Y 364 | position_investment = account_balance[i-1] 365 | # if new position, that most legs contain now the overall invested 366 | if beta > 1: 367 | pnl_y[i] = y_returns[i] * position_investment * (1 / beta) 368 | pnl_x[i] = x_returns[i] * position_investment 369 | else: 370 | pnl_y[i] = y_returns[i] * position_investment 371 | pnl_x[i] = x_returns[i] * position_investment * beta 372 | 373 | # update legs 374 | if beta > 1: 375 | if positions[i] == 1: 376 | leg_y[i] = position_investment * (1 / beta) + pnl_y[i] 377 | leg_x[i] = position_investment - pnl_x[i] 378 | else: 379 | leg_y[i] = position_investment * (1 / beta) - pnl_y[i] 380 | leg_x[i] = position_investment + pnl_x[i] 381 | else: 382 | if positions[i] == 1: 383 | leg_y[i] = position_investment + pnl_y[i] 384 | leg_x[i] = position_investment * beta - pnl_x[i] 385 | else: 386 | leg_y[i] = position_investment - pnl_y[i] 387 | leg_x[i] = position_investment * beta + pnl_x[i] 388 | 389 | # commission costs + market impact costs + short rental costs 390 | if beta >= 1: 391 | pnl_y[i] = pnl_y[i] - 0.0028*(1/beta)*position_investment # add commission + bid ask spread 392 | pnl_x[i] = pnl_x[i] - 0.0028*position_investment # add commission + bid ask spread 393 | elif beta < 1: 394 | pnl_y[i] = pnl_y[i] - 0.0028 * position_investment # add commission + bid ask spread 395 | pnl_x[i] = pnl_x[i] - 0.0028 * beta * position_investment # add commission + bid ask spread 396 | # update balance 397 | account_balance[i] = account_balance[i - 1] + pnl_x[i] + pnl_y[i] 398 | 399 | else: 400 | # calculate trade pnl 401 | pnl_y[i] = y_returns[i] * leg_y[i - 1] 402 | pnl_x[i] = x_returns[i] * leg_x[i - 1] 403 | 404 | # update legs 405 | if positions[i] == 1: 406 | leg_y[i] = leg_y[i - 1] + pnl_y[i] 407 | leg_x[i] = leg_x[i - 1] - pnl_x[i] 408 | else: 409 | leg_y[i] = leg_y[i - 1] - pnl_y[i] 410 | leg_x[i] = leg_x[i - 1] + pnl_x[i] 411 | 412 | # add short costs 413 | if position_trigger[i] == -1: 414 | if positions[i]==1: 415 | if beta > 1: 416 | pnl_x[i] = pnl_x[i] - trading_durations[i] * (0.01 / 252) * position_investment 417 | elif beta < 1: 418 | pnl_x[i] = pnl_x[i] - trading_durations[i] * (0.01 / 252)*beta*position_investment 419 | elif positions[i]==-1: 420 | if beta > 1: 421 | pnl_y[i] = pnl_y[i] - trading_durations[i] * (0.01 / 252)*(1/beta)*position_investment 422 | elif beta < 1: 423 | pnl_y[i] = pnl_y[i] - trading_durations[i] * (0.01 / 252) * position_investment 424 | 425 | # update balance 426 | account_balance[i] = account_balance[i - 1] + pnl_x[i] + pnl_y[i] 427 | pnl = [pnl_y[i] + pnl_x[i] for i in range(len(y))] 428 | 429 | # join everything in dataframe 430 | balance = pd.Series(data=account_balance, index=y.index, name='account_balance') 431 | returns = balance.pct_change().fillna(0) 432 | returns.name = 'returns' 433 | pnl = pd.Series(data=pnl, index=y.index, name='pnl') 434 | pnl_y = pd.Series(data=pnl_y, index=y.index, name='pnl_y') 435 | pnl_x = pd.Series(data=pnl_x, index=y.index, name='pnl_x') 436 | leg_y = pd.Series(data=leg_y, index=y.index, name='leg_y') 437 | leg_x = pd.Series(data=leg_x, index=y.index, name='leg_x') 438 | balance_summary = pd.concat( 439 | [balance, pnl, pnl_y, pnl_x, leg_y, leg_x, returns, position_trigger, positions, y, x, 440 | trading_durations], axis=1) 441 | 442 | return balance_summary 443 | 444 | def calculate_sharpe_ratio(self, n_years, n_days, ret): 445 | """ 446 | Calculate sharpe ratio for one asset only. 447 | As an estimate of the expected value use the yearly return. 448 | :param n_years: number of years being considered 449 | :param n_days: number of trading days per year 450 | :param ret: array containing returns per timestep 451 | """ 452 | rf = {2014: 0.00033, 2015: 0.00053, 2016: 0.0032, 2017: 0.0093, 2018: 0.0194} 453 | time_in_market = n_years * n_days 454 | daily_index = ret.resample('D').last().dropna().index 455 | daily_ret = (ret + 1).resample('D').prod() - 1 456 | # remove added days from resample 457 | daily_ret = daily_ret.loc[daily_index] 458 | 459 | annualized_ret = (np.cumprod(1 + ret) - 1)[-1] 460 | year = ret.index[0].year 461 | if year in rf.keys(): 462 | sharpe_ratio = (annualized_ret-rf[year]) / (np.std(daily_ret)*np.sqrt(time_in_market)) 463 | else: 464 | print('Not considering risk-free rate') 465 | sharpe_ratio = annualized_ret / (np.std(daily_ret)*np.sqrt(time_in_market)) 466 | 467 | return sharpe_ratio 468 | 469 | def calculate_portfolio_sharpe_ratio(self, performance, pairs): 470 | """ 471 | Calculates the sharpe ratio based on the account balance of the total portfolio 472 | 473 | :param performance: df with summary statistics from strategy 474 | :param pairs: list with pairs 475 | """ 476 | # calculate total daily account balance & df with returns 477 | total_account_balance = performance[0][1]['account_balance'].resample('D').last().dropna() 478 | portfolio_returns = total_account_balance.pct_change().fillna(0) 479 | for index in range(1, len(pairs)): 480 | pair_balance = performance[index][1]['account_balance'].resample('D').last().dropna() 481 | total_account_balance = total_account_balance + pair_balance 482 | portfolio_returns = pd.concat([portfolio_returns, pair_balance.pct_change().fillna(0)], axis=1) 483 | 484 | # add first day with initial balance 485 | total_account_balance = pd.Series(data=[len(pairs)], 486 | index=[total_account_balance.index[0] - timedelta(days=1)]).append( 487 | total_account_balance) 488 | 489 | # calculate portfolio volatility 490 | weights = np.array([1 / len(pairs)] * len(pairs)) 491 | vol = np.sqrt(np.dot(weights.T, np.dot(portfolio_returns.cov(), weights))) 492 | 493 | # calculate sharpe ratio 494 | rf = {2014: 0.00033, 2015: 0.00053, 2016: 0.0032, 2017: 0.0093, 2018: 0.0194} 495 | annualized_ret = (total_account_balance[-1]-len(pairs))/len(pairs) 496 | year = total_account_balance.index[-1].year 497 | if year in rf.keys(): 498 | # assuming iid return's distributio, sr may be calculated as: 499 | sharpe_ratio = (annualized_ret - rf[year]) / (vol*np.sqrt(252)) 500 | print('Sharpe Ratio assumming IID returns: ',sharpe_ratio) 501 | print('Autocorrelation: ', total_account_balance.pct_change().fillna(0).autocorr(lag=1)) 502 | # accounting for non-zero autocorrelatio, daily sr should be calculated as: 503 | # the daily sharpe ratio is then multiplied by the annualization factor proposed by the paper: The 504 | # Statistics of Sharpe Ratios by Andrew W Lo 505 | annualized_ret = total_account_balance.pct_change().fillna(0).mean() 506 | rf_daily = (1+rf[year])**(1/252)-1 507 | sharpe_ratio = (annualized_ret-rf_daily) /vol 508 | print('Daily Sharpe Ratio', sharpe_ratio) 509 | else: 510 | print('Not considering risk-free rate') 511 | sharpe_ratio = annualized_ret / (vol*np.sqrt(252)) 512 | 513 | return sharpe_ratio 514 | 515 | def calculate_maximum_drawdown(self, account_balance): 516 | """ 517 | Function to calculate maximum drawdown w.r.t portfolio balance. 518 | 519 | source: https://stackoverflow.com/questions/22607324/start-end-and-duration-of-maximum-drawdown-in-python 520 | """ 521 | 522 | # first calculate total drawdown period 523 | account_balance_drawdowns = account_balance.resample('D').last().dropna().diff().fillna(0).apply(lambda row: 0 if row >= 0 else 1) 524 | total_dd_duration = account_balance_drawdowns.sum() 525 | print('Total Drawdown Days: {} days'.format(total_dd_duration)) 526 | 527 | xs = np.asarray(account_balance.values) 528 | 529 | i = np.argmax(np.maximum.accumulate(xs) - xs) # end of the period 530 | if i == 0: 531 | plt.plot(xs) 532 | return 0 533 | else: 534 | j = np.argmax(xs[:i]) # start of period 535 | plt.figure(figsize=(10,7)) 536 | plt.grid() 537 | plt.plot(xs, label='Total Account Balance') 538 | dates = account_balance.resample('BMS').first().dropna().index.date 539 | xi = np.arange(0, len(account_balance), len(account_balance)/12) 540 | plt.xticks(xi, dates, rotation=50) 541 | plt.xlim(0, len(account_balance)) 542 | plt.plot([i, j], [xs[i], xs[j]], 'o', color='Red', markersize=10) 543 | plt.xlabel('Date', size=12) 544 | plt.ylabel('Capital($)', size=12) 545 | plt.legend() 546 | 547 | max_dd_period = round((i - j) / 78) 548 | print('Max DD period: {} days'.format(max_dd_period)) 549 | #print('Max DD period: {} days'.format((account_balance.index[i]-account_balance.index[j]).days)) 550 | 551 | return (xs[i]-xs[j])/xs[j] * 100, max_dd_period, total_dd_duration 552 | 553 | def calculate_position_returns(self, y, x, beta, positions): 554 | """ 555 | This method receives the series with the positions and calculate the return at the end of each position, not 556 | yet accounting for costs 557 | 558 | Y: price of ETF Y 559 | X: price of ETF X 560 | beta: cointegration ratio 561 | positions: array indicating position to enter in next day 562 | """ 563 | # get copy of series 564 | y = y.copy() 565 | y.name = 'y' 566 | x = x.copy() 567 | x.name = 'x' 568 | 569 | # positions preceed the day when the position is actually entered! 570 | # get indices before entering position 571 | new_positions = positions.diff()[positions.diff() != 0].index.values 572 | # create variable for signalizing end of position 573 | end_position = pd.Series(data=[0] * len(y), index=y.index, name='end_position') 574 | end_position[new_positions] = 1. 575 | # add end position if trading period is over and position is open 576 | if positions[-1] != 0: 577 | end_position[-1] = 1. 578 | 579 | # get corresponding X and Y 580 | y_entry = pd.Series(data=[np.nan] * len(y), index=y.index, name='y_entry') 581 | x_entry = pd.Series(data=[np.nan] * len(y), index=y.index, name='x_entry') 582 | y_entry[new_positions] = y[new_positions] 583 | x_entry[new_positions] = x[new_positions] 584 | y_entry = y_entry.shift().fillna(method='ffill') 585 | x_entry = x_entry.shift().fillna(method='ffill') 586 | 587 | # name positions series 588 | positions.name = 'positions' 589 | 590 | # apply returns per trade 591 | # each row contain all the parameters to be applied in that position 592 | df = pd.concat([y, x, positions.shift().fillna(0), y_entry, x_entry, end_position], axis=1) 593 | returns = df.apply(lambda row: self.return_per_position(row, beta), axis=1).fillna(0) 594 | cum_returns = np.cumprod(returns + 1) - 1 595 | df['ret'] = returns 596 | returns.name = 'position_return' 597 | 598 | return returns, cum_returns, df 599 | 600 | def return_per_position(self, row, beta=None, sliding=False): 601 | if row['end_position'] != 0: 602 | y_returns = (row['y'] - row['y_entry']) / row['y_entry'] 603 | x_returns = (row['x'] - row['x_entry']) / row['x_entry'] 604 | if sliding: 605 | beta = row['beta_position'] 606 | if beta > 1.: 607 | return ((1 / beta) * y_returns - 1 * x_returns) * row['positions'] 608 | else: 609 | return (y_returns - beta * x_returns) * row['positions'] 610 | else: 611 | return 0 612 | 613 | def calculate_metrics(self, cum_returns, n_years): 614 | """ 615 | Calculate common metrics on average over all pairs. 616 | :param cum_returns: array with cumulative returns of every pair 617 | :param n_years: numbers of yers of the trading strategy 618 | :return: average average total roi 619 | :return: average annual roi 620 | :return: percentage of pairs with positive returns 621 | """ 622 | # use below for fully invested capital: 623 | # cum_returns_filtered = [cum for cum in cum_returns if cum != 0] 624 | # or use below for commited capital: 625 | cum_returns_filtered = cum_returns 626 | 627 | avg_total_roi = np.mean(cum_returns_filtered) 628 | 629 | avg_annual_roi = ((1 + (avg_total_roi / 100)) ** (1 / float(n_years)) - 1) * 100 630 | print('Annual ROI: ', avg_annual_roi) 631 | 632 | cum_returns_filtered = np.asarray(cum_returns_filtered) 633 | positive_pct = len(cum_returns_filtered[cum_returns_filtered > 0]) * 100 / len(cum_returns_filtered) 634 | print('{} % of the pairs had positive returns'.format(positive_pct)) 635 | 636 | return avg_total_roi, avg_annual_roi, positive_pct 637 | 638 | def summarize_results(self, sharpe_results, cum_returns, performance, total_pairs, ticker_segment_dict, n_years): 639 | """ 640 | This function summarizes interesting metrics to include in the final output 641 | :param sharpe_results: array containing sharpe results for each pair 642 | :param cum_returns: array containing cum returns for each pair 643 | :param performance: df containing a summary of each pair's trade 644 | :param total_pairs: list containing all the identified pairs 645 | :param ticker_segment_dict: dict containing segment for each ticker 646 | :param n_years: number of years the strategy is running 647 | :return: dictionary with metrics of interest 648 | """ 649 | 650 | avg_total_roi, avg_annual_roi, positive_pct = self.calculate_metrics(cum_returns, n_years) 651 | 652 | portfolio_sharpe_ratio = self.calculate_portfolio_sharpe_ratio(performance, total_pairs) 653 | 654 | sorted_indices = np.flip(np.argsort(sharpe_results), axis=0) 655 | # print(sorted_indices) 656 | # initialize list of lists 657 | data = [] 658 | for index in sorted_indices: 659 | # get number of positive and negative positions 660 | position_returns = performance[index][1]['position_ret_with_costs'] 661 | positive_positions = len(position_returns[position_returns > 0]) 662 | negative_positions = len(position_returns[position_returns < 0]) 663 | data.append([total_pairs[index][0], 664 | ticker_segment_dict[total_pairs[index][0]], 665 | total_pairs[index][1], 666 | ticker_segment_dict[total_pairs[index][1]], 667 | total_pairs[index][2]['t_statistic'], 668 | total_pairs[index][2]['p_value'], 669 | total_pairs[index][2]['zero_cross'], 670 | total_pairs[index][2]['half_life'], 671 | total_pairs[index][2]['hurst_exponent'], 672 | positive_positions, 673 | negative_positions, 674 | sharpe_results[index] 675 | ]) 676 | 677 | # Create the pandas DataFrame 678 | pairs_df = pd.DataFrame(data, columns=['Leg1', 'Leg1_Segmt', 'Leg2', 'Leg2_Segmt', 't_statistic', 'p_value', 679 | 'zero_cross', 'half_life', 'hurst_exponent', 'positive_trades', 680 | 'negative_trades', 'sharpe_result']) 681 | 682 | pairs_df['positive_trades_per_pair_pct'] = (pairs_df['positive_trades']) / \ 683 | (pairs_df['positive_trades'] + pairs_df['negative_trades']) * 100 684 | 685 | print('Total number of trades: ', pairs_df.positive_trades.sum() + pairs_df.negative_trades.sum()) 686 | print('Positive trades: ', pairs_df.positive_trades.sum()) 687 | print('Negative trades: ', pairs_df.negative_trades.sum()) 688 | 689 | avg_positive_trades_per_pair_pct = pairs_df['positive_trades_per_pair_pct'].mean() 690 | 691 | results = {'n_pairs': len(sharpe_results), 692 | 'portfolio_sharpe_ratio': portfolio_sharpe_ratio, 693 | 'avg_total_roi': avg_total_roi, 694 | 'avg_annual_roi': avg_annual_roi, 695 | 'pct_positive_trades_per_pair': avg_positive_trades_per_pair_pct, 696 | 'pct_pairs_with_positive_results': positive_pct, 697 | 'avg_half_life': pairs_df['half_life'].mean(), 698 | 'avg_hurst_exponent': pairs_df['hurst_exponent'].mean()} 699 | 700 | # Drawdown info 701 | total_account_balance = performance[0][1]['account_balance'] 702 | for index in range(1, len(total_pairs)): 703 | total_account_balance = total_account_balance + performance[index][1]['account_balance'] 704 | total_account_balance = total_account_balance.fillna(method='ffill') 705 | max_dd, max_dd_duration, total_dd_duration = self.calculate_maximum_drawdown(total_account_balance) 706 | print('Maximum drawdown of portfolio: {:.2f}%'.format(max_dd)) 707 | 708 | return results, pairs_df 709 | 710 | -------------------------------------------------------------------------------- /code_organization.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/simaomsarmento/PairsTrading/0781877c75673ceca3c61704eee9c9dca9d37b6b/code_organization.pdf -------------------------------------------------------------------------------- /data/link_to_data.txt: -------------------------------------------------------------------------------- 1 | https://www.dropbox.com/sh/0w3vu1eylrfnkch/AABttIlDf64MmVf5CP1Qy-XOa?dl=0 -------------------------------------------------------------------------------- /drafts/config/config.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "ticker_segment_dict": "data/etfs/pickle/ticker_segment_dict.pickle", 4 | "ticker_attribute": "Ticker", 5 | "training_initial_date": "01-01-2009", 6 | "training_final_date": "31-12-2017", 7 | "testing_initial_date": "01-01-2018", 8 | "testing_final_date": "31-12-2018", 9 | "nan_threshold": 0 10 | }, 11 | "PCA": { 12 | "N_COMPONENTS": 3, 13 | }, 14 | "clustering": { 15 | "algo": "DBSCAN", 16 | "epsilon": 0.4, 17 | "min_samples": 2 18 | }, 19 | "pair_restrictions": { 20 | "min_half_life": 5, 21 | "min_zero_crossings": 120, 22 | "p_value_threshold": 0.05, 23 | "hurst_threshold": 0.5 24 | }, 25 | "trading": { 26 | "strategy": "kalman", 27 | "lookback_multiplier": 2, 28 | "entry_multiplier": 2, 29 | "exit_multiplier": 0 30 | }, 31 | "trading_filter": { 32 | "active": 0, 33 | "name": "correlation", 34 | "filter_lookback_multiplier": 2, 35 | "lag": 1, 36 | "diff_threshold": 0 37 | }, 38 | "mlp": { 39 | "n_in": 5, 40 | "n_out": 1, 41 | "epochs": 200, 42 | "hidden_nodes":5, 43 | "loss_fct": "mse", 44 | "optimizer": "adam", 45 | "train_val_split": "2016-01-01" 46 | }, 47 | "output": { 48 | "filename": "summary/results.xlsx" 49 | } 50 | } -------------------------------------------------------------------------------- /drafts/config/config_commodities_2000_2018.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "path": "data/etfs/commodity_ETFs_long.xlsx", 4 | "ticker_attribute": "Ticker", 5 | "training_initial_date": "02-02-2000", 6 | "training_final_date": "01-01-2015", 7 | "testing_initial_date": "01-01-2015", 8 | "testing_final_date": "01-01-2018", 9 | "data_source": "yahoo", 10 | "nan_threshold": 0 11 | }, 12 | "PCA": { 13 | "N_COMPONENTS": [10,15] 14 | }, 15 | "clustering": { 16 | "algo": "DBSCAN", 17 | "epsilon": 0.4, 18 | "min_samples": 2 19 | }, 20 | "pair_restrictions": { 21 | "min_half_life": 5, 22 | "min_zero_crossings": 12, 23 | "p_value_threshold": 0.05, 24 | "hurst_threshold": 0.5 25 | }, 26 | "trading": { 27 | "strategy": "kalman", 28 | "lookback_multiplier": 2, 29 | "entry_multiplier": 1, 30 | "exit_multiplier": 0 31 | }, 32 | "trading_filter": { 33 | "active": 0, 34 | "name": "correlation", 35 | "filter_lookback_multiplier": 2, 36 | "lag": 1, 37 | "diff_threshold": 0 38 | }, 39 | "output": { 40 | "filename": "summary/results.xlsx" 41 | } 42 | } -------------------------------------------------------------------------------- /drafts/config/config_commodities_2008_2018.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "path": "data/etfs/commodity_ETFs_long.xlsx", 4 | "ticker_attribute": "Ticker", 5 | "training_initial_date": "02-02-2008", 6 | "training_final_date": "01-01-2015", 7 | "testing_initial_date": "01-01-2015", 8 | "testing_final_date": "01-01-2018", 9 | "data_source": "yahoo", 10 | "nan_threshold": 0 11 | }, 12 | "PCA": { 13 | "N_COMPONENTS": [10,15] 14 | }, 15 | "clustering": { 16 | "algo": "DBSCAN", 17 | "epsilon": 0.4, 18 | "min_samples": 2 19 | }, 20 | "pair_restrictions": { 21 | "min_half_life": 5, 22 | "min_zero_crossings": 12, 23 | "p_value_threshold": 0.05, 24 | "hurst_threshold": 0.5 25 | }, 26 | "trading": { 27 | "strategy": "kalman", 28 | "lookback_multiplier": 2, 29 | "entry_multiplier": 1, 30 | "exit_multiplier": 0 31 | }, 32 | "trading_filter": { 33 | "active": 0, 34 | "name": "correlation", 35 | "filter_lookback_multiplier": 2, 36 | "lag": 1, 37 | "diff_threshold": 0 38 | }, 39 | "output": { 40 | "filename": "summary/results.xlsx" 41 | } 42 | } -------------------------------------------------------------------------------- /drafts/config/config_commodities_2009_2015.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "path": "data/etfs/pickle/commodity_ETFs_long_updated", 4 | "ticker_attribute": "Ticker", 5 | "training_initial_date": "01-01-2009", 6 | "training_final_date": "01-01-2013", 7 | "testing_initial_date": "01-01-2013", 8 | "testing_final_date": "01-01-2015", 9 | "data_source": "yahoo", 10 | "nan_threshold": 0 11 | }, 12 | "PCA": { 13 | "N_COMPONENTS": [10,15] 14 | }, 15 | "clustering": { 16 | "algo": "DBSCAN", 17 | "epsilon": 0.4, 18 | "min_samples": 2 19 | }, 20 | "pair_restrictions": { 21 | "min_half_life": 5, 22 | "min_zero_crossings": 12, 23 | "p_value_threshold": 0.05, 24 | "hurst_threshold": 0.5 25 | }, 26 | "trading": { 27 | "strategy": "kalman", 28 | "lookback_multiplier": 2, 29 | "entry_multiplier": 1, 30 | "exit_multiplier": 0 31 | }, 32 | "trading_filter": { 33 | "active": 0, 34 | "name": "correlation", 35 | "filter_lookback_multiplier": 2, 36 | "lag": 1, 37 | "diff_threshold": 0 38 | }, 39 | "output": { 40 | "filename": "summary/results.xlsx" 41 | } 42 | } -------------------------------------------------------------------------------- /drafts/config/config_commodities_2009_2017.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "path": "data/etfs/pickle/commodity_ETFs_long_updated", 4 | "ticker_attribute": "Ticker", 5 | "training_initial_date": "01-01-2009", 6 | "training_final_date": "01-01-2014", 7 | "testing_initial_date": "01-01-2014", 8 | "testing_final_date": "01-01-2017", 9 | "data_source": "yahoo", 10 | "nan_threshold": 0 11 | }, 12 | "PCA": { 13 | "N_COMPONENTS": [10,15] 14 | }, 15 | "clustering": { 16 | "algo": "DBSCAN", 17 | "epsilon": 0.4, 18 | "min_samples": 2 19 | }, 20 | "pair_restrictions": { 21 | "min_half_life": 5, 22 | "min_zero_crossings": 12, 23 | "p_value_threshold": 0.05, 24 | "hurst_threshold": 0.5 25 | }, 26 | "trading": { 27 | "strategy": "kalman", 28 | "lookback_multiplier": 2, 29 | "entry_multiplier": 1, 30 | "exit_multiplier": 0 31 | }, 32 | "trading_filter": { 33 | "active": 0, 34 | "name": "correlation", 35 | "filter_lookback_multiplier": 2, 36 | "lag": 1, 37 | "diff_threshold": 0 38 | }, 39 | "output": { 40 | "filename": "summary/results.xlsx" 41 | } 42 | } -------------------------------------------------------------------------------- /drafts/config/config_commodities_2010_2016.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "path": "data/etfs/pickle/commodity_ETFs_long_updated", 4 | "ticker_attribute": "Ticker", 5 | "training_initial_date": "01-01-2010", 6 | "training_final_date": "01-01-2014", 7 | "testing_initial_date": "01-01-2014", 8 | "testing_final_date": "01-01-2016", 9 | "data_source": "yahoo", 10 | "nan_threshold": 0 11 | }, 12 | "PCA": { 13 | "N_COMPONENTS": [10,15] 14 | }, 15 | "clustering": { 16 | "algo": "DBSCAN", 17 | "epsilon": 0.4, 18 | "min_samples": 2 19 | }, 20 | "pair_restrictions": { 21 | "min_half_life": 5, 22 | "min_zero_crossings": 12, 23 | "p_value_threshold": 0.05, 24 | "hurst_threshold": 0.5 25 | }, 26 | "trading": { 27 | "strategy": "kalman", 28 | "lookback_multiplier": 2, 29 | "entry_multiplier": 1, 30 | "exit_multiplier": 0 31 | }, 32 | "trading_filter": { 33 | "active": 0, 34 | "name": "correlation", 35 | "filter_lookback_multiplier": 2, 36 | "lag": 1, 37 | "diff_threshold": 0 38 | }, 39 | "output": { 40 | "filename": "summary/results.xlsx" 41 | } 42 | } -------------------------------------------------------------------------------- /drafts/config/config_commodities_2010_2018.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "path": "data/etfs/pickle/commodity_ETFs_long_updated", 4 | "ticker_attribute": "Ticker", 5 | "training_initial_date": "01-01-2010", 6 | "training_final_date": "01-01-2015", 7 | "testing_initial_date": "01-01-2015", 8 | "testing_final_date": "01-01-2018", 9 | "data_source": "yahoo", 10 | "nan_threshold": 0 11 | }, 12 | "PCA": { 13 | "N_COMPONENTS": [10,15] 14 | }, 15 | "clustering": { 16 | "algo": "DBSCAN", 17 | "epsilon": 0.4, 18 | "min_samples": 2 19 | }, 20 | "pair_restrictions": { 21 | "min_half_life": 5, 22 | "min_zero_crossings": 12, 23 | "p_value_threshold": 0.05, 24 | "hurst_threshold": 0.5 25 | }, 26 | "trading": { 27 | "strategy": "kalman", 28 | "lookback_multiplier": 2, 29 | "entry_multiplier": 1, 30 | "exit_multiplier": 0 31 | }, 32 | "trading_filter": { 33 | "active": 0, 34 | "name": "correlation", 35 | "filter_lookback_multiplier": 2, 36 | "lag": 1, 37 | "diff_threshold": 0 38 | }, 39 | "output": { 40 | "filename": "summary/results.xlsx" 41 | } 42 | } -------------------------------------------------------------------------------- /drafts/config/config_commodities_2010_2019.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "ticker_segment_dict": "../data/etfs/pickle/ticker_segment_dict.pickle", 4 | }, 5 | "pair_restrictions": { 6 | "min_half_life": 5, 7 | "min_zero_crossings": 12, 8 | "p_value_threshold": 0.05, 9 | "hurst_threshold": 0.5 10 | }, 11 | "trading": { 12 | "strategy": "bollinger", 13 | "lookback_multiplier": 2, 14 | "entry_multiplier": 1, 15 | "exit_multiplier": 0 16 | }, 17 | "trading_filter": { 18 | "active": 0, 19 | "name": "correlation", 20 | "filter_lookback_multiplier": 2, 21 | "lag": 1, 22 | "diff_threshold": 0 23 | }, 24 | "output": { 25 | "filename": "summary/results.xlsx" 26 | } 27 | } -------------------------------------------------------------------------------- /drafts/config/config_commodities_2011_2015.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "path": "data/etfs/pickle/commodity_ETFs_long_updated", 4 | "ticker_attribute": "Ticker", 5 | "training_initial_date": "01-01-2011", 6 | "training_final_date": "01-01-2014", 7 | "testing_initial_date": "01-01-2014", 8 | "testing_final_date": "01-01-2015", 9 | "data_source": "yahoo", 10 | "nan_threshold": 0 11 | }, 12 | "PCA": { 13 | "N_COMPONENTS": [10,15] 14 | }, 15 | "clustering": { 16 | "algo": "DBSCAN", 17 | "epsilon": 0.4, 18 | "min_samples": 2 19 | }, 20 | "pair_restrictions": { 21 | "min_half_life": 5, 22 | "min_zero_crossings": 12, 23 | "p_value_threshold": 0.05, 24 | "hurst_threshold": 0.5 25 | }, 26 | "trading": { 27 | "strategy": "kalman", 28 | "lookback_multiplier": 2, 29 | "entry_multiplier": 1, 30 | "exit_multiplier": 0 31 | }, 32 | "trading_filter": { 33 | "active": 0, 34 | "name": "correlation", 35 | "filter_lookback_multiplier": 2, 36 | "lag": 1, 37 | "diff_threshold": 0 38 | }, 39 | "output": { 40 | "filename": "summary/results.xlsx" 41 | } 42 | } -------------------------------------------------------------------------------- /drafts/config/config_commodities_2011_2017.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "path": "data/etfs/pickle/commodity_ETFs_long_updated", 4 | "ticker_attribute": "Ticker", 5 | "training_initial_date": "01-01-2011", 6 | "training_final_date": "01-01-2015", 7 | "testing_initial_date": "01-01-2015", 8 | "testing_final_date": "01-01-2017", 9 | "data_source": "yahoo", 10 | "nan_threshold": 0 11 | }, 12 | "PCA": { 13 | "N_COMPONENTS": [10,15] 14 | }, 15 | "clustering": { 16 | "algo": "DBSCAN", 17 | "epsilon": 0.4, 18 | "min_samples": 2 19 | }, 20 | "pair_restrictions": { 21 | "min_half_life": 5, 22 | "min_zero_crossings": 12, 23 | "p_value_threshold": 0.05, 24 | "hurst_threshold": 0.5 25 | }, 26 | "trading": { 27 | "strategy": "kalman", 28 | "lookback_multiplier": 2, 29 | "entry_multiplier": 1, 30 | "exit_multiplier": 0 31 | }, 32 | "trading_filter": { 33 | "active": 0, 34 | "name": "correlation", 35 | "filter_lookback_multiplier": 2, 36 | "lag": 1, 37 | "diff_threshold": 0 38 | }, 39 | "output": { 40 | "filename": "summary/results.xlsx" 41 | } 42 | } -------------------------------------------------------------------------------- /drafts/config/config_commodities_2011_2019.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "path": "data/etfs/pickle/commodity_ETFs_long_updated", 4 | "ticker_attribute": "Ticker", 5 | "training_initial_date": "01-01-2011", 6 | "training_final_date": "01-01-2016", 7 | "testing_initial_date": "01-01-2016", 8 | "testing_final_date": "01-01-2019", 9 | "data_source": "yahoo", 10 | "nan_threshold": 0 11 | }, 12 | "PCA": { 13 | "N_COMPONENTS": [10,15] 14 | }, 15 | "clustering": { 16 | "algo": "DBSCAN", 17 | "epsilon": 0.4, 18 | "min_samples": 2 19 | }, 20 | "pair_restrictions": { 21 | "min_half_life": 5, 22 | "min_zero_crossings": 12, 23 | "p_value_threshold": 0.05, 24 | "hurst_threshold": 0.5 25 | }, 26 | "trading": { 27 | "strategy": "kalman", 28 | "lookback_multiplier": 2, 29 | "entry_multiplier": 1, 30 | "exit_multiplier": 0 31 | }, 32 | "trading_filter": { 33 | "active": 0, 34 | "name": "correlation", 35 | "filter_lookback_multiplier": 2, 36 | "lag": 1, 37 | "diff_threshold": 0 38 | }, 39 | "output": { 40 | "filename": "summary/results.xlsx" 41 | } 42 | } -------------------------------------------------------------------------------- /drafts/config/config_commodities_2012_2016.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "path": "data/etfs/pickle/commodity_ETFs_long_updated", 4 | "ticker_attribute": "Ticker", 5 | "training_initial_date": "01-01-2012", 6 | "training_final_date": "01-01-2015", 7 | "testing_initial_date": "01-01-2015", 8 | "testing_final_date": "01-01-2016", 9 | "data_source": "yahoo", 10 | "nan_threshold": 0 11 | }, 12 | "PCA": { 13 | "N_COMPONENTS": [10,15] 14 | }, 15 | "clustering": { 16 | "algo": "DBSCAN", 17 | "epsilon": 0.4, 18 | "min_samples": 2 19 | }, 20 | "pair_restrictions": { 21 | "min_half_life": 5, 22 | "min_zero_crossings": 12, 23 | "p_value_threshold": 0.05, 24 | "hurst_threshold": 0.5 25 | }, 26 | "trading": { 27 | "strategy": "kalman", 28 | "lookback_multiplier": 2, 29 | "entry_multiplier": 1, 30 | "exit_multiplier": 0 31 | }, 32 | "trading_filter": { 33 | "active": 0, 34 | "name": "correlation", 35 | "filter_lookback_multiplier": 2, 36 | "lag": 1, 37 | "diff_threshold": 0 38 | }, 39 | "output": { 40 | "filename": "summary/results.xlsx" 41 | } 42 | } -------------------------------------------------------------------------------- /drafts/config/config_commodities_2012_2018.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "path": "data/etfs/pickle/commodity_ETFs_long_updated", 4 | "ticker_attribute": "Ticker", 5 | "training_initial_date": "01-01-2012", 6 | "training_final_date": "01-01-2016", 7 | "testing_initial_date": "01-01-2016", 8 | "testing_final_date": "01-01-2018", 9 | "data_source": "yahoo", 10 | "nan_threshold": 0 11 | }, 12 | "PCA": { 13 | "N_COMPONENTS": [10,15] 14 | }, 15 | "clustering": { 16 | "algo": "DBSCAN", 17 | "epsilon": 0.4, 18 | "min_samples": 2 19 | }, 20 | "pair_restrictions": { 21 | "min_half_life": 5, 22 | "min_zero_crossings": 12, 23 | "p_value_threshold": 0.05, 24 | "hurst_threshold": 0.5 25 | }, 26 | "trading": { 27 | "strategy": "kalman", 28 | "lookback_multiplier": 2, 29 | "entry_multiplier": 1, 30 | "exit_multiplier": 0 31 | }, 32 | "trading_filter": { 33 | "active": 0, 34 | "name": "correlation", 35 | "filter_lookback_multiplier": 2, 36 | "lag": 1, 37 | "diff_threshold": 0 38 | }, 39 | "output": { 40 | "filename": "summary/results.xlsx" 41 | } 42 | } -------------------------------------------------------------------------------- /drafts/config/config_commodities_2013_2017.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "path": "data/etfs/pickle/commodity_ETFs_long_updated", 4 | "ticker_attribute": "Ticker", 5 | "training_initial_date": "01-01-2013", 6 | "training_final_date": "01-01-2016", 7 | "testing_initial_date": "01-01-2016", 8 | "testing_final_date": "01-01-2017", 9 | "data_source": "yahoo", 10 | "nan_threshold": 0 11 | }, 12 | "PCA": { 13 | "N_COMPONENTS": [10,15] 14 | }, 15 | "clustering": { 16 | "algo": "DBSCAN", 17 | "epsilon": 0.4, 18 | "min_samples": 2 19 | }, 20 | "pair_restrictions": { 21 | "min_half_life": 5, 22 | "min_zero_crossings": 12, 23 | "p_value_threshold": 0.05, 24 | "hurst_threshold": 0.5 25 | }, 26 | "trading": { 27 | "strategy": "kalman", 28 | "lookback_multiplier": 2, 29 | "entry_multiplier": 1, 30 | "exit_multiplier": 0 31 | }, 32 | "trading_filter": { 33 | "active": 0, 34 | "name": "correlation", 35 | "filter_lookback_multiplier": 2, 36 | "lag": 1, 37 | "diff_threshold": 0 38 | }, 39 | "output": { 40 | "filename": "summary/results.xlsx" 41 | } 42 | } -------------------------------------------------------------------------------- /drafts/config/config_commodities_2013_2019.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "path": "data/etfs/pickle/commodity_ETFs_long_updated", 4 | "ticker_attribute": "Ticker", 5 | "training_initial_date": "01-01-2013", 6 | "training_final_date": "01-01-2017", 7 | "testing_initial_date": "01-01-2017", 8 | "testing_final_date": "01-01-2019", 9 | "data_source": "yahoo", 10 | "nan_threshold": 0 11 | }, 12 | "PCA": { 13 | "N_COMPONENTS": [10,15] 14 | }, 15 | "clustering": { 16 | "algo": "DBSCAN", 17 | "epsilon": 0.4, 18 | "min_samples": 2 19 | }, 20 | "pair_restrictions": { 21 | "min_half_life": 5, 22 | "min_zero_crossings": 12, 23 | "p_value_threshold": 0.05, 24 | "hurst_threshold": 0.5 25 | }, 26 | "trading": { 27 | "strategy": "kalman", 28 | "lookback_multiplier": 2, 29 | "entry_multiplier": 1, 30 | "exit_multiplier": 0 31 | }, 32 | "trading_filter": { 33 | "active": 0, 34 | "name": "correlation", 35 | "filter_lookback_multiplier": 2, 36 | "lag": 1, 37 | "diff_threshold": 0 38 | }, 39 | "output": { 40 | "filename": "summary/results.xlsx" 41 | } 42 | } -------------------------------------------------------------------------------- /drafts/config/config_commodities_2014_2018.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "path": "data/etfs/pickle/commodity_ETFs_long_updated", 4 | "ticker_attribute": "Ticker", 5 | "training_initial_date": "01-01-2014", 6 | "training_final_date": "01-01-2017", 7 | "testing_initial_date": "01-01-2017", 8 | "testing_final_date": "01-01-2018", 9 | "data_source": "yahoo", 10 | "nan_threshold": 0 11 | }, 12 | "PCA": { 13 | "N_COMPONENTS": [10,15] 14 | }, 15 | "clustering": { 16 | "algo": "DBSCAN", 17 | "epsilon": 0.4, 18 | "min_samples": 2 19 | }, 20 | "pair_restrictions": { 21 | "min_half_life": 5, 22 | "min_zero_crossings": 12, 23 | "p_value_threshold": 0.05, 24 | "hurst_threshold": 0.5 25 | }, 26 | "trading": { 27 | "strategy": "kalman", 28 | "lookback_multiplier": 2, 29 | "entry_multiplier": 1, 30 | "exit_multiplier": 0 31 | }, 32 | "trading_filter": { 33 | "active": 0, 34 | "name": "correlation", 35 | "filter_lookback_multiplier": 2, 36 | "lag": 1, 37 | "diff_threshold": 0 38 | }, 39 | "output": { 40 | "filename": "summary/results.xlsx" 41 | } 42 | } -------------------------------------------------------------------------------- /drafts/config/config_commodities_2015_2019.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "path": "data/etfs/pickle/commodity_ETFs_long_updated", 4 | "ticker_segment_dict": "data/etfs/pickle/ticker_segment_dict.pickle", 5 | "ticker_attribute": "Ticker", 6 | "training_initial_date": "01-01-2015", 7 | "training_final_date": "01-01-2018", 8 | "testing_initial_date": "01-01-2018", 9 | "testing_final_date": "01-01-2019", 10 | "data_source": "yahoo", 11 | "nan_threshold": 0 12 | }, 13 | "PCA": { 14 | "N_COMPONENTS": [10,15] 15 | }, 16 | "clustering": { 17 | "algo": "DBSCAN", 18 | "epsilon": 0.4, 19 | "min_samples": 2 20 | }, 21 | "pair_restrictions": { 22 | "min_half_life": 5, 23 | "min_zero_crossings": 12, 24 | "p_value_threshold": 0.05, 25 | "hurst_threshold": 0.5 26 | }, 27 | "trading": { 28 | "strategy": "kalman", 29 | "lookback_multiplier": 2, 30 | "entry_multiplier": 1, 31 | "exit_multiplier": 0 32 | }, 33 | "trading_filter": { 34 | "active": 0, 35 | "name": "correlation", 36 | "filter_lookback_multiplier": 2, 37 | "lag": 1, 38 | "diff_threshold": 0 39 | }, 40 | "mlp": { 41 | "n_in": 5, 42 | "n_out": 1, 43 | "epochs": 200, 44 | "hidden_nodes":5, 45 | "loss_fct": "mse", 46 | "optimizer": "adam", 47 | "train_val_split": "2016-01-01" 48 | }, 49 | "output": { 50 | "filename": "summary/results.xlsx" 51 | } 52 | } -------------------------------------------------------------------------------- /drafts/config/config_commodities_pr.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset": { 3 | "path": "data/etfs/commodity_ETFs_long.xlsx", 4 | "ticker_attribute": "Ticker", 5 | "training_initial_date": "01-06-2012", 6 | "training_final_date": "01-01-2018", 7 | "testing_initial_date": "01-06-2017", 8 | "testing_final_date": "01-01-2018", 9 | "data_source": "yahoo", 10 | "nan_threshold": 0 11 | }, 12 | "PCA": { 13 | "N_COMPONENTS": 15 14 | }, 15 | "clustering": { 16 | "algo": "DBSCAN", 17 | "epsilon": 0.4, 18 | "min_samples": 2 19 | }, 20 | "pair_restrictions": { 21 | "min_half_life": 5, 22 | "min_zero_crossings": 12, 23 | "p_value_threshold": 0.05, 24 | "hurst_threshold": 0.5 25 | }, 26 | "trading": { 27 | "strategy": "kalman", 28 | "lookback_multiplier": 2, 29 | "entry_multiplier": 1, 30 | "exit_multiplier": 0 31 | }, 32 | "trading_filter": { 33 | "active": 0, 34 | "name": "correlation", 35 | "filter_lookback_multiplier": 2, 36 | "lag": 1, 37 | "diff_threshold": 0 38 | }, 39 | "output": { 40 | "filename": "summary/results.xlsx" 41 | } 42 | } -------------------------------------------------------------------------------- /drafts/draft.py: -------------------------------------------------------------------------------- 1 | # This file contains old functions, not being used anymore but might turn out helpful at some point in time 2 | 3 | # trader.py 4 | def bollinger_bands(self, y, x, lookback, entry_multiplier=1, exit_multiplier=0): 5 | """ 6 | This function implements a pairs trading strategy based 7 | on bollinger bands. 8 | Source: Example 3.2 EC's book 9 | : Y & X: time series composing the spread 10 | : lookback : Lookback period 11 | : entry_multiplier : defines the multiple of std deviation used to enter a position 12 | : exit_multiplier: defines the multiple of std deviation used to exit a position 13 | """ 14 | # print("Warning: don't forget lookback (halflife) must be at least 3.") 15 | 16 | entryZscore = entry_multiplier 17 | exitZscore = exit_multiplier 18 | 19 | # obtain zscore 20 | zscore, rolling_beta = self.rolling_zscore(y, x, lookback) 21 | zscore_array = np.asarray(zscore) 22 | 23 | # find long and short indices 24 | numUnitsLong = pd.Series([np.nan for i in range(len(y))]) 25 | numUnitsLong.iloc[0] = 0. 26 | long_entries = self.cross_threshold(zscore_array, -entryZscore, 'down', 'entry') 27 | numUnitsLong[long_entries] = 1.0 28 | long_exits = self.cross_threshold(zscore_array, -exitZscore, 'up') 29 | numUnitsLong[long_exits] = 0.0 30 | numUnitsLong = numUnitsLong.fillna(method='ffill') 31 | numUnitsLong.index = zscore.index 32 | 33 | numUnitsShort = pd.Series([np.nan for i in range(len(y))]) 34 | numUnitsShort.iloc[0] = 0. 35 | short_entries = self.cross_threshold(zscore_array, entryZscore, 'up', 'entry') 36 | numUnitsShort[short_entries] = -1.0 37 | short_exits = self.cross_threshold(zscore_array, exitZscore, 'down') 38 | numUnitsShort[short_exits] = 0.0 39 | numUnitsShort = numUnitsShort.fillna(method='ffill') 40 | numUnitsShort.index = zscore.index 41 | 42 | # concatenate all positions 43 | numUnits = numUnitsShort + numUnitsLong 44 | numUnits = pd.Series(data=numUnits.values, index=y.index, name='numUnits') 45 | 46 | # position durations 47 | trading_durations = self.add_trading_duration(pd.DataFrame(numUnits, index=y.index)) 48 | 49 | beta = rolling_beta.copy() 50 | position_ret, _, ret_summary = self.calculate_sliding_position_returns(y, x, beta, numUnits) 51 | 52 | # get trade summary 53 | rolling_spread = y - rolling_beta * x 54 | 55 | # All series contain Date as index 56 | series_to_include = [(position_ret, 'position_return'), 57 | (y, y.name), 58 | (x, x.name), 59 | (rolling_beta, 'beta_position'), 60 | (rolling_spread, 'spread'), 61 | (zscore, 'zscore'), 62 | (numUnits, 'numUnits'), 63 | (trading_durations, 'trading_duration')] 64 | summary = self.trade_summary(series_to_include) 65 | 66 | return summary, ret_summary 67 | 68 | def bollinger_bands_ec(self, Y, X, lookback, entry_multiplier=1, exit_multiplier=0): 69 | df = pd.concat([Y, X], axis=1) 70 | df = df.reset_index() 71 | df['hedgeRatio'] = np.nan 72 | for t in range(lookback, len(df)): 73 | x = np.array(X)[t - lookback:t] 74 | x = sm.add_constant(x) 75 | y = np.array(Y)[t - lookback:t] 76 | df.loc[t, 'hedgeRatio'] = sm.OLS(y, x).fit().params[1] 77 | 78 | cols = [X.name, Y.name] 79 | 80 | yport = np.ones(df[cols].shape); 81 | yport[:, 0] = -df['hedgeRatio'] 82 | yport = yport * df[cols] 83 | 84 | yport = yport[X.name] + yport[Y.name] 85 | data_mean = pd.rolling_mean(yport, window=lookback) 86 | data_std = pd.rolling_std(yport, window=lookback) 87 | zScore = (yport - data_mean) / data_std 88 | 89 | entryZscore = entry_multiplier 90 | exitZscore = exit_multiplier 91 | 92 | longsEntry = zScore < -entryZscore 93 | longsExit = zScore > -exitZscore 94 | shortsEntry = zScore > entryZscore 95 | shortsExit = zScore < exitZscore 96 | 97 | numUnitsLong = pd.Series([np.nan for i in range(len(df))]) 98 | numUnitsShort = pd.Series([np.nan for i in range(len(df))]) 99 | numUnitsLong[0] = 0. 100 | numUnitsShort[0] = 0. 101 | 102 | numUnitsLong[longsEntry] = 1.0 103 | numUnitsLong[longsExit] = 0.0 104 | numUnitsLong = numUnitsLong.fillna(method='ffill') 105 | 106 | numUnitsShort[shortsEntry] = -1.0 107 | numUnitsShort[shortsExit] = 0.0 108 | numUnitsShort = numUnitsShort.fillna(method='ffill') 109 | df['numUnits'] = numUnitsShort + numUnitsLong 110 | 111 | tmp1 = np.ones(df[cols].shape) * np.array([df['numUnits']]).T 112 | tmp2 = np.ones(df[cols].shape) 113 | tmp2[:, 0] = -df['hedgeRatio'] 114 | positions = pd.DataFrame(tmp1 * tmp2 * df[cols]).fillna(0) 115 | pnl = positions.shift(1) * (df[cols] - df[cols].shift(1)) / df[cols].shift(1) 116 | pnl = pnl.sum(axis=1) 117 | ret = pnl / np.sum(np.abs(positions.shift(1)), axis=1) 118 | ret = ret.fillna(0) 119 | apr = ((np.prod(1. + ret)) ** (252. / len(ret))) - 1 120 | print('APR', apr) 121 | if np.std(ret) == 0: 122 | sharpe = 0 123 | else: 124 | sharpe = np.sqrt(252.) * np.mean(ret) / np.std(ret) # should the mean include moments of no holding? 125 | print('Sharpe', sharpe) 126 | 127 | # checking results 128 | X = X.reset_index(drop=True) 129 | Y = Y.reset_index() 130 | pnl.name = 'pnl'; 131 | rolling_spread = yport 132 | rolling_spread.name = 'spread' 133 | zScore.name = 'zscore' 134 | ret.name = 'ret' 135 | numUnits = df['numUnits']; 136 | numUnits.name = 'position_during_day' 137 | numUnits = numUnits.shift() 138 | summary = pd.concat([pnl, ret, X, Y, rolling_spread, zScore, numUnits], axis=1) 139 | summary.index = summary['Date'] 140 | # new_df = new_df.loc[datetime(2006,7,26):] 141 | summary = summary[36:] 142 | 143 | return pnl, ret, summary, sharpe 144 | 145 | def linear_strategy(self, Y, X, lookback): 146 | """ 147 | This function applies a simple pairs trading strategy based on 148 | Ernie Chan's book: Algoritmic Trading. 149 | 150 | The number of shares for each position is set to be the negative 151 | z-score 152 | """ 153 | 154 | # z-score 155 | zscore = self.rolling_zscore(Y, X, lookback) 156 | numUnits = -zscore 157 | 158 | # Define strategy 159 | # Multiply num positions inversely (-) proportionally to z-score 160 | # ATTENTION: in the book the signals are inverted. The author confirms it here: 161 | # http://epchan.blogspot.com/2013/05/my-new-book-on-algorithmic-trading-is.html 162 | X_positions = numUnits*(-rolling_beta*X) 163 | Y_positions = numUnits*Y 164 | 165 | # P&L:position (-spread value) * percentage of change 166 | # note that pnl is not a percentage. We multiply a position value by a percentage 167 | X_returns = (X - X.shift(periods=-1))/X.shift(periods=-1) 168 | Y_returns = (Y - Y.shift(periods=-1))/Y.shift(periods=-1) 169 | pnl = X_positions.shift(periods=-1)*X_returns + Y_positions.shift(periods=-1)*Y_returns 170 | total_pnl = (X_positions.shift(periods=-1)*(X - X.shift(periods=-1)) + \ 171 | Y_positions.shift(periods=-1)*(Y - Y.shift(periods=-1))).sum() 172 | ret=pnl/(abs(X_positions.shift(periods=-1))+abs(Y_positions).shift(periods=-1)) 173 | 174 | return pnl, total_pnl, ret 175 | 176 | def calculate_returns_no_rebalance(self, y, x, beta, positions): 177 | """ 178 | Y: price of ETF Y 179 | X: price of ETF X 180 | beta: cointegration ratio 181 | positions: array indicating position to enter in next day 182 | """ 183 | 184 | # calculate each leg return 185 | y_returns = y.pct_change().fillna(0); y_returns.name = 'y_returns' 186 | x_returns = x.pct_change().fillna(0); x_returns.name = 'x_returns' 187 | 188 | # positions preceed the day when the position is actually entered! 189 | # get indices before entering position 190 | new_positions = positions.diff()[positions.diff() != 0].index.values 191 | # get corresponding betas 192 | beta_position = pd.Series(data=[np.nan] * len(y), index=y.index, name='beta_position') 193 | beta_position[new_positions] = beta[new_positions] 194 | # fill in between time slots with same beta 195 | beta_position = beta_position.fillna(method='ffill') 196 | # shift betas to match row when position is on 197 | beta_position = beta_position.shift().fillna(0) 198 | # name positions series 199 | positions.name = 'positions' 200 | 201 | # apply returns per trade 202 | # each row contain all the parameters to be applied in that position 203 | df = pd.concat([y_returns, x_returns, beta_position, positions.shift().fillna(0)], axis=1) 204 | returns = df.apply(lambda row: self.return_per_timestep(row), axis=1) 205 | cum_returns = np.cumprod(returns + 1) - 1 206 | 207 | return returns, cum_returns 208 | 209 | def calculate_returns_adapted(self, y, x, beta, positions): 210 | """ 211 | Y: price of ETF Y 212 | X: price of ETF X 213 | beta: cointegration ratio 214 | positions: array indicating when to take a position 215 | """ 216 | # calculate each leg return 217 | y_returns = y.pct_change().fillna(0); 218 | y_returns.name = 'y_returns' 219 | x_returns = x.pct_change().fillna(0); 220 | x_returns.name = 'x_returns' 221 | 222 | # name positions series 223 | positions.name = 'positions' 224 | 225 | # beta must shift from row above 226 | beta_position = beta.shift().fillna(0) 227 | beta_position.name = 'beta_position' 228 | 229 | # apply returns per trade 230 | df = pd.concat([y_returns, x_returns, beta_position, positions], axis=1) 231 | returns = df.apply(lambda row: self.return_per_timestep(row), axis=1) 232 | cum_returns = np.cumprod(returns + 1) - 1 233 | 234 | return returns, cum_returns 235 | 236 | def filter_profitable_pairs(self, sharpe_results, pairs): 237 | """ 238 | This function discards pairs that were not profitable mantaining those for which a positive sharpe ratio was 239 | obtained. 240 | :param sharpe_results: list with sharpe resutls for every pair 241 | :param pairs: list with all pairs and their info 242 | :return: list with profitable pairs and their info 243 | """ 244 | 245 | sharpe_results = np.asarray(sharpe_results) 246 | profitable_pairs_indices = np.argwhere(sharpe_results > 0) 247 | profitable_pairs = [pairs[i] for i in profitable_pairs_indices.flatten()] 248 | 249 | return profitable_pairs 250 | 251 | def rolling_zscore(self, Y, X, lookback): 252 | """ 253 | This function calculates the normalized moving spread 254 | Note that moving average and moving std will have the first 39 values as np.Nan, because 255 | the spread is only valid after 20 points, and the moving averages still need 20 points more 256 | to define its value. 257 | """ 258 | # Calculate moving parameters 259 | # 1.beta: 260 | rolling_beta = self.rolling_regression(Y, X, window=lookback) 261 | # 2.spread: 262 | rolling_spread = Y - rolling_beta * X 263 | # 3.moving average 264 | rolling_avg = rolling_spread.rolling(window=lookback, center=False).mean() 265 | rolling_avg.name = 'spread_' + str(lookback) + 'mavg' 266 | # 4. rolling standard deviation 267 | rolling_std = rolling_spread.rolling(window=lookback, center=False).std() 268 | rolling_std.name = 'rolling_std_' + str(lookback) 269 | 270 | # z-score 271 | zscore = (rolling_spread - rolling_avg) / rolling_std 272 | 273 | return zscore, rolling_beta 274 | 275 | def rolling_regression(self, y, x, window): 276 | """ 277 | y and x must be pandas.Series 278 | y is the dependent variable 279 | x is the independent variable 280 | spread: y - b*x 281 | Source: https://stackoverflow.com/questions/37317727/deprecated-rolling-window- 282 | option-in-ols-from-pandas-to-statsmodels/39704930#39704930 283 | """ 284 | # Clean-up 285 | x = x.dropna() 286 | y = y.dropna() 287 | # Trim acc to shortest 288 | if x.index.size > y.index.size: 289 | x = x[y.index] 290 | else: 291 | y = y[x.index] 292 | # Verify enough space 293 | if x.index.size < window: 294 | return None 295 | else: 296 | # Add a constant if needed 297 | X_name = x.name 298 | X = x.to_frame() 299 | X['c'] = 1 300 | # Loop... this can be improved 301 | estimate_data = [] 302 | for i in range(window, len(X)): 303 | X_slice = X.iloc[i - window:i, :] # always index in np as opposed to pandas, much faster 304 | y_slice = y.iloc[i - window:i] 305 | coeff = sm.OLS(y_slice, X_slice).fit() 306 | estimate_data.append(coeff.params[X_name]) 307 | 308 | # Assemble 309 | estimate = pd.Series(data=np.nan, index=x.index[:window]) 310 | # add nan values for first #lookback indices 311 | estimate = estimate.append(pd.Series(data=estimate_data, index=x.index[window:])) 312 | return estimate 313 | 314 | def kalman_filter(self, y, x, entry_multiplier=1.0, exit_multiplier=1.0, stabilizing_threshold=5): 315 | """ 316 | This function implements a Kalman Filter for the estimation of 317 | the moving hedge ratio 318 | :param y: 319 | :param x: 320 | :param entry_multiplier: 321 | :param exit_multiplier: 322 | :param stabilizing_threshold: 323 | :return: 324 | """ 325 | 326 | # store series for late usage 327 | x_series = x.copy() 328 | y_series = y.copy() 329 | 330 | # add constant 331 | x = x.to_frame() 332 | x['intercept'] = 1 333 | 334 | x = np.array(x) 335 | y = np.array(y) 336 | delta = 0.0001 337 | Ve = 0.001 338 | 339 | yhat = np.ones(len(y)) * np.nan 340 | e = np.ones(len(y)) * np.nan 341 | Q = np.ones(len(y)) * np.nan 342 | R = np.zeros((2, 2)) 343 | P = np.zeros((2, 2)) 344 | 345 | beta = np.matrix(np.zeros((2, len(y))) * np.nan) 346 | 347 | Vw = delta / (1 - delta) * np.eye(2) 348 | 349 | beta[:, 0] = 0. 350 | 351 | for t in range(len(y)): 352 | if (t > 0): 353 | beta[:, t] = beta[:, t - 1] 354 | R = P + Vw 355 | 356 | yhat[t] = np.dot(x[t, :], beta[:, t]) 357 | 358 | tmp1 = np.matrix(x[t, :]) 359 | tmp2 = np.matrix(x[t, :]).T 360 | Q[t] = np.dot(np.dot(tmp1, R), tmp2) + Ve 361 | 362 | e[t] = y[t] - yhat[t] # plays spread role 363 | 364 | K = np.dot(R, np.matrix(x[t, :]).T) / Q[t] 365 | 366 | # print R;print x[t, :].T;print Q[t];print 'K',K;print;print 367 | 368 | beta[:, t] = beta[:, t] + np.dot(K, np.matrix(e[t])) 369 | 370 | tmp1 = np.matrix(x[t, :]) 371 | P = R - np.dot(np.dot(K, tmp1), R) 372 | 373 | # if t==2: 374 | # print beta[0, :].T 375 | 376 | # plt.plot(beta[0, :].T) 377 | # plt.savefig('/tmp/beta1.png') 378 | # plt.hold(False) 379 | # plt.plot(beta[1, :].T) 380 | # plt.savefig('/tmp/beta2.png') 381 | # plt.hold(False) 382 | # plt.plot(e[2:], 'r') 383 | # plt.hold(True) 384 | # plt.plot(np.sqrt(Q[2:])) 385 | # plt.savefig('/tmp/Q.png') 386 | 387 | y2 = pd.concat([x_series, y_series], axis=1) 388 | 389 | longsEntry = e < -entry_multiplier * np.sqrt(Q) 390 | longsExit = e > -exit_multiplier * np.sqrt(Q) 391 | 392 | shortsEntry = e > entry_multiplier * np.sqrt(Q) 393 | shortsExit = e < exit_multiplier * np.sqrt(Q) 394 | 395 | numUnitsLong = pd.Series([np.nan for i in range(len(y))]) 396 | numUnitsShort = pd.Series([np.nan for i in range(len(y))]) 397 | # initialize with zero 398 | numUnitsLong[0] = 0. 399 | numUnitsShort[0] = 0. 400 | # remove trades while the spread is stabilizing 401 | longsEntry[:stabilizing_threshold] = False 402 | longsExit[:stabilizing_threshold] = False 403 | shortsEntry[:stabilizing_threshold] = False 404 | shortsExit[:stabilizing_threshold] = False 405 | 406 | numUnitsLong[longsEntry] = 1. 407 | numUnitsLong[longsExit] = 0 408 | numUnitsLong = numUnitsLong.fillna(method='ffill') 409 | 410 | numUnitsShort[shortsEntry] = -1. 411 | numUnitsShort[shortsExit] = 0 412 | numUnitsShort = numUnitsShort.fillna(method='ffill') 413 | 414 | numUnits = numUnitsLong + numUnitsShort 415 | numUnits = pd.Series(data=numUnits.values, index=y_series.index, name='numUnits') 416 | 417 | # position durations 418 | trading_durations = self.add_trading_duration(pd.DataFrame(numUnits, index=y_series.index)) 419 | 420 | beta = pd.Series(data=np.squeeze(np.asarray(beta[0, :])), index=y_series.index).fillna(0) 421 | position_ret, _, ret_summary = self.calculate_sliding_position_returns(y_series, x_series, beta, numUnits) 422 | 423 | # add transaction costs and gather all info in df 424 | series_to_include = [(position_ret, 'position_return'), 425 | (y_series, y_series.name), 426 | (x_series, x_series.name), 427 | (beta, 'beta_position'), 428 | (pd.Series(e, index=y_series.index), 'e'), 429 | (pd.Series(np.sqrt(Q), index=y_series.index), 'sqrt(Q)'), 430 | (numUnits, 'numUnits'), 431 | (trading_durations, 'trading_duration')] 432 | 433 | summary = self.trade_summary(series_to_include) 434 | 435 | return summary, ret_summary 436 | 437 | def cross_threshold(self, array, threshold, direction='up', position='exit'): 438 | """ 439 | This function returns the indices corresponding to the positions where a given threshold 440 | is crossed 441 | :param array: np.array with time series 442 | :param threshold: threshold to be crossed 443 | :param direction: going up or down 444 | :param mode: auxiliar variable indicating whether we are checking for a position entry or exit 445 | :return: indices where threshold is crossed going in the desired direction 446 | """ 447 | 448 | # add index for first element transitioning from None value, in case its above/below threshold 449 | # only add when checking if position should be entered. 450 | initial_index = [] 451 | first_index, first_element = next((item[0], item[1]) for item in enumerate(array) if not np.isnan(item[1])) 452 | if position == 'entry': 453 | if direction == 'up': 454 | if first_element > threshold: 455 | initial_index.append(first_index) 456 | elif direction == 'down': 457 | if first_element < threshold: 458 | initial_index.append(first_index) 459 | else: 460 | print('The series must be either going "up" or "down", please insert valid direction') 461 | initial_index = np.asarray(initial_index, dtype='int') 462 | 463 | # add small decimal case to consider only strictly larger/smaller 464 | if threshold > 0: 465 | threshold = threshold + 0.000000001 466 | else: 467 | threshold = threshold - 0.000000001 468 | array = array - threshold 469 | 470 | # add other indices 471 | indices = np.where(np.diff(np.sign(array)))[0] + 1 472 | # only consider indices after first element which is not Nan 473 | indices = indices[indices > first_index] 474 | 475 | direction_indices = indices 476 | for index in indices: 477 | if direction == 'up': 478 | if array[index] < array[index - 1]: 479 | direction_indices = direction_indices[direction_indices != index] 480 | elif direction == 'down': 481 | if array[index] > array[index - 1]: 482 | direction_indices = direction_indices[direction_indices != index] 483 | # concatenate 484 | direction_indices = np.concatenate((initial_index, direction_indices), axis=0) 485 | 486 | return direction_indices 487 | 488 | def apply_correlation_filter(self, lookback, lag, threshold, Y, X, units): 489 | """ 490 | This function implements a filter proposed by Dunnis 2005. 491 | The main idea is tracking how the correlation is varying in a moving period, so that we 492 | are able to identify when the two legs of the spread are moving in opposing directions 493 | by analyzing how the correlation values are varying. 494 | :param lookback: lookback period 495 | :param lag: lag to compare the correlaiton evolution 496 | :param threshold: minimium difference to consider change 497 | :param Y: Y series 498 | :param X: X series 499 | :param units: positions taken 500 | :return: indices for position entry 501 | """ 502 | 503 | # calculate correlation variations 504 | rolling_window = lookback 505 | returns_X = X.pct_change() 506 | returns_Y = Y.pct_change() 507 | correlation = returns_X.rolling(rolling_window).corr(returns_Y) 508 | diff_correlation = correlation.diff(periods=lag).fillna(0) 509 | 510 | # change positions accordingly 511 | diff_correlation.name = 'diff_correlation'; 512 | units.name = 'units' 513 | units.index = diff_correlation.index 514 | df = pd.concat([diff_correlation, units], axis=1) 515 | new_df = self.update_positions(df, 'diff_correlation', threshold) 516 | 517 | units = new_df['units'] 518 | 519 | return units 520 | 521 | def apply_zscorediff_filter(self, lag, threshold, zscore, units): 522 | """ 523 | This function implements a filter which tracks how the zscore has been growing. 524 | The premise is that positions should not be entered while zscore is rising. 525 | :param lookback: lookback period 526 | :param lag: lag to compare the zscore evolution 527 | :param threshold: minimium difference to consider change 528 | :param Y: Y series 529 | :param X: X series 530 | :param units: positions taken 531 | :return: indices for position entry 532 | """ 533 | 534 | # calculate zscore differences 535 | zscore_diff = zscore.diff(periods=lag).fillna(0) 536 | 537 | # change positions accordingly 538 | zscore_diff.name = 'zscore_diff'; 539 | units.name = 'units' 540 | units.index = zscore_diff.index 541 | df = pd.concat([zscore_diff, units], axis=1) 542 | new_df = self.update_positions(df, 'zscore_diff', threshold) 543 | 544 | units = new_df['units'] 545 | 546 | return units 547 | 548 | def calculate_sliding_position_returns(self, y, x, beta, positions): 549 | """ 550 | Y: price of ETF Y 551 | X: price of ETF X 552 | beta: moving cointegration ratio 553 | positions: array indicating position to enter in next day 554 | """ 555 | # get copy of series 556 | y = y.copy() 557 | y.name = 'y' 558 | x = x.copy() 559 | x.name = 'x' 560 | 561 | # positions preceed the day when the position is actually entered! 562 | # get indices before entering position 563 | new_positions = positions.diff()[positions.diff() != 0].index.values 564 | # get corresponding betas 565 | beta_position = pd.Series(data=[np.nan] * len(y), index=y.index, name='beta_position') 566 | beta_position[new_positions] = beta[new_positions] 567 | # fill in between time slots with same beta 568 | beta_position = beta_position.fillna(method='ffill') 569 | # shift betas to match row when position is on 570 | beta_position = beta_position.shift().fillna(0) 571 | 572 | # create variable for signalizing end of position 573 | end_position = pd.Series(data=[0] * len(y), index=y.index, name='end_position') 574 | end_position[new_positions] = 1. 575 | 576 | # get corresponding X and Y 577 | y_entry = pd.Series(data=[np.nan] * len(y), index=y.index, name='y_entry') 578 | x_entry = pd.Series(data=[np.nan] * len(y), index=y.index, name='x_entry') 579 | y_entry[new_positions] = y[new_positions] 580 | x_entry[new_positions] = x[new_positions] 581 | y_entry = y_entry.shift().fillna(method='ffill') 582 | x_entry = x_entry.shift().fillna(method='ffill') 583 | 584 | # name positions series 585 | positions.name = 'positions' 586 | 587 | # apply returns per trade 588 | # each row contain all the parameters to be applied in that position 589 | df = pd.concat([y, x, beta_position, positions.shift().fillna(0), y_entry, x_entry, end_position], axis=1) 590 | returns = df.apply(lambda row: self.return_per_position(row, sliding=True), axis=1).fillna(0) 591 | cum_returns = np.cumprod(returns + 1) - 1 592 | df['ret'] = returns 593 | returns.name = 'position_return' 594 | 595 | return returns, cum_returns, df 596 | 597 | def return_per_timestep(self, row): 598 | if row['beta_position'] > 1.: 599 | return ((1 / row['beta_position']) * row['y_returns'] - 1 * row['x_returns']) * row['positions'] 600 | else: 601 | return (row['y_returns'] - row['beta_position'] * row['x_returns']) * row['positions'] 602 | 603 | def update_positions(self, df, attribute, threshold): 604 | """ 605 | The following function receives a dataframe containing the current positions 606 | along with the attribute column from which condition should be verified. 607 | A new df with positions updated accordingly is returned. 608 | :param df: df containing positions and column with attribute 609 | :param attribute: attribute name 610 | :param threshold: threshold that condition must verify 611 | :return: df with updated positions 612 | """ 613 | previous_unit = 0 614 | for index, row in df.iterrows(): 615 | if previous_unit == row['units']: 616 | continue # no change in positions to verify 617 | else: 618 | if row['units'] == 0: 619 | previous_unit = row['units'] 620 | continue # simply close trade, nothing to verify 621 | else: 622 | if (row[attribute] <= threshold and row['units'] < 0) or \ 623 | (row[attribute] > threshold and row['units'] > 0): # if criteria is met, continue 624 | previous_unit = row['units'] 625 | continue 626 | else: # if criteria is not met, update row 627 | df.loc[index, 'units'] = 0 628 | previous_unit = 0 629 | continue 630 | 631 | return df 632 | 633 | # data_processor.py 634 | def read_tickers_prices(self, tickers, initial_date, final_date, data_source, column='Adj Close'): 635 | """ 636 | This function reads the price series for the requested tickers 637 | 638 | :param tickers: list with tickers from which to retrieve prices 639 | :param initial_date: start date to retrieve price series 640 | :param final_date: end point 641 | :param data_source: data source from where to retrieve data 642 | 643 | :return: dictionary with price series for each ticker 644 | """ 645 | error_counter = 0 646 | dataset = {key: None for key in tickers} 647 | for ticker in tickers: 648 | try: 649 | df = data.DataReader(ticker, data_source, initial_date, final_date) 650 | series = df[column] 651 | series.name = ticker # filter close price only 652 | dataset[ticker] = series.copy() 653 | except: 654 | error_counter = error_counter + 1 655 | print('Not Possible to retrieve information for ' + ticker) 656 | 657 | print('\nUnable to download ' + str(error_counter / len(tickers) * 100) + '% of the ETFs') 658 | 659 | return dataset 660 | 661 | 662 | 663 | # forecasting notebook 664 | 665 | def apply_ARIMA(series, p, d, q): 666 | # fit model 667 | model = ARIMA(series, order=(p,d,q)) 668 | model_fit = model.fit(disp=0) 669 | print(model_fit.summary()) 670 | # plot residual errors 671 | residuals = pd.DataFrame(model_fit.resid) 672 | residuals.plot() 673 | plt.show() 674 | residuals.plot(kind='kde') 675 | plt.show() 676 | print(residuals.describe()) 677 | 678 | def rolling_ARIMA(series, p, d, q, train_val_split): 679 | # standardize 680 | mean = series.mean() 681 | std = np.std(series) 682 | norm_series = (series - mean) / std 683 | 684 | train, val = norm_series[:train_val_split].values, norm_series[train_val_split:].values 685 | history = np.asarray([x for x in train]) 686 | predictions = list() 687 | for t in range(len(val)): 688 | model = ARIMA(history, order=(p, d, q)) 689 | model_fit = model.fit(transparams=False, trend='nc', tol=0.0001, disp=0) 690 | if t == 0: 691 | print(model_fit.summary()) 692 | print(history[-5:]) 693 | output = model_fit.forecast() 694 | yhat = output[0] 695 | predictions.append(yhat) 696 | obs = val[t] 697 | history = np.append(history, obs) 698 | print('predicted=%f, expected=%f' % (yhat, obs)) 699 | 700 | # destandardize 701 | val = val * std + mean 702 | predictions = np.asarray(predictions); 703 | predictions = predictions * std + mean 704 | error = mean_squared_error(val, predictions) 705 | print('Test MSE: {}'.format(error)) 706 | # plot 707 | # plt.plot(val) 708 | # plt.plot(predictions, color='red') 709 | # plt.show() 710 | 711 | return error, predictions 712 | -------------------------------------------------------------------------------- /drafts/main.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import json 4 | import sys 5 | import pickle 6 | from classes import class_Trader, class_ForecastingTrader, class_DataProcessor, class_SeriesAnalyser 7 | 8 | # just set the seed for the random number generator 9 | np.random.seed(107) 10 | 11 | if __name__ == "__main__": 12 | 13 | # read inout parameters 14 | config_path = sys.argv[1] 15 | pairs_mode = int(sys.argv[2]) 16 | trade_mode = int(sys.argv[3]) # 1. Benchmark 2.ML 3.Both 17 | with open(config_path, 'r') as f: 18 | config = json.load(f) 19 | 20 | ################################################################################################################### 21 | # 1. Upload Dataset 22 | # This code assumes the data preprocessing has been done previously by running the notebook: 23 | # - PairsTrading-DataPreprocessing.ipynb 24 | # Therefore, we simply retrieve the data from a pickle file and select the dates to study 25 | ################################################################################################################### 26 | 27 | # initialize data processor 28 | data_processor = class_DataProcessor.DataProcessor() 29 | 30 | # Read dataset and select dates 31 | dataset_path = config['dataset']['path'] 32 | df_prices = pd.read_pickle(dataset_path) 33 | 34 | # split data in training and test 35 | df_prices_train, df_prices_test = data_processor.split_data(df_prices, 36 | (config['dataset']['training_initial_date'], 37 | config['dataset']['training_final_date']), 38 | (config['dataset']['testing_initial_date'], 39 | config['dataset']['testing_final_date']), 40 | remove_nan=True) 41 | 42 | ################################################################################################################### 43 | # 2. Pairs Filtering & Selection 44 | # As this part is very visual, the pairs filtering and selection can be obtained by running the notebook: 45 | # - 'PairsTrading-Clustering.ipynb' 46 | # This section uploads the pairs for each scenario 47 | ################################################################################################################### 48 | # initialize series analyser 49 | series_analyser = class_SeriesAnalyser.SeriesAnalyser() 50 | 51 | if pairs_mode == 1: 52 | with open('data/etfs/pickle/pairs_unfiltered.pickle', 'rb') as handle: 53 | pairs = pickle.load(handle) 54 | elif pairs_mode == 2: 55 | with open('data/etfs/pickle/pairs_category.pickle', 'rb') as handle: 56 | pairs = pickle.load(handle) 57 | elif pairs_mode == 3: 58 | with open('data/etfs/pickle/pairs_unsupervised_learning.pickle', 'rb') as handle: 59 | pairs = pickle.load(handle) 60 | 61 | ################################################################################################################### 62 | # 3. Apply trading 63 | # First apply the strategy to the training data, to discard the pairs that were not profitable not even in the 64 | # training period. 65 | # Secondly, apply the strategy on the test set 66 | ################################################################################################################### 67 | trader = class_Trader.Trader() 68 | 69 | # obtain trading strategy 70 | trading_strategy = config['trading']['strategy'] 71 | 72 | # obtain trading filter info 73 | if config['trading_filter']['active'] == 1: 74 | trading_filter = config['trading_filter'] 75 | else: 76 | trading_filter = None 77 | 78 | # ################################################ BENCHMARK ####################################################### 79 | if (trade_mode == 1) or (trade_mode == 3): 80 | # Run on TRAIN SET 81 | if 'bollinger' in trading_strategy: 82 | sharpe_results, cum_returns, performance = \ 83 | trader.apply_bollinger_strategy(pairs=pairs, 84 | lookback_multiplier=config['trading']['lookback_multiplier'], 85 | entry_multiplier=config['trading']['entry_multiplier'], 86 | exit_multiplier=config['trading']['exit_multiplier'], 87 | trading_filter=trading_filter, 88 | test_mode=False 89 | ) 90 | elif 'kalman' in trading_strategy: 91 | sharpe_results, cum_returns, performance = \ 92 | trader.apply_kalman_strategy(pairs, 93 | entry_multiplier=config['trading']['entry_multiplier'], 94 | exit_multiplier=config['trading']['exit_multiplier'], 95 | trading_filter=trading_filter, 96 | test_mode=False 97 | ) 98 | else: 99 | print('Please insert valid trading strategy: 1. "bollinger" or 2."kalman"') 100 | exit() 101 | 102 | # get train metrics 103 | n_years_train = round(len(df_prices_train) / 240) 104 | train_metrics = trader.calculate_metrics(sharpe_results, cum_returns, n_years_train) 105 | 106 | # filter pairs with positive results 107 | profitable_pairs = trader.filter_profitable_pairs(sharpe_results=sharpe_results, pairs=pairs) 108 | 109 | # Run on TEST SET 110 | if 'bollinger' in trading_strategy: 111 | sharpe_results, cum_returns, performance = \ 112 | trader.apply_bollinger_strategy(pairs=profitable_pairs, 113 | lookback_multiplier=config['trading']['lookback_multiplier'], 114 | entry_multiplier=config['trading']['entry_multiplier'], 115 | exit_multiplier=config['trading']['exit_multiplier'], 116 | trading_filter=trading_filter, 117 | test_mode=True 118 | ) 119 | print('Avg sharpe Ratio using Bollinger in test set: ', np.mean(sharpe_results)) 120 | 121 | elif 'kalman' in trading_strategy: 122 | sharpe_results, cum_returns, performance = \ 123 | trader.apply_kalman_strategy(pairs=profitable_pairs, 124 | entry_multiplier=config['trading']['entry_multiplier'], 125 | exit_multiplier=config['trading']['exit_multiplier'], 126 | trading_filter=trading_filter, 127 | test_mode=True 128 | ) 129 | print('Avg sharpe Ratio using kalman in the test set: ', np.mean(sharpe_results)) 130 | 131 | # ################################################ ML BASED ####################################################### 132 | if (trade_mode == 2) or (trade_mode == 3): 133 | 134 | forecasting_trader = class_ForecastingTrader.ForecastingTrader() 135 | 136 | # 1) get pairs spreads and train models 137 | mlp_config = config['mlp'] 138 | mlp_config['train_val_split'] = int(config['mlp']['train_val_split']*len(pairs[0][2]['spread'])) 139 | models = forecasting_trader.train_models(pairs[:2], model_config=mlp_config) # CHANGE LIMITATION OF PAIRS 140 | 141 | # 2) test models on training set and only keep profitable spreads 142 | print('Still under construction') 143 | exit() 144 | 145 | # 3) test spreads on test set 146 | 147 | ################################################################################################################### 148 | # 4. Get results 149 | # Obtain the results in the test set. 150 | # - writes global pairs results in an excel file 151 | # - stores dataframe with info regarding every pair in pickle file 152 | ################################################################################################################### 153 | with open(config['dataset']['ticker_segment_dict'], 'rb') as handle: 154 | ticker_segment_dict = pickle.load(handle) 155 | 156 | results, pairs_summary = trader.summarize_results(sharpe_results, cum_returns, performance, profitable_pairs, 157 | ticker_segment_dict) 158 | 159 | -------------------------------------------------------------------------------- /drafts/mlp_trainer.py: -------------------------------------------------------------------------------- 1 | from classes import class_ForecastingTrader, class_DataProcessor 2 | import numpy as np 3 | np.random.seed(1) # NumPy 4 | import random 5 | random.seed(3) # Python 6 | import tensorflow as tf 7 | tf.set_random_seed(2) # Tensorflow 8 | session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, 9 | inter_op_parallelism_threads=1) 10 | from keras import backend as K 11 | sess = tf.Session(graph=tf.get_default_graph(), config=session_conf) 12 | K.set_session(sess) 13 | 14 | import pandas as pd 15 | import pickle 16 | import gc 17 | 18 | forecasting_trader = class_ForecastingTrader.ForecastingTrader() 19 | data_processor = class_DataProcessor.DataProcessor() 20 | 21 | ################################# READ PRICES AND PAIRS ################################# 22 | # read prices 23 | df_prices = pd.read_pickle('/content/drive/PairsTrading/2009-2019/commodity_ETFs_intraday_interpolated_screened_no_outliers.pickle') 24 | # split data in training and test 25 | df_prices_train, df_prices_test = data_processor.split_data(df_prices, 26 | ('01-01-2009', 27 | '31-12-2017'), 28 | ('01-01-2018', 29 | '31-12-2018'), 30 | remove_nan=True) 31 | # load pairs 32 | with open('/content/drive/PairsTrading/2009-2019/pairs_unsupervised_learning_optical_intraday.pickle', 'rb') as handle: 33 | pairs = pickle.load(handle) 34 | n_years_train = round(len(df_prices_train) / (240 * 78)) 35 | print('Loaded {} pairs!'.format(len(pairs))) 36 | 37 | 38 | ################################# TRAIN MODELS ################################# 39 | 40 | n_in_set = [6, 12, 24] 41 | hidden_nodes_set = [[10], [20], [30], [10,10]] 42 | hidden_nodes_names = [str(nodes[0])+'*2' if len(nodes) > 1 else str(nodes[0]) for nodes in hidden_nodes_set] 43 | 44 | # WARNING!! 45 | # pairs = pairs[:2] 46 | 47 | for input_dim in n_in_set: 48 | for i, hidden_nodes in enumerate(hidden_nodes_set): 49 | model_config = {"n_in": input_dim, 50 | "n_out": 1, 51 | "epochs": 500, 52 | "hidden_nodes": hidden_nodes, 53 | "loss_fct": "mse", 54 | "optimizer": "rmsprop", 55 | "batch_size": 256, 56 | "train_val_split": '2017-01-01', 57 | "test_init": '2018-01-01'} 58 | models = forecasting_trader.train_models(pairs, model_config, model_type='mlp') 59 | # save models for this configuration 60 | with open('/content/drive/PairsTrading/mlp_models/models_n_in-'+str(input_dim)+'_hidden_nodes-'+hidden_nodes_names[i]+'.pkl', 'wb') as f: 61 | pickle.dump(models, f) 62 | 63 | gc.collect() -------------------------------------------------------------------------------- /training/PairsTrading_DeepLearning.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"PairsTrading_DeepLearning.ipynb","version":"0.3.2","provenance":[],"collapsed_sections":[]},"kernelspec":{"name":"python3","display_name":"Python 3"},"accelerator":"GPU"},"cells":[{"cell_type":"code","metadata":{"id":"EBl2Ok7tbGHC","colab_type":"code","outputId":"c4fbb45a-3fc9-4f6d-95b3-d341445f23b8","executionInfo":{"status":"ok","timestamp":1567675837287,"user_tz":-60,"elapsed":4098,"user":{"displayName":"Simao Sarmento","photoUrl":"https://lh3.googleusercontent.com/a-/AAuE7mBWip6a0UyBh_1Dd-LrfHuFavFBPDAae6wiUEky=s64","userId":"16654987200280043400"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["import tensorflow as tf\n","tf.test.gpu_device_name()"],"execution_count":1,"outputs":[{"output_type":"execute_result","data":{"text/plain":["'/device:GPU:0'"]},"metadata":{"tags":[]},"execution_count":1}]},{"cell_type":"markdown","metadata":{"id":"MPzxbLTnNQet","colab_type":"text"},"source":["**Instalation**\n","\n","Must run every time the notebook is closed."]},{"cell_type":"code","metadata":{"id":"PARahqujMTrP","colab_type":"code","outputId":"d3989cdb-a455-43e4-85e8-6f6cec5ae2b5","executionInfo":{"status":"ok","timestamp":1567675959397,"user_tz":-60,"elapsed":118030,"user":{"displayName":"Simao Sarmento","photoUrl":"https://lh3.googleusercontent.com/a-/AAuE7mBWip6a0UyBh_1Dd-LrfHuFavFBPDAae6wiUEky=s64","userId":"16654987200280043400"}},"colab":{"base_uri":"https://localhost:8080/","height":136}},"source":["# Install a Drive FUSE wrapper.\n"," # https://github.com/astrada/google-drive-ocamlfuse\n","\n","!apt-get install -y -qq software-properties-common python-software-properties module-init-tools\n","!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null \n","!apt-get update -qq 2>&1 > /dev/null\n","!apt-get -y install -qq google-drive-ocamlfuse fuse"],"execution_count":2,"outputs":[{"output_type":"stream","text":["E: Package 'python-software-properties' has no installation candidate\n","Selecting previously unselected package google-drive-ocamlfuse.\n","(Reading database ... 131183 files and directories currently installed.)\n","Preparing to unpack .../google-drive-ocamlfuse_0.7.6-0ubuntu1~ubuntu18.04.1_amd64.deb ...\n","Unpacking google-drive-ocamlfuse (0.7.6-0ubuntu1~ubuntu18.04.1) ...\n","Setting up google-drive-ocamlfuse (0.7.6-0ubuntu1~ubuntu18.04.1) ...\n","Processing triggers for man-db (2.8.3-2ubuntu0.1) ...\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"IuppVB4mMcyH","colab_type":"code","colab":{}},"source":["# Generate auth tokens for Colab\n","\n","from google.colab import auth \n","auth.authenticate_user()"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"H1DhebmJMr3O","colab_type":"code","outputId":"85c60d31-b24b-4717-9383-d7aab5955f4b","executionInfo":{"status":"ok","timestamp":1567676010997,"user_tz":-60,"elapsed":24171,"user":{"displayName":"Simao Sarmento","photoUrl":"https://lh3.googleusercontent.com/a-/AAuE7mBWip6a0UyBh_1Dd-LrfHuFavFBPDAae6wiUEky=s64","userId":"16654987200280043400"}},"colab":{"base_uri":"https://localhost:8080/","height":105}},"source":["# Generate creds for the Drive FUSE library.\n","\n","from oauth2client.client import GoogleCredentials \n","creds = GoogleCredentials.get_application_default()\n","import getpass \n","!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL\n","vcode = getpass.getpass() \n","!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}"],"execution_count":4,"outputs":[{"output_type":"stream","text":["Please, open the following URL in a web browser: https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&response_type=code&access_type=offline&approval_prompt=force\n","··········\n","Please, open the following URL in a web browser: https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&response_type=code&access_type=offline&approval_prompt=force\n","Please enter the verification code: Access token retrieved correctly.\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"mID6JP_yM24_","colab_type":"code","colab":{}},"source":["# Create a directory and mount Google Drive using that directory.\n","\n","!mkdir -p drive\n","!google-drive-ocamlfuse drive"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"7NwIXZ-bNGjg","colab_type":"code","outputId":"1a90f65f-f5f4-4acf-e1dd-9f982772ca98","executionInfo":{"status":"ok","timestamp":1567676034775,"user_tz":-60,"elapsed":2435,"user":{"displayName":"Simao Sarmento","photoUrl":"https://lh3.googleusercontent.com/a-/AAuE7mBWip6a0UyBh_1Dd-LrfHuFavFBPDAae6wiUEky=s64","userId":"16654987200280043400"}},"colab":{"base_uri":"https://localhost:8080/","height":119}},"source":["print ('Files in Drive:')\n","!ls /content/drive/PairsTrading"],"execution_count":6,"outputs":[{"output_type":"stream","text":["Files in Drive:\n","2009-2019\t\t encoder_decoder\t\t __pycache__\n","class_DataProcessor.py\t encoder_decoder_trainer.py\t rnn_models\n","class_ForecastingTrader.py mlp_models\t\t\t rnn_trainer.py\n","class_SeriesAnalyser.py mlp_trainer.py\n","class_Trader.py\t\t PairsTrading_DeepLearning.ipynb\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"Y7vdSjDhMm7G","colab_type":"text"},"source":[""]},{"cell_type":"markdown","metadata":{"id":"2qjrA9XTgww5","colab_type":"text"},"source":["**Run Python File**\n","\n","Run the executable here"]},{"cell_type":"code","metadata":{"id":"BjZQKJaegvDl","colab_type":"code","outputId":"55c317f9-55ab-4834-a69e-1d2e6608e063","executionInfo":{"status":"ok","timestamp":1567676538921,"user_tz":-60,"elapsed":96370,"user":{"displayName":"Simao Sarmento","photoUrl":"https://lh3.googleusercontent.com/a-/AAuE7mBWip6a0UyBh_1Dd-LrfHuFavFBPDAae6wiUEky=s64","userId":"16654987200280043400"}},"colab":{"base_uri":"https://localhost:8080/","height":1000}},"source":["!python3 \"/content/drive/PairsTrading/rnn_trainer.py\""],"execution_count":9,"outputs":[{"output_type":"stream","text":["WARNING: Logging before flag parsing goes to stderr.\n","W0905 09:40:50.160757 140025728874368 deprecation_wrapper.py:119] From /content/drive/PairsTrading/class_ForecastingTrader.py:7: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.\n","\n","W0905 09:40:50.172061 140025728874368 deprecation_wrapper.py:119] From /content/drive/PairsTrading/class_ForecastingTrader.py:8: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.\n","\n","Using TensorFlow backend.\n","W0905 09:40:50.200075 140025728874368 deprecation_wrapper.py:119] From /content/drive/PairsTrading/class_ForecastingTrader.py:11: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.\n","\n","W0905 09:40:50.200282 140025728874368 deprecation_wrapper.py:119] From /content/drive/PairsTrading/class_ForecastingTrader.py:11: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.\n","\n","2019-09-05 09:40:50.205835: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz\n","2019-09-05 09:40:50.206109: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x297b2c0 executing computations on platform Host. Devices:\n","2019-09-05 09:40:50.206147: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , \n","2019-09-05 09:40:50.208095: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1\n","2019-09-05 09:40:50.255080: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n","2019-09-05 09:40:50.255896: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x297bb80 executing computations on platform CUDA. Devices:\n","2019-09-05 09:40:50.255925: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Tesla K80, Compute Capability 3.7\n","2019-09-05 09:40:50.256097: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n","2019-09-05 09:40:50.256766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: \n","name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235\n","pciBusID: 0000:00:04.0\n","2019-09-05 09:40:50.257082: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0\n","2019-09-05 09:40:50.258509: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0\n","2019-09-05 09:40:50.259685: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0\n","2019-09-05 09:40:50.260017: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0\n","2019-09-05 09:40:50.261678: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0\n","2019-09-05 09:40:50.262807: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0\n","2019-09-05 09:40:50.266743: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7\n","2019-09-05 09:40:50.266863: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n","2019-09-05 09:40:50.267610: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n","2019-09-05 09:40:50.268276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0\n","2019-09-05 09:40:50.268329: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0\n","2019-09-05 09:40:50.269628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:\n","2019-09-05 09:40:50.269660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 \n","2019-09-05 09:40:50.269684: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N \n","2019-09-05 09:40:50.269810: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n","2019-09-05 09:40:50.270546: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n","2019-09-05 09:40:50.271232: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:40] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.\n","2019-09-05 09:40:50.271278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10748 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)\n","2019-09-05 09:40:51.371562: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n","2019-09-05 09:40:51.372348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: \n","name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235\n","pciBusID: 0000:00:04.0\n","2019-09-05 09:40:51.372436: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0\n","2019-09-05 09:40:51.372501: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0\n","2019-09-05 09:40:51.372536: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0\n","2019-09-05 09:40:51.372580: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0\n","2019-09-05 09:40:51.372612: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0\n","2019-09-05 09:40:51.372649: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0\n","2019-09-05 09:40:51.372710: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7\n","2019-09-05 09:40:51.372819: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n","2019-09-05 09:40:51.373585: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n","2019-09-05 09:40:51.374279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0\n","2019-09-05 09:40:51.374338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:\n","2019-09-05 09:40:51.374357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 \n","2019-09-05 09:40:51.374372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N \n","2019-09-05 09:40:51.374505: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n","2019-09-05 09:40:51.375273: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n","2019-09-05 09:40:51.375942: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10748 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)\n","Total of 59 tickers\n","Total of 58 tickers after removing tickers with Nan values\n","Loaded 5 pairs!\n","W0905 09:40:52.478559 140025728874368 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:541: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.\n","\n","W0905 09:40:53.106798 140025728874368 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:793: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.\n","\n","Model: \"sequential_1\"\n","_________________________________________________________________\n","Layer (type) Output Shape Param # \n","=================================================================\n","cu_dnnlstm_1 (CuDNNLSTM) (None, 50) 10600 \n","_________________________________________________________________\n","dense_1 (Dense) (None, 1) 51 \n","=================================================================\n","Total params: 10,651\n","Trainable params: 10,651\n","Non-trainable params: 0\n","_________________________________________________________________\n","Train on 156492 samples, validate on 19506 samples\n","Epoch 1/1\n","2019-09-05 09:40:56.115567: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0\n","2019-09-05 09:40:56.278138: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7\n","156492/156492 [==============================] - 3s 17us/step - loss: 0.1698 - mean_absolute_error: 0.2601 - val_loss: 0.0486 - val_mean_absolute_error: 0.1749\n","156492/156492 [==============================] - 10s 66us/step\n","19506/19506 [==============================] - 1s 72us/step\n","19368/19368 [==============================] - 1s 71us/step\n","------------------------------------------------------------\n","The mse train loss is: 0.16978542109206232\n","The mae train loss is: 0.26013653967987066\n","The mse test loss is: 0.048557348591191006\n","The mae test loss is: 0.17493258422916882\n","------------------------------------------------------------\n","Model: \"sequential_2\"\n","_________________________________________________________________\n","Layer (type) Output Shape Param # \n","=================================================================\n","cu_dnnlstm_2 (CuDNNLSTM) (None, 50) 10600 \n","_________________________________________________________________\n","dense_2 (Dense) (None, 1) 51 \n","=================================================================\n","Total params: 10,651\n","Trainable params: 10,651\n","Non-trainable params: 0\n","_________________________________________________________________\n","Train on 156492 samples, validate on 19506 samples\n","Epoch 1/1\n","156492/156492 [==============================] - 1s 9us/step - loss: 0.2911 - mean_absolute_error: 0.2505 - val_loss: 0.0068 - val_mean_absolute_error: 0.0590\n","156492/156492 [==============================] - 9s 59us/step\n","19506/19506 [==============================] - 1s 60us/step\n","19368/19368 [==============================] - 1s 58us/step\n","------------------------------------------------------------\n","The mse train loss is: 0.2911193497426787\n","The mae train loss is: 0.25051101294911976\n","The mse test loss is: 0.006790886505617798\n","The mae test loss is: 0.058951191548370745\n","------------------------------------------------------------\n","Model: \"sequential_3\"\n","_________________________________________________________________\n","Layer (type) Output Shape Param # \n","=================================================================\n","cu_dnnlstm_3 (CuDNNLSTM) (None, 50) 10600 \n","_________________________________________________________________\n","dense_3 (Dense) (None, 1) 51 \n","=================================================================\n","Total params: 10,651\n","Trainable params: 10,651\n","Non-trainable params: 0\n","_________________________________________________________________\n","Train on 156492 samples, validate on 19506 samples\n","Epoch 1/1\n","156492/156492 [==============================] - 1s 9us/step - loss: 0.1242 - mean_absolute_error: 0.2012 - val_loss: 0.0019 - val_mean_absolute_error: 0.0268\n","156492/156492 [==============================] - 9s 60us/step\n","19506/19506 [==============================] - 1s 59us/step\n","19368/19368 [==============================] - 1s 59us/step\n","------------------------------------------------------------\n","The mse train loss is: 0.12416537870057956\n","The mae train loss is: 0.2011626216097118\n","The mse test loss is: 0.0018718050352051262\n","The mae test loss is: 0.026804204574329395\n","------------------------------------------------------------\n","Model: \"sequential_4\"\n","_________________________________________________________________\n","Layer (type) Output Shape Param # \n","=================================================================\n","cu_dnnlstm_4 (CuDNNLSTM) (None, 50) 10600 \n","_________________________________________________________________\n","dense_4 (Dense) (None, 1) 51 \n","=================================================================\n","Total params: 10,651\n","Trainable params: 10,651\n","Non-trainable params: 0\n","_________________________________________________________________\n","Train on 156492 samples, validate on 19506 samples\n","Epoch 1/1\n","156492/156492 [==============================] - 1s 9us/step - loss: 0.3695 - mean_absolute_error: 0.2716 - val_loss: 0.0065 - val_mean_absolute_error: 0.0699\n","156492/156492 [==============================] - 10s 66us/step\n","19506/19506 [==============================] - 1s 71us/step\n","19368/19368 [==============================] - 1s 72us/step\n","------------------------------------------------------------\n","The mse train loss is: 0.3694828930803668\n","The mae train loss is: 0.2716029795592812\n","The mse test loss is: 0.006465116719103487\n","The mae test loss is: 0.06986089733682399\n","------------------------------------------------------------\n","Model: \"sequential_5\"\n","_________________________________________________________________\n","Layer (type) Output Shape Param # \n","=================================================================\n","cu_dnnlstm_5 (CuDNNLSTM) (None, 50) 10600 \n","_________________________________________________________________\n","dense_5 (Dense) (None, 1) 51 \n","=================================================================\n","Total params: 10,651\n","Trainable params: 10,651\n","Non-trainable params: 0\n","_________________________________________________________________\n","Train on 156492 samples, validate on 19506 samples\n","Epoch 1/1\n","156492/156492 [==============================] - 2s 10us/step - loss: 0.3712 - mean_absolute_error: 0.3324 - val_loss: 0.2948 - val_mean_absolute_error: 0.3062\n","156492/156492 [==============================] - 10s 61us/step\n","19506/19506 [==============================] - 1s 60us/step\n","19368/19368 [==============================] - 1s 59us/step\n","------------------------------------------------------------\n","The mse train loss is: 0.37119835905431353\n","The mae train loss is: 0.33239949728593815\n","The mse test loss is: 0.2948396854766464\n","The mae test loss is: 0.3062351587905794\n","------------------------------------------------------------\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"6aIp-WltjwMh","colab_type":"code","colab":{}},"source":[""],"execution_count":0,"outputs":[]}]} -------------------------------------------------------------------------------- /training/encoder_decoder_trainer.py: -------------------------------------------------------------------------------- 1 | from classes import class_ForecastingTrader, class_DataProcessor 2 | import numpy as np 3 | np.random.seed(1) # NumPy 4 | import random 5 | random.seed(3) # Python 6 | import tensorflow as tf 7 | tf.set_random_seed(2) # Tensorflow 8 | session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, 9 | inter_op_parallelism_threads=1) 10 | from keras import backend as K 11 | sess = tf.Session(graph=tf.get_default_graph(), config=session_conf) 12 | K.set_session(sess) 13 | 14 | import pandas as pd 15 | import pickle 16 | import gc 17 | 18 | forecasting_trader = class_ForecastingTrader.ForecastingTrader() 19 | data_processor = class_DataProcessor.DataProcessor() 20 | 21 | ################################# READ PRICES AND PAIRS ################################# 22 | # read prices 23 | df_prices = pd.read_pickle('/content/drive/PairsTrading/2009-2019/commodity_ETFs_intraday_interpolated_screened_no_outliers.pickle') 24 | #df_prices = pd.read_pickle('data/etfs/pickle/commodity_ETFs_intraday_interpolated_screened_no_outliers.pickle') 25 | # split data in training and test 26 | df_prices_train, df_prices_test = data_processor.split_data(df_prices, 27 | ('01-01-2009', 28 | '31-12-2017'), 29 | ('01-01-2018', 30 | '31-12-2018'), 31 | remove_nan=True) 32 | # load pairs 33 | with open('/content/drive/PairsTrading/2009-2019/pairs_unsupervised_learning_optical_intraday.pickle', 'rb') as handle: 34 | #with open('data/etfs/pickle/2009-2019/pairs_unsupervised_learning_optical_intraday.pickle', 'rb') as handle: 35 | pairs = pickle.load(handle) 36 | n_years_train = round(len(df_prices_train) / (240 * 78)) 37 | print('Loaded {} pairs!'.format(len(pairs))) 38 | 39 | ################################# TRAIN MODELS ################################# 40 | 41 | combinations = [(24, [15, 15])] 42 | hidden_nodes_names = ['15_15_nodes'] 43 | 44 | for i, configuration in enumerate(combinations): 45 | 46 | model_config = {"n_in": configuration[0], 47 | "n_out": 2, 48 | "epochs": 500, 49 | "hidden_nodes": configuration[1], 50 | "loss_fct": "mse", 51 | "optimizer": "rmsprop", 52 | "batch_size": 512, 53 | "train_val_split": '2017-01-01', 54 | "test_init": '2018-01-01'} 55 | models = forecasting_trader.train_models(pairs, model_config, model_type='encoder_decoder') 56 | 57 | # save models for this configuration 58 | with open('/content/drive/PairsTrading/encoder_decoder/models_n_in-' + str(configuration[0]) + '_hidden_nodes-' + 59 | hidden_nodes_names[i] + '.pkl', 'wb') as f: 60 | pickle.dump(models, f) 61 | 62 | gc.collect() -------------------------------------------------------------------------------- /training/rnn_trainer.py: -------------------------------------------------------------------------------- 1 | from classes import class_ForecastingTrader, class_DataProcessor 2 | import numpy as np 3 | np.random.seed(1) # NumPy 4 | import random 5 | random.seed(3) # Python 6 | import tensorflow as tf 7 | tf.set_random_seed(2) # Tensorflow 8 | session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, 9 | inter_op_parallelism_threads=1) 10 | from keras import backend as K 11 | sess = tf.Session(graph=tf.get_default_graph(), config=session_conf) 12 | K.set_session(sess) 13 | 14 | import pandas as pd 15 | import pickle 16 | import gc 17 | 18 | forecasting_trader = class_ForecastingTrader.ForecastingTrader() 19 | data_processor = class_DataProcessor.DataProcessor() 20 | 21 | ################################# READ PRICES AND PAIRS ################################# 22 | # read prices 23 | df_prices = pd.read_pickle('/content/drive/PairsTrading/2009-2019/commodity_ETFs_intraday_interpolated_screened_no_outliers.pickle') 24 | #df_prices = pd.read_pickle('data/etfs/pickle/commodity_ETFs_intraday_interpolated_screened_no_outliers.pickle') 25 | # split data in training and test 26 | df_prices_train, df_prices_test = data_processor.split_data(df_prices, 27 | ('01-01-2009', 28 | '31-12-2017'), 29 | ('01-01-2018', 30 | '31-12-2018'), 31 | remove_nan=True) 32 | # load pairs 33 | with open('/content/drive/PairsTrading/2009-2019/pairs_unsupervised_learning_optical_intraday.pickle', 'rb') as handle: 34 | #with open('data/etfs/pickle/2009-2019/pairs_unsupervised_learning_optical_intraday.pickle', 'rb') as handle: 35 | pairs = pickle.load(handle) 36 | n_years_train = round(len(df_prices_train) / (240 * 78)) 37 | print('Loaded {} pairs!'.format(len(pairs))) 38 | 39 | ################################# TRAIN MODELS ################################# 40 | 41 | combinations = [(24, [50])] 42 | hidden_nodes_names = ['50_nodes'] 43 | 44 | for i, configuration in enumerate(combinations): 45 | 46 | model_config = {"n_in": configuration[0], 47 | "n_out": 1, 48 | "epochs": 500, 49 | "hidden_nodes": configuration[1], 50 | "loss_fct": "mse", 51 | "optimizer": "rmsprop", 52 | "batch_size": 512, 53 | "train_val_split": '2017-01-01', 54 | "test_init": '2018-01-01'} 55 | models = forecasting_trader.train_models(pairs, model_config, model_type='rnn') 56 | 57 | # save models for this configuration 58 | with open('/content/drive/PairsTrading/rnn_models/models_n_in-' + str(configuration[0]) + '_hidden_nodes-' + hidden_nodes_names[i] + 59 | '.pkl','wb') as f: 60 | pickle.dump(models, f) 61 | 62 | gc.collect() --------------------------------------------------------------------------------