├── Images
    ├── buySell.png
    ├── FittingWindow.png
    ├── cointegration.png
    └── trade_performance.png
├── Report+Poster
    ├── Pairs_Trading_Doc.pdf
    └── Pairs_Trading_Poster.pdf
├── Code
    ├── CointHurst.py
    ├── PairsTrade.py
    └── slib.py
└── README.md


/Images/buySell.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/david-alber/Pairs-Trading-as-application-to-the-Ornstein-Uhlenbeck-Process/HEAD/Images/buySell.png


--------------------------------------------------------------------------------
/Images/FittingWindow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/david-alber/Pairs-Trading-as-application-to-the-Ornstein-Uhlenbeck-Process/HEAD/Images/FittingWindow.png


--------------------------------------------------------------------------------
/Images/cointegration.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/david-alber/Pairs-Trading-as-application-to-the-Ornstein-Uhlenbeck-Process/HEAD/Images/cointegration.png


--------------------------------------------------------------------------------
/Images/trade_performance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/david-alber/Pairs-Trading-as-application-to-the-Ornstein-Uhlenbeck-Process/HEAD/Images/trade_performance.png


--------------------------------------------------------------------------------
/Report+Poster/Pairs_Trading_Doc.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/david-alber/Pairs-Trading-as-application-to-the-Ornstein-Uhlenbeck-Process/HEAD/Report+Poster/Pairs_Trading_Doc.pdf


--------------------------------------------------------------------------------
/Report+Poster/Pairs_Trading_Poster.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/david-alber/Pairs-Trading-as-application-to-the-Ornstein-Uhlenbeck-Process/HEAD/Report+Poster/Pairs_Trading_Poster.pdf


--------------------------------------------------------------------------------
/Code/CointHurst.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """
 3 | Created on Thu Apr 30 14:57:23 2020
 4 | 
 5 | @author: alber
 6 | """
 7 | 
 8 | import numpy as np
 9 | import matplotlib.pyplot as plt
10 | 
11 | from statsmodels.tsa.stattools import coint
12 | 
13 | import slib as slb
14 | plt.close('all')
15 | #%%Load data
16 | start_year = 2010; end_year = 2020;
17 | names = np.load('names.npy'); symbols = np.load('symbols.npy'); 
18 | variation = np.load('variation.npy'); close_val = np.load('close_val.npy'); open_val = np.load('open_val.npy')
19 | 
20 | #%%Cointegration Value
21 | # =============================================================================
22 | # scores, pvalues, pairs = slb.find_cointegrated_pairs(close_val)
23 | # #Save data
24 | # np.save('scores',scores)
25 | # np.save('pvalues',pvalues)
26 | # np.save('pairs',pairs)
27 | # =============================================================================
28 | pairs = np.load('pairs.npy'); pvalues = np.load('pvalues.npy'); scores = np.load('scores.npy'); 
29 | slb.covplot(pvalues,symbols,'Cointegration')
30 | 
31 | pvalue = slb.cointplt(close_val,pairs[1],names,start_year,end_year,plot=True)#1#5#9
32 | #%%
33 | S1 = pairs[1,0]; S2 = pairs[1,1]
34 | ratio = close_val[S1]/close_val[S2]
35 | H=slb.hurstcoeff(close_val,pairs[1],names,plot=True)
36 | 
37 | 
38 | 
39 | 
40 | 
41 | 
42 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Pairs-Trading-as-application-of-the-Ornstein-Uhlenbeck-Process
 2 |  A model simulation shows how pairs trading could be used for two S&P500 traded stocks. It proofs that the strategy is successful on real data, which is downloaded via the
 3 |  [pandas-datareader](https://pandas-datareader.readthedocs.io/en/latest/). In the [report and poster section](https://github.com/david-alber/Pairs-Trading-as-application-to-the-Ornstein-Uhlenbeck-Process/tree/master/Report%2BPoster) a profound analysis of the pairs trading underlying Ornstein-Uhlenbeck process is done. Rigorous mathematical formulations can be found in this section also. 
 4 |  
 5 |  Keywords: Stochastic differential equations, Ornstein-Ulenbeck Process, Pairs Trading, S&P500 
 6 | 
 7 | ### Work Flow & Manual: 
 8 | 1.  Load stock data from [slib.py](https://github.com/david-alber/Pairs-Trading-as-application-to-the-Ornstein-Uhlenbeck-Process/blob/master/Code/slib.py) with **stockvals()**.
 9 | 2.  Determine two stocks suitable for the pairs trading strategy with **find_cointegrated_pairs()** in [CointHurst.py](https://github.com/david-alber/Pairs-Trading-as-application-to-the-Ornstein-Uhlenbeck-Process/blob/master/Code/CointHurst.py). Note tha the output 'pairs' already suggest pairs of stocks with a significance level below 0.05 using a cointegration test. A full map of pvalues for pairwise cointegration can be displayed with **covplot()**.
10 | 3.  Use **pairstrade()** in [PairsTrade.py](https://github.com/david-alber/Pairs-Trading-as-application-to-the-Ornstein-Uhlenbeck-Process/blob/master/Code/PairsTrade.py) to execute the pairs trading strategy on the two stocks (id0, id1) under consideration. The parameters that determine the success of the strategy are defined in the **pairstrade()** input *param: f8[d,buy_signal,sell_signal]* - threshold parameters for the time window, buying and selling signal. A graph that shows the buying and selling signal and how much money the strategy would have made can be displayed by setting *plot = True*.
11 | 
12 | ### Getting started
13 | ```python
14 | import pandas_datareader.data as web
15 | from statsmodels.tsa.stattools import coint
16 | import numpy as np
17 | import matplotlib.pyplot as plt
18 | import slib as slb #Custom library
19 | 
20 | #%% Load data
21 | names[indices.astype(int)],symbols[indices.astype(int)],variation,close_val,open_val = stockvals(df,start_date,end_date)
22 | data = pd.DataFrame.from_records(close_val)
23 | 
24 | #%% Find suitable stocks for pairs trading
25 | scores, pvalues, pairs = slb.find_cointegrated_pairs(close_val)
26 | slb.covplot(pvalues,symbols,'Cointegration')
27 | 
28 | #%% Perform trading strategy 
29 | start_year = 2010; end_year = 2020;
30 | d, buy_signal, sell_signal = 252 0.05, 0.01
31 | param = [d,buy_signal,sell_signal]
32 | 
33 |  money,buy,sell,k,Y_tilde,t,S0,S1 = slb.pairstrade(param,data_train,id0,id1,names,start_year,end_year,plot=True)
34 | ```
35 | 
36 | 
37 | ### Map of pairwise cointegrated stocks
38 |  
39 |   Cointegration  
40 | :-------------------------:
41 |  <img align="center" src="https://github.com/david-alber/Pairs-Trading-as-application-to-the-Ornstein-Uhlenbeck-Process/blob/master/Images/cointegration.png" width="450" height="350" />  
42 |  
43 |  Map of p-values for pairwise cointegration. The statistical test for cointegration checks how likely the closing values of two stocks are mean reversive. The null hypothesis is no cointegration. Hence low p-values suggest that the stochastic processes under consideration show the property of being mean reversiv and can be described as an Ornstein-Uhlenbeck processes. They are thus suitable to perform the pairs trading strategy. 
44 |  
45 | 
46 | 
47 | ### Buying-Selling Signal & Trading Performance
48 |  
49 |   Buying-Selling Signal |   Trade Performance
50 | :-------------------------:|:-------------------------:
51 |  <img align="center" src="https://github.com/david-alber/Pairs-Trading-as-application-to-the-Ornstein-Uhlenbeck-Process/blob/master/Images/buySell.png" width="400" height="320" />  |  <img src="https://github.com/david-alber/Pairs-Trading-as-application-to-the-Ornstein-Uhlenbeck-Process/blob/master/Images/trade_performance.png" width="400" height="320" />
52 |  
53 | Buying and selling signals for two suitable S&P500 stocks (Adobe Inc.-Cintas Corporation) over a time period of 10 years (2010-2020).
54 | The strategy started with 0 EUR, ended with 3000 EUR of hypothetical profit.
55 | 
56 | 
57 | 
58 | ### How to contribute
59 | Fork from the `Developer`- branch and pull request to merge back into the original `Developer`- branch. 
60 | Working updates and improvements will then be merged into the `Master` branch, which will always contain the latest working version.
61 | 
62 | With: 
63 | * [Daniel Brus](https://www.linkedin.com/in/daniel-brus)
64 | 
65 | ### Dependencies 
66 |  [Numpy](https://numpy.org/),  
67 |  [Matplotlib](https://matplotlib.org/), 
68 |  [Pandas](https://pandas.pydata.org/), [pandas-datareader](https://pandas-datareader.readthedocs.io/en/latest/), [Statsmodels](https://www.statsmodels.org/stable/index.html),
69 |  [slib (custom)](https://github.com/david-alber/Pairs-Trading-as-application-to-the-Ornstein-Uhlenbeck-Process/blob/master/Code/slib.py)
70 |  
71 | 


--------------------------------------------------------------------------------
/Code/PairsTrade.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import scipy as sp
  3 | import matplotlib.pyplot as plt
  4 | import pandas as pd
  5 | from scipy.stats import norm
  6 | from tqdm import tqdm
  7 | import slib as slb
  8 | plt.close('all')
  9 | #%%
 10 | start_year = 2010; end_year = 2020;
 11 | end_yearTe = end_year
 12 | names = np.load('names.npy'); symbols = np.load('symbols.npy'); 
 13 | variation = np.load('variation.npy'); close_val = np.load('close_val.npy'); open_val = np.load('open_val.npy')
 14 | pairs = np.load('pairs.npy'); pvalues = np.load('pvalues.npy'); scores = np.load('scores.npy'); 
 15 | 
 16 | data = pd.DataFrame.from_records(close_val)
 17 | days = np.shape(data)[1]
 18 | train = 0.7
 19 | data_train = data.loc[:,0:int(days*train)]
 20 | data_test =  data.loc[:,int(days*train):]
 21 | end_year = int(start_year + (end_year-start_year)*train) #redifine end year according to training samples
 22 | start_yearTe = end_year;
 23 | id0 = pairs[1,0]; id1 = pairs[1,1]
 24 | 
 25 | #%%
 26 | def returnMaximizer(random_walks,function,n_param,data,id0,id1,start_year,end_year,d):
 27 |     """
 28 |     random_walks : f8; number of random walks, i.e different random init guesses
 29 |     function : function handle; function to optimize: sp.optimize.minimize(...,method='Nelder-Mead')
 30 |     n_param : f8; number of free parameters to optimize
 31 | 
 32 |     Returns: opt_object: .x->optimal parameters; .fun-> fct evaluation at opt parameters
 33 |     """
 34 |     #Funcition specific definitions
 35 |     d = int(d)
 36 |     
 37 |     minimum = 9e15
 38 |     for i in tqdm(range(0,random_walks),position=0,desc ='MPL Maximizer'): #Minimize for different initial guesses
 39 |         initial_guess = np.zeros(n_param)
 40 |         initial_guess[0] = np.random.uniform(0,3);
 41 |         initial_guess[1] = np.random.uniform(0,initial_guess[0])#print(initial_guess) #init guess for values of connectivity matrix Jmat
 42 |         optimizer_object = sp.optimize.minimize(function,initial_guess,method='Nelder-Mead')
 43 |         y = optimizer_object.fun
 44 |         if (y < minimum): #keep parameters if -MPL is smaller than before
 45 |             minimum = y;# print(minimum)
 46 |             opt_object = optimizer_object;         
 47 |     print(f'Parameters that minimize returns: buy,sell = {opt_object.x}')
 48 |     return opt_object 
 49 | 
 50 | def pairstrade_maxi(param):
 51 |     buy_signal,sell_signal = param[0],param[1] 
 52 |     S0 = data.iloc[id0][:]; S1 = data.iloc[id1][:]
 53 |     
 54 |     k = (S0/S1).rolling(window=int(d), center=False).mean()
 55 |     Y_tilde = np.log(S0/(k*S1))[d:-1] # model implied OU value based on deviation from k_old
 56 |     
 57 |     #TRADE THE PAIR BASED ON OU IMPLIED OVER-/UNDERVALUATION.
 58 |     t = np.linspace(start_year,end_year,int(len(S0)))
 59 |     S0=S0[d:-1]; S1=S1[d:-1]; k = k[d:-1]; t = t[d:-1]
 60 |   
 61 |     money = np.zeros(len(Y_tilde)) # start trading with no positions
 62 |     S0_held = np.zeros(len(Y_tilde))
 63 |     S1_held = np.zeros(len(Y_tilde))
 64 |     
 65 |     buy = []; sell = []
 66 |     for i in range(len(S0)):
 67 |         if  Y_tilde.iloc[i] >  buy_signal:            # short the pair when OU up
 68 |             S0_held[i] = S0_held[i-1] - 1
 69 |             S1_held[i] = S1_held[i-1] + S0.iloc[i]/S1.iloc[i]
 70 |             money[i] = money[i-1] - 0.005 * 2 * S0.iloc[i]  # 50 bps tx costs
 71 |             
 72 |             buy.append((t[i], S1.iloc[i]))
 73 |             sell.append((t[i], S0.iloc[i]))
 74 |             
 75 |         elif Y_tilde.iloc[i] < -buy_signal:           # long the pair when OU down
 76 |             S0_held[i] = S0_held[i-1] + 1
 77 |             S1_held[i] = S1_held[i-1] - S0.iloc[i]/S1.iloc[i]
 78 |             money[i] = money[i-1] - 0.005 * 2 * S0.iloc[i]
 79 |             
 80 |             buy.append((t[i], S0.iloc[i]))
 81 |             sell.append((t[i], S1.iloc[i]))
 82 |             
 83 |         elif abs(Y_tilde.iloc[i]) < sell_signal and S0_held[i-1]!=0:     # clear position when ratio normal
 84 |             money[i] = money[i-1] + (S0_held[i-1]*S0.iloc[i] + S1_held[i-1]*S1.iloc[i]) * 0.995 
 85 |             S0_held[i], S1_held[i] = 0, 0
 86 |             
 87 |             if S0_held[i-1] < 0:
 88 |                 buy.append((t[i], S0.iloc[i]))
 89 |                 sell.append((t[i], S1.iloc[i]))
 90 |             else: 
 91 |                 buy.append((t[i], S1.iloc[i]))
 92 |                 sell.append((t[i], S0.iloc[i]))
 93 |                 
 94 |         else:
 95 |             money[i] = money[i-1]
 96 |             S0_held[i] = S0_held[i-1]
 97 |             S1_held[i] = S1_held[i-1]
 98 |     ret = 9e10-(money[-1]); 
 99 |     return ret #flip sign for optimizer
100 | 
101 | d = 252
102 | n_param = 2
103 | random_walks = 20
104 | opt_object = returnMaximizer(random_walks,pairstrade_maxi,n_param,data,id0,id1,start_year,end_year,d)
105 | 
106 | #%%
107 | buy_signal, sell_signal = opt_object.x[0], opt_object.x[1]
108 | buy_signal, sell_signal = 0.05, 0.01
109 | 
110 | windows = np.linspace(10,700,100)
111 | train_return = np.zeros(len(windows)); test_return = np.zeros(len(windows))
112 | for i in range(0,len(windows)):
113 |     d = int(windows[i])
114 |     param = np.array([d,buy_signal,sell_signal]) #d,buy_signal,sell_signal
115 |     #training data evaluation
116 |     moneyTr,_,_,_,_,_,_,_ = slb.pairstrade(param,data_train,id0,id1,names,start_year,end_year,plot=False)                                                                 
117 |     #test data evaluation
118 |     moneyTe,_,_,_,_,_,_,_ = slb.pairstrade(param,data_test,id0,id1,names,start_yearTe,end_yearTe,plot=False)
119 |     
120 |     train_return[i] = moneyTr[-1]
121 |     test_return[i] = moneyTe[-1]
122 |     
123 | #%% Plot Window Fit
124 | plt.figure('FittingWindow')
125 | plt.plot(windows,train_return,color='g',label='Trainin data')
126 | plt.plot(windows,test_return,color='r',label='Test data')
127 | plt.title('Window analysis')
128 | plt.xlabel('Window length [days]'); plt.ylabel('Final money [EUR]')
129 | plt.legend(); plt.show()
130 | #%%
131 | buy_signal, sell_signal = opt_object.x[0], opt_object.x[1]
132 | best_window = windows[np.argmax(train_return)]
133 | param = np.array([best_window,buy_signal,sell_signal])
134 | moneybest,_,_,_,_,_,_,_ = slb.pairstrade(param,data_train,id0,id1,names,start_year,end_year,plot=True)
135 | #%%
136 | best_window=252
137 | buy_signal, sell_signal = 0.05, 0.01
138 | param = np.array([best_window,buy_signal,sell_signal])
139 | moneybest,_,_,_,_,_,_,_ = slb.pairstrade(param,data_train,id0,id1,names,start_year,end_year,plot=True)


--------------------------------------------------------------------------------
/Code/slib.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on Mon Apr 27 18:37:42 2020
  4 | 
  5 | @author: alber
  6 | """
  7 | 
  8 | 
  9 | import numpy as np
 10 | import matplotlib.pyplot as plt
 11 | import pandas_datareader.data as web
 12 | from matplotlib.collections import LineCollection
 13 | from sklearn import cluster, covariance, manifold, metrics
 14 | from sklearn.cluster import KMeans
 15 | import matplotlib.cm as cm
 16 | from tqdm import tqdm #Progressbar
 17 | 
 18 | from statsmodels.tsa.stattools import coint
 19 | 
 20 | def stockvals(df,start_date,end_date):
 21 |     """
 22 |     Converts opening/closing values of stocks from a pd dataframe to np arrays 
 23 |     of floats and converts the names to arrays of strings 
 24 |     IN:
 25 |     df : pandas dataframe of stocks
 26 |     start_date, end_date : datetime; start and end date
 27 |     end_date : TYPE
 28 |     RETURNS:
 29 |     names,symbols : str, stock names and symbols
 30 |     variation : f8[n_samples,n_timesteps], variation value between opening and closing price
 31 |     """
 32 |     #convert pd dataframes to strings
 33 |     symbols, names = df.Symbol, df.Security
 34 |     symbols = symbols.to_numpy()
 35 |     symbols = symbols.astype(str)
 36 |     names = names.to_numpy()
 37 |     names = names.astype(str)
 38 |     start_date_int = datetime_to_integer(start_date)
 39 |     #Stocks under consideration (from S&P500)
 40 |     n_stocks = len(symbols)
 41 |     #Open - Closing value of stocks (as float)
 42 |     indices = []; open_val = []; close_val = []
 43 |     for j in tqdm(range(0,n_stocks),position=0,desc='Loading Stock Data'):
 44 |         if j == 91:
 45 |             continue
 46 |         date_string=(df.iloc[j][6]).replace('-',''); #print(date_string)
 47 |         date_added = int(date_string[:8])
 48 |         if(date_added <= start_date_int):
 49 |             index = j
 50 |             indices = np.append(indices,index)
 51 |             quotes = web.DataReader(symbols[j], 'yahoo', start_date, end_date)
 52 |             opening = quotes.Open
 53 |             closing = quotes.Close
 54 |             open_val = np.append(open_val,opening,axis=0)
 55 |             close_val = np.append(close_val,closing,axis=0)
 56 |     open_val = open_val.reshape(len(indices),-1)
 57 |     close_val = close_val.reshape(len(indices),-1)
 58 |     variation = open_val-close_val
 59 |     return names[indices.astype(int)],symbols[indices.astype(int)],variation,close_val,open_val
 60 | 
 61 | def datetime_to_integer(dt_time):
 62 |     return 10000*dt_time.year + 100*dt_time.month + dt_time.day
 63 | 
 64 | def blocking(tt,vector,blocks):
 65 |         """
 66 |         This is a function which helps to process big data files more easily
 67 |         by the method of block averaging. 
 68 |         For this the second argument is a vector with data, e.g. averaged temperature/stock price
 69 |         the second argument is another vector, e.g. time grid. 
 70 |         The third argument should be the number of blocks. 
 71 |         The more blocks, the more data points are taken into consideration. 
 72 |         If less blocks, more averaging takes place.
 73 |         Out: blockvec - blockaveraged vector
 74 |              blocktt - timesteps acording to blockaveraged data
 75 |              ...
 76 |              bdata - number of data points combined in one block
 77 |         """
 78 |         blockvec = np.zeros(blocks)
 79 |         elements = len(vector) 
 80 |         rest     = elements % blocks
 81 |         if rest != 0: #truncate vector if number of blocks dont fit in vector
 82 |             vector   = vector[0:-rest]
 83 |             tt       = tt[0:-rest]
 84 |             elements = len(vector)   
 85 |         meanA  = np.mean(vector)        
 86 |         bdata  = int((elements/blocks))#how many points per block
 87 |         sigBsq = 0; 
 88 |         for k in range(0,blocks):
 89 |             blockvec[k] = np.average(vector[k*bdata : (k+1)*bdata]) 
 90 |             sigBsq      = sigBsq + (blockvec[k]-meanA)**2    
 91 |         sigBsq *= 1/(blocks-1); 
 92 |         sigmaB = np.sqrt(sigBsq)
 93 |         error  = 1/np.sqrt(blocks)*sigmaB
 94 |         blocktt = tt[0:-1:bdata]
 95 |         return(blockvec,blocktt,error,sigmaB,bdata)    
 96 |     
 97 | def covplot(cov,symbols,title):
 98 |     """
 99 |     Plots a covariance matrix cov as heatmap. The x-,y tick lables are the input symbols
100 |     cov : f8[n_samples,n_samples]; covariance matrix
101 |     symbols : str[n_samples]; name of x and y tick labels
102 |     """
103 |     fig, ax = plt.subplots(figsize=(12,7))
104 |     im = ax.imshow(cov)
105 |     plt.colorbar(im, spacing='proportional')
106 |     # We want to show all ticks...
107 |     ax.set_xticks(np.arange(len(symbols)))
108 |     ax.set_yticks(np.arange(len(symbols)))
109 |     # ... and label them with the respective list entries
110 |     ax.set_xticklabels(symbols)
111 |     ax.set_yticklabels(symbols)
112 |     # Rotate the tick labels and set their alignment.
113 |     plt.setp(ax.get_xticklabels(), rotation=60, ha="right",
114 |              rotation_mode="anchor")
115 |     
116 |     ax.set_title(title)
117 |     fig.tight_layout()
118 |     plt.show()    
119 |     
120 | def stockplot(X,stock,names,start_year,end_year,ylabel):
121 |     """
122 |     Plots stock values from a portfolio X from start_year to end_year
123 |     X : f8[n_stock,n_timestep]; Portfolio: open_value/close_value/variation of stocks
124 |     stock : i4; stock under consideration from the portfolio X
125 |     names : str[n_stocks]; Name of stocks in the portfolio
126 |     start_year,end_year : f8; start and end year under consideration
127 |     ylabel : str; ylabel of what is plotted ( open_value/close_value/variation)
128 |     """
129 |     fig = plt.figure()
130 |     ax = fig.add_subplot(1, 1, 1)
131 |     major_ticks = np.arange(start_year, end_year+1, 1)
132 |     ax.set_xticks(major_ticks)
133 |     ax.grid(which='both')
134 |     plt.plot(np.linspace(start_year,end_year,len(X[:,stock])),X[:,stock])
135 |     plt.xlabel('Time [years]'); plt.ylabel(ylabel + ' [USD]')
136 |     plt.title(names[stock])
137 |     plt.show()  
138 |    
139 |     
140 | def k_means_ana(data,max_n_clusters,plot):
141 |     """
142 |     Analyses the best number of clusters according to the average silhouette score. 
143 |     Silhouette coefficients (as these values are referred to as) near +1 indicate 
144 |     that the sample is far away from the neighboring clusters. A value of 0 indicates 
145 |     that the sample is on or very close to the decision boundary between two neighboring clusters
146 |     and negative values indicate that those samples might have been assigned to the wrong cluster.
147 |     IN:data[n_samples,m_characteristics] = data to analyse
148 |        max_n_clusters = Integer - Maximum number of clusters
149 |        plot=boolean  
150 |     """
151 |     #The range of clusters is 2-max_n_clusters
152 |     range_n_clusters = np.arange(2,max_n_clusters+1)
153 |     range_silhouette_avg = np.zeros(max_n_clusters-1)
154 |     
155 |     for n_clusters in range_n_clusters: #loop over different number of clusters
156 |         clusterer = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10)
157 |         cluster_labels = clusterer.fit_predict(data)
158 |         silhouette_avg = metrics.silhouette_score(data, cluster_labels)
159 |         range_silhouette_avg[n_clusters-2] = silhouette_avg
160 |         
161 |     if plot == True:
162 |         plt.figure()
163 |         plt.plot(range_n_clusters,range_silhouette_avg,'*-')
164 |         plt.title(f'Cluster Analysis for data with {len(data[1]):d} characteristics',fontsize=18)
165 |         plt.xlabel('Number of clusters',fontsize=16)
166 |         plt.ylabel('Average Silhouette Number',fontsize=16)
167 |         
168 | def k_means_trafo(data,symbols,n_clusters,plot):
169 |     """
170 |     Computes the centers of K-Means clusters
171 |     If m_characteristics == 3 the clusters can be visualized
172 |     IN: data[n_samples,m_characteristics] = data to analyse
173 |         n_clusters = Integer - Number of clusters to find in the given dataset
174 |     OUT: centers[n_centers,m_characteristics] = cluster centers for the given data  
175 |          cluster_labels[n_samples] - cluster labels for data
176 |     """
177 |     #init: initialization method ('k-means++','random'); n_clusters:Integer - number of cluster centroids;
178 |     #n_init: Integer - number of random walks executed to optimize results 
179 |     clusterer = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10)
180 |     cluster_labels = clusterer.fit_predict(data)
181 |     centers = clusterer.cluster_centers_
182 |     
183 |     if (plot == True and len(data[0]) <=3 ):
184 |         silhouette_avg = metrics.silhouette_score(data, cluster_labels)
185 |         # Compute the silhouette scores for each sample
186 |         sample_silhouette_values = metrics.silhouette_samples(data, cluster_labels)
187 |         
188 |         # Create a subplot with 1 row and 2 columns
189 |         fig = plt.figure()
190 |         ax1 = fig.add_subplot(1,2,1) #Plot: Silhouette Score
191 |         fig.set_size_inches(13, 7)   #Format size of subplots
192 |         archColors = cm.plasma(np.arange(n_clusters).astype(float) / (n_clusters)) #shape: (n_clusters,4)
193 |         dataColors = cm.plasma(cluster_labels.astype(float) / n_clusters)          #shape: (n_samples,4)
194 |         # The 1st subplot is the silhouette plot
195 |         # The silhouette coefficient can range from -1, 1 but in this eg. the range is -0.2 to 1
196 |         ax1.set_xlim([-0.2, 1])
197 |         # The (n_clusters+1)*10 is for inserting blank space between silhouette
198 |         # plots of individual clusters, to demarcate them clearly.
199 |         ax1.set_ylim([0, len(data) + (n_clusters + 1) * 10])
200 |         y_lower = 10
201 |         for i in range(n_clusters):
202 |             # Aggregate the silhouette scores for samples belonging to
203 |             # cluster i, and sort them
204 |             ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
205 |             ith_cluster_silhouette_values.sort()
206 |             size_cluster_i = ith_cluster_silhouette_values.shape[0]
207 |             y_upper = y_lower + size_cluster_i
208 |         
209 |             #color = cm.nipy_spectral(float(i) / n_clusters)
210 |             ax1.fill_betweenx(np.arange(y_lower, y_upper),
211 |                               0, ith_cluster_silhouette_values, facecolors=archColors[i,:], alpha=1)
212 |             # Label the silhouette plots with their cluster numbers at the middle
213 |             ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
214 |             # Compute the new y_lower for next plot
215 |             y_lower = y_upper + 10  # 10 for the 0 samples
216 |         
217 |         # Labeling for subplot 1: Silhouette scores
218 |         ax1.set_title(f"Silhouette plot for various clusters. Silhouette Average = {silhouette_avg:.3f}"
219 |                        ,fontsize=18)
220 |         ax1.set_xlabel("Silhouette coefficient values",fontsize=16)
221 |         ax1.set_ylabel("Cluster label",fontsize=16)
222 |         # The vertical line for average silhouette score of all the values
223 |         ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
224 |         
225 |         # 2nd Plot showing the actual clusters formed
226 |         if (len(data[0]) == 2): 
227 |             ax2 = fig.add_subplot(1,2,2)
228 |             #Plot data
229 |             ax2.scatter(data[:,0], data[:,1], c=dataColors, alpha = 0.6, s=20)
230 |             #Plot cluster centers
231 |             #ax2.scatter(centers[:, 0], centers[:, 1],c=archColors,marker='X',edgecolor = 'k', alpha=1, s=200)
232 |             ax2.set_title("K-Means clusteres",fontsize=18)
233 |             ax2.set_xlabel('1st component',fontsize=16)
234 |             ax2.set_ylabel('2nd component',fontsize=16)
235 |             for i, txt in enumerate(symbols):
236 |                 ax2.annotate(txt, (data[i,0], data[i,1]))
237 |             plt.tight_layout()
238 |             plt.show()
239 |             
240 |         elif len(data[0] == 3):
241 |             ax2 = fig.add_subplot(1,2,2, projection='3d')
242 |             #Plot data
243 |             ax2.scatter(data[:,0], data[:,1], data[:,2], c=dataColors, alpha = 0.6, s=5)
244 |             #Plot cluster centers
245 |             ax2.scatter(centers[:, 0], centers[:, 1],centers[:, 2],c=archColors,marker='X',edgecolor = 'k', alpha=1, s=200)
246 |             ax2.set_title("K-Means clusteres of PCA reduced data.",fontsize=18)
247 |             ax2.set_xlabel('1st pca component',fontsize=16)
248 |             ax2.set_ylabel('2nd pca component',fontsize=16)
249 |             ax2.set_zlabel('3rd pca component',fontsize=16)
250 |             plt.tight_layout()
251 |             plt.show()
252 |     return centers,cluster_labels
253 | 
254 | def nodeplt(embedding,labels,symbols,partial_correlations,threshold):
255 |     """
256 |     Plots a map of the samples. The node positions is the 2D embedding of the cov of
257 |     the samples. The colour indicates the group to which the sample belongs;
258 |     The edges represent the connectivity to the other nodes (partial correlation)
259 |     IN:    
260 |     embedding : f8[2,n_samples]; A 2D embedding of the covariance matrix
261 |     labels : labels[n_samples] - cluster labels for data 
262 |     symbols : str[n_samples]; names of stocks in the portfolio
263 |     partial_correlations : f8[n_samples,n_samples]; partial correlation between samples (connectivity)
264 |     threshold : f8; value to cut off partial correlation
265 |     """              
266 |     x = embedding[0,:]; y = embedding[1,:]
267 |     fig, ax = plt.subplots()
268 |     ax.scatter(x, y,c=labels)
269 |     for i, txt in enumerate(symbols):
270 |         ax.annotate(txt, (x[i], y[i]))
271 |     
272 |     # Display a graph of the partial correlations
273 |     non_zero = (np.abs(np.triu(partial_correlations, k=1)) > threshold) #tridiag entries where partial corr > 0.06 
274 |     # Plot the edges
275 |     start_idx, end_idx = np.where(non_zero)
276 |     # a sequence of (*line0*, *line1*, *line2*), where::
277 |     #            linen = (x0, y0), (x1, y1), ... (xm, ym)
278 |     segments = [[embedding[:, start], embedding[:, stop]]
279 |                 for start, stop in zip(start_idx, end_idx)]
280 |     values = np.abs(partial_correlations[non_zero])
281 |     lc = LineCollection(segments,
282 |                         zorder=0, cmap=plt.cm.hot_r,
283 |                         norm=plt.Normalize(0, .7 * values.max()))
284 |     lc.set_array(values)
285 |     lc.set_linewidths(15 * values)
286 |     ax.add_collection(lc)
287 |     axcb = fig.colorbar(lc)
288 |     axcb.set_label('Partial Correlation')
289 |     ax.xaxis.set_visible(False)
290 |     ax.yaxis.set_visible(False)
291 |     plt.show()
292 |     
293 | def find_cointegrated_pairs(data):
294 |     n = data.shape[0]
295 |     score_matrix = np.zeros((n, n))
296 |     pvalue_matrix = np.ones((n, n))
297 |     pairs = []
298 |     for i in range(n):
299 |         print(i)
300 |         for j in range(i+1, n):
301 |             S1 = data[i]
302 |             S2 = data[j]
303 |             result = coint(S1, S2)
304 |             score = result[0]
305 |             pvalue = result[1]
306 |             score_matrix[i, j] = score
307 |             pvalue_matrix[i, j] = pvalue
308 |             if pvalue < 0.05:
309 |                 pairs.append((i, j))
310 |     return score_matrix, pvalue_matrix, pairs
311 | 
312 | def zscore(X):
313 |     return (X - X.mean()) / np.std(X)
314 | 
315 | def cointplt(data,pair,names,start_year,end_year,plot):
316 |     index_s1 = pair[0]; index_s2 = pair[1]
317 |     
318 |     score, pvalue, _ = coint(data[index_s1],data[index_s2])
319 |     ratio = data[index_s1] / data[index_s2]
320 |     z_ratio = zscore(ratio) 
321 |     if plot == True:
322 |         X = z_ratio; 
323 |         
324 |         fig = plt.figure()
325 |         ax = fig.add_subplot(1, 1, 1)
326 |         major_ticks = np.arange(start_year, end_year+1, 1)
327 |         ax.set_xticks(major_ticks)
328 |         ax.grid(which='both')
329 |         
330 |         plt.plot(np.linspace(start_year,end_year,len(X[:])),X[:])
331 |         plt.axhline(z_ratio.mean(),color='black',ls='--')
332 |         plt.axhline(1.0, color='red',ls='--')
333 |         plt.axhline(-1.0, color='green',ls='--')
334 |         
335 |         plt.xlabel('Time [years]'); plt.ylabel('Ratio')
336 |         plt.suptitle(names[pair[0]] + ' / ' + names[pair[1]])
337 |         plt.title(f'Coint-Pvalue = {pvalue:.3} ')
338 |         plt.show()  
339 |     return pvalue
340 | 
341 | def hurstcoeff(close_val,pairs,names,plot):
342 |     S1 = pairs[0]; S2 = pairs[1]
343 |     ts = close_val[S1]/close_val[S2]
344 |     lags = np.arange(2,400,1)
345 |     tauvec = []
346 |     for i in range(0,len(lags)):
347 |         lag = lags[i]
348 |         tau = np.sqrt(np.std(np.subtract(ts[lag:],ts[:-lag])))
349 |         tauvec = np.append(tauvec,tau)
350 |         
351 |     poly = np.polyfit(np.log(lags), np.log(tauvec), 1) 
352 |     H = poly[0]*2; #coeff orderd s.t. highest power first
353 |     if plot == True:
354 |         plt.figure()
355 |         plt.scatter(np.log(lags),np.log(tauvec),color='black',s=2,label=r'$<|ln(S_{t+\tau})-ln(S_t)|^2>$')
356 |         plt.plot(np.log(lags),poly[1]+np.log(lags)*0.5*H,color='green',
357 |                  label=f'Linear fit: '+r'$f(\tau)$= '+f'{poly[1]:.2} + {poly[0]:.2}'+r'$\tau$' )
358 |         plt.xlabel(r'$ln(\tau)$');
359 |         plt.suptitle(f'Estimation of the Hurst coeff. H = {H:.2}')
360 |         plt.title(names[pairs[0]] + '/' + names[pairs[1]])
361 |         plt.legend(); plt.show()
362 |     return H
363 | 
364 | def pairstrade(param,data,id0,id1,names,start_year,end_year,plot):
365 |     S0 = data.iloc[id0][:]; S1 = data.iloc[id1][:]
366 |     d,buy_signal,sell_signal = param[0],param[1],param[2] 
367 |     d = int(d)
368 |     k = (S0/S1).rolling(window=d, center=False).mean()
369 |     Y_tilde = np.log(S0/(k*S1))[d:-1] # model implied OU value based on deviation from k_old
370 |     
371 |     #TRADE THE PAIR BASED ON OU IMPLIED OVER-/UNDERVALUATION.
372 |     t = np.linspace(start_year,end_year,int(len(S0)))
373 |     S0=S0[d:-1]; S1=S1[d:-1]; k = k[d:-1]; t = t[d:-1]
374 |   
375 |     money = np.zeros(len(Y_tilde)) # start trading with no positions
376 |     S0_held = np.zeros(len(Y_tilde))
377 |     S1_held = np.zeros(len(Y_tilde))
378 |     
379 |     buy = []; sell = []
380 |     for i in range(len(S0)):
381 |         if  Y_tilde.iloc[i] >  buy_signal:            # short the pair when OU up
382 |             S0_held[i] = S0_held[i-1] - 1
383 |             S1_held[i] = S1_held[i-1] + S0.iloc[i]/S1.iloc[i]
384 |             money[i] = money[i-1] - 0.005 * 2 * S0.iloc[i]  # 50 bps tx costs
385 |             
386 |             buy.append((t[i], S1.iloc[i]))
387 |             sell.append((t[i], S0.iloc[i]))
388 |             
389 |         elif Y_tilde.iloc[i] < -buy_signal:           # long the pair when OU down
390 |             S0_held[i] = S0_held[i-1] + 1
391 |             S1_held[i] = S1_held[i-1] - S0.iloc[i]/S1.iloc[i]
392 |             money[i] = money[i-1] - 0.005 * 2 * S0.iloc[i]
393 |             
394 |             buy.append((t[i], S0.iloc[i]))
395 |             sell.append((t[i], S1.iloc[i]))
396 |             
397 |         elif abs(Y_tilde.iloc[i]) < sell_signal and S0_held[i-1]!=0:     # clear position when ratio normal
398 |             money[i] = money[i-1] + (S0_held[i-1]*S0.iloc[i] + S1_held[i-1]*S1.iloc[i]) * 0.995 
399 |             S0_held[i], S1_held[i] = 0, 0
400 |             
401 |             if S0_held[i-1] < 0:
402 |                 buy.append((t[i], S0.iloc[i]))
403 |                 sell.append((t[i], S1.iloc[i]))
404 |             else: 
405 |                 buy.append((t[i], S1.iloc[i]))
406 |                 sell.append((t[i], S0.iloc[i]))
407 |                 
408 |         else:
409 |             money[i] = money[i-1]
410 |             S0_held[i] = S0_held[i-1]
411 |             S1_held[i] = S1_held[i-1]
412 |     buy,sell = np.asarray(buy),np.asarray(sell) 
413 |     if plot == True and len(buy) != 0:
414 |         size = (16,8)    
415 |         fig = plt.figure(figsize=size)
416 |         ax = fig.add_subplot(1, 1, 1)
417 |         major_ticks = np.arange(start_year, end_year+1, 1)
418 |         ax.set_xticks(major_ticks)
419 |         ax.grid(which='both')
420 |         plt.plot(t,S0/S1, 'black',label='current k')
421 |         plt.plot(t,k, 'green',label=f'{int(param[0])}-day moving average'); 
422 |         plt.title('Stock Pair Ratio')
423 |         plt.xlabel('t [years]'); plt.ylabel('k')
424 |         plt.legend(); plt.show()
425 |         #%%
426 |         #Buy sell
427 |         fig = plt.figure(figsize=size)
428 |         ax = fig.add_subplot(1, 1, 1)
429 |         major_ticks = np.arange(start_year, end_year+1, 1);
430 |         ax.set_xticks(major_ticks)
431 |         ax.grid(which='both')
432 |         plt.plot(t,S0.iloc[0:], 'black',label=names[id0]); plt.plot(t,S1.iloc[:], 'black', linestyle='dotted',label=names[id1]) 
433 |         plt.scatter(buy[:,0],buy[:,1], color='g', linestyle='None', marker='^',label='Buy Signal')
434 |         plt.scatter(sell[:,0],sell[:,1], color='r', linestyle='None', marker='^',label='Sell Signal')
435 |         plt.title('Buy / Sell signal'); plt.xlabel('Time [years]'); plt.ylabel('Stock value [EUR]')
436 |         plt.legend(); plt.show()
437 |         
438 |         #Money
439 |         fig = plt.figure(figsize=size)
440 |         ax = fig.add_subplot(1, 1, 1)
441 |         major_ticks = np.arange(start_year, end_year+1, 1)
442 |         ax.set_xticks(major_ticks)
443 |         ax.grid(which='both')
444 |         plt.plot(t, money, 'black'); 
445 |         plt.title('Trade performance');
446 |         plt.xlabel('Time [years]'); plt.ylabel('Money [EUR]')
447 |         plt.show()
448 |          
449 |     return money,buy,sell,k,Y_tilde,t,S0,S1


--------------------------------------------------------------------------------