├── License.md
├── README.md
├── TSLA_2015-01-07_34200000_57600000_message_10_SAMPLE.csv
├── useful.py
├── rf-labels.py
├── TSLA_2015-01-07_34200000_57600000_orderbook_10_SAMPLE.csv
├── rf-calibration.py
└── rf-features.py


/License.md:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) [2020] [Arnaud Amsellem]
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Using random forest to model limit order book dynamic
 2 | I use the random forest algorithm to forecast mid price dynamic over short time horizon i.e. a few seconds ahead. This is of particular interest to market makers to skew their bid/ask spread in the direction of the most favorable outcome. Most if not all the literature on the topic focuses on applying straight out of the box algorithm to create forecast at any point in time. The problem in a real life environment is different. A market maker can provide a standard bid/ask spread most of the time and only when she/he has a statistical hedge she/he can skew the spread in the direction given by the model. This is what I try to do here: creating a forecast only when a statistical hedge exists
 3 | 
 4 | I used Python scikit-learn library implementation of random forest. This GitHub repo contains the code, some sample data and the associated explanations (code comments). This repo is also associated to a post on my [blog](https://www.thertrader.com/). I'm happy for anyone to re-use my work as long as proper reference to it is made.
 5 | 
 6 | I wanted to thank [LOBSTER](https://lobsterdata.com/) for providing the dataset used here
 7 | 
 8 | ## Code structure
 9 | 
10 | The code is organised around 4 files. In each file there is a detailled explanation and what is done along with some commented lines of code.
11 | 
12 | **rf.labels.py**: Definition of labels used in the Random Forest model
13 | 
14 | **rf.features.py**: Definition of features used in the Random Forest model
15 | 
16 | **rf.calibration.py**: Calibration and test of the Random Forest model
17 | 
18 | **useful.py**: Various useful functions used in the other 3 files (work in progress)
19 | 
20 | 


--------------------------------------------------------------------------------
/TSLA_2015-01-07_34200000_57600000_message_10_SAMPLE.csv:
--------------------------------------------------------------------------------
 1 | 34200.007780658,4,8574201,23,2134000,-1
 2 | 34200.332976723,1,8770149,100,2126400,1
 3 | 34200.492462611,6,-1,19725,2133500,-1
 4 | 34200.492462611,1,6756607,1752,2133500,1
 5 | 34200.492462611,1,3146453,15,2135000,-1
 6 | 34200.492462611,1,7252378,5,2132600,1
 7 | 34200.492462611,1,5502065,1,2135000,-1
 8 | 34200.492462611,1,6252826,6,2132500,1
 9 | 34200.492462611,1,7200188,350,2137000,-1
10 | 34200.492462611,1,4151867,47,2131000,1
11 | 34200.492462611,1,3671133,50,2130000,1
12 | 34200.492462611,1,4015142,2,2130000,1
13 | 34200.492462611,1,4490967,8,2130000,1
14 | 34200.492462611,1,5203157,5,2130000,1
15 | 34200.492462611,1,5539893,60,2130000,1
16 | 34200.492462611,1,6823462,25,2130000,1
17 | 34200.492462611,1,7087997,14,2130000,1
18 | 34200.492462611,1,7323188,200,2130000,1
19 | 34200.492462611,1,2703149,6,2129000,1
20 | 34200.495578064,4,8574201,200,2134000,-1
21 | 34200.496060008,4,8574201,77,2134000,-1
22 | 34200.496149804,1,8814916,100,2133600,1
23 | 34200.496400481,1,8814964,200,2136900,-1
24 | 34200.496950332,1,8815112,200,2133700,1
25 | 34200.497854428,3,8814964,200,2136900,-1
26 | 34200.546253717,3,8814916,100,2133600,1
27 | 34200.559199502,1,8829525,78,2134000,-1
28 | 34200.580826018,4,8829525,78,2134000,-1
29 | 34200.930944949,1,8895561,3,2133700,1
30 | 34201.013532726,1,8908510,100,2135400,-1
31 | 34201.014941014,5,0,100,2133800,1
32 | 34201.149417591,3,8574203,200,2138700,-1
33 | 34201.220091181,5,0,14,2134600,1
34 | 34201.279265571,1,8947077,40,2130000,1
35 | 34201.62417489,5,0,25,2134600,1
36 | 34201.624181025,5,0,25,2134600,1
37 | 34201.624186296,5,0,20,2134600,1
38 | 34201.624446938,4,8815112,100,2133700,1
39 | 34201.624881166,1,8999485,100,2134600,-1
40 | 34201.625069898,3,8815112,100,2133700,1
41 | 34201.897971506,4,8895561,3,2133700,1
42 | 34201.898022274,1,9048906,100,2134400,-1
43 | 34201.898079405,3,8999485,100,2134600,-1
44 | 34201.920290897,4,6756607,900,2133500,1
45 | 34201.920340111,4,6756607,852,2133500,1
46 | 34201.920727407,1,9052854,400,2133500,-1
47 | 34201.948076997,3,9048906,100,2134400,-1
48 | 34201.954050118,1,9058469,100,2135300,-1
49 | 34201.962694145,1,9060027,100,2133300,-1
50 | 34201.96510169,3,9060027,100,2133300,-1
51 | 


--------------------------------------------------------------------------------
/useful.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | """
 5 | Various useful functions (work in progress)
 6 | 
 7 | @author: arno
 8 | @Date: Mar 2020
 9 | """
10 | import numpy as np
11 | 
12 | 
13 | # =============================================================================
14 | # Enhanced division by zero
15 | # Used for the Random Forrest study to calculate trade intensity acceleration       
16 | # ============================================================================= 
17 | def divisionByZero(x,y):
18 |     try:
19 |         return x/y
20 |     except ZeroDivisionError:
21 |         return 1
22 | 
23 |         
24 | # =============================================================================
25 | # Extract decision rules from Random Forest
26 | # Adapted from: https://stackoverflow.com/questions/50600290/how-extraction-decision-rules-of-random-forest-in-python       
27 | # =============================================================================  
28 | def getDecisionRules(rf):
29 |     decisionRules = [] 
30 |     for tree_idx, est in enumerate(rf.estimators_):
31 |         tree = est.tree_
32 |         assert tree.value.shape[1] == 1 # no support for multi-output
33 | 
34 | #        print('TREE: {}'.format(tree_idx))
35 | #        decisionRules.append('TREE: {}'.format(tree_idx))
36 | 
37 |         iterator = enumerate(zip(tree.children_left, tree.children_right, tree.feature, tree.threshold, tree.value))
38 |         for node_idx, data in iterator:
39 |             left, right, feature, th, value = data
40 | 
41 |             # left: index of left child (if any)
42 |             # right: index of right child (if any)
43 |             # feature: index of the feature to check
44 |             # th: the threshold to compare against
45 |             # value: values associated with classes            
46 | 
47 |             # for classifier, value is 0 except the index of the class to return
48 |             class_idx = np.argmax(value[0])
49 | 
50 |             if left == -1 and right == -1:
51 | #                print('{} LEAF: return class={}'.format(node_idx, class_idx))
52 |                 decisionRules.append('{} LEAF: return class={}'.format(node_idx, class_idx))
53 |             else:
54 | #                print('{} NODE: if feature[{}] < {} then next={} else next={}'.format(node_idx, feature, th, left, right))    
55 |                 decisionRules.append('{} NODE: if feature[{}] < {} then next={} else next={}'.format(node_idx, feature, th, left, right))
56 |     
57 |     return(decisionRules)
58 | 
59 | 
60 | 
61 | 
62 | 
63 | 
64 | 
65 |     


--------------------------------------------------------------------------------
/rf-labels.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | """
  5 | Definition of the labels used in the Random Forest model
  6 |  
  7 | labels are defines as following:
  8 |     -1: Dowmward movement
  9 |      0: Stationary
 10 |     +1: Upward movement 
 11 | 
 12 | The labels are defined over 2 time horizons:
 13 |     - 10 seconds ahead
 14 |     - 20 seconds ahead
 15 | 
 16 | @author: thertrader@gmail.com
 17 | @Date: Mar 2020
 18 | """
 19 | 
 20 | import pandas as pd
 21 | import os
 22 | import numpy as np
 23 | from datetime import datetime
 24 | 
 25 | 
 26 | # =============================================================================
 27 | # 1 - BASIC PARAMETERS
 28 | # =============================================================================
 29 | os.chdir(r'/home/arno/work/research/lobster/data')
 30 | 
 31 | nlevels = 10
 32 | 
 33 | col = ['Ask Price ','Ask Size ','Bid Price ','Bid Size ']
 34 | 
 35 | theNames = []
 36 | for i in range(1, nlevels + 1):
 37 |     for j in col:
 38 |         theNames.append(str(j) + str(i))
 39 | 
 40 | tickers = ['TSLA'] 
 41 | 
 42 | #----- Lists of messages files and order book files 
 43 | theFiles = [] 
 44 | theOrderBookFiles = []
 45 | theMessagesFiles = []
 46 |  
 47 | for tk in tickers:
 48 |     # tk = 'INTC'
 49 |     os.chdir(r"/home/arno/work/research/lobster/data/" + tk)
 50 |     theFiles.extend(sorted(os.listdir()))
 51 |     theOrderBookFiles.extend([sl for sl in theFiles if "_orderbook" in sl])
 52 |     theMessagesFiles.extend([sk for sk in theFiles if "_message" in sk])
 53 |     theFiles = []
 54 | 
 55 | 
 56 | # =============================================================================
 57 | # 2 - Y LABELS FUNCTION 
 58 | # =============================================================================
 59 | # outputFileName = '_Y10Sec.csv'
 60 | # fwdTimeLength = 10
 61 | # f = 'TSLA_2015-01-02_34200000_57600000_orderbook_10.csv'
 62 | os.chdir(r'/home/arno/work/research/lobster/data')
 63 | def labels(outputFileName, fwdTimeLength):    
 64 |     
 65 |     for f in theOrderBookFiles:
 66 |         print(f + ' **** ' + datetime.now().strftime("%H:%M:%S"))
 67 |         
 68 |         theName = f[0:15] + outputFileName
 69 |                 
 70 |         if (f[0:4] == 'TSLA'):
 71 |             os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 
 72 |         
 73 |             theMessageFile = f[0:15] + '_34200000_57600000_message_10.csv'
 74 |             
 75 |             #---- Method using Numpy array, Loop and reduced array size to speed up things big time. This is how it's done:
 76 |             mf = pd.read_csv(theMessageFile, names = ['timeStamp','EventType','Order ID','Size','Price','Direction'], float_precision='round_trip')
 77 |             dfTimeStamp = mf['timeStamp']
 78 |             timeStampInSeconds = dfTimeStamp.round(0) # used to define max lookback period 
 79 |             maxLookAhead = sum(timeStampInSeconds.value_counts().iloc[:(fwdTimeLength + 2)]) # used to define max lookback period (should be iloc[:11] but iloc[:12] is safer)    
 80 |             timeStamp = mf['timeStamp'].to_frame().values 
 81 |             
 82 |             start = [] 
 83 |              
 84 | #            start_time = time.time() # For profiling only    
 85 |             for i in range(len(timeStamp)):
 86 |                 if i < (len(timeStamp) - maxLookAhead):
 87 |                     a = i + maxLookAhead
 88 |                     bb = np.column_stack((np.array(dfTimeStamp.iloc[i:a].index),timeStamp[i:a,0])) 
 89 |                     theIndexValue = bb[abs(bb[:,1] - (timeStamp[i,0] + fwdTimeLength)).argmin(),0]
 90 |                     start.append(int(theIndexValue))
 91 |                     
 92 |                 elif i >= (len(timeStamp) - maxLookAhead):    
 93 |                     bb = np.column_stack((np.array(dfTimeStamp.iloc[i:].index),timeStamp[i:,0])) 
 94 |                     theIndexValue = bb[abs(bb[:,1] - (timeStamp[i,0] + fwdTimeLength)).argmin(),0]
 95 |                     start.append(int(theIndexValue))
 96 |             
 97 |             stop = list(range(len(timeStamp)))
 98 | #            print("--- %s seconds ---" % round(time.time() - start_time,2))  # For profiling only           
 99 |             
100 |             theDataFile = f[0:15] + '_34200000_57600000_orderbook_10.csv'
101 |             df = pd.read_csv(theDataFile, names = theNames)
102 |             dfArray = df.values
103 |             
104 |             midRtn = np.array([(((dfArray[stop[j],0] + dfArray[stop[j],2])/2) / ((dfArray[start[j],0] + dfArray[start[j],2])/2)) - 1 for j in range(len(timeStamp))])
105 |             midRtn = np.where(midRtn > 0,1,np.where(midRtn < 0,-1,0)) 
106 |             dfToExport = pd.concat([pd.Series(timeStamp[:,0]),pd.Series(midRtn)],axis=1)
107 |             dfToExport.columns = ['timeStamp', 'label']
108 |                                      
109 |             dfToExport.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False)
110 |             
111 |             del dfToExport,midRtn,mf
112 | 
113 | # =============================================================================
114 | # 3 - RUN LABELS FUNCTION 
115 | # =============================================================================
116 | labels(outputFileName = '_Y10Sec.csv', fwdTimeLength = 10)
117 | labels(outputFileName = '_Y20Sec.csv', fwdTimeLength = 20)        
118 |     
119 | 
120 | 
121 | 
122 |     
123 |     
124 | 
125 |     
126 | 


--------------------------------------------------------------------------------
/TSLA_2015-01-07_34200000_57600000_orderbook_10_SAMPLE.csv:
--------------------------------------------------------------------------------
 1 | 2134000,277,2132500,19,2136400,53,2131500,100,2136900,50,2130000,15,2137900,50,2128400,165,2138000,1,2126800,100,2138700,200,2125300,300,2138800,100,2124800,80,2139400,50,2120000,30,2139900,100,2113200,100,2140000,350,2100000,17
 2 | 2134000,277,2132500,19,2136400,53,2131500,100,2136900,50,2130000,15,2137900,50,2128400,165,2138000,1,2126800,100,2138700,200,2126400,100,2138800,100,2125300,300,2139400,50,2124800,80,2139900,100,2120000,30,2140000,350,2113200,100
 3 | 2134000,277,2132500,19,2136400,53,2131500,100,2136900,50,2130000,15,2137900,50,2128400,165,2138000,1,2126800,100,2138700,200,2126400,100,2138800,100,2125300,300,2139400,50,2124800,80,2139900,100,2120000,30,2140000,350,2113200,100
 4 | 2134000,277,2133500,1752,2136400,53,2132500,19,2136900,50,2131500,100,2137900,50,2130000,15,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300,2139900,100,2124800,80,2140000,350,2120000,30
 5 | 2134000,277,2133500,1752,2135000,15,2132500,19,2136400,53,2131500,100,2136900,50,2130000,15,2137900,50,2128400,165,2138000,1,2126800,100,2138700,200,2126400,100,2138800,100,2125300,300,2139400,50,2124800,80,2139900,100,2120000,30
 6 | 2134000,277,2133500,1752,2135000,15,2132600,5,2136400,53,2132500,19,2136900,50,2131500,100,2137900,50,2130000,15,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300,2139900,100,2124800,80
 7 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,19,2136900,50,2131500,100,2137900,50,2130000,15,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300,2139900,100,2124800,80
 8 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137900,50,2130000,15,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300,2139900,100,2124800,80
 9 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2130000,15,2137900,50,2128400,165,2138000,1,2126800,100,2138700,200,2126400,100,2138800,100,2125300,300,2139400,50,2124800,80
10 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,15,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300
11 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,65,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300
12 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,67,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300
13 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,75,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300
14 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,80,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300
15 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,140,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300
16 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,165,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300
17 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,179,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300
18 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,379,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300
19 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,379,2138000,1,2129000,6,2138700,200,2128400,165,2138800,100,2126800,100,2139400,50,2126400,100
20 | 2134000,77,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,379,2138000,1,2129000,6,2138700,200,2128400,165,2138800,100,2126800,100,2139400,50,2126400,100
21 | 2135000,16,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,379,2138700,200,2129000,6,2138800,100,2128400,165,2139400,50,2126800,100,2139800,105,2126400,100
22 | 2135000,16,2133600,100,2136400,53,2133500,1752,2136900,50,2132600,5,2137000,350,2132500,25,2137900,50,2131500,100,2138000,1,2131000,47,2138700,200,2130000,379,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100
23 | 2135000,16,2133600,100,2136400,53,2133500,1752,2136900,250,2132600,5,2137000,350,2132500,25,2137900,50,2131500,100,2138000,1,2131000,47,2138700,200,2130000,379,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100
24 | 2135000,16,2133700,200,2136400,53,2133600,100,2136900,250,2133500,1752,2137000,350,2132600,5,2137900,50,2132500,25,2138000,1,2131500,100,2138700,200,2131000,47,2138800,100,2130000,379,2139400,50,2129000,6,2139800,105,2128400,165
25 | 2135000,16,2133700,200,2136400,53,2133600,100,2136900,50,2133500,1752,2137000,350,2132600,5,2137900,50,2132500,25,2138000,1,2131500,100,2138700,200,2131000,47,2138800,100,2130000,379,2139400,50,2129000,6,2139800,105,2128400,165
26 | 2135000,16,2133700,200,2136400,53,2133500,1752,2136900,50,2132600,5,2137000,350,2132500,25,2137900,50,2131500,100,2138000,1,2131000,47,2138700,200,2130000,379,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100
27 | 2134000,78,2133700,200,2135000,16,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,379,2138700,200,2129000,6,2138800,100,2128400,165,2139400,50,2126800,100
28 | 2135000,16,2133700,200,2136400,53,2133500,1752,2136900,50,2132600,5,2137000,350,2132500,25,2137900,50,2131500,100,2138000,1,2131000,47,2138700,200,2130000,379,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100
29 | 2135000,16,2133700,203,2136400,53,2133500,1752,2136900,50,2132600,5,2137000,350,2132500,25,2137900,50,2131500,100,2138000,1,2131000,47,2138700,200,2130000,379,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100
30 | 2135000,16,2133700,203,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,379,2138700,200,2129000,6,2138800,100,2128400,165,2139400,50,2126800,100
31 | 2135000,16,2133700,203,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,379,2138700,200,2129000,6,2138800,100,2128400,165,2139400,50,2126800,100
32 | 2135000,16,2133700,203,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,379,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100
33 | 2135000,16,2133700,203,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,379,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100
34 | 2135000,16,2133700,203,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,419,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100
35 | 2135000,16,2133700,203,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,419,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100
36 | 2135000,16,2133700,203,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,419,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100
37 | 2135000,16,2133700,203,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,419,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100
38 | 2135000,16,2133700,103,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,419,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100
39 | 2134600,100,2133700,103,2135000,16,2133500,1752,2135400,100,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,419,2138000,1,2129000,6,2138800,100,2128400,165,2139400,50,2126800,100
40 | 2134600,100,2133700,3,2135000,16,2133500,1752,2135400,100,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,419,2138000,1,2129000,6,2138800,100,2128400,165,2139400,50,2126800,100
41 | 2134600,100,2133500,1752,2135000,16,2132600,5,2135400,100,2132500,25,2136400,53,2131500,100,2136900,50,2131000,47,2137000,350,2130000,419,2137900,50,2129000,6,2138000,1,2128400,165,2138800,100,2126800,100,2139400,50,2126400,100
42 | 2134400,100,2133500,1752,2134600,100,2132600,5,2135000,16,2132500,25,2135400,100,2131500,100,2136400,53,2131000,47,2136900,50,2130000,419,2137000,350,2129000,6,2137900,50,2128400,165,2138000,1,2126800,100,2138800,100,2126400,100
43 | 2134400,100,2133500,1752,2135000,16,2132600,5,2135400,100,2132500,25,2136400,53,2131500,100,2136900,50,2131000,47,2137000,350,2130000,419,2137900,50,2129000,6,2138000,1,2128400,165,2138800,100,2126800,100,2139400,50,2126400,100
44 | 2134400,100,2133500,852,2135000,16,2132600,5,2135400,100,2132500,25,2136400,53,2131500,100,2136900,50,2131000,47,2137000,350,2130000,419,2137900,50,2129000,6,2138000,1,2128400,165,2138800,100,2126800,100,2139400,50,2126400,100
45 | 2134400,100,2132600,5,2135000,16,2132500,25,2135400,100,2131500,100,2136400,53,2131000,47,2136900,50,2130000,419,2137000,350,2129000,6,2137900,50,2128400,165,2138000,1,2126800,100,2138800,100,2126400,100,2139400,50,2124800,80
46 | 2133500,400,2132600,5,2134400,100,2132500,25,2135000,16,2131500,100,2135400,100,2131000,47,2136400,53,2130000,419,2136900,50,2129000,6,2137000,350,2128400,165,2137900,50,2126800,100,2138000,1,2126400,100,2138800,100,2124800,80
47 | 2133500,400,2132600,5,2135000,16,2132500,25,2135400,100,2131500,100,2136400,53,2131000,47,2136900,50,2130000,419,2137000,350,2129000,6,2137900,50,2128400,165,2138000,1,2126800,100,2138800,100,2126400,100,2139400,50,2124800,80
48 | 2133500,400,2132600,5,2135000,16,2132500,25,2135300,100,2131500,100,2135400,100,2131000,47,2136400,53,2130000,419,2136900,50,2129000,6,2137000,350,2128400,165,2137900,50,2126800,100,2138000,1,2126400,100,2138800,100,2124800,80
49 | 2133300,100,2132600,5,2133500,400,2132500,25,2135000,16,2131500,100,2135300,100,2131000,47,2135400,100,2130000,419,2136400,53,2129000,6,2136900,50,2128400,165,2137000,350,2126800,100,2137900,50,2126400,100,2138000,1,2124800,80
50 | 2133500,400,2132600,5,2135000,16,2132500,25,2135300,100,2131500,100,2135400,100,2131000,47,2136400,53,2130000,419,2136900,50,2129000,6,2137000,350,2128400,165,2137900,50,2126800,100,2138000,1,2126400,100,2138800,100,2124800,80
51 | 


--------------------------------------------------------------------------------
/rf-calibration.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | """
  5 | Calibration and test of the Random Forest model 
  6 | 
  7 | 1. Basic parameters
  8 | 2. Features pre processing - Run it only once
  9 | 3. Calibration & test
 10 | 
 11 | @author: thertrader@gmail.com
 12 | @Date: Mar 2020
 13 | """
 14 | 
 15 | import os
 16 | import sys
 17 | import pandas as pd
 18 | import numpy as np
 19 | from datetime import datetime
 20 | from sklearn.ensemble import RandomForestClassifier
 21 | #from sklearn import metrics
 22 | 
 23 | sys.path.append('/home/arno/work/programming/python') # add to the list of python paths
 24 | import useful
 25 | 
 26 | 
 27 | # =============================================================================
 28 | # 1 - Basic parameters
 29 | # =============================================================================
 30 | tickers = ['TSLA'] 
 31 | 
 32 | #----- Lists of all files 
 33 | theFiles = []  
 34 | for tk in tickers:
 35 |     os.chdir(r"/home/arno/work/research/lobster/data/" + tk)
 36 |     theFiles.extend(sorted(os.listdir()))
 37 | 
 38 | #----- Unique dates
 39 | theDates = []
 40 | for i in range(0,len(theFiles)):
 41 |     theDates.append(theFiles[i][5:15])
 42 | 
 43 | theDates = sorted(list(set(theDates)))
 44 | 
 45 | #--- To limit the effect of Open & Close auctions I removed the first and last 15 minutes. Not sure what the real impact is...  
 46 | top = 35100 # starts @ 9:45
 47 | bottom = 56700 # starts @ 15:45
 48 | 
 49 | 
 50 | # =============================================================================
 51 | # 2 - Features pre processing - Run it only once
 52 | # =============================================================================
 53 | # tk = 'TSLA'
 54 | # f = 'TSLA_2015-01-02_34200000_57600000_orderbook_10.csv'
 55 | # m = '2015-01-05'
 56 | # g = 'TSLA_2015-01-02_Y10Sec.csv'
 57 | os.chdir(r'/home/arno/work/research/lobster/data/features/') 
 58 | for tk in tickers:    
 59 |     for m in theDates:
 60 |         print(m + ' **** ' + datetime.now().strftime("%H:%M:%S")) # for prototyping only
 61 |         thisStockFiles = [f for f in os.listdir() if (f[:4] == tk and f[5:15] == m)]
 62 |         xFiles = sorted([f for f in thisStockFiles if ((f[16] != 'Y') and (f[16:36] != 'relativeIntensity10s') and (f[16:37] != 'relativeIntensity900s') and (f[16:28] != 'startAndStop'))])
 63 |         yFiles = sorted([f for f in thisStockFiles if f[16] == 'Y'])
 64 |         
 65 |         #--- Define start, stop lines and index to keep
 66 |         dd = pd.read_csv(yFiles[0])
 67 |         timeStamp = dd['timeStamp']
 68 |         t = abs(timeStamp - top).idxmin()
 69 |         b = abs(timeStamp - bottom).idxmin()
 70 |         
 71 |         #--- Y variables
 72 |         yVar = pd.DataFrame(dtype = 'int')        
 73 |         for g in yFiles:
 74 |             dt = pd.read_csv(g)
 75 |             dt = dt.drop(columns = ['timeStamp'])
 76 |             yVar = pd.concat([yVar, dt], axis = 1)
 77 |         yVar = yVar.iloc[t:(b+1),:]
 78 |         yVar.columns = ['midRtn_10','midRtn_20']
 79 |         yVar = yVar.astype(int)
 80 |         
 81 |         #--- X variables    
 82 |         xVar = pd.DataFrame()
 83 |         for f in xFiles:
 84 |             dt = pd.read_csv(f)
 85 |             xVar = pd.concat([xVar, dt], axis = 1)    
 86 | 
 87 |         xVar = xVar.drop(columns = ['timeStamp','decEventType_2','levelEventType_2','relIntEventType_2'])
 88 |         xVar = xVar[[f for f in xVar.columns if (f[-1] in list('12345') or f[:3] == 'acc' or f[:3] == 'ave')]] # Drop order book levels > 5
 89 |         xVar = xVar.iloc[t:(b+1),:] # remove first and last 15 min.
 90 |         print(m + ' **** '+ str(len(list(xVar.columns))))
 91 |         
 92 |         yVar.to_csv(r'/home/arno/work/research/lobster/data/calibration/yVar_' + tk + '_' + m + '.csv', header = True, index = False)
 93 |         xVar.to_csv(r'/home/arno/work/research/lobster/data/calibration/xVar_' + tk + '_' + m + '.csv', header = True, index = False)
 94 |         
 95 |         del xVar,yVar
 96 |         
 97 |         
 98 | # =============================================================================
 99 | # 3 - Calibration & test
100 | # =============================================================================
101 | #--- Create RFC      
102 | rfc = RandomForestClassifier(
103 |                             n_jobs = 4, # How many processors is it allowed to use. -1 means there is no restriction, 1 means it can only use one processor
104 |                             n_estimators = 200, # The number of trees in the forest
105 |                             min_samples_leaf = 200, # Minimum number of observations (i.e. samples) in terminal leaf
106 |                             min_samples_split = 300, # represents the minimum number of samples (i.e. observations) required to split an internal node. 
107 |                             oob_score = True, # This is a random forest cross validation method. It is very similar to leave one out validation technique 
108 |                             max_depth = None, # The maximum depth of the tree
109 |                             verbose  = 1, # To check progress of the estimation
110 |                             max_features = 'sqrt') # The number of features to consider when looking for the best split. None = no limit    
111 | 
112 |  
113 | #--- Set parameters to run the model 
114 | os.chdir(r'/home/arno/work/research/lobster/data/calibration')
115 | listOfFiles = os.listdir()
116 | 
117 | tk = 'TSLA'
118 | th = 0.50
119 | lookBackPeriod = 4
120 | indepVar = 'midRtn_10'
121 | 
122 | gatherResults = pd.DataFrame(index = range((20-lookBackPeriod)), 
123 |                              columns = ['date','IS-HR-U','#obs.1','IS-HR-S','#obs.2','IS-HR-D','#obs.3','OOS-HR-U','#obs.4','OOS-HR-S','#obs.5','OOS-HR-D','#obs.6',
124 |                                         'IS-HR-U55','#obs.155','IS-HR-S55','#obs.255','IS-HR-D55','#obs.355','OOS-HR-U55','#obs.455','OOS-HR-S55','#obs.555','OOS-HR-D55','#obs.655',
125 |                                         'IS-HR-U60','#obs.160','IS-HR-S60','#obs.260','IS-HR-D60','#obs.360','OOS-HR-U60','#obs.460','OOS-HR-S60','#obs.560','OOS-HR-D60','#obs.660',
126 |                                         'overallScore'])
127 | 
128 | #--- Run the model 
129 | for i in range(len(theDates)):
130 |     if (i >= lookBackPeriod):    
131 |         print(theDates[i] + ' **** ' + datetime.now().strftime("%H:%M:%S")) # for prototyping only    
132 |         for tk in tickers:        
133 |             outputName = tk + '_' + theDates[i] + '.csv'
134 |             
135 |             #--- Define features & labels calibration dataset (4 days)             
136 |             x = pd.DataFrame()
137 |             xFiles = sorted([f for f in listOfFiles if f[5:9] == tk and f[0] == 'x'])
138 |             xFiles = xFiles[(i-lookBackPeriod):i]
139 |             for g in xFiles:
140 |                 dt = pd.read_csv(g)
141 |                 x = pd.concat([x, dt], axis = 0)  
142 |             
143 |             y = pd.DataFrame()
144 |             yFiles = sorted([f for f in listOfFiles if f[5:9] == tk and f[0] == 'y'])
145 |             yFiles = yFiles[(i-lookBackPeriod):i]
146 |             for h in yFiles:
147 |                 du = pd.read_csv(h)
148 |                 y = pd.concat([y, du], axis = 0)         
149 |             
150 |             #--- Define features & labels test dataset (1 day)
151 |             xTestFile = 'xVar_' + tk + '_' + theDates[i] + '.csv'
152 |             xTest = pd.read_csv(xTestFile)
153 | 
154 |             yTestFile = 'yVar_' + tk + '_' + theDates[i] + '.csv'
155 |             yTest = pd.read_csv(yTestFile)
156 |             
157 |             #--- Random forest estimation
158 |             theFit = rfc.fit(x.values, y[indepVar].values)
159 |             
160 |             #---- Extract Decision Rules from in sample estimateions
161 |             decisionsRules = pd.Series(useful.getDecisionRules(theFit))
162 |             decisionsRules.to_csv(r'/home/arno/work/research/lobster/results/decisionsRules_' + outputName, header = False, index = False)
163 |             
164 |             #---- Feature importance ranked by score
165 |             featureImportance = pd.concat([pd.Series(x.columns),pd.Series(theFit.feature_importances_)], axis = 1)
166 |             featureImportance.columns = ['feature','score']
167 |             featureImportance = featureImportance.sort_values(by=['score'], ascending=False)
168 |             featureImportance.to_csv(r'/home/arno/work/research/lobster/results/featuresImportance_' + outputName, header = True, index = False)
169 |             
170 |             #---- Calibration diagnostic
171 |             featureOverallScore = theFit.score(x.values, y[indepVar].values)
172 |                         
173 |             #---- Calibration; Hit Ratio per label
174 |             pp = theFit.predict_proba(x) # tricky calculation (see scikit doc).....
175 |             newDt = np.concatenate([pp,y],axis= 1)
176 |             newDt = newDt[:,[0,1,2,3]]
177 |             
178 |             #---- Test: Hit Ratio per label
179 |             qq = theFit.predict_proba(xTest)
180 |             newDtTest = np.concatenate([qq,yTest],axis= 1)
181 |             newDtTest = newDtTest[:,[0,1,2,3]]
182 |             
183 |             #--- Store results 
184 |             k = i - lookBackPeriod
185 |             gatherResults.iloc[k,0] = int(theDates[i][:4] + theDates[i][5:7] + theDates[i][8:10])
186 |             #--- th = 0.5
187 |             gatherResults.iloc[k,1] = np.where((newDt[:,0] > th) & (newDt[:,3] == -1),1,0).sum()/np.where(newDt[:,0] > th,1,0).sum()
188 |             gatherResults.iloc[k,2] = np.where(newDt[:,0] > th,1,0).sum()
189 |             gatherResults.iloc[k,3] = np.where((newDt[:,1] > th) & (newDt[:,3] == 0),1,0).sum()/np.where(newDt[:,1] > th,1,0).sum()
190 |             gatherResults.iloc[k,4] = np.where(newDt[:,1] > th,1,0).sum()
191 |             gatherResults.iloc[k,5] = np.where((newDt[:,2] > th) & (newDt[:,3] == 1),1,0).sum()/np.where(newDt[:,2] > th,1,0).sum()
192 |             gatherResults.iloc[k,6] = np.where(newDt[:,2] > th,1,0).sum()
193 |             
194 |             gatherResults.iloc[k,7] = np.where((newDtTest[:,0] > th) & (newDtTest[:,3] == -1),1,0).sum()/np.where(newDtTest[:,0] > th,1,0).sum()
195 |             gatherResults.iloc[k,8] = np.where(newDtTest[:,0] > th,1,0).sum()
196 |             gatherResults.iloc[k,9] = np.where((newDtTest[:,1] > th) & (newDtTest[:,3] == 0),1,0).sum()/np.where(newDtTest[:,1] > th,1,0).sum()
197 |             gatherResults.iloc[k,10] = np.where(newDtTest[:,1] > th,1,0).sum()
198 |             gatherResults.iloc[k,11] = np.where((newDtTest[:,2] > th) & (newDtTest[:,3] == 1),1,0).sum()/np.where(newDtTest[:,2] > th,1,0).sum()
199 |             gatherResults.iloc[k,12] = np.where(newDtTest[:,2] > th,1,0).sum()
200 |             
201 |             #--- th = 0.55
202 |             gatherResults.iloc[k,13] = np.where((newDt[:,0] > (th+0.05)) & (newDt[:,3] == -1),1,0).sum()/np.where(newDt[:,0] > (th+0.05),1,0).sum()
203 |             gatherResults.iloc[k,14] = np.where(newDt[:,0] > (th+0.05),1,0).sum()
204 |             gatherResults.iloc[k,15] = np.where((newDt[:,1] > (th+0.05)) & (newDt[:,3] == 0),1,0).sum()/np.where(newDt[:,1] > (th+0.05),1,0).sum()
205 |             gatherResults.iloc[k,16] = np.where(newDt[:,1] > (th+0.05),1,0).sum()
206 |             gatherResults.iloc[k,17] = np.where((newDt[:,2] > (th+0.05)) & (newDt[:,3] == 1),1,0).sum()/np.where(newDt[:,2] > (th+0.05),1,0).sum()
207 |             gatherResults.iloc[k,18] = np.where(newDt[:,2] > (th+0.05),1,0).sum()
208 |             
209 |             gatherResults.iloc[k,19] = np.where((newDtTest[:,0] > (th+0.05)) & (newDtTest[:,3] == -1),1,0).sum()/np.where(newDtTest[:,0] > (th+0.05),1,0).sum()
210 |             gatherResults.iloc[k,20] = np.where(newDtTest[:,0] > (th+0.05),1,0).sum()
211 |             gatherResults.iloc[k,21] = np.where((newDtTest[:,1] > (th+0.05)) & (newDtTest[:,3] == 0),1,0).sum()/np.where(newDtTest[:,1] > (th+0.05),1,0).sum()
212 |             gatherResults.iloc[k,22] = np.where(newDtTest[:,1] > (th+0.05),1,0).sum()
213 |             gatherResults.iloc[k,23] = np.where((newDtTest[:,2] > (th+0.05)) & (newDtTest[:,3] == 1),1,0).sum()/np.where(newDtTest[:,2] > (th+0.05),1,0).sum()
214 |             gatherResults.iloc[k,24] = np.where(newDtTest[:,2] > (th+0.05),1,0).sum()
215 |             
216 |             #--- th = 0.6
217 |             gatherResults.iloc[k,25] = np.where((newDt[:,0] > (th+0.1)) & (newDt[:,3] == -1),1,0).sum()/np.where(newDt[:,0] > (th+0.1),1,0).sum()
218 |             gatherResults.iloc[k,26] = np.where(newDt[:,0] > (th+0.1),1,0).sum()
219 |             gatherResults.iloc[k,27] = np.where((newDt[:,1] > (th+0.1)) & (newDt[:,3] == 0),1,0).sum()/np.where(newDt[:,1] > (th+0.1),1,0).sum()
220 |             gatherResults.iloc[k,28] = np.where(newDt[:,1] > (th+0.1),1,0).sum()
221 |             gatherResults.iloc[k,29] = np.where((newDt[:,2] > (th+0.1)) & (newDt[:,3] == 1),1,0).sum()/np.where(newDt[:,2] > (th+0.1),1,0).sum()
222 |             gatherResults.iloc[k,30] = np.where(newDt[:,2] > (th+0.1),1,0).sum()
223 |             
224 |             gatherResults.iloc[k,31] = np.where((newDtTest[:,0] > (th+0.1)) & (newDtTest[:,3] == -1),1,0).sum()/np.where(newDtTest[:,0] > (th+0.1),1,0).sum()
225 |             gatherResults.iloc[k,32] = np.where(newDtTest[:,0] > (th+0.1),1,0).sum()
226 |             gatherResults.iloc[k,33] = np.where((newDtTest[:,1] > (th+0.1)) & (newDtTest[:,3] == 0),1,0).sum()/np.where(newDtTest[:,1] > (th+0.1),1,0).sum()
227 |             gatherResults.iloc[k,34] = np.where(newDtTest[:,1] > (th+0.1),1,0).sum()
228 |             gatherResults.iloc[k,35] = np.where((newDtTest[:,2] > (th+0.1)) & (newDtTest[:,3] == 1),1,0).sum()/np.where(newDtTest[:,2] > (th+0.1),1,0).sum()
229 |             gatherResults.iloc[k,36] = np.where(newDtTest[:,2] > (th+0.1),1,0).sum()
230 |             gatherResults.iloc[k,37] = featureOverallScore
231 |             
232 | gatherResults.to_csv(r'/home/arno/work/research/lobster/results/results.csv', header = True, index = False)
233 | 
234 | 
235 | 
236 | 
237 | 
238 | 
239 | 
240 | 
241 | 
242 | 
243 |         
244 | 
245 | 
246 | 
247 | 
248 | 
249 | 
250 | 
251 | 


--------------------------------------------------------------------------------
/rf-features.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | """
  5 | Definition of features used in the random forest modeling 
  6 | 
  7 | 1. Basic parameters
  8 | 2. Features definition
  9 |    - 2.1 START & STOP TIMESTAMPS: This is ran once for all for each file. Look for the timestamp 5sec backward. This is used for deciles definition.
 10 |    - 2.2 IMBALANCE: Absolute levels & deciles ranking compared to last 5min
 11 |    - 2.3 MID PRICE, B/A PRICE SPREAD, B/A VOLUME SPREAD: Absolute levels & deciles ranking compared to last 5min
 12 |    - 2.4 PRICE DIFFERENCES ACCROSS LEVELS: Bid & Ask differences between order book levels (1 to 10). Absolute levels & deciles ranking compared to last 5 min
 13 |    - 2.5 PRICE AVERAGE & VOLUME AVERAGE: Mean calculated across order book levels. Deciles ranking compared to last 5 min
 14 |    - 2.6 ACCUMULATED DIFFERENCES PRICE & VOLUME: Sum of all bid minus sum of all asks across levels. Deciles ranking compared to last 5 min
 15 |    - 2.7 BID, ASK, BID SIZE & ASK SIZE DERIVATIVEs: Last compared to price 1 sec. ago. Deciles ranking compared to last 5 sec. 
 16 |    - 2.8 AVERAGE TRADE INTENSITY: Count events of type 1,2,3 (see definition above) over the last second and deciles ranking compared to last 5 min
 17 |    - 2.9 RELATIVE TRADE INTENSITY 10s: Count events of type 1,2,3 over the last 10 sec (Intermediary step)
 18 |    - 2.10 RELATIVE TRADE INTENSITY 900s: Count events of type 1,2,3 over the last 900 sec. (Intermediary step)
 19 |    - 2.11 RELATIVE TRADE INTENSITY: Relative count events 10 sec./900 sec. Deciles ranking compared to last 5 min
 20 | 
 21 | EVENT TYPE (see LOBSTER website - [https://lobsterdata.com/](https://lobsterdata.com/)):
 22 | 1. New limit order 
 23 | 2. Partial deletion of limit order
 24 | 3  Total cancellation of limit order
 25 | 
 26 |         
 27 | @author: thertrader@gmail.com
 28 | @Date: Mar 2020
 29 | """
 30 | 
 31 | import os
 32 | import pandas as pd
 33 | import numpy as np
 34 | from scipy import stats
 35 | from datetime import datetime
 36 | 
 37 | 
 38 | # =============================================================================
 39 | # 1 - Basic parameters
 40 | # =============================================================================
 41 | os.chdir(r'/home/arno/work/research/lobster/data')
 42 | 
 43 | tickers = ['TSLA'] 
 44 | 
 45 | nlevels = 10
 46 | 
 47 | col = ['Ask Price ','Ask Size ','Bid Price ','Bid Size ']
 48 | 
 49 | theNames = []
 50 | for i in range(1, nlevels + 1):
 51 |     for j in col:
 52 |         theNames.append(str(j) + str(i))
 53 | 
 54 | #----- Lists of messages files and order book files 
 55 | theFiles = [] 
 56 | theOrderBookFiles = []
 57 | theMessagesFiles = []
 58 |  
 59 | for tk in tickers:
 60 |     # tk = 'INTC'
 61 |     os.chdir(r"/home/arno/work/research/lobster/data/" + tk)
 62 |     theFiles.extend(sorted(os.listdir()))
 63 |     theOrderBookFiles.extend([sl for sl in theFiles if "_orderbook" in sl])
 64 |     theMessagesFiles.extend([sk for sk in theFiles if "_message" in sk])
 65 |     theFiles = []
 66 | 
 67 | 
 68 | # =============================================================================
 69 | # 2 - Features definition
 70 | # =============================================================================
 71 | #----- 2.1 Define Start & Stop timestamp for all files (run it once for all for each file)
 72 | for f in theOrderBookFiles:
 73 |     print(f + ' **** ' + datetime.now().strftime("%H:%M:%S")) # used for prtotyping only
 74 |     
 75 |     theName = f[0:15] + '_startAndStop.csv'
 76 |      
 77 |     if (f[0:4] == 'INTC'):
 78 |         os.chdir(r'/home/arno/work/research/lobster/data/INTC')
 79 |     
 80 |     if (f[0:4] == 'TSLA'):
 81 |         os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 
 82 |     
 83 |     dt = pd.read_csv(f, names = theNames)
 84 |     #dt.head(5)
 85 |     
 86 |     theMessageFile = f[0:15] + '_34200000_57600000_message_10.csv'
 87 |     mf = pd.read_csv(theMessageFile, names = ['timeStamp','EventType','Order ID','Size','Price','Direction'], float_precision='round_trip')
 88 |     dfTimeStamp = mf['timeStamp']
 89 |     theIndex = dfTimeStamp.index.values 
 90 |     timeStampInSeconds = dfTimeStamp.round(0)  
 91 |     maxLookBack = sum(timeStampInSeconds.value_counts().iloc[:6])      
 92 |     timeStamp = dfTimeStamp.to_frame().values  
 93 |     
 94 |     start = [] 
 95 |     
 96 |     for j in range(len(timeStamp)):
 97 |         if (j == 0):
 98 |             start.append(j)
 99 |             
100 |         elif (j <= maxLookBack):
101 |             theIndexValue = int(abs(timeStamp[:j] - (timeStamp[j] - 5)).argmin())
102 |             start.append(theIndexValue)
103 |             
104 |         elif (j > maxLookBack):
105 |             a = j - maxLookBack
106 |             bb = np.column_stack((theIndex[a:j],timeStamp[a:j])) 
107 |             theIndexValue = int(bb[abs(bb[:,1] - (timeStamp[j] - 5)).argmin(),0])
108 |             start.append(theIndexValue)
109 |     
110 |     stop = list(range(len(timeStamp)))
111 |     
112 |     startAndStop = pd.concat([dfTimeStamp, pd.Series(start), pd.Series(stop)], axis= 1)
113 |     startAndStop.columns = ['timeStamp','start','stop']
114 |     
115 |     startAndStop.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False)
116 | 
117 | 
118 | 
119 | #----- 2.2 Imbalance Level and Derivative per level
120 | for f in theOrderBookFiles:
121 |     print(f + ' **** ' + datetime.now().strftime("%H:%M:%S")) # used for prtotyping only
122 |     
123 |     theNameLevel = f[0:15] + '_imbalanceLevel_' + str(nlevels) + '.csv'
124 |     theNameDerivative = f[0:15] + '_imbalanceDerivative_' + str(nlevels) + '.csv'
125 |     
126 |     if (f[0:4] == 'INTC'):
127 |         os.chdir(r'/home/arno/work/research/lobster/data/INTC')
128 |     
129 |     if (f[0:4] == 'TSLA'):
130 |         os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 
131 |     
132 |     #--- Open Order Book file
133 |     theOrderBookFile = pd.read_csv(f, names = theNames)
134 |         
135 |     #--- Open Messages file
136 |     theMessageFile = f[0:15] + '_34200000_57600000_message_10.csv'
137 |     mf = pd.read_csv(theMessageFile, names = ['timeStamp','EventType','Order ID','Size','Price','Direction'], float_precision='round_trip')
138 |     timeStamp = mf['timeStamp'].to_frame().values
139 |     
140 |     #--- Open Start & Stop file
141 |     os.chdir(r'/home/arno/work/research/lobster/data/features/')
142 |     sSFile = [g for g in os.listdir() if ((g[-17:] == '_startAndStop.csv') and (g[:15] == f[:15]))]
143 |     startStopFile = pd.read_csv(sSFile[0], names = ['timeStamp','start','stop'], header = 0)
144 |     start =  np.array(startStopFile['start'])
145 |     stop = np.array(startStopFile['stop'])
146 |     
147 |     imbLevel = pd.DataFrame()
148 |     imbDerivative = pd.DataFrame()
149 |        
150 |     for i in range(1,nlevels+1):
151 |         #--- Define Imbalance
152 |         nameOne = 'Bid Size '+ str(i)
153 |         nameTwo = 'Ask Size '+ str(i)
154 |         colName = 'imb '+ str(i)
155 |         lev = (theOrderBookFile[nameTwo] - theOrderBookFile[nameOne])/(theOrderBookFile[nameTwo] + theOrderBookFile[nameOne])
156 |         theLevel = round(10 * (lev - lev.min()) / (lev.max() - lev.min()),0) # Deciles
157 |         lev = np.array(lev)
158 | 
159 |         #--- Define Imbalance Derivative
160 | #        print('**** ' + datetime.now().strftime("%H:%M:%S"))
161 |         theDerivative = [round(stats.percentileofscore(lev[start[k]:k],lev[k])/10,0) for k in range(len(timeStamp))] # Deciles
162 | #        print('**** ' + datetime.now().strftime("%H:%M:%S"))
163 |         
164 |         imbLevel = pd.concat([imbLevel,theLevel],axis=1)
165 |         imbDerivative = pd.concat([imbDerivative,pd.Series(theDerivative)],axis=1)
166 |     
167 |     imbLevel.columns = ['imbLevel1','imbLevel2','imbLevel3','imbLevel4','imbLevel5','imbLevel6','imbLevel7','imbLevel8','imbLevel9','imbLevel10']
168 |     imbLevel.to_csv(r'/home/arno/work/research/lobster/data/features/' + theNameLevel, header = True, index = False)
169 |     
170 |     imbDerivative.columns = ['imbDer1','imbDer2','imbDer3','imbDer4','imbDer5','imbDer6','imbDer7','imbDer8','imbDer9','imbDer10']
171 |     imbDerivative.to_csv(r'/home/arno/work/research/lobster/data/features/' + theNameDerivative, header = True, index = False)
172 |     
173 |     os.chdir(r'/home/arno/work/research/lobster/data')       
174 |     
175 |     
176 |     
177 | #----- 2.3 Mid price, Spread, Volume Spread derivatives
178 | for f in theOrderBookFiles:
179 |     print(f + ' **** ' + datetime.now().strftime("%H:%M:%S")) # used for prtotyping only
180 |     theName = f[0:15] + '_misc_' + str(nlevels) + '.csv'
181 |     
182 |     if (f[0:4] == 'INTC'):
183 |         os.chdir(r'/home/arno/work/research/lobster/data/INTC')
184 |     
185 |     if (f[0:4] == 'TSLA'):
186 |         os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 
187 |     
188 |     dt = pd.read_csv(f, names = theNames)
189 |     
190 |     #--- Open Start & Stop file
191 |     os.chdir(r'/home/arno/work/research/lobster/data/features/')
192 |     sSFile = [g for g in os.listdir() if ((g[-17:] == '_startAndStop.csv') and (g[:15] == f[:15]))]
193 |     startStopFile = pd.read_csv(sSFile[0], names = ['timeStamp','start','stop'], header = 0)
194 |     start = np.array(startStopFile['start'])
195 |     stop = np.array(startStopFile['stop'])
196 |     
197 |     aa = pd.DataFrame()
198 |     newNames = []
199 |     
200 |     for i in range(1,nlevels+1):
201 |         newNames.extend(['midDer'+ str(i),'spreadDer'+ str(i),'volumeDer'+ str(i)])  
202 |         
203 |         nameOne = 'Bid Price '+ str(i)
204 |         nameTwo = 'Ask Price '+ str(i)
205 |         nameThree = 'Bid Size '+ str(i)
206 |         nameFour = 'Ask Size '+ str(i)
207 |         
208 |         #--- Define Levels
209 |         mid = np.array((dt[nameTwo] + dt[nameOne])/2)
210 |         spread = np.array(dt[nameTwo] - dt[nameOne])
211 |         volumeSpread = np.array(dt[nameFour] - dt[nameThree])
212 |         
213 |         #--- Define Derivatives
214 |         theMidDerivative = [round(stats.percentileofscore(mid[start[k]:k],mid[k])/10,0) for k in range(len(startStopFile))] # Deciles
215 |         theSpreadDerivative = [round(stats.percentileofscore(spread[start[k]:k],spread[k])/10,0) for k in range(len(startStopFile))] # Deciles
216 |         theVolumeSpreadDerivative = [round(stats.percentileofscore(volumeSpread[start[k]:k],volumeSpread[k])/10,0) for k in range(len(startStopFile))] # Deciles
217 | 
218 |         aa = pd.concat([aa,pd.Series(theMidDerivative),pd.Series(theSpreadDerivative),pd.Series(theVolumeSpreadDerivative)],axis=1)
219 |  
220 |     aa = pd.concat([startStopFile['timeStamp'],aa],axis=1)
221 |     newNames.insert(0,'timeStamp')
222 |     aa.columns = newNames
223 |     aa.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False)
224 | 
225 |     os.chdir(r'/home/arno/work/research/lobster/data')
226 | 
227 | 
228 | 
229 | #----- 2.4 Price differences
230 | for f in theOrderBookFiles:
231 |     print(f + ' **** ' + datetime.now().strftime("%H:%M:%S")) # used for prtotyping only
232 |     theName = f[0:15] + '_priceDiff_' + str(nlevels) + '.csv'
233 |     
234 |     if (f[0:4] == 'INTC'):
235 |         os.chdir(r'/home/arno/work/research/lobster/data/INTC')
236 |     
237 |     if (f[0:4] == 'TSLA'):
238 |         os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 
239 |     
240 |     dt = pd.read_csv(f, names = theNames)
241 |     
242 |     #--- Open Start & Stop file
243 |     os.chdir(r'/home/arno/work/research/lobster/data/features/')
244 |     sSFile = [g for g in os.listdir() if ((g[-17:] == '_startAndStop.csv') and (g[:15] == f[:15]))]
245 |     startStopFile = pd.read_csv(sSFile[0], names = ['timeStamp','start','stop'], header = 0)
246 |     start = np.array(startStopFile['start'])
247 |     stop = np.array(startStopFile['stop'])
248 |     
249 |     newNames = []
250 |     cc = pd.DataFrame()
251 |     
252 |     for i in range(1,nlevels):
253 |         # i = 1
254 |         nameOne = 'Ask Price '+ str(i)
255 |         nameTwo = 'Ask Price '+ str(i+1)
256 |         aa = np.array(abs(dt[nameTwo] + dt[nameOne]))
257 |         theAskDiffDerivative = [round(stats.percentileofscore(aa[start[k]:k],aa[k])/10,0) for k in range(len(startStopFile))] # Deciles
258 |         
259 |         nameTree = 'Bid Price '+ str(i)
260 |         nameFour = 'Bid Price '+ str(i+1)
261 |         bb = np.array(abs(dt[nameFour] + dt[nameTree]))
262 |         theBidDiffDerivative = [round(stats.percentileofscore(bb[start[k]:k],bb[k])/10,0) for k in range(len(startStopFile))] # Deciles
263 |         
264 |         newNames.extend(['AskPriceDiff '+ str(i),'BidPriceDiff '+ str(i)])
265 |         
266 |         cc = pd.concat([cc,pd.Series(theAskDiffDerivative),pd.Series(theBidDiffDerivative)],axis=1)
267 |     
268 |     cc = pd.concat([startStopFile['timeStamp'],cc],axis=1)
269 |     newNames.insert(0,'timeStamp')
270 |     cc.columns = newNames
271 |         
272 |     cc.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False)
273 | 
274 |     os.chdir(r'/home/arno/work/research/lobster/data')
275 | 
276 | 
277 | 
278 | #----- 2.5 Mean Price and Volume
279 | for f in theOrderBookFiles:
280 |     print(f + ' **** ' + datetime.now().strftime("%H:%M:%S")) # used for prtotyping only
281 |     theName = f[0:15] + '_meanPriceAndVolume.csv'
282 |     
283 |     if (f[0:4] == 'INTC'):
284 |         os.chdir(r'/home/arno/work/research/lobster/data/INTC')
285 |     
286 |     if (f[0:4] == 'TSLA'):
287 |         os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 
288 |     
289 |     dt = pd.read_csv(f, names = theNames)
290 |     
291 |     #--- Open Start & Stop file
292 |     os.chdir(r'/home/arno/work/research/lobster/data/features/')
293 |     sSFile = [g for g in os.listdir() if ((g[-17:] == '_startAndStop.csv') and (g[:15] == f[:15]))]
294 |     startStopFile = pd.read_csv(sSFile[0], names = ['timeStamp','start','stop'], header = 0)
295 |     start = np.array(startStopFile['start'])
296 |     stop = np.array(startStopFile['stop'])
297 |     
298 |     for i in range(1,nlevels+1):
299 |         #i = 1
300 |         accAskName = 'Ask Price '+ str(i)
301 |         accBidName = 'Bid Price '+ str(i)
302 |         accAskVolName = 'Ask Size '+ str(i)
303 |         accBidVolName = 'Bid Size '+ str(i)
304 |         
305 |         if i == 1:
306 |             dtAccAsk = dt[accAskName]
307 |             dtAccBid = dt[accBidName]
308 |             dtAccAskVol = dt[accAskVolName]
309 |             dtAccBidVol = dt[accBidVolName]
310 |         
311 |         if i != 1:
312 |             dtAccAsk = dtAccAsk + dt[accAskName]
313 |             dtAccBid = dtAccBid + dt[accBidName]
314 |             dtAccAskVol = dtAccAskVol + dt[accAskVolName]
315 |             dtAccBidVol = dtAccBidVol + dt[accBidVolName]
316 |         
317 |     meanAsk = np.array(dtAccAsk/nlevels) 
318 |     meanBid = np.array(dtAccBid/nlevels)
319 |     meanAskSize = np.array(dtAccAskVol/nlevels)
320 |     meanBidSize = np.array(dtAccBidVol/nlevels)
321 |     
322 |     aa = [round(stats.percentileofscore(meanAsk[start[k]:k],meanAsk[k])/10,0) for k in range(len(startStopFile))] # Deciles
323 |     bb = [round(stats.percentileofscore(meanBid[start[k]:k],meanBid[k])/10,0) for k in range(len(startStopFile))] # Deciles
324 |     cc = [round(stats.percentileofscore(meanAskSize[start[k]:k],meanAskSize[k])/10,0) for k in range(len(startStopFile))] # Deciles
325 |     dd = [round(stats.percentileofscore(meanBidSize[start[k]:k],meanBidSize[k])/10,0) for k in range(len(startStopFile))] # Deciles
326 |     
327 |     ee = pd.concat([pd.Series(startStopFile['timeStamp']),pd.Series(aa),pd.Series(bb),pd.Series(cc),pd.Series(dd)],axis=1)
328 |     ee.columns = ['timeStamp','averageAsk','averageBid','averageAskSize','averageBidSize']
329 |     
330 |     ee.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False)
331 | 
332 |     os.chdir(r'/home/arno/work/research/lobster/data')
333 | 
334 | 
335 | 
336 | #----- 2.6 Accumulated differences Price and Volume
337 | for f in theOrderBookFiles:
338 |     print(f + ' **** ' + datetime.now().strftime("%H:%M:%S")) # used for prtotyping only
339 |     theName = f[0:15] + '_accDiffPriceAndVolume.csv'
340 |     
341 |     if (f[0:4] == 'INTC'):
342 |         os.chdir(r'/home/arno/work/research/lobster/data/INTC')
343 |     
344 |     if (f[0:4] == 'TSLA'):
345 |         os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 
346 |     
347 |     dt = pd.read_csv(f, names = theNames)
348 |     
349 |     #--- Open Start & Stop file
350 |     os.chdir(r'/home/arno/work/research/lobster/data/features/')
351 |     sSFile = [g for g in os.listdir() if ((g[-17:] == '_startAndStop.csv') and (g[:15] == f[:15]))]
352 |     startStopFile = pd.read_csv(sSFile[0], names = ['timeStamp','start','stop'], header = 0)
353 |     start = startStopFile['start']
354 |     stop = startStopFile['stop']
355 |     
356 |     aa = pd.DataFrame()
357 |     
358 |     # i = 1
359 |     aa['AccPriceDiff'] = (dt ['Bid Price 1'] + dt ['Bid Price 2'] + dt ['Bid Price 3'] + dt ['Bid Price 4'] + dt ['Bid Price 5'] +
360 |     dt ['Bid Price 6'] - dt ['Bid Price 7'] + dt ['Bid Price 8'] + dt ['Bid Price 9'] + dt ['Bid Price 10'] -
361 |     dt ['Ask Price 1'] - dt ['Ask Price 2'] - dt ['Ask Price 3'] - dt ['Ask Price 4'] - dt ['Ask Price 5'] -
362 |     dt ['Ask Price 6'] - dt ['Ask Price 7'] - dt ['Ask Price 8'] - dt ['Ask Price 9'] - dt ['Ask Price 10'])
363 |     
364 |     aa['AccSizeDiff'] = (dt ['Bid Size 1'] + dt ['Bid Size 2'] + dt ['Bid Size 3'] + dt ['Bid Size 4'] + dt ['Bid Size 5'] +
365 |     dt ['Bid Size 6'] - dt ['Bid Size 7'] + dt ['Bid Size 8'] + dt ['Bid Size 9'] + dt ['Bid Size 10'] -
366 |     dt ['Ask Size 1'] - dt ['Ask Size 2'] - dt ['Ask Size 3'] - dt ['Ask Size 4'] - dt ['Ask Size 5'] -
367 |     dt ['Ask Size 6'] - dt ['Ask Size 7'] - dt ['Ask Size 8'] - dt ['Ask Size 9'] - dt ['Ask Size 10'])
368 |     
369 |     accPrideDiff = np.array(aa['AccPriceDiff'])
370 |     accSizeDiff = np.array(aa['AccSizeDiff'])
371 |     
372 |     bb = [round(stats.percentileofscore(accPrideDiff[start[k]:k],accPrideDiff[k])/10,0) for k in range(len(startStopFile))] # Deciles
373 |     cc = [round(stats.percentileofscore(accSizeDiff[start[k]:k],accSizeDiff[k])/10,0) for k in range(len(startStopFile))] # Deciles
374 |     
375 |     accDiffPriceAndVolume = pd.concat([pd.Series(startStopFile['timeStamp']),pd.Series(bb),pd.Series(cc)],axis=1)
376 |     accDiffPriceAndVolume.columns = ['timeStamp','accPriceDiff','accSizeDiff']
377 |         
378 |     accDiffPriceAndVolume.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False)
379 | 
380 |     os.chdir(r'/home/arno/work/research/lobster/data')
381 | 
382 | 
383 | 
384 | #----- 2.7 Price and Volume Derivatives
385 | for f in theOrderBookFiles:
386 |     print(f + ' **** ' + datetime.now().strftime("%H:%M:%S"))
387 |     
388 |     theName = f[0:15] + '_priceAndVolumeDerivatives.csv'
389 |     
390 |     if (f[0:4] == 'INTC'):
391 |         os.chdir(r'/home/arno/work/research/lobster/data/INTC')
392 |     
393 |     if (f[0:4] == 'TSLA'):
394 |         os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 
395 |    
396 |     #--- The message file
397 |     theMessageFile = f[0:15] + '_34200000_57600000_message_10.csv'
398 |     mf = pd.read_csv(theMessageFile, names = ['timeStamp','EventType','Order ID','Size','Price','Direction'], float_precision='round_trip')
399 |     dfTimeStamp = mf['timeStamp']
400 |     timeStampInSeconds = dfTimeStamp.round(0)  
401 |     maxLookBack = max(timeStampInSeconds.value_counts()) + 5     
402 |     timeStamp = mf['timeStamp'].to_frame().values 
403 |     
404 |     #--- The order book file
405 |     theDataFile = f[0:15] + '_34200000_57600000_orderbook_10.csv'
406 |     df = pd.read_csv(theDataFile, names = theNames)
407 |     dfArray = df.values
408 |     
409 |     #--- Open Start & Stop file
410 |     os.chdir(r'/home/arno/work/research/lobster/data/features/')
411 |     sSFile = [g for g in os.listdir() if ((g[-17:] == '_startAndStop.csv') and (g[:15] == f[:15]))]
412 |     startStopFile = pd.read_csv(sSFile[0], names = ['timeStamp','start','stop'], header = 0)
413 |     start = np.array(startStopFile['start'])
414 |     stop = np.array(startStopFile['stop'])
415 |     
416 |     begin = [] 
417 |     
418 | #    start_time = time.time() # used only for profiling    
419 |     for i in range(len(timeStamp)):
420 |         # i = 223200
421 |         if i == 0:
422 |             begin.append(0)
423 |             
424 |         elif i < maxLookBack:
425 |             theIndexValue = abs(timeStamp[:i,0] - (timeStamp[i,0] - 1)).argmin()
426 |             begin.append(theIndexValue)
427 |             
428 |         elif i >= maxLookBack:    
429 |             a = i - maxLookBack
430 |             bb = np.column_stack((np.array(dfTimeStamp.iloc[a:i].index),timeStamp[a:i,0])) 
431 |             theIndexValue = bb[abs(bb[:,1] - (timeStamp[i,0] - 1)).argmin(),0]
432 |             begin.append(int(theIndexValue))
433 |     
434 |     end = np.array(range(len(timeStamp)))
435 |     begin = np.array(begin)
436 | #    print("--- %s seconds ---" % round(time.time() - start_time,2))  # used only for profiling           
437 |     
438 |     theBigArray = np.empty([len(end), df.shape[1]])
439 |     for i in range(df.shape[1]):
440 |         #   i = 0
441 |         #   j = 0
442 |         aa = np.array([(dfArray[end[j],i]/dfArray[begin[j],i] - 1) for j in range(len(end))])
443 |         bb = [round(stats.percentileofscore(dfArray[start[k]:k,i],dfArray[k,i])/10,0) for k in range(len(startStopFile))]
444 |         theBigArray[:,i] = np.array(bb)
445 |        
446 |     theBigArray = pd.DataFrame(theBigArray)
447 |     theBigArray.columns = theNames
448 |     
449 |     theBigArray.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False)
450 | 
451 |     os.chdir(r'/home/arno/work/research/lobster/data')
452 | 
453 | 
454 | 
455 | #----- 2.8 Average intensity (1 sec.)
456 | #--- Only the following event types are selected (See LOBSTER page for the exect definition): 1(Submission of a new limit order) and 3(Deletion (total deletion of a limit order)      
457 | for f in theMessagesFiles:
458 |     #f = 'INTC_2015-01-02_34200000_57600000_message_10.csv'
459 |     print(f + ' **** ' + datetime.now().strftime("%H:%M:%S"))
460 |     
461 |     theName = f[0:15] + '_tradeIntensity.csv'
462 |     
463 |     if (f[0:4] == 'INTC'):
464 |         os.chdir(r'/home/arno/work/research/lobster/data/INTC')
465 |     
466 |     if (f[0:4] == 'TSLA'):
467 |         os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 
468 |     
469 |     theMessageFile = f[0:15] + '_34200000_57600000_message_10.csv'
470 |     
471 |     mf = pd.read_csv(theMessageFile, names = ['timeStamp','EventType','Order ID','Size','Price','Direction'], float_precision='round_trip')
472 |     dfTimeStamp = mf['timeStamp']
473 |     timeStampInSeconds = dfTimeStamp.round(0) # used to define max lookback period 
474 |     maxLookBack = sum(timeStampInSeconds.value_counts().iloc[:2]) # used to define max lookback period 
475 |     timeStamp = mf['timeStamp'].to_frame().values 
476 |     mf = mf.values
477 |     
478 |     #--- Open Start & Stop file
479 |     os.chdir(r'/home/arno/work/research/lobster/data/features/')
480 |     sSFile = [g for g in os.listdir() if ((g[-17:] == '_startAndStop.csv') and (g[:15] == f[:15]))]
481 |     startStopFile = pd.read_csv(sSFile[0], names = ['timeStamp','start','stop'], header = 0)
482 |     start = np.array(startStopFile['start'])
483 |     stop = np.array(startStopFile['stop'])
484 |     
485 |     eventsCount = np.empty((len(mf),3)) 
486 |     
487 | #    start_time = time.time() # For profiling only        
488 |     for i in range(len(timeStamp)):
489 |         if i == 0:
490 |             eventsCount[0,0] = 0 
491 |             eventsCount[0,1] = 0
492 |             eventsCount[0,2] = 0
493 |         
494 |         elif i < maxLookBack:
495 |             startIndexValue = abs(timeStamp[:i,0] - (timeStamp[i,0] - 1)).argmin()
496 |             mfSmall = mf[startIndexValue:i,:]
497 |             eventsCount[i,0] = np.count_nonzero(mfSmall[:,1] == 1)
498 |             eventsCount[i,1] = np.count_nonzero(mfSmall[:,1] == 2)
499 |             eventsCount[i,2] = np.count_nonzero(mfSmall[:,1] == 3)
500 |             
501 |         elif i >= maxLookBack:    
502 |             a = i - maxLookBack
503 |             bb = np.column_stack((np.array(dfTimeStamp.iloc[a:i].index),timeStamp[a:i,0]))
504 |             startIndexValue = int(bb[abs(bb[:,1] - (timeStamp[i,0] - 1)).argmin(),0])
505 |             mfSmall = mf[startIndexValue:i,:]
506 |             eventsCount[i,0] = np.count_nonzero(mfSmall[:,1] == 1)
507 |             eventsCount[i,1] = np.count_nonzero(mfSmall[:,1] == 2)
508 |             eventsCount[i,2] = np.count_nonzero(mfSmall[:,1] == 3)
509 | #    print("--- %s seconds ---" % round(time.time() - start_time,2))  # For profiling only  
510 |     
511 |     cc = [round(stats.percentileofscore(eventsCount[start[k]:k,0],eventsCount[k,0])/10,0) for k in range(len(startStopFile))]
512 |     dd = [round(stats.percentileofscore(eventsCount[start[k]:k,1],eventsCount[k,1])/10,0) for k in range(len(startStopFile))]
513 |     ee = [round(stats.percentileofscore(eventsCount[start[k]:k,2],eventsCount[k,2])/10,0) for k in range(len(startStopFile))]
514 |     
515 |     ff = pd.concat([dfTimeStamp,pd.Series(cc),pd.Series(dd),pd.Series(ee),pd.DataFrame(eventsCount)],axis=1)
516 |     ff.columns = ['timeStamp','decEventType_1','decEventType_2','decEventType_3','levelEventType_1','levelEventType_2','levelEventType_3']
517 |     
518 |     ff.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False)
519 | 
520 |     os.chdir(r'/home/arno/work/research/lobster/data')
521 | 
522 | 
523 | 
524 | #----- 2.9 Relative intensity - step 1 - (10 sec.) 
525 | for f in theMessagesFiles:
526 |     print(f + ' **** ' + datetime.now().strftime("%H:%M:%S"))
527 |     
528 |     theName = f[0:15] + '_relativeIntensity10s.csv'
529 |     
530 |     if (f[0:4] == 'INTC'):
531 |         os.chdir(r'/home/arno/work/research/lobster/data/INTC')
532 |     
533 |     if (f[0:4] == 'TSLA'):
534 |         os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 
535 |     
536 |     theMessageFile = f[0:15] + '_34200000_57600000_message_10.csv'
537 |     
538 |     mf = pd.read_csv(theMessageFile, names = ['timeStamp','EventType','Order ID','Size','Price','Direction'], float_precision='round_trip')
539 |     dfTimeStamp = mf['timeStamp']
540 |     timeStampInSeconds = dfTimeStamp.round(0)
541 |     maxLookBack = sum(timeStampInSeconds.value_counts().iloc[:11]) 
542 |     timeStampInSeconds = timeStampInSeconds.values 
543 |     timeStamp = mf['timeStamp'].to_frame().values 
544 |     mf = mf.values
545 |     
546 |     eventsCount10s = np.empty((len(mf),3)) 
547 | 
548 | #    start_time = time.time() # For profiling only        
549 |     for i in range(len(timeStamp)):
550 |         if (i == 0):
551 |             eventsCount10s[i,0] = 0
552 |             eventsCount10s[i,1] = 0
553 |             eventsCount10s[i,2] = 0
554 |             
555 |         if (i <= maxLookBack):
556 |             mfSmall = mf[:i,1]
557 |             eventsCount10s[i,0] = np.count_nonzero(mfSmall == 1)
558 |             eventsCount10s[i,1] = np.count_nonzero(mfSmall == 2)
559 |             eventsCount10s[i,2] = np.count_nonzero(mfSmall == 3) 
560 |             
561 |         elif (i > maxLookBack):
562 |             a = i - maxLookBack
563 |             aa = np.column_stack((np.array(dfTimeStamp.iloc[a:i].index),timeStamp[a:i,0]))
564 |             startIndexValue = int(aa[abs(aa[:,1] - (timeStamp[i,0] - 10)).argmin(),0])
565 |             mfSmall = mf[startIndexValue:i,1]
566 |             eventsCount10s[i,0] = np.count_nonzero(mfSmall == 1)
567 |             eventsCount10s[i,1] = np.count_nonzero(mfSmall == 2)
568 |             eventsCount10s[i,2] = np.count_nonzero(mfSmall == 3)    
569 | #    print("--- %s seconds ---" % round(time.time() - start_time,2))  # For profiling only  
570 |     
571 |     eventsCount10s = pd.DataFrame(eventsCount10s)
572 |     eventsCount10s.columns = ['EventType_1','EventType_2','EventType_3']
573 |     
574 |     # Add time stamp
575 |     eventsCount10s = pd.concat([dfTimeStamp,eventsCount10s],axis=1)
576 |     eventsCount10s.columns = ['timeStamp','EventType_1','EventType_2','EventType_3']
577 |     eventsCount10s.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False)
578 | 
579 |     os.chdir(r'/home/arno/work/research/lobster/data')
580 | 
581 | 
582 | 
583 | #----- 2.10 Relative intensity - step 2 - (900 sec.) 
584 | for f in theMessagesFiles:
585 |     print(f + ' **** ' + datetime.now().strftime("%H:%M:%S"))
586 |     
587 |     theName = f[0:15] + '_relativeIntensity900s.csv'
588 |     
589 |     if (f[0:4] == 'INTC'):
590 |         os.chdir(r'/home/arno/work/research/lobster/data/INTC')
591 |     
592 |     if (f[0:4] == 'TSLA'):
593 |         os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 
594 |     
595 |     theMessageFile = f[0:15] + '_34200000_57600000_message_10.csv'
596 |     
597 |     mf = pd.read_csv(theMessageFile, names = ['timeStamp','EventType','Order ID','Size','Price','Direction'], float_precision='round_trip')
598 |     dfTimeStamp = mf['timeStamp']
599 |     timeStampInSeconds = dfTimeStamp.round(0)
600 |     maxLookBack = sum(timeStampInSeconds.value_counts().iloc[:901]) 
601 |     timeStampInSeconds = timeStampInSeconds.values 
602 |     timeStamp = mf['timeStamp'].to_frame().values 
603 |     mf = mf.values
604 |     
605 |     eventsCount900s = np.empty((len(mf),3)) 
606 | 
607 | #    start_time = time.time() # For profiling only        
608 |     for i in range(len(timeStamp)):
609 |         if (i == 0):
610 |             eventsCount900s[i,0] = 0
611 |             eventsCount900s[i,1] = 0
612 |             eventsCount900s[i,2] = 0
613 |             
614 |         if (i <= maxLookBack):
615 |             mfSmall = mf[:i,1]
616 |             eventsCount900s[i,0] = np.count_nonzero(mfSmall == 1)
617 |             eventsCount900s[i,1] = np.count_nonzero(mfSmall == 2)
618 |             eventsCount900s[i,2] = np.count_nonzero(mfSmall == 3) 
619 |             
620 |         elif (i > maxLookBack):
621 |             a = i - maxLookBack
622 |             aa = np.column_stack((np.array(dfTimeStamp.iloc[a:i].index),timeStamp[a:i,0]))
623 |             startIndexValue = int(aa[abs(aa[:,1] - (timeStamp[i,0] - 900)).argmin(),0])
624 |             mfSmall = mf[startIndexValue:i,1]
625 |             eventsCount900s[i,0] = np.count_nonzero(mfSmall == 1)
626 |             eventsCount900s[i,1] = np.count_nonzero(mfSmall == 2)
627 |             eventsCount900s[i,2] = np.count_nonzero(mfSmall == 3) 
628 | #    print("--- %s seconds ---" % round(time.time() - start_time,2))  # For profiling only  
629 |     
630 |     eventsCount900s = pd.DataFrame(eventsCount900s)
631 |     eventsCount900s.columns = ['EventType_1','EventType_2','EventType_3']
632 |     
633 |     # Add time stamp
634 |     eventsCount900s = pd.concat([dfTimeStamp,eventsCount900s],axis=1)
635 |     eventsCount900s.columns = ['timeStamp','EventType_1','EventType_2','EventType_3']
636 |     eventsCount900s.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False)
637 | 
638 |     os.chdir(r'/home/arno/work/research/lobster/data')
639 | 
640 | 
641 | 
642 | #----- 2.11 Relative intensity - Step 3 - Put it all together   
643 | os.chdir(r'/home/arno/work/research/lobster/data/features/')
644 | allFiles = os.listdir()
645 | 
646 | files900s = [sl for sl in allFiles if "relativeIntensity900s" in sl and 'TSLA' in sl]
647 | files900s.sort()
648 | 
649 | files10s = [sk for sk in allFiles if "relativeIntensity10s" in sk and 'TSLA' in sk]
650 | files10s.sort()
651 | 
652 | for i in range(len(files10s)):
653 |     print(files10s[i] + ' *** ' + datetime.now().strftime("%H:%M:%S"))
654 |     theName = files10s[i][0:15] + '_relativeIntensity.csv' 
655 |         
656 |     the10sFile = pd.read_csv(files10s[i], float_precision='round_trip')
657 |     the900sFile = pd.read_csv(files900s[i], float_precision='round_trip')
658 |     
659 |     #--- Open Start & Stop file
660 |     os.chdir(r'/home/arno/work/research/lobster/data/features/')
661 |     sSFile = [g for g in os.listdir() if ((g[-17:] == '_startAndStop.csv') and (g[:15] == files10s[i][:15]))]
662 |     startStopFile = pd.read_csv(sSFile[0], names = ['timeStamp','start','stop'], header = 0)
663 |     start = np.array(startStopFile['start'])
664 |     stop = np.array(startStopFile['stop'])
665 | 
666 |     relInt1 = np.array(the10sFile['EventType_1']/the900sFile['EventType_1'])
667 |     relInt2 = np.array(the10sFile['EventType_2']/the900sFile['EventType_2'])
668 |     relInt3 = np.array(the10sFile['EventType_3']/the900sFile['EventType_3'])
669 |     
670 |     aa = [round(stats.percentileofscore(relInt1[start[k]:k],relInt1[k])/10,0) for k in range(len(startStopFile))]
671 |     bb = [round(stats.percentileofscore(relInt2[start[k]:k],relInt2[k])/10,0) for k in range(len(startStopFile))]
672 |     cc = [round(stats.percentileofscore(relInt3[start[k]:k],relInt3[k])/10,0) for k in range(len(startStopFile))]
673 |     
674 |     relInt = pd.concat([the10sFile['timeStamp'],pd.Series(aa),pd.Series(bb),pd.Series(cc)], axis = 1)
675 |     relInt.columns = ['timeStamp','relIntEventType_1','relIntEventType_2','relIntEventType_3']
676 |     
677 |     relInt.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False)
678 | 
679 | 
680 | 


--------------------------------------------------------------------------------