├── License.md ├── README.md ├── TSLA_2015-01-07_34200000_57600000_message_10_SAMPLE.csv ├── useful.py ├── rf-labels.py ├── TSLA_2015-01-07_34200000_57600000_orderbook_10_SAMPLE.csv ├── rf-calibration.py └── rf-features.py /License.md: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) [2020] [Arnaud Amsellem] 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Using random forest to model limit order book dynamic 2 | I use the random forest algorithm to forecast mid price dynamic over short time horizon i.e. a few seconds ahead. This is of particular interest to market makers to skew their bid/ask spread in the direction of the most favorable outcome. Most if not all the literature on the topic focuses on applying straight out of the box algorithm to create forecast at any point in time. The problem in a real life environment is different. A market maker can provide a standard bid/ask spread most of the time and only when she/he has a statistical hedge she/he can skew the spread in the direction given by the model. This is what I try to do here: creating a forecast only when a statistical hedge exists 3 | 4 | I used Python scikit-learn library implementation of random forest. This GitHub repo contains the code, some sample data and the associated explanations (code comments). This repo is also associated to a post on my [blog](https://www.thertrader.com/). I'm happy for anyone to re-use my work as long as proper reference to it is made. 5 | 6 | I wanted to thank [LOBSTER](https://lobsterdata.com/) for providing the dataset used here 7 | 8 | ## Code structure 9 | 10 | The code is organised around 4 files. In each file there is a detailled explanation and what is done along with some commented lines of code. 11 | 12 | **rf.labels.py**: Definition of labels used in the Random Forest model 13 | 14 | **rf.features.py**: Definition of features used in the Random Forest model 15 | 16 | **rf.calibration.py**: Calibration and test of the Random Forest model 17 | 18 | **useful.py**: Various useful functions used in the other 3 files (work in progress) 19 | 20 | -------------------------------------------------------------------------------- /TSLA_2015-01-07_34200000_57600000_message_10_SAMPLE.csv: -------------------------------------------------------------------------------- 1 | 34200.007780658,4,8574201,23,2134000,-1 2 | 34200.332976723,1,8770149,100,2126400,1 3 | 34200.492462611,6,-1,19725,2133500,-1 4 | 34200.492462611,1,6756607,1752,2133500,1 5 | 34200.492462611,1,3146453,15,2135000,-1 6 | 34200.492462611,1,7252378,5,2132600,1 7 | 34200.492462611,1,5502065,1,2135000,-1 8 | 34200.492462611,1,6252826,6,2132500,1 9 | 34200.492462611,1,7200188,350,2137000,-1 10 | 34200.492462611,1,4151867,47,2131000,1 11 | 34200.492462611,1,3671133,50,2130000,1 12 | 34200.492462611,1,4015142,2,2130000,1 13 | 34200.492462611,1,4490967,8,2130000,1 14 | 34200.492462611,1,5203157,5,2130000,1 15 | 34200.492462611,1,5539893,60,2130000,1 16 | 34200.492462611,1,6823462,25,2130000,1 17 | 34200.492462611,1,7087997,14,2130000,1 18 | 34200.492462611,1,7323188,200,2130000,1 19 | 34200.492462611,1,2703149,6,2129000,1 20 | 34200.495578064,4,8574201,200,2134000,-1 21 | 34200.496060008,4,8574201,77,2134000,-1 22 | 34200.496149804,1,8814916,100,2133600,1 23 | 34200.496400481,1,8814964,200,2136900,-1 24 | 34200.496950332,1,8815112,200,2133700,1 25 | 34200.497854428,3,8814964,200,2136900,-1 26 | 34200.546253717,3,8814916,100,2133600,1 27 | 34200.559199502,1,8829525,78,2134000,-1 28 | 34200.580826018,4,8829525,78,2134000,-1 29 | 34200.930944949,1,8895561,3,2133700,1 30 | 34201.013532726,1,8908510,100,2135400,-1 31 | 34201.014941014,5,0,100,2133800,1 32 | 34201.149417591,3,8574203,200,2138700,-1 33 | 34201.220091181,5,0,14,2134600,1 34 | 34201.279265571,1,8947077,40,2130000,1 35 | 34201.62417489,5,0,25,2134600,1 36 | 34201.624181025,5,0,25,2134600,1 37 | 34201.624186296,5,0,20,2134600,1 38 | 34201.624446938,4,8815112,100,2133700,1 39 | 34201.624881166,1,8999485,100,2134600,-1 40 | 34201.625069898,3,8815112,100,2133700,1 41 | 34201.897971506,4,8895561,3,2133700,1 42 | 34201.898022274,1,9048906,100,2134400,-1 43 | 34201.898079405,3,8999485,100,2134600,-1 44 | 34201.920290897,4,6756607,900,2133500,1 45 | 34201.920340111,4,6756607,852,2133500,1 46 | 34201.920727407,1,9052854,400,2133500,-1 47 | 34201.948076997,3,9048906,100,2134400,-1 48 | 34201.954050118,1,9058469,100,2135300,-1 49 | 34201.962694145,1,9060027,100,2133300,-1 50 | 34201.96510169,3,9060027,100,2133300,-1 51 | -------------------------------------------------------------------------------- /useful.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | """ 5 | Various useful functions (work in progress) 6 | 7 | @author: arno 8 | @Date: Mar 2020 9 | """ 10 | import numpy as np 11 | 12 | 13 | # ============================================================================= 14 | # Enhanced division by zero 15 | # Used for the Random Forrest study to calculate trade intensity acceleration 16 | # ============================================================================= 17 | def divisionByZero(x,y): 18 | try: 19 | return x/y 20 | except ZeroDivisionError: 21 | return 1 22 | 23 | 24 | # ============================================================================= 25 | # Extract decision rules from Random Forest 26 | # Adapted from: https://stackoverflow.com/questions/50600290/how-extraction-decision-rules-of-random-forest-in-python 27 | # ============================================================================= 28 | def getDecisionRules(rf): 29 | decisionRules = [] 30 | for tree_idx, est in enumerate(rf.estimators_): 31 | tree = est.tree_ 32 | assert tree.value.shape[1] == 1 # no support for multi-output 33 | 34 | # print('TREE: {}'.format(tree_idx)) 35 | # decisionRules.append('TREE: {}'.format(tree_idx)) 36 | 37 | iterator = enumerate(zip(tree.children_left, tree.children_right, tree.feature, tree.threshold, tree.value)) 38 | for node_idx, data in iterator: 39 | left, right, feature, th, value = data 40 | 41 | # left: index of left child (if any) 42 | # right: index of right child (if any) 43 | # feature: index of the feature to check 44 | # th: the threshold to compare against 45 | # value: values associated with classes 46 | 47 | # for classifier, value is 0 except the index of the class to return 48 | class_idx = np.argmax(value[0]) 49 | 50 | if left == -1 and right == -1: 51 | # print('{} LEAF: return class={}'.format(node_idx, class_idx)) 52 | decisionRules.append('{} LEAF: return class={}'.format(node_idx, class_idx)) 53 | else: 54 | # print('{} NODE: if feature[{}] < {} then next={} else next={}'.format(node_idx, feature, th, left, right)) 55 | decisionRules.append('{} NODE: if feature[{}] < {} then next={} else next={}'.format(node_idx, feature, th, left, right)) 56 | 57 | return(decisionRules) 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | -------------------------------------------------------------------------------- /rf-labels.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | """ 5 | Definition of the labels used in the Random Forest model 6 | 7 | labels are defines as following: 8 | -1: Dowmward movement 9 | 0: Stationary 10 | +1: Upward movement 11 | 12 | The labels are defined over 2 time horizons: 13 | - 10 seconds ahead 14 | - 20 seconds ahead 15 | 16 | @author: thertrader@gmail.com 17 | @Date: Mar 2020 18 | """ 19 | 20 | import pandas as pd 21 | import os 22 | import numpy as np 23 | from datetime import datetime 24 | 25 | 26 | # ============================================================================= 27 | # 1 - BASIC PARAMETERS 28 | # ============================================================================= 29 | os.chdir(r'/home/arno/work/research/lobster/data') 30 | 31 | nlevels = 10 32 | 33 | col = ['Ask Price ','Ask Size ','Bid Price ','Bid Size '] 34 | 35 | theNames = [] 36 | for i in range(1, nlevels + 1): 37 | for j in col: 38 | theNames.append(str(j) + str(i)) 39 | 40 | tickers = ['TSLA'] 41 | 42 | #----- Lists of messages files and order book files 43 | theFiles = [] 44 | theOrderBookFiles = [] 45 | theMessagesFiles = [] 46 | 47 | for tk in tickers: 48 | # tk = 'INTC' 49 | os.chdir(r"/home/arno/work/research/lobster/data/" + tk) 50 | theFiles.extend(sorted(os.listdir())) 51 | theOrderBookFiles.extend([sl for sl in theFiles if "_orderbook" in sl]) 52 | theMessagesFiles.extend([sk for sk in theFiles if "_message" in sk]) 53 | theFiles = [] 54 | 55 | 56 | # ============================================================================= 57 | # 2 - Y LABELS FUNCTION 58 | # ============================================================================= 59 | # outputFileName = '_Y10Sec.csv' 60 | # fwdTimeLength = 10 61 | # f = 'TSLA_2015-01-02_34200000_57600000_orderbook_10.csv' 62 | os.chdir(r'/home/arno/work/research/lobster/data') 63 | def labels(outputFileName, fwdTimeLength): 64 | 65 | for f in theOrderBookFiles: 66 | print(f + ' **** ' + datetime.now().strftime("%H:%M:%S")) 67 | 68 | theName = f[0:15] + outputFileName 69 | 70 | if (f[0:4] == 'TSLA'): 71 | os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 72 | 73 | theMessageFile = f[0:15] + '_34200000_57600000_message_10.csv' 74 | 75 | #---- Method using Numpy array, Loop and reduced array size to speed up things big time. This is how it's done: 76 | mf = pd.read_csv(theMessageFile, names = ['timeStamp','EventType','Order ID','Size','Price','Direction'], float_precision='round_trip') 77 | dfTimeStamp = mf['timeStamp'] 78 | timeStampInSeconds = dfTimeStamp.round(0) # used to define max lookback period 79 | maxLookAhead = sum(timeStampInSeconds.value_counts().iloc[:(fwdTimeLength + 2)]) # used to define max lookback period (should be iloc[:11] but iloc[:12] is safer) 80 | timeStamp = mf['timeStamp'].to_frame().values 81 | 82 | start = [] 83 | 84 | # start_time = time.time() # For profiling only 85 | for i in range(len(timeStamp)): 86 | if i < (len(timeStamp) - maxLookAhead): 87 | a = i + maxLookAhead 88 | bb = np.column_stack((np.array(dfTimeStamp.iloc[i:a].index),timeStamp[i:a,0])) 89 | theIndexValue = bb[abs(bb[:,1] - (timeStamp[i,0] + fwdTimeLength)).argmin(),0] 90 | start.append(int(theIndexValue)) 91 | 92 | elif i >= (len(timeStamp) - maxLookAhead): 93 | bb = np.column_stack((np.array(dfTimeStamp.iloc[i:].index),timeStamp[i:,0])) 94 | theIndexValue = bb[abs(bb[:,1] - (timeStamp[i,0] + fwdTimeLength)).argmin(),0] 95 | start.append(int(theIndexValue)) 96 | 97 | stop = list(range(len(timeStamp))) 98 | # print("--- %s seconds ---" % round(time.time() - start_time,2)) # For profiling only 99 | 100 | theDataFile = f[0:15] + '_34200000_57600000_orderbook_10.csv' 101 | df = pd.read_csv(theDataFile, names = theNames) 102 | dfArray = df.values 103 | 104 | midRtn = np.array([(((dfArray[stop[j],0] + dfArray[stop[j],2])/2) / ((dfArray[start[j],0] + dfArray[start[j],2])/2)) - 1 for j in range(len(timeStamp))]) 105 | midRtn = np.where(midRtn > 0,1,np.where(midRtn < 0,-1,0)) 106 | dfToExport = pd.concat([pd.Series(timeStamp[:,0]),pd.Series(midRtn)],axis=1) 107 | dfToExport.columns = ['timeStamp', 'label'] 108 | 109 | dfToExport.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False) 110 | 111 | del dfToExport,midRtn,mf 112 | 113 | # ============================================================================= 114 | # 3 - RUN LABELS FUNCTION 115 | # ============================================================================= 116 | labels(outputFileName = '_Y10Sec.csv', fwdTimeLength = 10) 117 | labels(outputFileName = '_Y20Sec.csv', fwdTimeLength = 20) 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | -------------------------------------------------------------------------------- /TSLA_2015-01-07_34200000_57600000_orderbook_10_SAMPLE.csv: -------------------------------------------------------------------------------- 1 | 2134000,277,2132500,19,2136400,53,2131500,100,2136900,50,2130000,15,2137900,50,2128400,165,2138000,1,2126800,100,2138700,200,2125300,300,2138800,100,2124800,80,2139400,50,2120000,30,2139900,100,2113200,100,2140000,350,2100000,17 2 | 2134000,277,2132500,19,2136400,53,2131500,100,2136900,50,2130000,15,2137900,50,2128400,165,2138000,1,2126800,100,2138700,200,2126400,100,2138800,100,2125300,300,2139400,50,2124800,80,2139900,100,2120000,30,2140000,350,2113200,100 3 | 2134000,277,2132500,19,2136400,53,2131500,100,2136900,50,2130000,15,2137900,50,2128400,165,2138000,1,2126800,100,2138700,200,2126400,100,2138800,100,2125300,300,2139400,50,2124800,80,2139900,100,2120000,30,2140000,350,2113200,100 4 | 2134000,277,2133500,1752,2136400,53,2132500,19,2136900,50,2131500,100,2137900,50,2130000,15,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300,2139900,100,2124800,80,2140000,350,2120000,30 5 | 2134000,277,2133500,1752,2135000,15,2132500,19,2136400,53,2131500,100,2136900,50,2130000,15,2137900,50,2128400,165,2138000,1,2126800,100,2138700,200,2126400,100,2138800,100,2125300,300,2139400,50,2124800,80,2139900,100,2120000,30 6 | 2134000,277,2133500,1752,2135000,15,2132600,5,2136400,53,2132500,19,2136900,50,2131500,100,2137900,50,2130000,15,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300,2139900,100,2124800,80 7 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,19,2136900,50,2131500,100,2137900,50,2130000,15,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300,2139900,100,2124800,80 8 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137900,50,2130000,15,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300,2139900,100,2124800,80 9 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2130000,15,2137900,50,2128400,165,2138000,1,2126800,100,2138700,200,2126400,100,2138800,100,2125300,300,2139400,50,2124800,80 10 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,15,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300 11 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,65,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300 12 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,67,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300 13 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,75,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300 14 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,80,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300 15 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,140,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300 16 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,165,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300 17 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,179,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300 18 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,379,2138000,1,2128400,165,2138700,200,2126800,100,2138800,100,2126400,100,2139400,50,2125300,300 19 | 2134000,277,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,379,2138000,1,2129000,6,2138700,200,2128400,165,2138800,100,2126800,100,2139400,50,2126400,100 20 | 2134000,77,2133500,1752,2135000,16,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,379,2138000,1,2129000,6,2138700,200,2128400,165,2138800,100,2126800,100,2139400,50,2126400,100 21 | 2135000,16,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,379,2138700,200,2129000,6,2138800,100,2128400,165,2139400,50,2126800,100,2139800,105,2126400,100 22 | 2135000,16,2133600,100,2136400,53,2133500,1752,2136900,50,2132600,5,2137000,350,2132500,25,2137900,50,2131500,100,2138000,1,2131000,47,2138700,200,2130000,379,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100 23 | 2135000,16,2133600,100,2136400,53,2133500,1752,2136900,250,2132600,5,2137000,350,2132500,25,2137900,50,2131500,100,2138000,1,2131000,47,2138700,200,2130000,379,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100 24 | 2135000,16,2133700,200,2136400,53,2133600,100,2136900,250,2133500,1752,2137000,350,2132600,5,2137900,50,2132500,25,2138000,1,2131500,100,2138700,200,2131000,47,2138800,100,2130000,379,2139400,50,2129000,6,2139800,105,2128400,165 25 | 2135000,16,2133700,200,2136400,53,2133600,100,2136900,50,2133500,1752,2137000,350,2132600,5,2137900,50,2132500,25,2138000,1,2131500,100,2138700,200,2131000,47,2138800,100,2130000,379,2139400,50,2129000,6,2139800,105,2128400,165 26 | 2135000,16,2133700,200,2136400,53,2133500,1752,2136900,50,2132600,5,2137000,350,2132500,25,2137900,50,2131500,100,2138000,1,2131000,47,2138700,200,2130000,379,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100 27 | 2134000,78,2133700,200,2135000,16,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,379,2138700,200,2129000,6,2138800,100,2128400,165,2139400,50,2126800,100 28 | 2135000,16,2133700,200,2136400,53,2133500,1752,2136900,50,2132600,5,2137000,350,2132500,25,2137900,50,2131500,100,2138000,1,2131000,47,2138700,200,2130000,379,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100 29 | 2135000,16,2133700,203,2136400,53,2133500,1752,2136900,50,2132600,5,2137000,350,2132500,25,2137900,50,2131500,100,2138000,1,2131000,47,2138700,200,2130000,379,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100 30 | 2135000,16,2133700,203,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,379,2138700,200,2129000,6,2138800,100,2128400,165,2139400,50,2126800,100 31 | 2135000,16,2133700,203,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,379,2138700,200,2129000,6,2138800,100,2128400,165,2139400,50,2126800,100 32 | 2135000,16,2133700,203,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,379,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100 33 | 2135000,16,2133700,203,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,379,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100 34 | 2135000,16,2133700,203,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,419,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100 35 | 2135000,16,2133700,203,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,419,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100 36 | 2135000,16,2133700,203,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,419,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100 37 | 2135000,16,2133700,203,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,419,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100 38 | 2135000,16,2133700,103,2135400,100,2133500,1752,2136400,53,2132600,5,2136900,50,2132500,25,2137000,350,2131500,100,2137900,50,2131000,47,2138000,1,2130000,419,2138800,100,2129000,6,2139400,50,2128400,165,2139800,105,2126800,100 39 | 2134600,100,2133700,103,2135000,16,2133500,1752,2135400,100,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,419,2138000,1,2129000,6,2138800,100,2128400,165,2139400,50,2126800,100 40 | 2134600,100,2133700,3,2135000,16,2133500,1752,2135400,100,2132600,5,2136400,53,2132500,25,2136900,50,2131500,100,2137000,350,2131000,47,2137900,50,2130000,419,2138000,1,2129000,6,2138800,100,2128400,165,2139400,50,2126800,100 41 | 2134600,100,2133500,1752,2135000,16,2132600,5,2135400,100,2132500,25,2136400,53,2131500,100,2136900,50,2131000,47,2137000,350,2130000,419,2137900,50,2129000,6,2138000,1,2128400,165,2138800,100,2126800,100,2139400,50,2126400,100 42 | 2134400,100,2133500,1752,2134600,100,2132600,5,2135000,16,2132500,25,2135400,100,2131500,100,2136400,53,2131000,47,2136900,50,2130000,419,2137000,350,2129000,6,2137900,50,2128400,165,2138000,1,2126800,100,2138800,100,2126400,100 43 | 2134400,100,2133500,1752,2135000,16,2132600,5,2135400,100,2132500,25,2136400,53,2131500,100,2136900,50,2131000,47,2137000,350,2130000,419,2137900,50,2129000,6,2138000,1,2128400,165,2138800,100,2126800,100,2139400,50,2126400,100 44 | 2134400,100,2133500,852,2135000,16,2132600,5,2135400,100,2132500,25,2136400,53,2131500,100,2136900,50,2131000,47,2137000,350,2130000,419,2137900,50,2129000,6,2138000,1,2128400,165,2138800,100,2126800,100,2139400,50,2126400,100 45 | 2134400,100,2132600,5,2135000,16,2132500,25,2135400,100,2131500,100,2136400,53,2131000,47,2136900,50,2130000,419,2137000,350,2129000,6,2137900,50,2128400,165,2138000,1,2126800,100,2138800,100,2126400,100,2139400,50,2124800,80 46 | 2133500,400,2132600,5,2134400,100,2132500,25,2135000,16,2131500,100,2135400,100,2131000,47,2136400,53,2130000,419,2136900,50,2129000,6,2137000,350,2128400,165,2137900,50,2126800,100,2138000,1,2126400,100,2138800,100,2124800,80 47 | 2133500,400,2132600,5,2135000,16,2132500,25,2135400,100,2131500,100,2136400,53,2131000,47,2136900,50,2130000,419,2137000,350,2129000,6,2137900,50,2128400,165,2138000,1,2126800,100,2138800,100,2126400,100,2139400,50,2124800,80 48 | 2133500,400,2132600,5,2135000,16,2132500,25,2135300,100,2131500,100,2135400,100,2131000,47,2136400,53,2130000,419,2136900,50,2129000,6,2137000,350,2128400,165,2137900,50,2126800,100,2138000,1,2126400,100,2138800,100,2124800,80 49 | 2133300,100,2132600,5,2133500,400,2132500,25,2135000,16,2131500,100,2135300,100,2131000,47,2135400,100,2130000,419,2136400,53,2129000,6,2136900,50,2128400,165,2137000,350,2126800,100,2137900,50,2126400,100,2138000,1,2124800,80 50 | 2133500,400,2132600,5,2135000,16,2132500,25,2135300,100,2131500,100,2135400,100,2131000,47,2136400,53,2130000,419,2136900,50,2129000,6,2137000,350,2128400,165,2137900,50,2126800,100,2138000,1,2126400,100,2138800,100,2124800,80 51 | -------------------------------------------------------------------------------- /rf-calibration.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | """ 5 | Calibration and test of the Random Forest model 6 | 7 | 1. Basic parameters 8 | 2. Features pre processing - Run it only once 9 | 3. Calibration & test 10 | 11 | @author: thertrader@gmail.com 12 | @Date: Mar 2020 13 | """ 14 | 15 | import os 16 | import sys 17 | import pandas as pd 18 | import numpy as np 19 | from datetime import datetime 20 | from sklearn.ensemble import RandomForestClassifier 21 | #from sklearn import metrics 22 | 23 | sys.path.append('/home/arno/work/programming/python') # add to the list of python paths 24 | import useful 25 | 26 | 27 | # ============================================================================= 28 | # 1 - Basic parameters 29 | # ============================================================================= 30 | tickers = ['TSLA'] 31 | 32 | #----- Lists of all files 33 | theFiles = [] 34 | for tk in tickers: 35 | os.chdir(r"/home/arno/work/research/lobster/data/" + tk) 36 | theFiles.extend(sorted(os.listdir())) 37 | 38 | #----- Unique dates 39 | theDates = [] 40 | for i in range(0,len(theFiles)): 41 | theDates.append(theFiles[i][5:15]) 42 | 43 | theDates = sorted(list(set(theDates))) 44 | 45 | #--- To limit the effect of Open & Close auctions I removed the first and last 15 minutes. Not sure what the real impact is... 46 | top = 35100 # starts @ 9:45 47 | bottom = 56700 # starts @ 15:45 48 | 49 | 50 | # ============================================================================= 51 | # 2 - Features pre processing - Run it only once 52 | # ============================================================================= 53 | # tk = 'TSLA' 54 | # f = 'TSLA_2015-01-02_34200000_57600000_orderbook_10.csv' 55 | # m = '2015-01-05' 56 | # g = 'TSLA_2015-01-02_Y10Sec.csv' 57 | os.chdir(r'/home/arno/work/research/lobster/data/features/') 58 | for tk in tickers: 59 | for m in theDates: 60 | print(m + ' **** ' + datetime.now().strftime("%H:%M:%S")) # for prototyping only 61 | thisStockFiles = [f for f in os.listdir() if (f[:4] == tk and f[5:15] == m)] 62 | xFiles = sorted([f for f in thisStockFiles if ((f[16] != 'Y') and (f[16:36] != 'relativeIntensity10s') and (f[16:37] != 'relativeIntensity900s') and (f[16:28] != 'startAndStop'))]) 63 | yFiles = sorted([f for f in thisStockFiles if f[16] == 'Y']) 64 | 65 | #--- Define start, stop lines and index to keep 66 | dd = pd.read_csv(yFiles[0]) 67 | timeStamp = dd['timeStamp'] 68 | t = abs(timeStamp - top).idxmin() 69 | b = abs(timeStamp - bottom).idxmin() 70 | 71 | #--- Y variables 72 | yVar = pd.DataFrame(dtype = 'int') 73 | for g in yFiles: 74 | dt = pd.read_csv(g) 75 | dt = dt.drop(columns = ['timeStamp']) 76 | yVar = pd.concat([yVar, dt], axis = 1) 77 | yVar = yVar.iloc[t:(b+1),:] 78 | yVar.columns = ['midRtn_10','midRtn_20'] 79 | yVar = yVar.astype(int) 80 | 81 | #--- X variables 82 | xVar = pd.DataFrame() 83 | for f in xFiles: 84 | dt = pd.read_csv(f) 85 | xVar = pd.concat([xVar, dt], axis = 1) 86 | 87 | xVar = xVar.drop(columns = ['timeStamp','decEventType_2','levelEventType_2','relIntEventType_2']) 88 | xVar = xVar[[f for f in xVar.columns if (f[-1] in list('12345') or f[:3] == 'acc' or f[:3] == 'ave')]] # Drop order book levels > 5 89 | xVar = xVar.iloc[t:(b+1),:] # remove first and last 15 min. 90 | print(m + ' **** '+ str(len(list(xVar.columns)))) 91 | 92 | yVar.to_csv(r'/home/arno/work/research/lobster/data/calibration/yVar_' + tk + '_' + m + '.csv', header = True, index = False) 93 | xVar.to_csv(r'/home/arno/work/research/lobster/data/calibration/xVar_' + tk + '_' + m + '.csv', header = True, index = False) 94 | 95 | del xVar,yVar 96 | 97 | 98 | # ============================================================================= 99 | # 3 - Calibration & test 100 | # ============================================================================= 101 | #--- Create RFC 102 | rfc = RandomForestClassifier( 103 | n_jobs = 4, # How many processors is it allowed to use. -1 means there is no restriction, 1 means it can only use one processor 104 | n_estimators = 200, # The number of trees in the forest 105 | min_samples_leaf = 200, # Minimum number of observations (i.e. samples) in terminal leaf 106 | min_samples_split = 300, # represents the minimum number of samples (i.e. observations) required to split an internal node. 107 | oob_score = True, # This is a random forest cross validation method. It is very similar to leave one out validation technique 108 | max_depth = None, # The maximum depth of the tree 109 | verbose = 1, # To check progress of the estimation 110 | max_features = 'sqrt') # The number of features to consider when looking for the best split. None = no limit 111 | 112 | 113 | #--- Set parameters to run the model 114 | os.chdir(r'/home/arno/work/research/lobster/data/calibration') 115 | listOfFiles = os.listdir() 116 | 117 | tk = 'TSLA' 118 | th = 0.50 119 | lookBackPeriod = 4 120 | indepVar = 'midRtn_10' 121 | 122 | gatherResults = pd.DataFrame(index = range((20-lookBackPeriod)), 123 | columns = ['date','IS-HR-U','#obs.1','IS-HR-S','#obs.2','IS-HR-D','#obs.3','OOS-HR-U','#obs.4','OOS-HR-S','#obs.5','OOS-HR-D','#obs.6', 124 | 'IS-HR-U55','#obs.155','IS-HR-S55','#obs.255','IS-HR-D55','#obs.355','OOS-HR-U55','#obs.455','OOS-HR-S55','#obs.555','OOS-HR-D55','#obs.655', 125 | 'IS-HR-U60','#obs.160','IS-HR-S60','#obs.260','IS-HR-D60','#obs.360','OOS-HR-U60','#obs.460','OOS-HR-S60','#obs.560','OOS-HR-D60','#obs.660', 126 | 'overallScore']) 127 | 128 | #--- Run the model 129 | for i in range(len(theDates)): 130 | if (i >= lookBackPeriod): 131 | print(theDates[i] + ' **** ' + datetime.now().strftime("%H:%M:%S")) # for prototyping only 132 | for tk in tickers: 133 | outputName = tk + '_' + theDates[i] + '.csv' 134 | 135 | #--- Define features & labels calibration dataset (4 days) 136 | x = pd.DataFrame() 137 | xFiles = sorted([f for f in listOfFiles if f[5:9] == tk and f[0] == 'x']) 138 | xFiles = xFiles[(i-lookBackPeriod):i] 139 | for g in xFiles: 140 | dt = pd.read_csv(g) 141 | x = pd.concat([x, dt], axis = 0) 142 | 143 | y = pd.DataFrame() 144 | yFiles = sorted([f for f in listOfFiles if f[5:9] == tk and f[0] == 'y']) 145 | yFiles = yFiles[(i-lookBackPeriod):i] 146 | for h in yFiles: 147 | du = pd.read_csv(h) 148 | y = pd.concat([y, du], axis = 0) 149 | 150 | #--- Define features & labels test dataset (1 day) 151 | xTestFile = 'xVar_' + tk + '_' + theDates[i] + '.csv' 152 | xTest = pd.read_csv(xTestFile) 153 | 154 | yTestFile = 'yVar_' + tk + '_' + theDates[i] + '.csv' 155 | yTest = pd.read_csv(yTestFile) 156 | 157 | #--- Random forest estimation 158 | theFit = rfc.fit(x.values, y[indepVar].values) 159 | 160 | #---- Extract Decision Rules from in sample estimateions 161 | decisionsRules = pd.Series(useful.getDecisionRules(theFit)) 162 | decisionsRules.to_csv(r'/home/arno/work/research/lobster/results/decisionsRules_' + outputName, header = False, index = False) 163 | 164 | #---- Feature importance ranked by score 165 | featureImportance = pd.concat([pd.Series(x.columns),pd.Series(theFit.feature_importances_)], axis = 1) 166 | featureImportance.columns = ['feature','score'] 167 | featureImportance = featureImportance.sort_values(by=['score'], ascending=False) 168 | featureImportance.to_csv(r'/home/arno/work/research/lobster/results/featuresImportance_' + outputName, header = True, index = False) 169 | 170 | #---- Calibration diagnostic 171 | featureOverallScore = theFit.score(x.values, y[indepVar].values) 172 | 173 | #---- Calibration; Hit Ratio per label 174 | pp = theFit.predict_proba(x) # tricky calculation (see scikit doc)..... 175 | newDt = np.concatenate([pp,y],axis= 1) 176 | newDt = newDt[:,[0,1,2,3]] 177 | 178 | #---- Test: Hit Ratio per label 179 | qq = theFit.predict_proba(xTest) 180 | newDtTest = np.concatenate([qq,yTest],axis= 1) 181 | newDtTest = newDtTest[:,[0,1,2,3]] 182 | 183 | #--- Store results 184 | k = i - lookBackPeriod 185 | gatherResults.iloc[k,0] = int(theDates[i][:4] + theDates[i][5:7] + theDates[i][8:10]) 186 | #--- th = 0.5 187 | gatherResults.iloc[k,1] = np.where((newDt[:,0] > th) & (newDt[:,3] == -1),1,0).sum()/np.where(newDt[:,0] > th,1,0).sum() 188 | gatherResults.iloc[k,2] = np.where(newDt[:,0] > th,1,0).sum() 189 | gatherResults.iloc[k,3] = np.where((newDt[:,1] > th) & (newDt[:,3] == 0),1,0).sum()/np.where(newDt[:,1] > th,1,0).sum() 190 | gatherResults.iloc[k,4] = np.where(newDt[:,1] > th,1,0).sum() 191 | gatherResults.iloc[k,5] = np.where((newDt[:,2] > th) & (newDt[:,3] == 1),1,0).sum()/np.where(newDt[:,2] > th,1,0).sum() 192 | gatherResults.iloc[k,6] = np.where(newDt[:,2] > th,1,0).sum() 193 | 194 | gatherResults.iloc[k,7] = np.where((newDtTest[:,0] > th) & (newDtTest[:,3] == -1),1,0).sum()/np.where(newDtTest[:,0] > th,1,0).sum() 195 | gatherResults.iloc[k,8] = np.where(newDtTest[:,0] > th,1,0).sum() 196 | gatherResults.iloc[k,9] = np.where((newDtTest[:,1] > th) & (newDtTest[:,3] == 0),1,0).sum()/np.where(newDtTest[:,1] > th,1,0).sum() 197 | gatherResults.iloc[k,10] = np.where(newDtTest[:,1] > th,1,0).sum() 198 | gatherResults.iloc[k,11] = np.where((newDtTest[:,2] > th) & (newDtTest[:,3] == 1),1,0).sum()/np.where(newDtTest[:,2] > th,1,0).sum() 199 | gatherResults.iloc[k,12] = np.where(newDtTest[:,2] > th,1,0).sum() 200 | 201 | #--- th = 0.55 202 | gatherResults.iloc[k,13] = np.where((newDt[:,0] > (th+0.05)) & (newDt[:,3] == -1),1,0).sum()/np.where(newDt[:,0] > (th+0.05),1,0).sum() 203 | gatherResults.iloc[k,14] = np.where(newDt[:,0] > (th+0.05),1,0).sum() 204 | gatherResults.iloc[k,15] = np.where((newDt[:,1] > (th+0.05)) & (newDt[:,3] == 0),1,0).sum()/np.where(newDt[:,1] > (th+0.05),1,0).sum() 205 | gatherResults.iloc[k,16] = np.where(newDt[:,1] > (th+0.05),1,0).sum() 206 | gatherResults.iloc[k,17] = np.where((newDt[:,2] > (th+0.05)) & (newDt[:,3] == 1),1,0).sum()/np.where(newDt[:,2] > (th+0.05),1,0).sum() 207 | gatherResults.iloc[k,18] = np.where(newDt[:,2] > (th+0.05),1,0).sum() 208 | 209 | gatherResults.iloc[k,19] = np.where((newDtTest[:,0] > (th+0.05)) & (newDtTest[:,3] == -1),1,0).sum()/np.where(newDtTest[:,0] > (th+0.05),1,0).sum() 210 | gatherResults.iloc[k,20] = np.where(newDtTest[:,0] > (th+0.05),1,0).sum() 211 | gatherResults.iloc[k,21] = np.where((newDtTest[:,1] > (th+0.05)) & (newDtTest[:,3] == 0),1,0).sum()/np.where(newDtTest[:,1] > (th+0.05),1,0).sum() 212 | gatherResults.iloc[k,22] = np.where(newDtTest[:,1] > (th+0.05),1,0).sum() 213 | gatherResults.iloc[k,23] = np.where((newDtTest[:,2] > (th+0.05)) & (newDtTest[:,3] == 1),1,0).sum()/np.where(newDtTest[:,2] > (th+0.05),1,0).sum() 214 | gatherResults.iloc[k,24] = np.where(newDtTest[:,2] > (th+0.05),1,0).sum() 215 | 216 | #--- th = 0.6 217 | gatherResults.iloc[k,25] = np.where((newDt[:,0] > (th+0.1)) & (newDt[:,3] == -1),1,0).sum()/np.where(newDt[:,0] > (th+0.1),1,0).sum() 218 | gatherResults.iloc[k,26] = np.where(newDt[:,0] > (th+0.1),1,0).sum() 219 | gatherResults.iloc[k,27] = np.where((newDt[:,1] > (th+0.1)) & (newDt[:,3] == 0),1,0).sum()/np.where(newDt[:,1] > (th+0.1),1,0).sum() 220 | gatherResults.iloc[k,28] = np.where(newDt[:,1] > (th+0.1),1,0).sum() 221 | gatherResults.iloc[k,29] = np.where((newDt[:,2] > (th+0.1)) & (newDt[:,3] == 1),1,0).sum()/np.where(newDt[:,2] > (th+0.1),1,0).sum() 222 | gatherResults.iloc[k,30] = np.where(newDt[:,2] > (th+0.1),1,0).sum() 223 | 224 | gatherResults.iloc[k,31] = np.where((newDtTest[:,0] > (th+0.1)) & (newDtTest[:,3] == -1),1,0).sum()/np.where(newDtTest[:,0] > (th+0.1),1,0).sum() 225 | gatherResults.iloc[k,32] = np.where(newDtTest[:,0] > (th+0.1),1,0).sum() 226 | gatherResults.iloc[k,33] = np.where((newDtTest[:,1] > (th+0.1)) & (newDtTest[:,3] == 0),1,0).sum()/np.where(newDtTest[:,1] > (th+0.1),1,0).sum() 227 | gatherResults.iloc[k,34] = np.where(newDtTest[:,1] > (th+0.1),1,0).sum() 228 | gatherResults.iloc[k,35] = np.where((newDtTest[:,2] > (th+0.1)) & (newDtTest[:,3] == 1),1,0).sum()/np.where(newDtTest[:,2] > (th+0.1),1,0).sum() 229 | gatherResults.iloc[k,36] = np.where(newDtTest[:,2] > (th+0.1),1,0).sum() 230 | gatherResults.iloc[k,37] = featureOverallScore 231 | 232 | gatherResults.to_csv(r'/home/arno/work/research/lobster/results/results.csv', header = True, index = False) 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | -------------------------------------------------------------------------------- /rf-features.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | """ 5 | Definition of features used in the random forest modeling 6 | 7 | 1. Basic parameters 8 | 2. Features definition 9 | - 2.1 START & STOP TIMESTAMPS: This is ran once for all for each file. Look for the timestamp 5sec backward. This is used for deciles definition. 10 | - 2.2 IMBALANCE: Absolute levels & deciles ranking compared to last 5min 11 | - 2.3 MID PRICE, B/A PRICE SPREAD, B/A VOLUME SPREAD: Absolute levels & deciles ranking compared to last 5min 12 | - 2.4 PRICE DIFFERENCES ACCROSS LEVELS: Bid & Ask differences between order book levels (1 to 10). Absolute levels & deciles ranking compared to last 5 min 13 | - 2.5 PRICE AVERAGE & VOLUME AVERAGE: Mean calculated across order book levels. Deciles ranking compared to last 5 min 14 | - 2.6 ACCUMULATED DIFFERENCES PRICE & VOLUME: Sum of all bid minus sum of all asks across levels. Deciles ranking compared to last 5 min 15 | - 2.7 BID, ASK, BID SIZE & ASK SIZE DERIVATIVEs: Last compared to price 1 sec. ago. Deciles ranking compared to last 5 sec. 16 | - 2.8 AVERAGE TRADE INTENSITY: Count events of type 1,2,3 (see definition above) over the last second and deciles ranking compared to last 5 min 17 | - 2.9 RELATIVE TRADE INTENSITY 10s: Count events of type 1,2,3 over the last 10 sec (Intermediary step) 18 | - 2.10 RELATIVE TRADE INTENSITY 900s: Count events of type 1,2,3 over the last 900 sec. (Intermediary step) 19 | - 2.11 RELATIVE TRADE INTENSITY: Relative count events 10 sec./900 sec. Deciles ranking compared to last 5 min 20 | 21 | EVENT TYPE (see LOBSTER website - [https://lobsterdata.com/](https://lobsterdata.com/)): 22 | 1. New limit order 23 | 2. Partial deletion of limit order 24 | 3 Total cancellation of limit order 25 | 26 | 27 | @author: thertrader@gmail.com 28 | @Date: Mar 2020 29 | """ 30 | 31 | import os 32 | import pandas as pd 33 | import numpy as np 34 | from scipy import stats 35 | from datetime import datetime 36 | 37 | 38 | # ============================================================================= 39 | # 1 - Basic parameters 40 | # ============================================================================= 41 | os.chdir(r'/home/arno/work/research/lobster/data') 42 | 43 | tickers = ['TSLA'] 44 | 45 | nlevels = 10 46 | 47 | col = ['Ask Price ','Ask Size ','Bid Price ','Bid Size '] 48 | 49 | theNames = [] 50 | for i in range(1, nlevels + 1): 51 | for j in col: 52 | theNames.append(str(j) + str(i)) 53 | 54 | #----- Lists of messages files and order book files 55 | theFiles = [] 56 | theOrderBookFiles = [] 57 | theMessagesFiles = [] 58 | 59 | for tk in tickers: 60 | # tk = 'INTC' 61 | os.chdir(r"/home/arno/work/research/lobster/data/" + tk) 62 | theFiles.extend(sorted(os.listdir())) 63 | theOrderBookFiles.extend([sl for sl in theFiles if "_orderbook" in sl]) 64 | theMessagesFiles.extend([sk for sk in theFiles if "_message" in sk]) 65 | theFiles = [] 66 | 67 | 68 | # ============================================================================= 69 | # 2 - Features definition 70 | # ============================================================================= 71 | #----- 2.1 Define Start & Stop timestamp for all files (run it once for all for each file) 72 | for f in theOrderBookFiles: 73 | print(f + ' **** ' + datetime.now().strftime("%H:%M:%S")) # used for prtotyping only 74 | 75 | theName = f[0:15] + '_startAndStop.csv' 76 | 77 | if (f[0:4] == 'INTC'): 78 | os.chdir(r'/home/arno/work/research/lobster/data/INTC') 79 | 80 | if (f[0:4] == 'TSLA'): 81 | os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 82 | 83 | dt = pd.read_csv(f, names = theNames) 84 | #dt.head(5) 85 | 86 | theMessageFile = f[0:15] + '_34200000_57600000_message_10.csv' 87 | mf = pd.read_csv(theMessageFile, names = ['timeStamp','EventType','Order ID','Size','Price','Direction'], float_precision='round_trip') 88 | dfTimeStamp = mf['timeStamp'] 89 | theIndex = dfTimeStamp.index.values 90 | timeStampInSeconds = dfTimeStamp.round(0) 91 | maxLookBack = sum(timeStampInSeconds.value_counts().iloc[:6]) 92 | timeStamp = dfTimeStamp.to_frame().values 93 | 94 | start = [] 95 | 96 | for j in range(len(timeStamp)): 97 | if (j == 0): 98 | start.append(j) 99 | 100 | elif (j <= maxLookBack): 101 | theIndexValue = int(abs(timeStamp[:j] - (timeStamp[j] - 5)).argmin()) 102 | start.append(theIndexValue) 103 | 104 | elif (j > maxLookBack): 105 | a = j - maxLookBack 106 | bb = np.column_stack((theIndex[a:j],timeStamp[a:j])) 107 | theIndexValue = int(bb[abs(bb[:,1] - (timeStamp[j] - 5)).argmin(),0]) 108 | start.append(theIndexValue) 109 | 110 | stop = list(range(len(timeStamp))) 111 | 112 | startAndStop = pd.concat([dfTimeStamp, pd.Series(start), pd.Series(stop)], axis= 1) 113 | startAndStop.columns = ['timeStamp','start','stop'] 114 | 115 | startAndStop.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False) 116 | 117 | 118 | 119 | #----- 2.2 Imbalance Level and Derivative per level 120 | for f in theOrderBookFiles: 121 | print(f + ' **** ' + datetime.now().strftime("%H:%M:%S")) # used for prtotyping only 122 | 123 | theNameLevel = f[0:15] + '_imbalanceLevel_' + str(nlevels) + '.csv' 124 | theNameDerivative = f[0:15] + '_imbalanceDerivative_' + str(nlevels) + '.csv' 125 | 126 | if (f[0:4] == 'INTC'): 127 | os.chdir(r'/home/arno/work/research/lobster/data/INTC') 128 | 129 | if (f[0:4] == 'TSLA'): 130 | os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 131 | 132 | #--- Open Order Book file 133 | theOrderBookFile = pd.read_csv(f, names = theNames) 134 | 135 | #--- Open Messages file 136 | theMessageFile = f[0:15] + '_34200000_57600000_message_10.csv' 137 | mf = pd.read_csv(theMessageFile, names = ['timeStamp','EventType','Order ID','Size','Price','Direction'], float_precision='round_trip') 138 | timeStamp = mf['timeStamp'].to_frame().values 139 | 140 | #--- Open Start & Stop file 141 | os.chdir(r'/home/arno/work/research/lobster/data/features/') 142 | sSFile = [g for g in os.listdir() if ((g[-17:] == '_startAndStop.csv') and (g[:15] == f[:15]))] 143 | startStopFile = pd.read_csv(sSFile[0], names = ['timeStamp','start','stop'], header = 0) 144 | start = np.array(startStopFile['start']) 145 | stop = np.array(startStopFile['stop']) 146 | 147 | imbLevel = pd.DataFrame() 148 | imbDerivative = pd.DataFrame() 149 | 150 | for i in range(1,nlevels+1): 151 | #--- Define Imbalance 152 | nameOne = 'Bid Size '+ str(i) 153 | nameTwo = 'Ask Size '+ str(i) 154 | colName = 'imb '+ str(i) 155 | lev = (theOrderBookFile[nameTwo] - theOrderBookFile[nameOne])/(theOrderBookFile[nameTwo] + theOrderBookFile[nameOne]) 156 | theLevel = round(10 * (lev - lev.min()) / (lev.max() - lev.min()),0) # Deciles 157 | lev = np.array(lev) 158 | 159 | #--- Define Imbalance Derivative 160 | # print('**** ' + datetime.now().strftime("%H:%M:%S")) 161 | theDerivative = [round(stats.percentileofscore(lev[start[k]:k],lev[k])/10,0) for k in range(len(timeStamp))] # Deciles 162 | # print('**** ' + datetime.now().strftime("%H:%M:%S")) 163 | 164 | imbLevel = pd.concat([imbLevel,theLevel],axis=1) 165 | imbDerivative = pd.concat([imbDerivative,pd.Series(theDerivative)],axis=1) 166 | 167 | imbLevel.columns = ['imbLevel1','imbLevel2','imbLevel3','imbLevel4','imbLevel5','imbLevel6','imbLevel7','imbLevel8','imbLevel9','imbLevel10'] 168 | imbLevel.to_csv(r'/home/arno/work/research/lobster/data/features/' + theNameLevel, header = True, index = False) 169 | 170 | imbDerivative.columns = ['imbDer1','imbDer2','imbDer3','imbDer4','imbDer5','imbDer6','imbDer7','imbDer8','imbDer9','imbDer10'] 171 | imbDerivative.to_csv(r'/home/arno/work/research/lobster/data/features/' + theNameDerivative, header = True, index = False) 172 | 173 | os.chdir(r'/home/arno/work/research/lobster/data') 174 | 175 | 176 | 177 | #----- 2.3 Mid price, Spread, Volume Spread derivatives 178 | for f in theOrderBookFiles: 179 | print(f + ' **** ' + datetime.now().strftime("%H:%M:%S")) # used for prtotyping only 180 | theName = f[0:15] + '_misc_' + str(nlevels) + '.csv' 181 | 182 | if (f[0:4] == 'INTC'): 183 | os.chdir(r'/home/arno/work/research/lobster/data/INTC') 184 | 185 | if (f[0:4] == 'TSLA'): 186 | os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 187 | 188 | dt = pd.read_csv(f, names = theNames) 189 | 190 | #--- Open Start & Stop file 191 | os.chdir(r'/home/arno/work/research/lobster/data/features/') 192 | sSFile = [g for g in os.listdir() if ((g[-17:] == '_startAndStop.csv') and (g[:15] == f[:15]))] 193 | startStopFile = pd.read_csv(sSFile[0], names = ['timeStamp','start','stop'], header = 0) 194 | start = np.array(startStopFile['start']) 195 | stop = np.array(startStopFile['stop']) 196 | 197 | aa = pd.DataFrame() 198 | newNames = [] 199 | 200 | for i in range(1,nlevels+1): 201 | newNames.extend(['midDer'+ str(i),'spreadDer'+ str(i),'volumeDer'+ str(i)]) 202 | 203 | nameOne = 'Bid Price '+ str(i) 204 | nameTwo = 'Ask Price '+ str(i) 205 | nameThree = 'Bid Size '+ str(i) 206 | nameFour = 'Ask Size '+ str(i) 207 | 208 | #--- Define Levels 209 | mid = np.array((dt[nameTwo] + dt[nameOne])/2) 210 | spread = np.array(dt[nameTwo] - dt[nameOne]) 211 | volumeSpread = np.array(dt[nameFour] - dt[nameThree]) 212 | 213 | #--- Define Derivatives 214 | theMidDerivative = [round(stats.percentileofscore(mid[start[k]:k],mid[k])/10,0) for k in range(len(startStopFile))] # Deciles 215 | theSpreadDerivative = [round(stats.percentileofscore(spread[start[k]:k],spread[k])/10,0) for k in range(len(startStopFile))] # Deciles 216 | theVolumeSpreadDerivative = [round(stats.percentileofscore(volumeSpread[start[k]:k],volumeSpread[k])/10,0) for k in range(len(startStopFile))] # Deciles 217 | 218 | aa = pd.concat([aa,pd.Series(theMidDerivative),pd.Series(theSpreadDerivative),pd.Series(theVolumeSpreadDerivative)],axis=1) 219 | 220 | aa = pd.concat([startStopFile['timeStamp'],aa],axis=1) 221 | newNames.insert(0,'timeStamp') 222 | aa.columns = newNames 223 | aa.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False) 224 | 225 | os.chdir(r'/home/arno/work/research/lobster/data') 226 | 227 | 228 | 229 | #----- 2.4 Price differences 230 | for f in theOrderBookFiles: 231 | print(f + ' **** ' + datetime.now().strftime("%H:%M:%S")) # used for prtotyping only 232 | theName = f[0:15] + '_priceDiff_' + str(nlevels) + '.csv' 233 | 234 | if (f[0:4] == 'INTC'): 235 | os.chdir(r'/home/arno/work/research/lobster/data/INTC') 236 | 237 | if (f[0:4] == 'TSLA'): 238 | os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 239 | 240 | dt = pd.read_csv(f, names = theNames) 241 | 242 | #--- Open Start & Stop file 243 | os.chdir(r'/home/arno/work/research/lobster/data/features/') 244 | sSFile = [g for g in os.listdir() if ((g[-17:] == '_startAndStop.csv') and (g[:15] == f[:15]))] 245 | startStopFile = pd.read_csv(sSFile[0], names = ['timeStamp','start','stop'], header = 0) 246 | start = np.array(startStopFile['start']) 247 | stop = np.array(startStopFile['stop']) 248 | 249 | newNames = [] 250 | cc = pd.DataFrame() 251 | 252 | for i in range(1,nlevels): 253 | # i = 1 254 | nameOne = 'Ask Price '+ str(i) 255 | nameTwo = 'Ask Price '+ str(i+1) 256 | aa = np.array(abs(dt[nameTwo] + dt[nameOne])) 257 | theAskDiffDerivative = [round(stats.percentileofscore(aa[start[k]:k],aa[k])/10,0) for k in range(len(startStopFile))] # Deciles 258 | 259 | nameTree = 'Bid Price '+ str(i) 260 | nameFour = 'Bid Price '+ str(i+1) 261 | bb = np.array(abs(dt[nameFour] + dt[nameTree])) 262 | theBidDiffDerivative = [round(stats.percentileofscore(bb[start[k]:k],bb[k])/10,0) for k in range(len(startStopFile))] # Deciles 263 | 264 | newNames.extend(['AskPriceDiff '+ str(i),'BidPriceDiff '+ str(i)]) 265 | 266 | cc = pd.concat([cc,pd.Series(theAskDiffDerivative),pd.Series(theBidDiffDerivative)],axis=1) 267 | 268 | cc = pd.concat([startStopFile['timeStamp'],cc],axis=1) 269 | newNames.insert(0,'timeStamp') 270 | cc.columns = newNames 271 | 272 | cc.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False) 273 | 274 | os.chdir(r'/home/arno/work/research/lobster/data') 275 | 276 | 277 | 278 | #----- 2.5 Mean Price and Volume 279 | for f in theOrderBookFiles: 280 | print(f + ' **** ' + datetime.now().strftime("%H:%M:%S")) # used for prtotyping only 281 | theName = f[0:15] + '_meanPriceAndVolume.csv' 282 | 283 | if (f[0:4] == 'INTC'): 284 | os.chdir(r'/home/arno/work/research/lobster/data/INTC') 285 | 286 | if (f[0:4] == 'TSLA'): 287 | os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 288 | 289 | dt = pd.read_csv(f, names = theNames) 290 | 291 | #--- Open Start & Stop file 292 | os.chdir(r'/home/arno/work/research/lobster/data/features/') 293 | sSFile = [g for g in os.listdir() if ((g[-17:] == '_startAndStop.csv') and (g[:15] == f[:15]))] 294 | startStopFile = pd.read_csv(sSFile[0], names = ['timeStamp','start','stop'], header = 0) 295 | start = np.array(startStopFile['start']) 296 | stop = np.array(startStopFile['stop']) 297 | 298 | for i in range(1,nlevels+1): 299 | #i = 1 300 | accAskName = 'Ask Price '+ str(i) 301 | accBidName = 'Bid Price '+ str(i) 302 | accAskVolName = 'Ask Size '+ str(i) 303 | accBidVolName = 'Bid Size '+ str(i) 304 | 305 | if i == 1: 306 | dtAccAsk = dt[accAskName] 307 | dtAccBid = dt[accBidName] 308 | dtAccAskVol = dt[accAskVolName] 309 | dtAccBidVol = dt[accBidVolName] 310 | 311 | if i != 1: 312 | dtAccAsk = dtAccAsk + dt[accAskName] 313 | dtAccBid = dtAccBid + dt[accBidName] 314 | dtAccAskVol = dtAccAskVol + dt[accAskVolName] 315 | dtAccBidVol = dtAccBidVol + dt[accBidVolName] 316 | 317 | meanAsk = np.array(dtAccAsk/nlevels) 318 | meanBid = np.array(dtAccBid/nlevels) 319 | meanAskSize = np.array(dtAccAskVol/nlevels) 320 | meanBidSize = np.array(dtAccBidVol/nlevels) 321 | 322 | aa = [round(stats.percentileofscore(meanAsk[start[k]:k],meanAsk[k])/10,0) for k in range(len(startStopFile))] # Deciles 323 | bb = [round(stats.percentileofscore(meanBid[start[k]:k],meanBid[k])/10,0) for k in range(len(startStopFile))] # Deciles 324 | cc = [round(stats.percentileofscore(meanAskSize[start[k]:k],meanAskSize[k])/10,0) for k in range(len(startStopFile))] # Deciles 325 | dd = [round(stats.percentileofscore(meanBidSize[start[k]:k],meanBidSize[k])/10,0) for k in range(len(startStopFile))] # Deciles 326 | 327 | ee = pd.concat([pd.Series(startStopFile['timeStamp']),pd.Series(aa),pd.Series(bb),pd.Series(cc),pd.Series(dd)],axis=1) 328 | ee.columns = ['timeStamp','averageAsk','averageBid','averageAskSize','averageBidSize'] 329 | 330 | ee.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False) 331 | 332 | os.chdir(r'/home/arno/work/research/lobster/data') 333 | 334 | 335 | 336 | #----- 2.6 Accumulated differences Price and Volume 337 | for f in theOrderBookFiles: 338 | print(f + ' **** ' + datetime.now().strftime("%H:%M:%S")) # used for prtotyping only 339 | theName = f[0:15] + '_accDiffPriceAndVolume.csv' 340 | 341 | if (f[0:4] == 'INTC'): 342 | os.chdir(r'/home/arno/work/research/lobster/data/INTC') 343 | 344 | if (f[0:4] == 'TSLA'): 345 | os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 346 | 347 | dt = pd.read_csv(f, names = theNames) 348 | 349 | #--- Open Start & Stop file 350 | os.chdir(r'/home/arno/work/research/lobster/data/features/') 351 | sSFile = [g for g in os.listdir() if ((g[-17:] == '_startAndStop.csv') and (g[:15] == f[:15]))] 352 | startStopFile = pd.read_csv(sSFile[0], names = ['timeStamp','start','stop'], header = 0) 353 | start = startStopFile['start'] 354 | stop = startStopFile['stop'] 355 | 356 | aa = pd.DataFrame() 357 | 358 | # i = 1 359 | aa['AccPriceDiff'] = (dt ['Bid Price 1'] + dt ['Bid Price 2'] + dt ['Bid Price 3'] + dt ['Bid Price 4'] + dt ['Bid Price 5'] + 360 | dt ['Bid Price 6'] - dt ['Bid Price 7'] + dt ['Bid Price 8'] + dt ['Bid Price 9'] + dt ['Bid Price 10'] - 361 | dt ['Ask Price 1'] - dt ['Ask Price 2'] - dt ['Ask Price 3'] - dt ['Ask Price 4'] - dt ['Ask Price 5'] - 362 | dt ['Ask Price 6'] - dt ['Ask Price 7'] - dt ['Ask Price 8'] - dt ['Ask Price 9'] - dt ['Ask Price 10']) 363 | 364 | aa['AccSizeDiff'] = (dt ['Bid Size 1'] + dt ['Bid Size 2'] + dt ['Bid Size 3'] + dt ['Bid Size 4'] + dt ['Bid Size 5'] + 365 | dt ['Bid Size 6'] - dt ['Bid Size 7'] + dt ['Bid Size 8'] + dt ['Bid Size 9'] + dt ['Bid Size 10'] - 366 | dt ['Ask Size 1'] - dt ['Ask Size 2'] - dt ['Ask Size 3'] - dt ['Ask Size 4'] - dt ['Ask Size 5'] - 367 | dt ['Ask Size 6'] - dt ['Ask Size 7'] - dt ['Ask Size 8'] - dt ['Ask Size 9'] - dt ['Ask Size 10']) 368 | 369 | accPrideDiff = np.array(aa['AccPriceDiff']) 370 | accSizeDiff = np.array(aa['AccSizeDiff']) 371 | 372 | bb = [round(stats.percentileofscore(accPrideDiff[start[k]:k],accPrideDiff[k])/10,0) for k in range(len(startStopFile))] # Deciles 373 | cc = [round(stats.percentileofscore(accSizeDiff[start[k]:k],accSizeDiff[k])/10,0) for k in range(len(startStopFile))] # Deciles 374 | 375 | accDiffPriceAndVolume = pd.concat([pd.Series(startStopFile['timeStamp']),pd.Series(bb),pd.Series(cc)],axis=1) 376 | accDiffPriceAndVolume.columns = ['timeStamp','accPriceDiff','accSizeDiff'] 377 | 378 | accDiffPriceAndVolume.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False) 379 | 380 | os.chdir(r'/home/arno/work/research/lobster/data') 381 | 382 | 383 | 384 | #----- 2.7 Price and Volume Derivatives 385 | for f in theOrderBookFiles: 386 | print(f + ' **** ' + datetime.now().strftime("%H:%M:%S")) 387 | 388 | theName = f[0:15] + '_priceAndVolumeDerivatives.csv' 389 | 390 | if (f[0:4] == 'INTC'): 391 | os.chdir(r'/home/arno/work/research/lobster/data/INTC') 392 | 393 | if (f[0:4] == 'TSLA'): 394 | os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 395 | 396 | #--- The message file 397 | theMessageFile = f[0:15] + '_34200000_57600000_message_10.csv' 398 | mf = pd.read_csv(theMessageFile, names = ['timeStamp','EventType','Order ID','Size','Price','Direction'], float_precision='round_trip') 399 | dfTimeStamp = mf['timeStamp'] 400 | timeStampInSeconds = dfTimeStamp.round(0) 401 | maxLookBack = max(timeStampInSeconds.value_counts()) + 5 402 | timeStamp = mf['timeStamp'].to_frame().values 403 | 404 | #--- The order book file 405 | theDataFile = f[0:15] + '_34200000_57600000_orderbook_10.csv' 406 | df = pd.read_csv(theDataFile, names = theNames) 407 | dfArray = df.values 408 | 409 | #--- Open Start & Stop file 410 | os.chdir(r'/home/arno/work/research/lobster/data/features/') 411 | sSFile = [g for g in os.listdir() if ((g[-17:] == '_startAndStop.csv') and (g[:15] == f[:15]))] 412 | startStopFile = pd.read_csv(sSFile[0], names = ['timeStamp','start','stop'], header = 0) 413 | start = np.array(startStopFile['start']) 414 | stop = np.array(startStopFile['stop']) 415 | 416 | begin = [] 417 | 418 | # start_time = time.time() # used only for profiling 419 | for i in range(len(timeStamp)): 420 | # i = 223200 421 | if i == 0: 422 | begin.append(0) 423 | 424 | elif i < maxLookBack: 425 | theIndexValue = abs(timeStamp[:i,0] - (timeStamp[i,0] - 1)).argmin() 426 | begin.append(theIndexValue) 427 | 428 | elif i >= maxLookBack: 429 | a = i - maxLookBack 430 | bb = np.column_stack((np.array(dfTimeStamp.iloc[a:i].index),timeStamp[a:i,0])) 431 | theIndexValue = bb[abs(bb[:,1] - (timeStamp[i,0] - 1)).argmin(),0] 432 | begin.append(int(theIndexValue)) 433 | 434 | end = np.array(range(len(timeStamp))) 435 | begin = np.array(begin) 436 | # print("--- %s seconds ---" % round(time.time() - start_time,2)) # used only for profiling 437 | 438 | theBigArray = np.empty([len(end), df.shape[1]]) 439 | for i in range(df.shape[1]): 440 | # i = 0 441 | # j = 0 442 | aa = np.array([(dfArray[end[j],i]/dfArray[begin[j],i] - 1) for j in range(len(end))]) 443 | bb = [round(stats.percentileofscore(dfArray[start[k]:k,i],dfArray[k,i])/10,0) for k in range(len(startStopFile))] 444 | theBigArray[:,i] = np.array(bb) 445 | 446 | theBigArray = pd.DataFrame(theBigArray) 447 | theBigArray.columns = theNames 448 | 449 | theBigArray.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False) 450 | 451 | os.chdir(r'/home/arno/work/research/lobster/data') 452 | 453 | 454 | 455 | #----- 2.8 Average intensity (1 sec.) 456 | #--- Only the following event types are selected (See LOBSTER page for the exect definition): 1(Submission of a new limit order) and 3(Deletion (total deletion of a limit order) 457 | for f in theMessagesFiles: 458 | #f = 'INTC_2015-01-02_34200000_57600000_message_10.csv' 459 | print(f + ' **** ' + datetime.now().strftime("%H:%M:%S")) 460 | 461 | theName = f[0:15] + '_tradeIntensity.csv' 462 | 463 | if (f[0:4] == 'INTC'): 464 | os.chdir(r'/home/arno/work/research/lobster/data/INTC') 465 | 466 | if (f[0:4] == 'TSLA'): 467 | os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 468 | 469 | theMessageFile = f[0:15] + '_34200000_57600000_message_10.csv' 470 | 471 | mf = pd.read_csv(theMessageFile, names = ['timeStamp','EventType','Order ID','Size','Price','Direction'], float_precision='round_trip') 472 | dfTimeStamp = mf['timeStamp'] 473 | timeStampInSeconds = dfTimeStamp.round(0) # used to define max lookback period 474 | maxLookBack = sum(timeStampInSeconds.value_counts().iloc[:2]) # used to define max lookback period 475 | timeStamp = mf['timeStamp'].to_frame().values 476 | mf = mf.values 477 | 478 | #--- Open Start & Stop file 479 | os.chdir(r'/home/arno/work/research/lobster/data/features/') 480 | sSFile = [g for g in os.listdir() if ((g[-17:] == '_startAndStop.csv') and (g[:15] == f[:15]))] 481 | startStopFile = pd.read_csv(sSFile[0], names = ['timeStamp','start','stop'], header = 0) 482 | start = np.array(startStopFile['start']) 483 | stop = np.array(startStopFile['stop']) 484 | 485 | eventsCount = np.empty((len(mf),3)) 486 | 487 | # start_time = time.time() # For profiling only 488 | for i in range(len(timeStamp)): 489 | if i == 0: 490 | eventsCount[0,0] = 0 491 | eventsCount[0,1] = 0 492 | eventsCount[0,2] = 0 493 | 494 | elif i < maxLookBack: 495 | startIndexValue = abs(timeStamp[:i,0] - (timeStamp[i,0] - 1)).argmin() 496 | mfSmall = mf[startIndexValue:i,:] 497 | eventsCount[i,0] = np.count_nonzero(mfSmall[:,1] == 1) 498 | eventsCount[i,1] = np.count_nonzero(mfSmall[:,1] == 2) 499 | eventsCount[i,2] = np.count_nonzero(mfSmall[:,1] == 3) 500 | 501 | elif i >= maxLookBack: 502 | a = i - maxLookBack 503 | bb = np.column_stack((np.array(dfTimeStamp.iloc[a:i].index),timeStamp[a:i,0])) 504 | startIndexValue = int(bb[abs(bb[:,1] - (timeStamp[i,0] - 1)).argmin(),0]) 505 | mfSmall = mf[startIndexValue:i,:] 506 | eventsCount[i,0] = np.count_nonzero(mfSmall[:,1] == 1) 507 | eventsCount[i,1] = np.count_nonzero(mfSmall[:,1] == 2) 508 | eventsCount[i,2] = np.count_nonzero(mfSmall[:,1] == 3) 509 | # print("--- %s seconds ---" % round(time.time() - start_time,2)) # For profiling only 510 | 511 | cc = [round(stats.percentileofscore(eventsCount[start[k]:k,0],eventsCount[k,0])/10,0) for k in range(len(startStopFile))] 512 | dd = [round(stats.percentileofscore(eventsCount[start[k]:k,1],eventsCount[k,1])/10,0) for k in range(len(startStopFile))] 513 | ee = [round(stats.percentileofscore(eventsCount[start[k]:k,2],eventsCount[k,2])/10,0) for k in range(len(startStopFile))] 514 | 515 | ff = pd.concat([dfTimeStamp,pd.Series(cc),pd.Series(dd),pd.Series(ee),pd.DataFrame(eventsCount)],axis=1) 516 | ff.columns = ['timeStamp','decEventType_1','decEventType_2','decEventType_3','levelEventType_1','levelEventType_2','levelEventType_3'] 517 | 518 | ff.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False) 519 | 520 | os.chdir(r'/home/arno/work/research/lobster/data') 521 | 522 | 523 | 524 | #----- 2.9 Relative intensity - step 1 - (10 sec.) 525 | for f in theMessagesFiles: 526 | print(f + ' **** ' + datetime.now().strftime("%H:%M:%S")) 527 | 528 | theName = f[0:15] + '_relativeIntensity10s.csv' 529 | 530 | if (f[0:4] == 'INTC'): 531 | os.chdir(r'/home/arno/work/research/lobster/data/INTC') 532 | 533 | if (f[0:4] == 'TSLA'): 534 | os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 535 | 536 | theMessageFile = f[0:15] + '_34200000_57600000_message_10.csv' 537 | 538 | mf = pd.read_csv(theMessageFile, names = ['timeStamp','EventType','Order ID','Size','Price','Direction'], float_precision='round_trip') 539 | dfTimeStamp = mf['timeStamp'] 540 | timeStampInSeconds = dfTimeStamp.round(0) 541 | maxLookBack = sum(timeStampInSeconds.value_counts().iloc[:11]) 542 | timeStampInSeconds = timeStampInSeconds.values 543 | timeStamp = mf['timeStamp'].to_frame().values 544 | mf = mf.values 545 | 546 | eventsCount10s = np.empty((len(mf),3)) 547 | 548 | # start_time = time.time() # For profiling only 549 | for i in range(len(timeStamp)): 550 | if (i == 0): 551 | eventsCount10s[i,0] = 0 552 | eventsCount10s[i,1] = 0 553 | eventsCount10s[i,2] = 0 554 | 555 | if (i <= maxLookBack): 556 | mfSmall = mf[:i,1] 557 | eventsCount10s[i,0] = np.count_nonzero(mfSmall == 1) 558 | eventsCount10s[i,1] = np.count_nonzero(mfSmall == 2) 559 | eventsCount10s[i,2] = np.count_nonzero(mfSmall == 3) 560 | 561 | elif (i > maxLookBack): 562 | a = i - maxLookBack 563 | aa = np.column_stack((np.array(dfTimeStamp.iloc[a:i].index),timeStamp[a:i,0])) 564 | startIndexValue = int(aa[abs(aa[:,1] - (timeStamp[i,0] - 10)).argmin(),0]) 565 | mfSmall = mf[startIndexValue:i,1] 566 | eventsCount10s[i,0] = np.count_nonzero(mfSmall == 1) 567 | eventsCount10s[i,1] = np.count_nonzero(mfSmall == 2) 568 | eventsCount10s[i,2] = np.count_nonzero(mfSmall == 3) 569 | # print("--- %s seconds ---" % round(time.time() - start_time,2)) # For profiling only 570 | 571 | eventsCount10s = pd.DataFrame(eventsCount10s) 572 | eventsCount10s.columns = ['EventType_1','EventType_2','EventType_3'] 573 | 574 | # Add time stamp 575 | eventsCount10s = pd.concat([dfTimeStamp,eventsCount10s],axis=1) 576 | eventsCount10s.columns = ['timeStamp','EventType_1','EventType_2','EventType_3'] 577 | eventsCount10s.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False) 578 | 579 | os.chdir(r'/home/arno/work/research/lobster/data') 580 | 581 | 582 | 583 | #----- 2.10 Relative intensity - step 2 - (900 sec.) 584 | for f in theMessagesFiles: 585 | print(f + ' **** ' + datetime.now().strftime("%H:%M:%S")) 586 | 587 | theName = f[0:15] + '_relativeIntensity900s.csv' 588 | 589 | if (f[0:4] == 'INTC'): 590 | os.chdir(r'/home/arno/work/research/lobster/data/INTC') 591 | 592 | if (f[0:4] == 'TSLA'): 593 | os.chdir(r'/home/arno/work/research/lobster/data/TSLA') 594 | 595 | theMessageFile = f[0:15] + '_34200000_57600000_message_10.csv' 596 | 597 | mf = pd.read_csv(theMessageFile, names = ['timeStamp','EventType','Order ID','Size','Price','Direction'], float_precision='round_trip') 598 | dfTimeStamp = mf['timeStamp'] 599 | timeStampInSeconds = dfTimeStamp.round(0) 600 | maxLookBack = sum(timeStampInSeconds.value_counts().iloc[:901]) 601 | timeStampInSeconds = timeStampInSeconds.values 602 | timeStamp = mf['timeStamp'].to_frame().values 603 | mf = mf.values 604 | 605 | eventsCount900s = np.empty((len(mf),3)) 606 | 607 | # start_time = time.time() # For profiling only 608 | for i in range(len(timeStamp)): 609 | if (i == 0): 610 | eventsCount900s[i,0] = 0 611 | eventsCount900s[i,1] = 0 612 | eventsCount900s[i,2] = 0 613 | 614 | if (i <= maxLookBack): 615 | mfSmall = mf[:i,1] 616 | eventsCount900s[i,0] = np.count_nonzero(mfSmall == 1) 617 | eventsCount900s[i,1] = np.count_nonzero(mfSmall == 2) 618 | eventsCount900s[i,2] = np.count_nonzero(mfSmall == 3) 619 | 620 | elif (i > maxLookBack): 621 | a = i - maxLookBack 622 | aa = np.column_stack((np.array(dfTimeStamp.iloc[a:i].index),timeStamp[a:i,0])) 623 | startIndexValue = int(aa[abs(aa[:,1] - (timeStamp[i,0] - 900)).argmin(),0]) 624 | mfSmall = mf[startIndexValue:i,1] 625 | eventsCount900s[i,0] = np.count_nonzero(mfSmall == 1) 626 | eventsCount900s[i,1] = np.count_nonzero(mfSmall == 2) 627 | eventsCount900s[i,2] = np.count_nonzero(mfSmall == 3) 628 | # print("--- %s seconds ---" % round(time.time() - start_time,2)) # For profiling only 629 | 630 | eventsCount900s = pd.DataFrame(eventsCount900s) 631 | eventsCount900s.columns = ['EventType_1','EventType_2','EventType_3'] 632 | 633 | # Add time stamp 634 | eventsCount900s = pd.concat([dfTimeStamp,eventsCount900s],axis=1) 635 | eventsCount900s.columns = ['timeStamp','EventType_1','EventType_2','EventType_3'] 636 | eventsCount900s.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False) 637 | 638 | os.chdir(r'/home/arno/work/research/lobster/data') 639 | 640 | 641 | 642 | #----- 2.11 Relative intensity - Step 3 - Put it all together 643 | os.chdir(r'/home/arno/work/research/lobster/data/features/') 644 | allFiles = os.listdir() 645 | 646 | files900s = [sl for sl in allFiles if "relativeIntensity900s" in sl and 'TSLA' in sl] 647 | files900s.sort() 648 | 649 | files10s = [sk for sk in allFiles if "relativeIntensity10s" in sk and 'TSLA' in sk] 650 | files10s.sort() 651 | 652 | for i in range(len(files10s)): 653 | print(files10s[i] + ' *** ' + datetime.now().strftime("%H:%M:%S")) 654 | theName = files10s[i][0:15] + '_relativeIntensity.csv' 655 | 656 | the10sFile = pd.read_csv(files10s[i], float_precision='round_trip') 657 | the900sFile = pd.read_csv(files900s[i], float_precision='round_trip') 658 | 659 | #--- Open Start & Stop file 660 | os.chdir(r'/home/arno/work/research/lobster/data/features/') 661 | sSFile = [g for g in os.listdir() if ((g[-17:] == '_startAndStop.csv') and (g[:15] == files10s[i][:15]))] 662 | startStopFile = pd.read_csv(sSFile[0], names = ['timeStamp','start','stop'], header = 0) 663 | start = np.array(startStopFile['start']) 664 | stop = np.array(startStopFile['stop']) 665 | 666 | relInt1 = np.array(the10sFile['EventType_1']/the900sFile['EventType_1']) 667 | relInt2 = np.array(the10sFile['EventType_2']/the900sFile['EventType_2']) 668 | relInt3 = np.array(the10sFile['EventType_3']/the900sFile['EventType_3']) 669 | 670 | aa = [round(stats.percentileofscore(relInt1[start[k]:k],relInt1[k])/10,0) for k in range(len(startStopFile))] 671 | bb = [round(stats.percentileofscore(relInt2[start[k]:k],relInt2[k])/10,0) for k in range(len(startStopFile))] 672 | cc = [round(stats.percentileofscore(relInt3[start[k]:k],relInt3[k])/10,0) for k in range(len(startStopFile))] 673 | 674 | relInt = pd.concat([the10sFile['timeStamp'],pd.Series(aa),pd.Series(bb),pd.Series(cc)], axis = 1) 675 | relInt.columns = ['timeStamp','relIntEventType_1','relIntEventType_2','relIntEventType_3'] 676 | 677 | relInt.to_csv(r'/home/arno/work/research/lobster/data/features/' + theName, header = True, index = False) 678 | 679 | 680 | --------------------------------------------------------------------------------