├── FIP ├── nnERA.xlsx ├── FIPdata.xlsx ├── TestFIPData.xlsx ├── validation_rows.xlsx ├── FIPGradientDescent.py ├── FIPNN.py ├── FIP-DNN.py └── FIPdata.csv ├── PitcherData.xlsx ├── .idea ├── misc.xml ├── modules.xml ├── SimpleBaseballNN.iml └── workspace.xml ├── README.md ├── .gitignore └── PitcherData.csv /FIP/nnERA.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacobdanovitch/Deep-Neural-Networks-for-Baseball/HEAD/FIP/nnERA.xlsx -------------------------------------------------------------------------------- /FIP/FIPdata.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacobdanovitch/Deep-Neural-Networks-for-Baseball/HEAD/FIP/FIPdata.xlsx -------------------------------------------------------------------------------- /PitcherData.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacobdanovitch/Deep-Neural-Networks-for-Baseball/HEAD/PitcherData.xlsx -------------------------------------------------------------------------------- /FIP/TestFIPData.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacobdanovitch/Deep-Neural-Networks-for-Baseball/HEAD/FIP/TestFIPData.xlsx -------------------------------------------------------------------------------- /FIP/validation_rows.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jacobdanovitch/Deep-Neural-Networks-for-Baseball/HEAD/FIP/validation_rows.xlsx -------------------------------------------------------------------------------- /.idea/misc.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | -------------------------------------------------------------------------------- /.idea/modules.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /.idea/SimpleBaseballNN.iml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 11 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Deep Neural Networks for Baseball 2 | 3 | On October 11, 2017, I made a note in my repository for my other baseball analysis project that said: 4 | > I regret my resentment towards (this project) for being "just a sports stats project." I think I should really try to do more projects like this one. 5 | 6 | New Years resolution = complete :tada: 7 | 8 | In reading [Andrew Trask's "Grokking Deep Learning"](https://www.manning.com/books/grokking-deep-learning), I've tried to hold myself accountable to learning the material by implementing it from scratch in creative ways. To do so, I've started this project that attempts to use different forms of neural network architecture to predict baseball statistics. Starting basic with predicting a pitcher's ERA using very basic inputs to get my bearings, I aim to eventually progress to much more sophisticated metrics. 9 | 10 | ## How it works 11 | 12 | * `/FIP/FIPGradientDescent.py`: Modelling ERA using basic inputs with simple linear regression by gradient descent 13 | * `/FIP/FIPNN.py`: Modelling ERA using basic inputs with a neural network containing a hidden layer 14 | * `/FIP/FIP-DNN.py`: Modelling ERA using an arbitrarily wide deep neural network 15 | 16 | ## What I Learned 17 | 18 | I learned about different architectures of neural networks, the theories behind learning, and started to grasp the intuition behind backpropagation as credit assignment. 19 | 20 | ## Future Plans 21 | 22 | * Keep learning new models! 23 | * Use more sophisticated inputs and outputs, nested models 24 | 25 | ## Current (Known) Problems 26 | 27 | * `/FIP/FIP-DNN.py`: NeuralNetwork.fit() doesn't work 28 | 29 | ### Built With 30 | 31 | * **Python** - Pandas, numpy 32 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | 49 | # Translations 50 | *.mo 51 | *.pot 52 | 53 | # Django stuff: 54 | *.log 55 | local_settings.py 56 | 57 | # Flask stuff: 58 | instance/ 59 | .webassets-cache 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | # Jupyter Notebook 71 | .ipynb_checkpoints 72 | 73 | # pyenv 74 | .python-version 75 | 76 | # celery beat schedule file 77 | celerybeat-schedule 78 | 79 | # SageMath parsed files 80 | *.sage.py 81 | 82 | # dotenv 83 | .env 84 | 85 | # virtualenv 86 | .venv 87 | venv/ 88 | ENV/ 89 | 90 | # Spyder project settings 91 | .spyderproject 92 | .spyproject 93 | 94 | # Rope project settings 95 | .ropeproject 96 | 97 | # mkdocs documentation 98 | /site 99 | 100 | # mypy 101 | .mypy_cache/ 102 | -------------------------------------------------------------------------------- /FIP/FIPGradientDescent.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import matplotlib.pyplot as plt 4 | 5 | def predict(x, weights): 6 | return np.dot(x, weights) 7 | 8 | def cost(prediction, values): 9 | m = len(prediction) 10 | err = np.square(prediction - values).sum() 11 | return err / (2*m) 12 | 13 | def gradient_descent(x, y, alpha=0.0001, max_iter=10000, random_seed=1234, verbose=False): 14 | np.random.seed(random_seed) 15 | weights = np.random.random((len(x[0]), 1)) 16 | 17 | m = len(y) 18 | cost_history = [] 19 | 20 | for i in range(max_iter): 21 | pred = predict(x, weights) 22 | delta = pred - y 23 | 24 | weight_deltas = np.dot(x.transpose(), delta) 25 | weights -= alpha*weight_deltas 26 | 27 | ''' 28 | for i in range(len(x)): 29 | pred = np.dot(x[i], weights) 30 | delta = (pred - y[i]) 31 | weight_deltas = alpha*np.array([delta*input for input in x[i]]) 32 | weights = weights - weight_deltas 33 | ''' 34 | 35 | cost_history.append(cost(pred, y)) 36 | 37 | if i % 100 == 0 and verbose: 38 | print("Iteration", i) 39 | print("Cost:", cost_history[-1],end="\n \n") 40 | 41 | print("Iteration", max_iter) 42 | print("Cost:", cost_history[-1]) 43 | 44 | return [i[0] for i in weights] 45 | 46 | data = pd.read_csv('TestFIPData.csv') 47 | train_data = data.sample(frac=0.7) 48 | test_data = data.loc[~data.index.isin(train_data.index)] 49 | 50 | x_train = train_data[['K9', 'BB9', 'HR9']] 51 | x_test = test_data[['K9', 'BB9', 'HR9']] 52 | 53 | y_train = train_data[['ERA']] 54 | y_test = test_data[['ERA']] 55 | 56 | weights = gradient_descent(x_train.values, y_train.values, alpha=0.000001) 57 | 58 | ''' 59 | test_df = pd.concat([x_test, y_test], axis=1) 60 | test_df.loc[:,'gdFIP'] = predict(test_df.iloc[:,:3], weights) 61 | print(test_df.head(10)) 62 | ''' 63 | 64 | data.loc[:,'gdFIP'] = predict(data[['K9', 'BB9', 'HR9']], weights) 65 | 66 | writer = pd.ExcelWriter('validation_rows.xlsx') 67 | data.to_excel(writer,'gdFIP') 68 | writer.save() 69 | 70 | 71 | -------------------------------------------------------------------------------- /FIP/FIPNN.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import matplotlib.pyplot as plt 4 | import pprint 5 | 6 | def predict(x, weights): 7 | return np.dot(x, weights) 8 | 9 | def cost(prediction, values): 10 | m = len(prediction) 11 | err = np.square(prediction - values).sum() 12 | return err / (2*m) 13 | 14 | def relu(x, deriv=False): 15 | if not deriv: 16 | return x * (x > 0) 17 | else: 18 | return x>0 19 | 20 | def generate_weights(size): 21 | return np.random.random(size) 22 | 23 | def neural_network(x, y, hidden_layers, alpha=0.0001, max_iter=10000, verbose=False, predict_data=[], random_seed=1234): 24 | np.random.seed(random_seed) 25 | layer0_weights = generate_weights((len(x[0]), hidden_layers)) #layer 0 -> 1 26 | layer1_weights = generate_weights((hidden_layers,1)) #layer 1 -> 2 27 | 28 | m = len(y) 29 | cost_history = [] 30 | 31 | for i in range(max_iter): 32 | layer0_output = relu(predict(x, layer0_weights)) 33 | layer1_output = predict(layer0_output, layer1_weights) 34 | 35 | layer1_deltas = layer1_output - y 36 | layer0_deltas = np.dot(layer1_deltas, layer1_weights.transpose()) * relu(layer0_output, deriv=True) 37 | 38 | layer1_weights -= alpha * np.dot(layer0_output.transpose(), layer1_deltas) 39 | layer0_weights -= alpha * np.dot(x.transpose(), layer0_deltas) 40 | 41 | cost_history.append(cost(layer1_output, y)) 42 | 43 | if i % 10**(len(str(max_iter))-2) == 0 and (i == 0 or i >= 10**(len(str(max_iter))-2)) and verbose: 44 | print("Iteration", i) 45 | print("Cost:", cost_history[-1],end="\n \n") 46 | 47 | print("Iteration", max_iter) 48 | print("Cost:", cost_history[-1]) 49 | 50 | if len(predict_data) > 0: 51 | layer0_output = relu(predict(predict_data, layer0_weights)) 52 | layer1_output = predict(layer0_output, layer1_weights) 53 | return layer1_output 54 | 55 | model = { 56 | 'L0':layer0_weights.tolist(), 57 | 'L1':layer1_weights.tolist() 58 | } 59 | 60 | if verbose: 61 | pprint.pprint(model) 62 | 63 | return model 64 | 65 | 66 | def main(): 67 | data = pd.read_csv('TestFIPData.csv') 68 | train_data = data.sample(frac=0.7) 69 | test_data = data.loc[~data.index.isin(train_data.index)] 70 | 71 | x_train = train_data[['K9', 'BB9', 'HR9']] 72 | x_test = test_data[['K9', 'BB9', 'HR9']] 73 | 74 | y_train = train_data[['ERA']] 75 | y_test = test_data[['ERA']] 76 | 77 | ''' 78 | test_df = pd.concat([x_test, y_test], axis=1) 79 | test_df.loc[:,'gdFIP'] = predict(test_df.iloc[:,:3], weights) 80 | print(test_df.head(10)) 81 | ''' 82 | 83 | data.loc[:,'nnERA'] = neural_network(x_train.values, y_train.values, 4, verbose=True, 84 | alpha=0.0000001, predict_data=data[['K9', 'BB9', 'HR9']]) 85 | 86 | writer = pd.ExcelWriter('nnERA.xlsx') 87 | data.to_excel(writer,'nnERA') 88 | writer.save() 89 | 90 | main() 91 | 92 | 93 | 94 | -------------------------------------------------------------------------------- /FIP/FIP-DNN.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | 5 | class Layer: 6 | def __init__(self, position, max_position, alpha): 7 | """ 8 | :param position: Layer's position in the overall network 9 | :param max_position: Network's width 10 | :param alpha: Learning rate 11 | """ 12 | self.position = position 13 | self.max_position = max_position 14 | self.alpha = alpha 15 | 16 | self.weights = [] # weights of layer 17 | self.prediction = None # output of layer 18 | self.delta = None # delta for backprop weight adjustment 19 | self.cost = None # cost (only used for final output layer) 20 | 21 | print("Layer",position,"initialized.\n") 22 | 23 | def generate_weights(self, size): 24 | self.weights = np.random.random(size) # randomly initialize weights given tuple 25 | print("Layer", self.position,"weights:",self.weights,'\n ') # print for confirmation 26 | 27 | def get_weights(self): 28 | return self.weights 29 | 30 | def update_weights(self, delta): 31 | self.weights -= self.alpha * delta 32 | 33 | def relu(self, x, deriv=False): 34 | if not deriv: 35 | return x * (x > 0) # Sets output to 0 if input is negative 36 | else: 37 | return x > 0 # Output is 1 if x is positive 38 | 39 | def predict(self, x, return_pred=False): 40 | if self.position == self.max_position: 41 | # Output layer prediction; no relu activation 42 | self.prediction = np.dot(x, self.weights) 43 | else: 44 | # Hidden layer predictions; use relu activation 45 | self.prediction = self.relu(np.dot(x, self.weights)) 46 | 47 | if return_pred: 48 | return self.prediction 49 | 50 | def get_prediction(self): 51 | return self.prediction 52 | 53 | def calculate_cost(self, prediction, values): 54 | m = len(prediction) 55 | err = np.square(prediction - values).sum() 56 | return err / (2 * m) 57 | 58 | def get_cost(self): 59 | return self.cost 60 | 61 | def get_delta(self): 62 | return self.delta 63 | 64 | def calculate_delta(self, y=None, prev_layer=None): 65 | if self.position == self.max_position: 66 | # only for final output layer 67 | self.delta = self.prediction - y # subtracts y/truth vector from prediction vector 68 | self.cost = self.calculate_cost(self.prediction, y.values) # adds cost 69 | else: 70 | """ 71 | delta calculation for backpropagating through hidden layers 72 | 73 | dot product of last layer's delta and transposed weights, 74 | multiplied by relu derivative to freeze negative weights 75 | """ 76 | self.delta = np.dot(prev_layer.get_delta(), prev_layer.get_weights().transpose()) * self.relu(self.prediction, 77 | deriv=True) 78 | 79 | 80 | class NeuralNetwork: 81 | def __init__(self, num_layers, x, y, max_iter=1000, alpha=0.01, random_seed=1234): 82 | ''' 83 | HYPERPARAMS 84 | :param num_layers: total layers in network 85 | :param max_iter: iterations for gradient descent 86 | :param alpha: learning rate (passed to layers) 87 | :param random_seed: random seed for weight generation (Change this if your network sucks!!) 88 | ''' 89 | 90 | self.alpha = alpha 91 | 92 | self.num_layers = num_layers 93 | self.layers = [] # list containing all Layer objects 94 | 95 | self.max_iter = max_iter 96 | self.random_seed = random_seed 97 | 98 | ''' 99 | DATA 100 | :param x: input vector/matrix 101 | :param features: shape of input / # of features 102 | :param y: ground truth vector/matrix 103 | ''' 104 | 105 | self.x = x 106 | self.features = len(x.iloc[0]) 107 | self.y = y 108 | 109 | def init_layers(self): 110 | np.random.seed(self.random_seed) # for deterministic weight generation 111 | max_position = self.num_layers - 1 # width of network 112 | 113 | # null initialize layers 114 | self.layers = [Layer(i, max_position, alpha=self.alpha) for i in range(self.num_layers)] 115 | 116 | # Generate weights for first layer (will always be as wide as # of features) 117 | self.layers[0].generate_weights((self.features, self.num_layers)) 118 | 119 | # If user specifies for more hidden layers 120 | if self.num_layers > 1: 121 | # Generate weights for middle hidden layers (won't run for num_layers == 2) 122 | for layer in self.layers[1:max_position]: 123 | # Need to make number of neurons editable 124 | layer.generate_weights((self.num_layers, self.num_layers)) 125 | 126 | # Generate weight for output layer (should have single output for regression) 127 | self.layers[-1].generate_weights((self.num_layers, 1)) 128 | 129 | def forward_prop(self): 130 | pred = self.layers[0].predict(self.x, return_pred=True) # initial prediction from input vector 131 | for l, layer in enumerate(self.layers[1:]): 132 | # update prediction iterating through rest of layers, chaining prediction 133 | pred = layer.predict(pred, return_pred=True) 134 | 135 | def back_prop(self): 136 | for l, layer in reversed(list(enumerate(self.layers))): 137 | # for final output layer 138 | if l == len(self.layers) - 1: 139 | # calculate initial delta for backprop 140 | layer.calculate_delta(y=self.y) 141 | # for rest of backprop 142 | else: 143 | # use previous layer to backprop and calculate delta 144 | layer.calculate_delta(prev_layer=self.layers[l + 1]) 145 | 146 | def update_weights(self): 147 | for l, layer in enumerate(self.layers): 148 | # weight update for first layer (multiplied by alpha inside function) 149 | if l == 0: 150 | # input vector transposed, dotted with input vector's delta 151 | layer.update_weights(np.dot(self.x.transpose(), layer.get_delta())) 152 | else: 153 | # uses previous layer to update weights 154 | prev_layer = self.layers[l - 1] 155 | layer.update_weights(np.dot(prev_layer.get_prediction().transpose(), layer.get_delta())) 156 | 157 | def train(self): 158 | # initialize layers 159 | self.init_layers() 160 | 161 | for epoch in range(1,self.max_iter+1): 162 | # forward, back, update 163 | self.forward_prop() 164 | self.back_prop() 165 | self.update_weights() 166 | 167 | # print every thousand 168 | if epoch % 1000 == 0: 169 | self.print_layers(epoch) 170 | 171 | def print_layers(self,epoch=-1): 172 | if epoch >= 0: 173 | print("Epoch",epoch,"complete.") 174 | 175 | for l, layer in enumerate(self.layers): 176 | print("Layer",l) 177 | print(layer.get_weights()) 178 | print() 179 | 180 | print("Cost:", self.layers[-1].get_cost()) 181 | 182 | def fit(self, x): 183 | # this is broke ngl 184 | pred = self.layers[0].predict(x) 185 | for l, layer in enumerate(self.layers[1:]): 186 | pred = layer.predict(pred, return_pred=True) 187 | return pred 188 | 189 | 190 | def main(): 191 | """ 192 | This is just project-specific stuff 193 | """ 194 | data = pd.read_csv('TestFIPData.csv') 195 | train_data = data.sample(frac=0.7) 196 | test_data = data.loc[~data.index.isin(train_data.index)] 197 | 198 | x_train = train_data[['K9', 'BB9', 'HR9']] 199 | x_test = test_data[['K9', 'BB9', 'HR9']] 200 | 201 | y_train = train_data[['ERA']] 202 | y_test = test_data[['ERA']] 203 | 204 | neural_net = NeuralNetwork(3, x_train, y_train, max_iter=20000, alpha=0.0000001) 205 | neural_net.train() 206 | 207 | ''' 208 | this doesn't work yet 209 | predictions = neural_net.fit(x_test) 210 | print(predictions) 211 | ''' 212 | 213 | ''' 214 | boiler plate code for saving to xlsx 215 | 216 | test_df = pd.concat([x_test, y_test], axis=1) 217 | test_df.loc[:,'gdFIP'] = predict(test_df.iloc[:,:3], weights) 218 | print(test_df.head(10)) 219 | 220 | 221 | data.loc[:,'nnERA'] = neural_network(x_train.values, y_train.values, 4, verbose=True, 222 | alpha=0.0000001, predict_data=data[['K9', 'BB9', 'HR9']]) 223 | 224 | writer = pd.ExcelWriter('nnERA.xlsx') 225 | data.to_excel(writer,'nnERA') 226 | writer.save() 227 | ''' 228 | 229 | 230 | # execute 231 | main() 232 | -------------------------------------------------------------------------------- /.idea/workspace.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 66 | 67 | 68 | 69 | hidden_layers 70 | print 71 | len(self.y) 72 | 73 | 74 | 75 | 80 | 81 | 82 | 83 | 84 | true 85 | DEFINITION_ORDER 86 | 87 | 88 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 |