├── README.md
├── cnn-model
├── LICENSE
├── README.md
├── Type Ia Supernova Classifier - Convolutional Neural Network.ipynb
├── Type Ia Supernova Classifier - Convolutional Neural Network.py
└── space_utils.py
├── requirements.txt
└── xgboost-baseline
├── README.md
├── XGBoost Comparison Model.ipynb
└── XGBoost Comparison Model.py
/README.md:
--------------------------------------------------------------------------------
1 | ## space2vec: Model Code
2 |
3 | Check our the posts here: [space2vec.com](http://space2vec.com)
4 |
5 | The project behind the code is talked about in detail throughout the blog posts. But this
6 | is where the cool code stuff happens!
7 |
8 |
9 | ## Data
10 | You can find the feature engineered CSV from the autoscan project (under the "Features" heading) site here:
11 | [http://portal.nersc.gov/project/dessn/autoscan/](http://portal.nersc.gov/project/dessn/autoscan/)
12 |
13 |
14 | ## Environment
15 | We have supplied requirements.txt file which you can use to setup the right environment. This was made for Python 3.6,
16 | so if you are getting errors about missing versions or something similar try removing anything after the "==" for that
17 | library in the requirements.txt and run again.
18 |
19 |
20 | ## Posts
21 |
22 | ### Week 2: building a baseline model
23 | See [/xgboost-baseline](https://github.com/pippinlee/space2vec-ml-code/tree/master/building-baseline-model) for code
24 |
25 | We pickled the feature engineered data for our above model, you can find the data here:
26 | [https://drive.google.com/open?id=1Pa4-imVbK7yfZuCX3mfF-mMae1eyhQqo](https://drive.google.com/open?id=1Pa4-imVbK7yfZuCX3mfF-mMae1eyhQqo)
27 |
28 |
29 |
30 | ###### Maintained by Pippin Lee (p.lee@dessa.com) and Cole Clifford (c.clifford@dessa.com)
31 |
--------------------------------------------------------------------------------
/cnn-model/LICENSE:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 |
3 | Copyright (c) Pippin Lee, Jinnah Ali-Clarke, Cole Clifford.
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
6 | this software and associated documentation files (the "Software"), to deal in
7 | the Software without restriction, including without limitation the rights to
8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
9 | the Software, and to permit persons to whom the Software is furnished to do so,
10 | subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
21 |
--------------------------------------------------------------------------------
/cnn-model/README.md:
--------------------------------------------------------------------------------
1 | ## CNN Model
2 |
3 | We have given 3 files:
4 |
5 | 1. The iPython/Jupyter notebook file (`Type Ia Supernova Classifier - Convolutional Neural Network.ipynb`)
6 | 2. The .py file outputted from iPython/Jupyter (`Type Ia Supernova Classifier - Convolutional Neural Network.py`)
7 | 3. Functions that are used in the other 2 files (`space_utils.py`)
8 |
9 | For this specific model, we strongly recommend the iPython/Jupyter notebook file. The code
10 | explanation is a lot nicer in the notebook interface, it will be easier to learn what is going on!
11 |
12 | There are 2 main data files that are used in the code:
13 |
14 | 1. all_object_data_in_dictionary_format.pkl
15 | 2. normalized_image_object_data_in_numpy_format.pkl
16 |
17 | The descriptions for what each one does it in the code.
18 |
19 | However, there are 3 different sizes of each with the links below:
20 |
21 | | Filename | S3 Link | File Size |
22 | |--------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|-----------|
23 | | all_object_data_in_dictionary_format.pkl | https://s3.amazonaws.com/space2vec-public/post3/all_object_data_in_dictionary_format.pkl | 6.7GB |
24 | | normalized_image_object_data_in_numpy_format.pkl | https://s3.amazonaws.com/space2vec-public/post3/normalized_image_object_data_in_numpy_format.pkl | 13.0GB |
25 | | small_all_object_data_in_dictionary_format.pkl | https://s3.amazonaws.com/space2vec-public/post3/small_all_object_data_in_dictionary_format.pkl | 772.0MB |
26 | | small_normalized_image_object_data_in_numpy_format.pkl | https://s3.amazonaws.com/space2vec-public/post3/small_normalized_image_object_data_in_numpy_format.pkl | 1.5GB |
27 | | extra_small_all_object_data_in_dictionary_format.pkl | https://s3.amazonaws.com/space2vec-public/post3/extra_small_all_object_data_in_dictionary_format.pkl | 386.0MB |
28 | | extra_small_normalized_image_object_data_in_numpy_format.pkl | https://s3.amazonaws.com/space2vec-public/post3/extra_small_normalized_image_object_data_in_numpy_format.pkl | 744.2MB |
29 |
30 | You can pick any of the links from that table and use `wget ` to download the data.
31 |
32 | ## License
33 |
34 | `space2vec-ml-code` is available under the MIT license. See the LICENSE file for more info.
35 |
36 | Copyright 2018 Pippin Lee, Jinnah Ali-Clarke, Cole Clifford.
37 |
--------------------------------------------------------------------------------
/cnn-model/Type Ia Supernova Classifier - Convolutional Neural Network.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "from sklearn.model_selection import StratifiedShuffleSplit\n",
10 | "from keras.callbacks import ModelCheckpoint, EarlyStopping\n",
11 | "from keras.layers.normalization import BatchNormalization\n",
12 | "from keras.layers import MaxPooling2D, Flatten, Conv2D\n",
13 | "from keras.layers import Dense, Dropout, Activation\n",
14 | "from keras.models import Sequential\n",
15 | "from matplotlib import pyplot as plt\n",
16 | "from slackclient import SlackClient\n",
17 | "from keras.models import load_model\n",
18 | "from keras.optimizers import Adam\n",
19 | "from space_utils import *\n",
20 | "\n",
21 | "from keras import regularizers\n",
22 | "from time import process_time\n",
23 | "from shutil import copyfile\n",
24 | "\n",
25 | "import pandas as pd\n",
26 | "import numpy as np\n",
27 | "\n",
28 | "import pickle\n",
29 | "import random\n",
30 | "import os\n",
31 | "\n",
32 | "pd.options.display.max_columns = 45"
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "## Introduction\n",
40 | "---\n",
41 | "Hi and hello! Welcome to the step-by-step guide of how to train a model to detect supernova.\n",
42 | "\n",
43 | "Throughout this guide you will learn about the data that we used, the building of a model in Keras, and how we went about record keeping for our experiments.\n",
44 | "\n",
45 | "There is a seperate file called utils.py that holds any functions that we wrote for our project."
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "## Constants\n",
53 | "---\n",
54 | "We find it best to define a set of constants at the beginning of the notebook for clarity."
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": null,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "HOME_PATH = \"/home/ubuntu\"\n",
64 | "DATA_PATH = \"/home/ubuntu/data/\"\n",
65 | "MODEL_PATH = \"/home/ubuntu/model/\"\n",
66 | "RESULTS_PATH = \"/home/ubuntu/results/\"\n",
67 | "\n",
68 | "ALL_DATA_FILE = \"extra_small_all_object_data_in_dictionary_format.pkl\"\n",
69 | "NORMALIZED_IMAGE_DATA_FILE = \"extra_small_normalized_image_object_data_in_numpy_format.pkl\"\n",
70 | "\n",
71 | "MODEL_LOGGING_FILE = \"model_results.csv\""
72 | ]
73 | },
74 | {
75 | "cell_type": "markdown",
76 | "metadata": {},
77 | "source": [
78 | "## Data Loading\n",
79 | "---\n",
80 | "We first have to load in the data to be used for model training.\n",
81 | "\n",
82 | "This consists of 2 main data files stored in the variables ALL_DATA_FILE and NORMALIZED_IMAGE_DATA_FILE.\n",
83 | "\n",
84 | "ALL_DATA_FILE: We have any information that will be relevent to an object observation in here. This is a dictionary\n",
85 | "with 4 keys -- images, targets, file_paths, observation_numbers -- where each key holds a Numpy array. The indices of\n",
86 | "each array are all properly aligned according to their respective objects (explained in the table).\n",
87 | "\n",
88 | "| X_normalized | X | Y | file_path | observation_number |\n",
89 | "|---------------------|----------|----------|------------------|---------------------------|\n",
90 | "| obj_0_X_normalized | obj_0_X | obj_0_Y | obj_0_file_path | obj_0_observation_number |\n",
91 | "| obj_42_X_normalized | obj_42_X | obj_42_Y | obj_42_file_path | obj_42_observation_number |\n",
92 | "\n",
93 | "NORMALIZED_IMAGE_DATA_FILE: This is simply a Numpy array of photos ready to be fed into a model. They are normalized and the channels -- search image, template image, difference image -- are organized properly. The preparation of this data is in >>>FILL IN<<<."
94 | ]
95 | },
96 | {
97 | "cell_type": "code",
98 | "execution_count": null,
99 | "metadata": {},
100 | "outputs": [],
101 | "source": [
102 | "all_data = pickle.load(open(DATA_PATH + ALL_DATA_FILE, \"rb\"))\n",
103 | "all_images_normalized = pickle.load(open(DATA_PATH + NORMALIZED_IMAGE_DATA_FILE, \"rb\"))"
104 | ]
105 | },
106 | {
107 | "cell_type": "markdown",
108 | "metadata": {},
109 | "source": [
110 | "## Data Splitting\n",
111 | "---\n",
112 | "We have to split the data into 3 different sets: training, validation, and testing. Utilizing the *split_space_data*\n",
113 | "function we imported from *utils.py* this is pretty straightforward.\n",
114 | "\n",
115 | "P.S. Sorry that each line is so long... We tried multiple ways of making this easier on the eyes but this makes\n",
116 | "the most sense!"
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": null,
122 | "metadata": {},
123 | "outputs": [],
124 | "source": [
125 | "(X_train, X_train_normal, Y_train, file_path_train, observation_number_train), (X_test, X_test_normal, Y_test, file_path_test, observation_number_test) = split_space_data(\n",
126 | " all_images_normalized, \n",
127 | " all_data[\"images\"],\n",
128 | " all_data[\"targets\"], \n",
129 | " all_data[\"file_paths\"], \n",
130 | " all_data[\"observation_numbers\"], \n",
131 | " 0.1\n",
132 | ")"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": null,
138 | "metadata": {},
139 | "outputs": [],
140 | "source": [
141 | "(X_train, X_train_normal, Y_train, file_path_train, observation_number_train), (X_valid, X_valid_normal, Y_valid, file_path_valid, observation_number_valid) = split_space_data(\n",
142 | " X_train,\n",
143 | " X_train_normal,\n",
144 | " Y_train,\n",
145 | " file_path_train,\n",
146 | " observation_number_train,\n",
147 | " 0.2\n",
148 | ")"
149 | ]
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "metadata": {},
154 | "source": [
155 | "## Model Definition\n",
156 | "---\n",
157 | "We define the model in a function just to keep things separated nicely. Feel free to change the model however\n",
158 | " you like! Try things out :D "
159 | ]
160 | },
161 | {
162 | "cell_type": "code",
163 | "execution_count": null,
164 | "metadata": {},
165 | "outputs": [],
166 | "source": [
167 | "def build_model(X, Y, params):\n",
168 | " \n",
169 | " # Figure out the data shape\n",
170 | " input_shape = (X.shape[1], X.shape[2], X.shape[3])\n",
171 | " \n",
172 | " # Define the model object to append layers to\n",
173 | " model = Sequential()\n",
174 | " \n",
175 | " # Add first layer\n",
176 | " model.add(Conv2D(\n",
177 | " filters=params[\"NUMBER_OF_FILTERS_1\"],\n",
178 | " kernel_size=(3,3),\n",
179 | " strides=(1,1),\n",
180 | " border_mode='same',\n",
181 | " data_format='channels_first',\n",
182 | " input_shape=input_shape\n",
183 | " ))\n",
184 | " model.add(Activation('relu'))\n",
185 | " model.add(Conv2D(\n",
186 | " filters=params[\"NUMBER_OF_FILTERS_1\"],\n",
187 | " kernel_size=(3,3),\n",
188 | " strides=(2,2),\n",
189 | " border_mode='same',\n",
190 | " data_format='channels_first',\n",
191 | " input_shape=input_shape\n",
192 | " ))\n",
193 | " model.add(BatchNormalization(axis=1))\n",
194 | " model.add(Activation('relu'))\n",
195 | " \n",
196 | " # Second layer\n",
197 | " model.add(Conv2D(\n",
198 | " filters=params[\"NUMBER_OF_FILTERS_2\"],\n",
199 | " strides=(1,1),\n",
200 | " kernel_size=(3,3),\n",
201 | " border_mode='same',\n",
202 | " data_format='channels_first',\n",
203 | " ))\n",
204 | " model.add(Activation('relu'))\n",
205 | " model.add(Conv2D(\n",
206 | " filters=params[\"NUMBER_OF_FILTERS_2\"],\n",
207 | " strides=(2,2),\n",
208 | " kernel_size=(3,3),\n",
209 | " border_mode='same',\n",
210 | " data_format='channels_first',\n",
211 | " ))\n",
212 | " model.add(BatchNormalization(axis=1))\n",
213 | " model.add(Activation('relu'))\n",
214 | " \n",
215 | " # Third layer\n",
216 | " model.add(Conv2D(\n",
217 | " filters=params[\"NUMBER_OF_FILTERS_3\"],\n",
218 | " strides=(1,1),\n",
219 | " kernel_size=(3,3),\n",
220 | " border_mode='same',\n",
221 | " data_format='channels_first',\n",
222 | " ))\n",
223 | " model.add(Activation('relu'))\n",
224 | " model.add(Conv2D(\n",
225 | " filters=params[\"NUMBER_OF_FILTERS_3\"],\n",
226 | " strides=(2,2),\n",
227 | " kernel_size=(3,3),\n",
228 | " border_mode='same',\n",
229 | " data_format='channels_first',\n",
230 | " ))\n",
231 | " model.add(BatchNormalization(axis=1))\n",
232 | " model.add(Activation('relu'))\n",
233 | " \n",
234 | " # Fourth layer\n",
235 | " model.add(Conv2D(\n",
236 | " filters=params[\"NUMBER_OF_FILTERS_4\"],\n",
237 | " strides=(1,1),\n",
238 | " kernel_size=(3,3),\n",
239 | " border_mode='same',\n",
240 | " data_format='channels_first',\n",
241 | " ))\n",
242 | " model.add(Activation('relu'))\n",
243 | " model.add(Conv2D(\n",
244 | " filters=params[\"NUMBER_OF_FILTERS_4\"],\n",
245 | " strides=(2,2),\n",
246 | " kernel_size=(3,3),\n",
247 | " border_mode='same',\n",
248 | " data_format='channels_first',\n",
249 | " ))\n",
250 | " model.add(BatchNormalization(axis=1))\n",
251 | " model.add(Activation('relu'))\n",
252 | " \n",
253 | " # Fifth layer\n",
254 | " model.add(Conv2D(\n",
255 | " filters=params[\"NUMBER_OF_FILTERS_4\"],\n",
256 | " strides=(1,1),\n",
257 | " kernel_size=(3,3),\n",
258 | " border_mode='same',\n",
259 | " data_format='channels_first',\n",
260 | " ))\n",
261 | " model.add(Activation('relu'))\n",
262 | " \n",
263 | " # Output layers\n",
264 | " model.add(Flatten())\n",
265 | " model.add(Dense(128))\n",
266 | " model.add(Dropout(params[\"DROPOUT_PERCENT\"]))\n",
267 | " model.add(Dense(1))\n",
268 | " model.add(Activation(\"sigmoid\"))\n",
269 | " \n",
270 | " return model"
271 | ]
272 | },
273 | {
274 | "cell_type": "markdown",
275 | "metadata": {},
276 | "source": [
277 | "## Model Parameters\n",
278 | "---\n",
279 | "We have separated parameters into 2 buckets with the folowing definitions:\n",
280 | "- user_params: Information *about* the model for record keeping\n",
281 | "- model_params: Information *for* the model to consume"
282 | ]
283 | },
284 | {
285 | "cell_type": "code",
286 | "execution_count": null,
287 | "metadata": {},
288 | "outputs": [],
289 | "source": [
290 | "user_params = {\n",
291 | " \"INITIALS\": \"cc\",\n",
292 | " \"MODEL_DESCRIPTION\": \"My first public model!\",\n",
293 | " \"VERSION\": \"1\"\n",
294 | "}\n",
295 | "\n",
296 | "model_params = {\n",
297 | " \"LEARNING_RATE\": 0.00014148226882681195,\n",
298 | " \"BATCH_SIZE\": 368,\n",
299 | " \"DROPOUT_PERCENT\": 0.4488113054975806,\n",
300 | " \"NUMBER_OF_FILTERS_1\": 25,\n",
301 | " \"NUMBER_OF_FILTERS_2\": 63,\n",
302 | " \"NUMBER_OF_FILTERS_3\": 119,\n",
303 | " \"NUMBER_OF_FILTERS_4\": 210, \n",
304 | " \"NUMBER_OF_EPOCHS\": 40,\n",
305 | "}"
306 | ]
307 | },
308 | {
309 | "cell_type": "markdown",
310 | "metadata": {},
311 | "source": [
312 | "## Model Experimentation\n",
313 | "---\n"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": null,
319 | "metadata": {},
320 | "outputs": [],
321 | "source": [
322 | "MODEL_AMOUNT = 1\n",
323 | "\n",
324 | "for current_model_number in range(MODEL_AMOUNT):\n",
325 | " \n",
326 | " # Indicate and log model start\n",
327 | " print(\"START MODEL SEARCH (model {} of {})\".format(current_model_number, MODEL_AMOUNT))\n",
328 | " start = process_time()\n",
329 | " \n",
330 | " # Randomize specific parameters if we are doing a search\n",
331 | " # Feel free to add or change the current parameters\n",
332 | " if MODEL_AMOUNT > 1:\n",
333 | " params[\"LEARNING_RATE\"] = 10 ** np.random.uniform(-4, -2)\n",
334 | " params[\"BATCH_SIZE\"] = 16 * np.random.randint(1, 96)\n",
335 | " params[\"DROPOUT_PERCENT\"] = np.random.uniform(0.0, 0.6)\n",
336 | " params[\"NUMBER_OF_FILTERS_1\"] = np.random.randint(4, 32)\n",
337 | " params[\"NUMBER_OF_FILTERS_2\"] = np.random.randint(16, 64)\n",
338 | " params[\"NUMBER_OF_FILTERS_3\"] = np.random.randint(32, 128)\n",
339 | " params[\"NUMBER_OF_FILTERS_4\"] = np.random.randint(64, 256) \n",
340 | " \n",
341 | " # Build the model and catch if the model acrhitectur is not valid\n",
342 | " try:\n",
343 | " model = build_model(X_train, Y_train, model_params)\n",
344 | " except Exception as e:\n",
345 | " print(\"That didn't work!\")\n",
346 | " print(e)\n",
347 | " continue\n",
348 | " \n",
349 | " # Create the specific model name\n",
350 | " model_name = user_params[\"INITIALS\"] + \"_convolutional_\" + str(user_params[\"VERSION\"]) + str(current_model_number)\n",
351 | " user_params[\"VERSION\"] = user_params[\"VERSION\"] + str(1)\n",
352 | " \n",
353 | " # Define an optimizer for the model\n",
354 | " adam_optimizer = Adam(\n",
355 | " lr=model_params[\"LEARNING_RATE\"], \n",
356 | " beta_1=0.9, \n",
357 | " beta_2=0.999, \n",
358 | " epsilon=None, \n",
359 | " decay=0.0\n",
360 | " )\n",
361 | " \n",
362 | " # Compile the model\n",
363 | " model.compile(\n",
364 | " loss=\"binary_crossentropy\", \n",
365 | " optimizer=adam_optimizer,\n",
366 | " metrics=['accuracy']\n",
367 | " )\n",
368 | " \n",
369 | " # Figure out where to save the model checkpoints\n",
370 | " checkpoint_file = MODEL_PATH + \"mdl.hdf5\"\n",
371 | " checkpointer = ModelCheckpoint(filepath=checkpoint_file, verbose=2, save_best_only=True)\n",
372 | " \n",
373 | " # Create an early stopping callback\n",
374 | " early_stopping_callback = EarlyStopping(patience=5, min_delta=0.0005, verbose=2)\n",
375 | " \n",
376 | " # Actually train the model\n",
377 | " print(model_params)\n",
378 | " history = model.fit(\n",
379 | " X_train,\n",
380 | " Y_train,\n",
381 | " batch_size=model_params[\"BATCH_SIZE\"],\n",
382 | " nb_epoch=model_params[\"NUMBER_OF_EPOCHS\"],\n",
383 | " verbose=1,\n",
384 | " validation_data=(X_valid, Y_valid),\n",
385 | " callbacks=[checkpointer, early_stopping_callback]\n",
386 | " )\n",
387 | " \n",
388 | " # Reload the best model\n",
389 | " model = load_model(checkpoint_file)\n",
390 | " \n",
391 | " # Get final predictions for the model and write to a file\n",
392 | " predictions = model.predict(X_test).flatten()\n",
393 | " model_metrics = get_metrics(predictions, Y_test)\n",
394 | " create_result_csv(user_params, model_params, model_metrics, file_name=RESULTS_PATH + MODEL_LOGGING_FILE)\n",
395 | " \n",
396 | " # Save the model to a unique location if the Pippin metric is better than the papers\n",
397 | " if model_metrics[\"PIPPIN_METRIC\"] < 0.202:\n",
398 | " copyfile(checkpoint_file, checkpoint_filepath + \"{}.hdf5\".format(model_name))\n",
399 | " \n",
400 | " # Plot the model history\n",
401 | " plt.plot(history.history['loss'])\n",
402 | " plt.plot(history.history['val_loss'])\n",
403 | " plt.title('Training History')\n",
404 | " plt.ylabel('Binary Cross Entropy Loss')\n",
405 | " plt.xlabel('Epoch')\n",
406 | " plt.xlim([0, len(history.history['loss'])])\n",
407 | " plt.legend(['Training set', 'Validation set'], loc='upper right')\n",
408 | " plt.show()\n",
409 | " \n",
410 | " # Reset plot to clean up extra lines\n",
411 | " plt.clf()\n",
412 | " \n",
413 | " # Get some indication of process length\n",
414 | " final = process_time()\n",
415 | " print('FINISHED MODEL SEARCH. {} SECONDS.'.format(str(final-start)))"
416 | ]
417 | },
418 | {
419 | "cell_type": "code",
420 | "execution_count": null,
421 | "metadata": {},
422 | "outputs": [],
423 | "source": []
424 | }
425 | ],
426 | "metadata": {
427 | "kernelspec": {
428 | "display_name": "Environment (conda_tensorflow_p36)",
429 | "language": "python",
430 | "name": "conda_tensorflow_p36"
431 | },
432 | "language_info": {
433 | "codemirror_mode": {
434 | "name": "ipython",
435 | "version": 3
436 | },
437 | "file_extension": ".py",
438 | "mimetype": "text/x-python",
439 | "name": "python",
440 | "nbconvert_exporter": "python",
441 | "pygments_lexer": "ipython3",
442 | "version": "3.6.6"
443 | }
444 | },
445 | "nbformat": 4,
446 | "nbformat_minor": 2
447 | }
448 |
--------------------------------------------------------------------------------
/cnn-model/Type Ia Supernova Classifier - Convolutional Neural Network.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[ ]:
5 |
6 |
7 | from sklearn.model_selection import StratifiedShuffleSplit
8 | from keras.callbacks import ModelCheckpoint, EarlyStopping
9 | from keras.layers.normalization import BatchNormalization
10 | from keras.layers import MaxPooling2D, Flatten, Conv2D
11 | from keras.layers import Dense, Dropout, Activation
12 | from keras.models import Sequential
13 | from matplotlib import pyplot as plt
14 | from slackclient import SlackClient
15 | from keras.models import load_model
16 | from keras.optimizers import Adam
17 | from space_utils import *
18 |
19 | from keras import regularizers
20 | from time import process_time
21 | from shutil import copyfile
22 |
23 | import pandas as pd
24 | import numpy as np
25 |
26 | import pickle
27 | import random
28 | import os
29 |
30 | pd.options.display.max_columns = 45
31 |
32 |
33 | # ## Introduction
34 | # ---
35 | # Hi and hello! Welcome to the step-by-step guide of how to train a model to detect supernova.
36 | #
37 | # Throughout this guide you will learn about the data that we used, the building of a model in Keras, and how we went about record keeping for our experiments.
38 | #
39 | # There is a seperate file called utils.py that holds any functions that we wrote for our project.
40 |
41 | # ## Constants
42 | # ---
43 | # We find it best to define a set of constants at the beginning of the notebook for clarity.
44 |
45 | # In[ ]:
46 |
47 |
48 | HOME_PATH = "/home/ubuntu"
49 | DATA_PATH = "/home/ubuntu/data/"
50 | MODEL_PATH = "/home/ubuntu/model/"
51 | RESULTS_PATH = "/home/ubuntu/results/"
52 |
53 | ALL_DATA_FILE = "extra_small_all_object_data_in_dictionary_format.pkl"
54 | NORMALIZED_IMAGE_DATA_FILE = "extra_small_normalized_image_object_data_in_numpy_format.pkl"
55 |
56 | MODEL_LOGGING_FILE = "model_results.csv"
57 |
58 |
59 | # ## Data Loading
60 | # ---
61 | # We first have to load in the data to be used for model training.
62 | #
63 | # This consists of 2 main data files stored in the variables ALL_DATA_FILE and NORMALIZED_IMAGE_DATA_FILE.
64 | #
65 | # ALL_DATA_FILE: We have any information that will be relevent to an object observation in here. This is a dictionary
66 | # with 4 keys -- images, targets, file_paths, observation_numbers -- where each key holds a Numpy array. The indices of
67 | # each array are all properly aligned according to their respective objects (explained in the table).
68 | #
69 | # | X_normalized | X | Y | file_path | observation_number |
70 | # |---------------------|----------|----------|------------------|---------------------------|
71 | # | obj_0_X_normalized | obj_0_X | obj_0_Y | obj_0_file_path | obj_0_observation_number |
72 | # | obj_42_X_normalized | obj_42_X | obj_42_Y | obj_42_file_path | obj_42_observation_number |
73 | #
74 | # NORMALIZED_IMAGE_DATA_FILE: This is simply a Numpy array of photos ready to be fed into a model. They are normalized and the channels -- search image, template image, difference image -- are organized properly. The preparation of this data is in >>>FILL IN<<<.
75 |
76 | # In[ ]:
77 |
78 |
79 | all_data = pickle.load(open(DATA_PATH + ALL_DATA_FILE, "rb"))
80 | all_images_normalized = pickle.load(open(DATA_PATH + NORMALIZED_IMAGE_DATA_FILE, "rb"))
81 |
82 |
83 | # ## Data Splitting
84 | # ---
85 | # We have to split the data into 3 different sets: training, validation, and testing. Utilizing the *split_space_data*
86 | # function we imported from *utils.py* this is pretty straightforward.
87 | #
88 | # P.S. Sorry that each line is so long... We tried multiple ways of making this easier on the eyes but this makes
89 | # the most sense!
90 |
91 | # In[ ]:
92 |
93 |
94 | (X_train, X_train_normal, Y_train, file_path_train, observation_number_train), (X_test, X_test_normal, Y_test, file_path_test, observation_number_test) = split_space_data(
95 | all_images_normalized,
96 | all_data["images"],
97 | all_data["targets"],
98 | all_data["file_paths"],
99 | all_data["observation_numbers"],
100 | 0.1
101 | )
102 |
103 |
104 | # In[ ]:
105 |
106 |
107 | (X_train, X_train_normal, Y_train, file_path_train, observation_number_train), (X_valid, X_valid_normal, Y_valid, file_path_valid, observation_number_valid) = split_space_data(
108 | X_train,
109 | X_train_normal,
110 | Y_train,
111 | file_path_train,
112 | observation_number_train,
113 | 0.2
114 | )
115 |
116 |
117 | # ## Model Definition
118 | # ---
119 | # We define the model in a function just to keep things separated nicely. Feel free to change the model however
120 | # you like! Try things out :D
121 |
122 | # In[ ]:
123 |
124 |
125 | def build_model(X, Y, params):
126 |
127 | # Figure out the data shape
128 | input_shape = (X.shape[1], X.shape[2], X.shape[3])
129 |
130 | # Define the model object to append layers to
131 | model = Sequential()
132 |
133 | # Add first layer
134 | model.add(Conv2D(
135 | filters=params["NUMBER_OF_FILTERS_1"],
136 | kernel_size=(3,3),
137 | strides=(1,1),
138 | border_mode='same',
139 | data_format='channels_first',
140 | input_shape=input_shape
141 | ))
142 | model.add(Activation('relu'))
143 | model.add(Conv2D(
144 | filters=params["NUMBER_OF_FILTERS_1"],
145 | kernel_size=(3,3),
146 | strides=(2,2),
147 | border_mode='same',
148 | data_format='channels_first',
149 | input_shape=input_shape
150 | ))
151 | model.add(BatchNormalization(axis=1))
152 | model.add(Activation('relu'))
153 |
154 | # Second layer
155 | model.add(Conv2D(
156 | filters=params["NUMBER_OF_FILTERS_2"],
157 | strides=(1,1),
158 | kernel_size=(3,3),
159 | border_mode='same',
160 | data_format='channels_first',
161 | ))
162 | model.add(Activation('relu'))
163 | model.add(Conv2D(
164 | filters=params["NUMBER_OF_FILTERS_2"],
165 | strides=(2,2),
166 | kernel_size=(3,3),
167 | border_mode='same',
168 | data_format='channels_first',
169 | ))
170 | model.add(BatchNormalization(axis=1))
171 | model.add(Activation('relu'))
172 |
173 | # Third layer
174 | model.add(Conv2D(
175 | filters=params["NUMBER_OF_FILTERS_3"],
176 | strides=(1,1),
177 | kernel_size=(3,3),
178 | border_mode='same',
179 | data_format='channels_first',
180 | ))
181 | model.add(Activation('relu'))
182 | model.add(Conv2D(
183 | filters=params["NUMBER_OF_FILTERS_3"],
184 | strides=(2,2),
185 | kernel_size=(3,3),
186 | border_mode='same',
187 | data_format='channels_first',
188 | ))
189 | model.add(BatchNormalization(axis=1))
190 | model.add(Activation('relu'))
191 |
192 | # Fourth layer
193 | model.add(Conv2D(
194 | filters=params["NUMBER_OF_FILTERS_4"],
195 | strides=(1,1),
196 | kernel_size=(3,3),
197 | border_mode='same',
198 | data_format='channels_first',
199 | ))
200 | model.add(Activation('relu'))
201 | model.add(Conv2D(
202 | filters=params["NUMBER_OF_FILTERS_4"],
203 | strides=(2,2),
204 | kernel_size=(3,3),
205 | border_mode='same',
206 | data_format='channels_first',
207 | ))
208 | model.add(BatchNormalization(axis=1))
209 | model.add(Activation('relu'))
210 |
211 | # Fifth layer
212 | model.add(Conv2D(
213 | filters=params["NUMBER_OF_FILTERS_4"],
214 | strides=(1,1),
215 | kernel_size=(3,3),
216 | border_mode='same',
217 | data_format='channels_first',
218 | ))
219 | model.add(Activation('relu'))
220 |
221 | # Output layers
222 | model.add(Flatten())
223 | model.add(Dense(128))
224 | model.add(Dropout(params["DROPOUT_PERCENT"]))
225 | model.add(Dense(1))
226 | model.add(Activation("sigmoid"))
227 |
228 | return model
229 |
230 |
231 | # ## Model Parameters
232 | # ---
233 | # We have separated parameters into 2 buckets with the folowing definitions:
234 | # - user_params: Information *about* the model for record keeping
235 | # - model_params: Information *for* the model to consume
236 |
237 | # In[ ]:
238 |
239 |
240 | user_params = {
241 | "INITIALS": "cc",
242 | "MODEL_DESCRIPTION": "My first public model!",
243 | "VERSION": "1"
244 | }
245 |
246 | model_params = {
247 | "LEARNING_RATE": 0.00014148226882681195,
248 | "BATCH_SIZE": 368,
249 | "DROPOUT_PERCENT": 0.4488113054975806,
250 | "NUMBER_OF_FILTERS_1": 25,
251 | "NUMBER_OF_FILTERS_2": 63,
252 | "NUMBER_OF_FILTERS_3": 119,
253 | "NUMBER_OF_FILTERS_4": 210,
254 | "NUMBER_OF_EPOCHS": 40,
255 | }
256 |
257 |
258 | # ## Model Experimentation
259 | # ---
260 | #
261 |
262 | # In[ ]:
263 |
264 |
265 | MODEL_AMOUNT = 1
266 |
267 | for current_model_number in range(MODEL_AMOUNT):
268 |
269 | # Indicate and log model start
270 | print("START MODEL SEARCH (model {} of {})".format(current_model_number, MODEL_AMOUNT))
271 | start = process_time()
272 |
273 | # Randomize specific parameters if we are doing a search
274 | # Feel free to add or change the current parameters
275 | if MODEL_AMOUNT > 1:
276 | params["LEARNING_RATE"] = 10 ** np.random.uniform(-4, -2)
277 | params["BATCH_SIZE"] = 16 * np.random.randint(1, 96)
278 | params["DROPOUT_PERCENT"] = np.random.uniform(0.0, 0.6)
279 | params["NUMBER_OF_FILTERS_1"] = np.random.randint(4, 32)
280 | params["NUMBER_OF_FILTERS_2"] = np.random.randint(16, 64)
281 | params["NUMBER_OF_FILTERS_3"] = np.random.randint(32, 128)
282 | params["NUMBER_OF_FILTERS_4"] = np.random.randint(64, 256)
283 |
284 | # Build the model and catch if the model acrhitectur is not valid
285 | try:
286 | model = build_model(X_train, Y_train, model_params)
287 | except Exception as e:
288 | print("That didn't work!")
289 | print(e)
290 | continue
291 |
292 | # Create the specific model name
293 | model_name = user_params["INITIALS"] + "_convolutional_" + str(user_params["VERSION"]) + str(current_model_number)
294 | user_params["VERSION"] = user_params["VERSION"] + str(1)
295 |
296 | # Define an optimizer for the model
297 | adam_optimizer = Adam(
298 | lr=model_params["LEARNING_RATE"],
299 | beta_1=0.9,
300 | beta_2=0.999,
301 | epsilon=None,
302 | decay=0.0
303 | )
304 |
305 | # Compile the model
306 | model.compile(
307 | loss="binary_crossentropy",
308 | optimizer=adam_optimizer,
309 | metrics=['accuracy']
310 | )
311 |
312 | # Figure out where to save the model checkpoints
313 | checkpoint_file = MODEL_PATH + "mdl.hdf5"
314 | checkpointer = ModelCheckpoint(filepath=checkpoint_file, verbose=2, save_best_only=True)
315 |
316 | # Create an early stopping callback
317 | early_stopping_callback = EarlyStopping(patience=5, min_delta=0.0005, verbose=2)
318 |
319 | # Actually train the model
320 | print(model_params)
321 | history = model.fit(
322 | X_train,
323 | Y_train,
324 | batch_size=model_params["BATCH_SIZE"],
325 | nb_epoch=model_params["NUMBER_OF_EPOCHS"],
326 | verbose=1,
327 | validation_data=(X_valid, Y_valid),
328 | callbacks=[checkpointer, early_stopping_callback]
329 | )
330 |
331 | # Reload the best model
332 | model = load_model(checkpoint_file)
333 |
334 | # Get final predictions for the model and write to a file
335 | predictions = model.predict(X_test).flatten()
336 | model_metrics = get_metrics(predictions, Y_test)
337 | create_result_csv(user_params, model_params, model_metrics, file_name=RESULTS_PATH + MODEL_LOGGING_FILE)
338 |
339 | # Save the model to a unique location if the Pippin metric is better than the papers
340 | if model_metrics["PIPPIN_METRIC"] < 0.202:
341 | copyfile(checkpoint_file, checkpoint_filepath + "{}.hdf5".format(model_name))
342 |
343 | # Plot the model history
344 | plt.plot(history.history['loss'])
345 | plt.plot(history.history['val_loss'])
346 | plt.title('Training History')
347 | plt.ylabel('Binary Cross Entropy Loss')
348 | plt.xlabel('Epoch')
349 | plt.xlim([0, len(history.history['loss'])])
350 | plt.legend(['Training set', 'Validation set'], loc='upper right')
351 | plt.show()
352 |
353 | # Reset plot to clean up extra lines
354 | plt.clf()
355 |
356 | # Get some indication of process length
357 | final = process_time()
358 | print('FINISHED MODEL SEARCH. {} SECONDS.'.format(str(final-start)))
359 |
360 |
--------------------------------------------------------------------------------
/cnn-model/space_utils.py:
--------------------------------------------------------------------------------
1 | def split_space_data(
2 | X_normalized,
3 | X,
4 | Y,
5 | file_path,
6 | observation_number,
7 | test_size
8 | ):
9 | '''Seperate the data in a stratified way.
10 |
11 | The function takes in a few different datasets, where the indices of each are aligned to be of the
12 | same object.
13 |
14 | | X_normalized | X | Y | file_path | observation_number |
15 | |---------------------|----------|----------|------------------|---------------------------|
16 | | obj_0_X_normalized | obj_0_X | obj_0_Y | obj_0_file_path | obj_0_observation_number |
17 | | obj_42_X_normalized | obj_42_X | obj_42_Y | obj_42_file_path | obj_42_observation_number |
18 |
19 | It is important to make sure that the split data is stratified. Stratification means that if there
20 | is multiple classes in our dataset, then when we split our data the classes are make up a similar
21 | balance as when they were in the full data set.
22 |
23 | An example is: Our full dataset is 60% dogs and 40% cats. When we split our data into a training set
24 | and a test set, each set is still made of 60% dogs and 40% cats (or as close to this split as possible).
25 | '''
26 | from sklearn.model_selection import StratifiedShuffleSplit
27 |
28 | # Create the helper object
29 | sss = StratifiedShuffleSplit(
30 | n_splits=1,
31 | test_size=test_size
32 | )
33 |
34 | # Generate the indecis
35 | train_index, test_index = next(sss.split(X_normalized, Y))
36 |
37 | # Shuffle and split the data
38 | X_normalized_train, X_normalized_test = X_normalized[train_index], X_normalized[test_index]
39 | X_train, X_test = X[train_index], X[test_index]
40 | Y_train, Y_test = Y[train_index], Y[test_index]
41 | file_path_train, file_path_test = file_path[train_index], file_path[test_index]
42 | observation_number_train, observation_number_test = observation_number[train_index], observation_number[test_index]
43 |
44 | return (
45 | X_normalized_train,
46 | X_train,
47 | Y_train,
48 | file_path_train,
49 | observation_number_train
50 | ), (
51 | X_normalized_test,
52 | X_test,
53 | Y_test,
54 | file_path_test,
55 | observation_number_test
56 | )
57 |
58 |
59 | def metrics(outputs, labels, threshold=0.5):
60 | '''Gets all metrics that we need for model comparison.
61 |
62 | Throughout the paper they talk about 2 main metrics: False Positive Rate (FPR) and Missed Detection
63 | Rate (MDR). We get these by calculating
64 |
65 | True Positive: The number of times we said a supernova EXISTS and it DID
66 | False Positive: The number of times we said a supernova EXISTS and it DID NOT
67 | True Negative: The number of times we said a supernova DID NOT EXISTS and it DID NOT
68 | False Negative: The number of times we said a supernova DID NOT EXISTS and it DID
69 | '''
70 |
71 | # Set the predicions to either 0 or 1 based on the given threshold
72 | predictions = outputs >= (1 - threshold)
73 |
74 | # Set the indices to either 0 or 1 based on the metric we are checking
75 | true_positive_indices = (predictions == 0.) * (labels == 0)
76 | false_positive_indices = (predictions == 0.) * (labels == 1)
77 | true_negative_indices = (predictions == 1.) * (labels == 1)
78 | false_negative_indices = (predictions == 1.) * (labels == 0)
79 |
80 | # Get the total count for each metric we are checking
81 | true_positive_count = true_positive_indices.sum()
82 | false_positive_count = false_positive_indices.sum()
83 | true_negative_count = true_negative_indices.sum()
84 | false_negative_count = false_negative_indices.sum()
85 |
86 | # Calculate and store the FPR and MDR in a dictionary for convenience
87 | fpr_and_mdr = {
88 | 'MDR': false_negative_count / (true_positive_count + false_negative_count),
89 | 'FPR': false_positive_count / (true_negative_count + false_positive_count)
90 | }
91 |
92 | return fpr_and_mdr
93 |
94 |
95 | def get_metrics(outputs, labels, with_acc=True):
96 | '''Get all metrics for all interesting thresholds.
97 |
98 | In the paper there is focus on 3 main thresholds -- 0.4, 0.5, and 0.6. We check
99 |
100 | To make sure we are all on the same page, a threshold is basically a boundry that dictates what decision
101 | the model is making. This happens because a models output is a float between 0 and 1. If a model outputs
102 | 0.42 we have to decide what that actually means. With a threshold of 0.4, a 0.42 would be pushed to a 1; where
103 | a threshold of 0.5 would push it to a 0.
104 | '''
105 | import numpy as np
106 |
107 | all_metrics = {}
108 |
109 | # FPR and MDR 0.4
110 | temp = metrics(outputs, labels, threshold=0.4)
111 | all_metrics["FALSE_POSITIVE_RATE_4"] = temp["FPR"]
112 | all_metrics["MISSED_DETECTION_RATE_4"] = temp["MDR"]
113 |
114 | # FPR and MDR 0.5
115 | temp = metrics(outputs, labels, threshold=0.5)
116 | all_metrics["FALSE_POSITIVE_RATE_5"] = temp["FPR"]
117 | all_metrics["MISSED_DETECTION_RATE_5"] = temp["MDR"]
118 |
119 | # FPR and MDR 0.6
120 | temp = metrics(outputs, labels, threshold=0.6)
121 | all_metrics["FALSE_POSITIVE_RATE_6"] = temp["FPR"]
122 | all_metrics["MISSED_DETECTION_RATE_6"] = temp["MDR"]
123 |
124 | # Summed FPR and MDR
125 | all_metrics["FALSE_POSITIVE_RATE"] = all_metrics["FALSE_POSITIVE_RATE_4"] + all_metrics["FALSE_POSITIVE_RATE_5"] + all_metrics["FALSE_POSITIVE_RATE_6"]
126 | all_metrics["MISSED_DETECTION_RATE"] = all_metrics["MISSED_DETECTION_RATE_4"] + all_metrics["MISSED_DETECTION_RATE_5"] + all_metrics["MISSED_DETECTION_RATE_6"]
127 |
128 | # The true sum
129 | all_metrics["PIPPIN_METRIC"] = all_metrics["FALSE_POSITIVE_RATE"] + all_metrics["MISSED_DETECTION_RATE"]
130 |
131 | # Accuracy
132 | if with_acc:
133 | predictions = np.around(outputs).astype(int)
134 | all_metrics["ACCURACY"] = (predictions == labels).sum() / len(labels)
135 |
136 | return all_metrics
137 |
138 |
139 | def create_result_csv(user_params, model_params, metrics, extra_dict=None, file_name="results.csv"):
140 | '''Format information to be stored and write to a CSV on disk.
141 |
142 | This function is used for record keeping of model experiments that have been run. Each header listed in
143 | csv_header_order are the pieces of information that we care about when comparing models. Each row of the
144 | CSV becomes a record for specific experiment where we can then use Pandas Dataframes or Excel to sort and
145 | compare models.
146 | '''
147 | import pandas as pd
148 | import os
149 |
150 | # Define results file
151 | results_file = file_name
152 |
153 | # Dictionary to be turned into a CSV
154 | csv_dict = {}
155 |
156 | # Set the important columns in a set order
157 | csv_header_order = [
158 | "INITIALS",
159 | "MODEL_DESCRIPTION",
160 | "VERSION",
161 | "FALSE_POSITIVE_RATE_4",
162 | "MISSED_DETECTION_RATE_4",
163 | "FALSE_POSITIVE_RATE_5",
164 | "MISSED_DETECTION_RATE_5",
165 | "FALSE_POSITIVE_RATE_6",
166 | "MISSED_DETECTION_RATE_6",
167 | "FALSE_POSITIVE_RATE",
168 | "MISSED_DETECTION_RATE",
169 | "PIPPIN_METRIC",
170 | "ACCURACY",
171 | "NUMBER_OF_FILTERS_1",
172 | "NUMBER_OF_FILTERS_2",
173 | "NUMBER_OF_FILTERS_3",
174 | "NUMBER_OF_FILTERS_4",
175 | "LEARNING_RATE",
176 | "BATCH_SIZE",
177 | "NUMBER_OF_EPOCHS",
178 | "NUMBER_OF_FILTERS",
179 | "POOL_SIZE",
180 | "KERNAL_SIZE",
181 | "NUMBER_OF_LAYERS",
182 | "DROPOUT_PERCENT",
183 | "FPR_ALPHA",
184 | "MDR_ALPHA",
185 | "DENSE_LAYER_SHAPES"
186 | ]
187 |
188 | # Loop through headers and create the dictionary
189 | for header in csv_header_order:
190 | # Check where the header is
191 | if header in metrics.keys():
192 | csv_dict[header] = str(metrics[header])
193 | elif header in user_params.keys():
194 | csv_dict[header] = str(user_params[header])
195 | elif header in model_params.keys():
196 | csv_dict[header] = str(model_params[header])
197 |
198 | # Turn the current data to a Dataframe
199 | updated_df = pd.DataFrame(csv_dict, index=[0])
200 |
201 | # Check if a CSV already exists so we can add the current experiment to the previously logged ones
202 | if os.path.isfile(results_file):
203 | df = pd.read_csv(results_file)
204 | updated_df = pd.concat([df, updated_df])
205 |
206 | # Write the CSV to disk
207 | updated_df.to_csv(results_file, index=False)
208 |
209 | return updated_df
210 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | asn1crypto==0.22.0
2 | astropy==2.0.3
3 | attrs==17.4.0
4 | bleach==1.5.0
5 | certifi==2017.11.5
6 | cffi==1.11.2
7 | chardet==3.0.4
8 | cryptography==2.1.4
9 | cycler==0.10.0
10 | decorator==4.2.1
11 | h5py==2.7.1
12 | html5lib==0.9999999
13 | idna==2.6
14 | ipykernel==4.8.0
15 | ipython==6.2.1
16 | ipython-genutils==0.2.0
17 | jedi==0.11.1
18 | jupyter-client==5.2.2
19 | jupyter-core==4.4.0
20 | Keras==2.1.3
21 | Markdown==2.6.9
22 | matplotlib==2.1.2
23 | numpy==1.13.3
24 | pandas==0.22.0
25 | parso==0.1.1
26 | pexpect==4.3.1
27 | pickleshare==0.7.4
28 | pluggy==0.6.0
29 | prompt-toolkit==1.0.15
30 | ptyprocess==0.5.2
31 | py==1.5.2
32 | pycparser==2.18
33 | Pygments==2.2.0
34 | pyOpenSSL==17.4.0
35 | pyparsing==2.2.0
36 | PySocks==1.6.8
37 | pytest==3.3.2
38 | python-dateutil==2.6.1
39 | pytz==2017.3
40 | PyYAML==3.12
41 | pyzmq==16.0.3
42 | requests==2.18.4
43 | scikit-learn==0.19.1
44 | scipy==1.0.0
45 | tensorflow==1.4.1
46 | xgboost==0.7.post3
--------------------------------------------------------------------------------
/xgboost-baseline/README.md:
--------------------------------------------------------------------------------
1 | ## XGBoost Comparison Model
2 |
3 | We have given 2 files:
4 |
5 | 1. The iPython/Jupyter notebook file
6 | 2. The .py file outputted from iPython/Jupyter
7 |
8 | This will give you a few options to play with. We recommend iPython/Jupyter notebooks as they are
9 | more user friendly than terminal... but this is a README, we don't have control over you!
10 |
11 | Everything else should be explained in the code/notebook!
--------------------------------------------------------------------------------
/xgboost-baseline/XGBoost Comparison Model.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "from sklearn.model_selection import train_test_split\n",
10 | "\n",
11 | "import xgboost as xgb\n",
12 | "import pandas as pd\n",
13 | "import numpy as np\n",
14 | "\n",
15 | "import pickle\n",
16 | "import random\n",
17 | "\n",
18 | "pd.set_option(\"max_columns\", 999)\n",
19 | "\n",
20 | "np.random.seed(1)"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "## Let's get started!\n",
28 | "\n",
29 | "First we have to load in the data, this is the feature engineered data right from the paper. We have actually taken the extra step of formatting it really nicely for Python.\n",
30 | "\n",
31 | "Make sure to change the path to where you downloaded the data!"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 2,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "path_to_data = \"/Users/clifford-laptop/Documents/space2vec/data/engineered-data.pkl\"\n",
41 | "\n",
42 | "data = pickle.load(open(path_to_data, 'rb'))"
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "## Next the column types\n",
50 | "\n",
51 | "Not all of this is necessary but we wanted to make sure that we explicitly state what each column type is. That way we can be sure that we don't include columns that shouldn't be in the training data."
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 3,
57 | "metadata": {},
58 | "outputs": [],
59 | "source": [
60 | "targets = [\n",
61 | " \"OBJECT_TYPE\",\n",
62 | "]\n",
63 | "\n",
64 | "ids = [\n",
65 | " \"ID\",\n",
66 | "]\n",
67 | "\n",
68 | "continuous = [\n",
69 | " \"AMP\",\n",
70 | " \"A_IMAGE\",\n",
71 | " \"A_REF\",\n",
72 | " \"B_IMAGE\",\n",
73 | " \"B_REF\",\n",
74 | " \"COLMEDS\",\n",
75 | " \"DIFFSUMRN\",\n",
76 | " \"ELLIPTICITY\",\n",
77 | " \"FLUX_RATIO\",\n",
78 | " \"GAUSS\",\n",
79 | " \"GFLUX\",\n",
80 | " \"L1\",\n",
81 | " \"LACOSMIC\",\n",
82 | " \"MAG\",\n",
83 | " \"MAGDIFF\",\n",
84 | " \"MAG_FROM_LIMIT\",\n",
85 | " \"MAG_REF\",\n",
86 | " \"MAG_REF_ERR\",\n",
87 | " \"MASKFRAC\",\n",
88 | " \"MIN_DISTANCE_TO_EDGE_IN_NEW\",\n",
89 | " \"NN_DIST_RENORM\",\n",
90 | " \"SCALE\",\n",
91 | " \"SNR\",\n",
92 | " \"SPREADERR_MODEL\",\n",
93 | " \"SPREAD_MODEL\",\n",
94 | "]\n",
95 | "\n",
96 | "categorical = [\n",
97 | " \"BAND\",\n",
98 | " \"CCDID\",\n",
99 | " \"FLAGS\",\n",
100 | "]\n",
101 | "\n",
102 | "ordinal = [\n",
103 | " \"N2SIG3\",\n",
104 | " \"N2SIG3SHIFT\",\n",
105 | " \"N2SIG5\",\n",
106 | " \"N2SIG5SHIFT\",\n",
107 | " \"N3SIG3\",\n",
108 | " \"N3SIG3SHIFT\",\n",
109 | " \"N3SIG5\",\n",
110 | " \"N3SIG5SHIFT\",\n",
111 | " \"NUMNEGRN\",\n",
112 | "]\n",
113 | "\n",
114 | "booleans = [\n",
115 | " \"MAGLIM\",\n",
116 | "]"
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {},
122 | "source": [
123 | "## One hot encode any categorical columns\n",
124 | "\n",
125 | "Here we do something called one hot encoding (https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f).\n",
126 | "\n",
127 | "This is to turn any categorical columns into something that a machine learning model can understand. Let's say we have a column, maybe we call it BAND, and this column might have 4 different possible values:\n",
128 | "\n",
129 | "g, i, r, or z\n",
130 | "\n",
131 | "Well we can't really shove these into our network so we hit it with the \"one hot\"! The BAND column becomes 5 different columns:\n",
132 | "\n",
133 | "BAND_g, BAND_i, BAND_r, BAND_z, and BAND_nan\n",
134 | "\n",
135 | "Now, instead of a letter value; we have a binary representation with a 1 in it's corresponding column and a zero in the rest.\n",
136 | "\n",
137 | "The function is a bit interesting but it does exactly what we need!"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": 4,
143 | "metadata": {},
144 | "outputs": [],
145 | "source": [
146 | "data = pd.get_dummies(\n",
147 | " data, \n",
148 | " prefix = categorical, \n",
149 | " prefix_sep = '_',\n",
150 | " dummy_na = True, \n",
151 | " columns = categorical, \n",
152 | " sparse = False, \n",
153 | " drop_first = False\n",
154 | ")"
155 | ]
156 | },
157 | {
158 | "cell_type": "markdown",
159 | "metadata": {},
160 | "source": [
161 | "## Split the inputs from the targets\n",
162 | "\n",
163 | "This is super important!\n",
164 | "\n",
165 | "We have to make sure we physically seperate the targets (aka labels) from our model input. This is to give us a piece of mind as we train.\n",
166 | "\n",
167 | "Obviously, the model should never train on our targets... That's like giving a student the exam answer sheet to study before the exam!"
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": 5,
173 | "metadata": {},
174 | "outputs": [],
175 | "source": [
176 | "target = data[targets]\n",
177 | "inputs = data.drop(columns = ids + targets)"
178 | ]
179 | },
180 | {
181 | "cell_type": "markdown",
182 | "metadata": {},
183 | "source": [
184 | "## Shuffle and split the data\n",
185 | "\n",
186 | "Now we split the data again, this time into a training set and a validation set.\n",
187 | "\n",
188 | "This is comparable to having a bunch of practice questions before a test (the training set) and quiz questions (the validation set).\n",
189 | "\n",
190 | "**It's important to note that the model should never learn on the validation set!**\n",
191 | "\n",
192 | "We also shuffle the data to make sure we remove any possible patterns that could be happening within the data (not very likely to happen in this dataset but it doesn't hurt).\n",
193 | "\n",
194 | "Another **really** important point here is \"stratification\". That sounds fancy but it basically means that when we split the data, the distribution of the populations should be the same in the training and validation set as it was originally... That didn't help did it?\n",
195 | "\n",
196 | "Let's say that in the total dataset we have 50.5% of the population as supernova and the other 49.5% of the population being not a supernova. When we split the data into two subset, in a stratified way, both subsets should keep a very similar ratio of supernova to not-supernova (50.5% to 49.5%).\n",
197 | "\n",
198 | "This is getting way too long... Lastly I'll point out the **test_size = 0.2**. This simply means that 20% of the data is put into a validation set (leaving the other 80% as training data)."
199 | ]
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": 9,
204 | "metadata": {},
205 | "outputs": [],
206 | "source": [
207 | "x_train, x_valid, y_train, y_valid = train_test_split(\n",
208 | " inputs, \n",
209 | " target, \n",
210 | " test_size = 0.2, \n",
211 | " random_state = 42,\n",
212 | " stratify = target.as_matrix()\n",
213 | ")"
214 | ]
215 | },
216 | {
217 | "cell_type": "markdown",
218 | "metadata": {},
219 | "source": [
220 | "## Parameters!\n",
221 | "\n",
222 | "Alright, we won't get too into the specifics here but you can definitely check out the documentation (http://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier).\n",
223 | "\n",
224 | "We just toyed around with the parameters to see what seemed to work the best.\n",
225 | "\n",
226 | "Once we get to the Convolutional Neural Network (CNN), the model we will more than likely use in the end, we will automate this parameter search.\n",
227 | "\n",
228 | "**The joys of this whole notebook thing is that you can run all of this! Try changing them and see what happens!**"
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": 26,
234 | "metadata": {},
235 | "outputs": [],
236 | "source": [
237 | "params = {\n",
238 | " 'max_depth': 6,\n",
239 | " 'learning_rate': 0.1,\n",
240 | " 'silent': 1,\n",
241 | " 'objective': 'binary:logistic',\n",
242 | " 'scale_pos_weight': 0.5,\n",
243 | " 'n_estimators': 40,\n",
244 | " \"gamma\": 0,\n",
245 | " \"min_child_weight\": 1,\n",
246 | " \"max_delta_step\": 0, \n",
247 | " \"subsample\": 0.9, \n",
248 | " \"colsample_bytree\": 0.8, \n",
249 | " \"colsample_bylevel\": 0.9, \n",
250 | " \"reg_alpha\": 0, \n",
251 | " \"reg_lambda\": 1, \n",
252 | " \"scale_pos_weight\": 1, \n",
253 | " \"base_score\": 0.5, \n",
254 | " \"seed\": 23, \n",
255 | " \"nthread\": 4\n",
256 | "}"
257 | ]
258 | },
259 | {
260 | "cell_type": "markdown",
261 | "metadata": {},
262 | "source": [
263 | "## *Rocky training montage*\n",
264 | "\n",
265 | "Now for the part where Rocky runs through the streets training for the big fight!\n",
266 | "\n",
267 | "Ahaha, oh the joys of modern programming! All we need to do is define the XGBClassifier and `.fit()`!\n",
268 | "\n",
269 | "As long as we pass in the data and the metrics that we want to define then we are good to go."
270 | ]
271 | },
272 | {
273 | "cell_type": "code",
274 | "execution_count": 27,
275 | "metadata": {},
276 | "outputs": [
277 | {
278 | "name": "stderr",
279 | "output_type": "stream",
280 | "text": [
281 | "/Users/clifford-laptop/anaconda2/envs/space2vec/lib/python3.6/site-packages/sklearn/preprocessing/label.py:95: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
282 | " y = column_or_1d(y, warn=True)\n",
283 | "/Users/clifford-laptop/anaconda2/envs/space2vec/lib/python3.6/site-packages/sklearn/preprocessing/label.py:128: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
284 | " y = column_or_1d(y, warn=True)\n"
285 | ]
286 | },
287 | {
288 | "name": "stdout",
289 | "output_type": "stream",
290 | "text": [
291 | "[0]\tvalidation_0-auc:0.967291\tvalidation_1-auc:0.966996\n",
292 | "[1]\tvalidation_0-auc:0.974637\tvalidation_1-auc:0.974235\n",
293 | "[2]\tvalidation_0-auc:0.982206\tvalidation_1-auc:0.981863\n",
294 | "[3]\tvalidation_0-auc:0.982742\tvalidation_1-auc:0.982427\n",
295 | "[4]\tvalidation_0-auc:0.98447\tvalidation_1-auc:0.984314\n",
296 | "[5]\tvalidation_0-auc:0.985039\tvalidation_1-auc:0.984842\n",
297 | "[6]\tvalidation_0-auc:0.985353\tvalidation_1-auc:0.985177\n",
298 | "[7]\tvalidation_0-auc:0.985621\tvalidation_1-auc:0.985435\n",
299 | "[8]\tvalidation_0-auc:0.985995\tvalidation_1-auc:0.985823\n",
300 | "[9]\tvalidation_0-auc:0.986418\tvalidation_1-auc:0.986239\n",
301 | "[10]\tvalidation_0-auc:0.986688\tvalidation_1-auc:0.986519\n",
302 | "[11]\tvalidation_0-auc:0.986884\tvalidation_1-auc:0.986712\n",
303 | "[12]\tvalidation_0-auc:0.987164\tvalidation_1-auc:0.986975\n",
304 | "[13]\tvalidation_0-auc:0.987417\tvalidation_1-auc:0.987218\n",
305 | "[14]\tvalidation_0-auc:0.987586\tvalidation_1-auc:0.987418\n",
306 | "[15]\tvalidation_0-auc:0.987908\tvalidation_1-auc:0.987705\n",
307 | "[16]\tvalidation_0-auc:0.988169\tvalidation_1-auc:0.987992\n",
308 | "[17]\tvalidation_0-auc:0.988351\tvalidation_1-auc:0.988176\n",
309 | "[18]\tvalidation_0-auc:0.988474\tvalidation_1-auc:0.988304\n",
310 | "[19]\tvalidation_0-auc:0.988711\tvalidation_1-auc:0.988529\n",
311 | "[20]\tvalidation_0-auc:0.988923\tvalidation_1-auc:0.988739\n",
312 | "[21]\tvalidation_0-auc:0.989098\tvalidation_1-auc:0.988922\n",
313 | "[22]\tvalidation_0-auc:0.989229\tvalidation_1-auc:0.989033\n",
314 | "[23]\tvalidation_0-auc:0.989479\tvalidation_1-auc:0.989271\n",
315 | "[24]\tvalidation_0-auc:0.989585\tvalidation_1-auc:0.989385\n",
316 | "[25]\tvalidation_0-auc:0.989726\tvalidation_1-auc:0.989511\n",
317 | "[26]\tvalidation_0-auc:0.98986\tvalidation_1-auc:0.98965\n",
318 | "[27]\tvalidation_0-auc:0.990075\tvalidation_1-auc:0.989816\n",
319 | "[28]\tvalidation_0-auc:0.990221\tvalidation_1-auc:0.989966\n",
320 | "[29]\tvalidation_0-auc:0.990338\tvalidation_1-auc:0.990079\n",
321 | "[30]\tvalidation_0-auc:0.990426\tvalidation_1-auc:0.990162\n",
322 | "[31]\tvalidation_0-auc:0.990536\tvalidation_1-auc:0.990268\n",
323 | "[32]\tvalidation_0-auc:0.990654\tvalidation_1-auc:0.990391\n",
324 | "[33]\tvalidation_0-auc:0.990745\tvalidation_1-auc:0.990473\n",
325 | "[34]\tvalidation_0-auc:0.990834\tvalidation_1-auc:0.99055\n",
326 | "[35]\tvalidation_0-auc:0.990964\tvalidation_1-auc:0.990658\n",
327 | "[36]\tvalidation_0-auc:0.99106\tvalidation_1-auc:0.990747\n",
328 | "[37]\tvalidation_0-auc:0.991139\tvalidation_1-auc:0.990819\n",
329 | "[38]\tvalidation_0-auc:0.991254\tvalidation_1-auc:0.99093\n",
330 | "[39]\tvalidation_0-auc:0.991371\tvalidation_1-auc:0.991034\n"
331 | ]
332 | },
333 | {
334 | "data": {
335 | "text/plain": [
336 | "XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=0.9,\n",
337 | " colsample_bytree=0.8, gamma=0, learning_rate=0.1, max_delta_step=0,\n",
338 | " max_depth=6, min_child_weight=1, missing=None, n_estimators=40,\n",
339 | " n_jobs=1, nthread=4, objective='binary:logistic', random_state=0,\n",
340 | " reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=23, silent=1,\n",
341 | " subsample=0.9)"
342 | ]
343 | },
344 | "execution_count": 27,
345 | "metadata": {},
346 | "output_type": "execute_result"
347 | }
348 | ],
349 | "source": [
350 | "bst = xgb.XGBClassifier(**params)\n",
351 | "\n",
352 | "bst.fit(\n",
353 | " x_train, \n",
354 | " y_train, \n",
355 | " eval_set = [(x_train, y_train), (x_valid, y_valid)], \n",
356 | " eval_metric = ['auc'], \n",
357 | " verbose = True\n",
358 | ")"
359 | ]
360 | },
361 | {
362 | "cell_type": "markdown",
363 | "metadata": {},
364 | "source": [
365 | "## Define the rules of the ring\n",
366 | "\n",
367 | "The rules of the big finale were described within the paper, these are the Missed Detection Rate (MDR) and the False Positive Rate (FPR). We won't dive in here as they are mentioned in depth in our blog post, but the following is the coded version of the metrics."
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": 31,
373 | "metadata": {},
374 | "outputs": [],
375 | "source": [
376 | "def metrics(outputs, labels, threshold=0.5):\n",
377 | " predictions = outputs >= (1 - threshold)\n",
378 | " true_positive_indices = (predictions == 0) * (labels == 0)\n",
379 | " false_positive_indices = (predictions == 0) * (labels == 1)\n",
380 | " true_negative_indices = (predictions == 1) * (labels == 1)\n",
381 | " false_negative_indices = (predictions == 1) * (labels == 0)\n",
382 | "\n",
383 | " true_positive_count = true_positive_indices.sum()\n",
384 | " false_positive_count = false_positive_indices.sum()\n",
385 | " true_negative_count = true_negative_indices.sum()\n",
386 | " false_negative_count = false_negative_indices.sum()\n",
387 | " \n",
388 | " return {\n",
389 | " # Missed detection rate\n",
390 | " 'MDR': false_negative_count / (true_positive_count + false_negative_count),\n",
391 | " # True positive rate\n",
392 | " 'FPR': false_positive_count / (true_negative_count + false_positive_count)\n",
393 | " }"
394 | ]
395 | },
396 | {
397 | "cell_type": "markdown",
398 | "metadata": {},
399 | "source": [
400 | "## Hiring the referee\n",
401 | "\n",
402 | "Great, now we have the rules for the big fight. But we also need someone (or something... or just a function) to take action on the rules.\n",
403 | "\n",
404 | "This is just a function that will run MDR and FPR on all 3 thresholds (0.4, 0.5, 0.6) and a few extras explained below:\n",
405 | "\n",
406 | "**FALSE_POSITIVE_RATE:** Is the sum of the FPR from all three thresholds, this helps us see how the models compare on a large scale.\n",
407 | "\n",
408 | "**MISSED_DETECTION_RATE:** Is the sum of the MDR from all three thresholds, this helps us see how the models compare on a large scale.\n",
409 | "\n",
410 | "**PIPPIN_METRIC:** Named after team member Pippin Lee, this is just **FALSE_POSITIVE_RATE** and **MISSED_DETECTION_RATE** summed to give us an even large scale of how the models compare.\n",
411 | "\n",
412 | "**ACCURACY:** Simply the percentage of guesses that we got right."
413 | ]
414 | },
415 | {
416 | "cell_type": "code",
417 | "execution_count": 30,
418 | "metadata": {},
419 | "outputs": [],
420 | "source": [
421 | "def get_metrics(outputs, labels, with_acc=True):\n",
422 | " \n",
423 | " all_metrics = {}\n",
424 | " \n",
425 | " # FPR and MDR 0.4\n",
426 | " temp = metrics(outputs, labels, threshold=0.4)\n",
427 | " all_metrics[\"FALSE_POSITIVE_RATE_4\"] = temp[\"FPR\"]\n",
428 | " all_metrics[\"MISSED_DETECTION_RATE_4\"] = temp[\"MDR\"]\n",
429 | " \n",
430 | " # FPR and MDR 0.5\n",
431 | " temp = metrics(outputs, labels, threshold=0.5)\n",
432 | " all_metrics[\"FALSE_POSITIVE_RATE_5\"] = temp[\"FPR\"]\n",
433 | " all_metrics[\"MISSED_DETECTION_RATE_5\"] = temp[\"MDR\"]\n",
434 | " \n",
435 | " # FPR and MDR 0.6\n",
436 | " temp = metrics(outputs, labels, threshold=0.6)\n",
437 | " all_metrics[\"FALSE_POSITIVE_RATE_6\"] = temp[\"FPR\"]\n",
438 | " all_metrics[\"MISSED_DETECTION_RATE_6\"] = temp[\"MDR\"]\n",
439 | " \n",
440 | " # Summed FPR and MDR\n",
441 | " all_metrics[\"FALSE_POSITIVE_RATE\"] = all_metrics[\"FALSE_POSITIVE_RATE_4\"] + all_metrics[\"FALSE_POSITIVE_RATE_5\"] + all_metrics[\"FALSE_POSITIVE_RATE_6\"] \n",
442 | " all_metrics[\"MISSED_DETECTION_RATE\"] = all_metrics[\"MISSED_DETECTION_RATE_4\"] + all_metrics[\"MISSED_DETECTION_RATE_5\"] + all_metrics[\"MISSED_DETECTION_RATE_6\"]\n",
443 | " \n",
444 | " # The true sum\n",
445 | " all_metrics[\"PIPPIN_METRIC\"] = all_metrics[\"FALSE_POSITIVE_RATE\"] + all_metrics[\"MISSED_DETECTION_RATE\"]\n",
446 | " \n",
447 | " # Accuracy\n",
448 | " if with_acc:\n",
449 | " predictions = np.around(outputs).astype(int)\n",
450 | " all_metrics[\"ACCURACY\"] = (predictions == labels).sum() / len(labels)\n",
451 | " \n",
452 | " return all_metrics"
453 | ]
454 | },
455 | {
456 | "cell_type": "markdown",
457 | "metadata": {},
458 | "source": [
459 | "## The big fight!\n",
460 | "\n",
461 | "Our model has trained up in the modern day version of a classic cinematic training montage!\n",
462 | "\n",
463 | "We can finally give it the final challange... this challenge just happens to be feeding it more data rather than fighting his own inner demons in the manifestation of a boxer."
464 | ]
465 | },
466 | {
467 | "cell_type": "code",
468 | "execution_count": 36,
469 | "metadata": {},
470 | "outputs": [],
471 | "source": [
472 | "y_predictions = bst.predict_proba(x_valid)[:, 1:]"
473 | ]
474 | },
475 | {
476 | "cell_type": "markdown",
477 | "metadata": {},
478 | "source": [
479 | "## To the judges!\n",
480 | "\n",
481 | "Our model has fought well and forced the match to decision. Only the judges can give us the final results!\n",
482 | "\n",
483 | "You can see that we use the metric functions defined above, passing in what the model guessed and what the actual results **should be**. We then do the math and see how our fighter did.\n",
484 | "\n",
485 | "We won't go in depth into the comparison here since we go into it in-depth in the article. \n",
486 | "\n",
487 | "(Teaser: it lost but actually did fairly well for how simple it is!)"
488 | ]
489 | },
490 | {
491 | "cell_type": "code",
492 | "execution_count": null,
493 | "metadata": {},
494 | "outputs": [],
495 | "source": [
496 | "all_metrics = get_metrics(y_predictions, y_valid)\n",
497 | "\n",
498 | "print(\"FPR (0.4): \" + str(all_metrics[\"FALSE_POSITIVE_RATE_4\"][0]))\n",
499 | "print(\"FPR (0.5): \" + str(all_metrics[\"FALSE_POSITIVE_RATE_5\"][0]))\n",
500 | "print(\"FPR (0.6): \" + str(all_metrics[\"FALSE_POSITIVE_RATE_6\"][0]))\n",
501 | "print(\"\")\n",
502 | "print(\"MDR (0.4): \" + str(all_metrics[\"MISSED_DETECTION_RATE_4\"][0]))\n",
503 | "print(\"MDR (0.5): \" + str(all_metrics[\"MISSED_DETECTION_RATE_5\"][0]))\n",
504 | "print(\"MDR (0.6): \" + str(all_metrics[\"MISSED_DETECTION_RATE_6\"][0]))\n",
505 | "print(\"\")\n",
506 | "print(\"SUMMED FPR: \" + str(all_metrics[\"FALSE_POSITIVE_RATE\"][0]))\n",
507 | "print(\"SUMMED MDR: \" + str(all_metrics[\"MISSED_DETECTION_RATE\"][0]))\n",
508 | "print(\"TOTAL SUM: \" + str(all_metrics[\"PIPPIN_METRIC\"][0]))\n",
509 | "print(\"\")\n",
510 | "print(\"ACCURACY: \" + str(all_metrics[\"ACCURACY\"][0]))"
511 | ]
512 | },
513 | {
514 | "cell_type": "code",
515 | "execution_count": null,
516 | "metadata": {},
517 | "outputs": [],
518 | "source": []
519 | }
520 | ],
521 | "metadata": {
522 | "kernelspec": {
523 | "display_name": "Python [conda env:space2vec]",
524 | "language": "python",
525 | "name": "conda-env-space2vec-py"
526 | },
527 | "language_info": {
528 | "codemirror_mode": {
529 | "name": "ipython",
530 | "version": 3
531 | },
532 | "file_extension": ".py",
533 | "mimetype": "text/x-python",
534 | "name": "python",
535 | "nbconvert_exporter": "python",
536 | "pygments_lexer": "ipython3",
537 | "version": "3.6.4"
538 | }
539 | },
540 | "nbformat": 4,
541 | "nbformat_minor": 2
542 | }
543 |
--------------------------------------------------------------------------------
/xgboost-baseline/XGBoost Comparison Model.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # In[1]:
5 |
6 |
7 | from sklearn.model_selection import train_test_split
8 |
9 | import xgboost as xgb
10 | import pandas as pd
11 | import numpy as np
12 |
13 | import pickle
14 | import random
15 |
16 | pd.set_option("max_columns", 999)
17 |
18 | np.random.seed(1)
19 |
20 |
21 | # ## Let's get started!
22 | #
23 | # First we have to load in the data, this is the feature engineered data right from the paper. We have actually taken the extra step of formatting it really nicely for Python.
24 | #
25 | # Make sure to change the path to where you downloaded the data!
26 |
27 | # In[2]:
28 |
29 |
30 | path_to_data = "/Users/clifford-laptop/Documents/space2vec/data/engineered-data.pkl"
31 |
32 | data = pickle.load(open(path_to_data, 'rb'))
33 |
34 |
35 | # ## Next the column types
36 | #
37 | # Not all of this is necessary but we wanted to make sure that we explicitly state what each column type is. That way we can be sure that we don't include columns that shouldn't be in the training data.
38 |
39 | # In[3]:
40 |
41 |
42 | targets = [
43 | "OBJECT_TYPE",
44 | ]
45 |
46 | ids = [
47 | "ID",
48 | ]
49 |
50 | continuous = [
51 | "AMP",
52 | "A_IMAGE",
53 | "A_REF",
54 | "B_IMAGE",
55 | "B_REF",
56 | "COLMEDS",
57 | "DIFFSUMRN",
58 | "ELLIPTICITY",
59 | "FLUX_RATIO",
60 | "GAUSS",
61 | "GFLUX",
62 | "L1",
63 | "LACOSMIC",
64 | "MAG",
65 | "MAGDIFF",
66 | "MAG_FROM_LIMIT",
67 | "MAG_REF",
68 | "MAG_REF_ERR",
69 | "MASKFRAC",
70 | "MIN_DISTANCE_TO_EDGE_IN_NEW",
71 | "NN_DIST_RENORM",
72 | "SCALE",
73 | "SNR",
74 | "SPREADERR_MODEL",
75 | "SPREAD_MODEL",
76 | ]
77 |
78 | categorical = [
79 | "BAND",
80 | "CCDID",
81 | "FLAGS",
82 | ]
83 |
84 | ordinal = [
85 | "N2SIG3",
86 | "N2SIG3SHIFT",
87 | "N2SIG5",
88 | "N2SIG5SHIFT",
89 | "N3SIG3",
90 | "N3SIG3SHIFT",
91 | "N3SIG5",
92 | "N3SIG5SHIFT",
93 | "NUMNEGRN",
94 | ]
95 |
96 | booleans = [
97 | "MAGLIM",
98 | ]
99 |
100 |
101 | # ## One hot encode any categorical columns
102 | #
103 | # Here we do something called one hot encoding (https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f).
104 | #
105 | # This is to turn any categorical columns into something that a machine learning model can understand. Let's say we have a column, maybe we call it BAND, and this column might have 4 different possible values:
106 | #
107 | # g, i, r, or z
108 | #
109 | # Well we can't really shove these into our network so we hit it with the "one hot"! The BAND column becomes 5 different columns:
110 | #
111 | # BAND_g, BAND_i, BAND_r, BAND_z, and BAND_nan
112 | #
113 | # Now, instead of a letter value; we have a binary representation with a 1 in it's corresponding column and a zero in the rest.
114 | #
115 | # The function is a bit interesting but it does exactly what we need!
116 |
117 | # In[4]:
118 |
119 |
120 | data = pd.get_dummies(
121 | data,
122 | prefix = categorical,
123 | prefix_sep = '_',
124 | dummy_na = True,
125 | columns = categorical,
126 | sparse = False,
127 | drop_first = False
128 | )
129 |
130 |
131 | # ## Split the inputs from the targets
132 | #
133 | # This is super important!
134 | #
135 | # We have to make sure we physically seperate the targets (aka labels) from our model input. This is to give us a piece of mind as we train.
136 | #
137 | # Obviously, the model should never train on our targets... That's like giving a student the exam answer sheet to study before the exam!
138 |
139 | # In[5]:
140 |
141 |
142 | target = data[targets]
143 | inputs = data.drop(columns = ids + targets)
144 |
145 |
146 | # ## Shuffle and split the data
147 | #
148 | # Now we split the data again, this time into a training set and a validation set.
149 | #
150 | # This is comparable to having a bunch of practice questions before a test (the training set) and quiz questions (the validation set).
151 | #
152 | # **It's important to note that the model should never learn on the validation set!**
153 | #
154 | # We also shuffle the data to make sure we remove any possible patterns that could be happening within the data (not very likely to happen in this dataset but it doesn't hurt).
155 | #
156 | # Another **really** important point here is "stratification". That sounds fancy but it basically means that when we split the data, the distribution of the populations should be the same in the training and validation set as it was originally... That didn't help did it?
157 | #
158 | # Let's say that in the total dataset we have 50.5% of the population as supernova and the other 49.5% of the population being not a supernova. When we split the data into two subset, in a stratified way, both subsets should keep a very similar ratio of supernova to not-supernova (50.5% to 49.5%).
159 | #
160 | # This is getting way too long... Lastly I'll point out the **test_size = 0.2**. This simply means that 20% of the data is put into a validation set (leaving the other 80% as training data).
161 |
162 | # In[9]:
163 |
164 |
165 | x_train, x_valid, y_train, y_valid = train_test_split(
166 | inputs,
167 | target,
168 | test_size = 0.2,
169 | random_state = 42,
170 | stratify = target.as_matrix()
171 | )
172 |
173 |
174 | # ## Parameters!
175 | #
176 | # Alright, we won't get too into the specifics here but you can definitely check out the documentation (http://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier).
177 | #
178 | # We just toyed around with the parameters to see what seemed to work the best.
179 | #
180 | # Once we get to the Convolutional Neural Network (CNN), the model we will more than likely use in the end, we will automate this parameter search.
181 | #
182 | # **The joys of this whole notebook thing is that you can run all of this! Try changing them and see what happens!**
183 |
184 | # In[26]:
185 |
186 |
187 | params = {
188 | 'max_depth': 6,
189 | 'learning_rate': 0.1,
190 | 'silent': 1,
191 | 'objective': 'binary:logistic',
192 | 'scale_pos_weight': 0.5,
193 | 'n_estimators': 40,
194 | "gamma": 0,
195 | "min_child_weight": 1,
196 | "max_delta_step": 0,
197 | "subsample": 0.9,
198 | "colsample_bytree": 0.8,
199 | "colsample_bylevel": 0.9,
200 | "reg_alpha": 0,
201 | "reg_lambda": 1,
202 | "scale_pos_weight": 1,
203 | "base_score": 0.5,
204 | "seed": 23,
205 | "nthread": 4
206 | }
207 |
208 |
209 | # ## *Rocky training montage*
210 | #
211 | # Now for the part where Rocky runs through the streets training for the big fight!
212 | #
213 | # Ahaha, oh the joys of modern programming! All we need to do is define the XGBClassifier and `.fit()`!
214 | #
215 | # As long as we pass in the data and the metrics that we want to define then we are good to go.
216 |
217 | # In[27]:
218 |
219 |
220 | bst = xgb.XGBClassifier(**params)
221 |
222 | bst.fit(
223 | x_train,
224 | y_train,
225 | eval_set = [(x_train, y_train), (x_valid, y_valid)],
226 | eval_metric = ['auc'],
227 | verbose = True
228 | )
229 |
230 |
231 | # ## Define the rules of the ring
232 | #
233 | # The rules of the big finale were described within the paper, these are the Missed Detection Rate (MDR) and the False Positive Rate (FPR). We won't dive in here as they are mentioned in depth in our blog post, but the following is the coded version of the metrics.
234 |
235 | # In[31]:
236 |
237 |
238 | def metrics(outputs, labels, threshold=0.5):
239 | predictions = outputs >= (1 - threshold)
240 | true_positive_indices = (predictions == 0) * (labels == 0)
241 | false_positive_indices = (predictions == 0) * (labels == 1)
242 | true_negative_indices = (predictions == 1) * (labels == 1)
243 | false_negative_indices = (predictions == 1) * (labels == 0)
244 |
245 | true_positive_count = true_positive_indices.sum()
246 | false_positive_count = false_positive_indices.sum()
247 | true_negative_count = true_negative_indices.sum()
248 | false_negative_count = false_negative_indices.sum()
249 |
250 | return {
251 | # Missed detection rate
252 | 'MDR': false_negative_count / (true_positive_count + false_negative_count),
253 | # True positive rate
254 | 'FPR': false_positive_count / (true_negative_count + false_positive_count)
255 | }
256 |
257 |
258 | # ## Hiring the referee
259 | #
260 | # Great, now we have the rules for the big fight. But we also need someone (or something... or just a function) to take action on the rules.
261 | #
262 | # This is just a function that will run MDR and FPR on all 3 thresholds (0.4, 0.5, 0.6) and a few extras explained below:
263 | #
264 | # **FALSE_POSITIVE_RATE:** Is the sum of the FPR from all three thresholds, this helps us see how the models compare on a large scale.
265 | #
266 | # **MISSED_DETECTION_RATE:** Is the sum of the MDR from all three thresholds, this helps us see how the models compare on a large scale.
267 | #
268 | # **PIPPIN_METRIC:** Named after team member Pippin Lee, this is just **FALSE_POSITIVE_RATE** and **MISSED_DETECTION_RATE** summed to give us an even large scale of how the models compare.
269 | #
270 | # **ACCURACY:** Simply the percentage of guesses that we got right.
271 |
272 | # In[30]:
273 |
274 |
275 | def get_metrics(outputs, labels, with_acc=True):
276 |
277 | all_metrics = {}
278 |
279 | # FPR and MDR 0.4
280 | temp = metrics(outputs, labels, threshold=0.4)
281 | all_metrics["FALSE_POSITIVE_RATE_4"] = temp["FPR"]
282 | all_metrics["MISSED_DETECTION_RATE_4"] = temp["MDR"]
283 |
284 | # FPR and MDR 0.5
285 | temp = metrics(outputs, labels, threshold=0.5)
286 | all_metrics["FALSE_POSITIVE_RATE_5"] = temp["FPR"]
287 | all_metrics["MISSED_DETECTION_RATE_5"] = temp["MDR"]
288 |
289 | # FPR and MDR 0.6
290 | temp = metrics(outputs, labels, threshold=0.6)
291 | all_metrics["FALSE_POSITIVE_RATE_6"] = temp["FPR"]
292 | all_metrics["MISSED_DETECTION_RATE_6"] = temp["MDR"]
293 |
294 | # Summed FPR and MDR
295 | all_metrics["FALSE_POSITIVE_RATE"] = all_metrics["FALSE_POSITIVE_RATE_4"] + all_metrics["FALSE_POSITIVE_RATE_5"] + all_metrics["FALSE_POSITIVE_RATE_6"]
296 | all_metrics["MISSED_DETECTION_RATE"] = all_metrics["MISSED_DETECTION_RATE_4"] + all_metrics["MISSED_DETECTION_RATE_5"] + all_metrics["MISSED_DETECTION_RATE_6"]
297 |
298 | # The true sum
299 | all_metrics["PIPPIN_METRIC"] = all_metrics["FALSE_POSITIVE_RATE"] + all_metrics["MISSED_DETECTION_RATE"]
300 |
301 | # Accuracy
302 | if with_acc:
303 | predictions = np.around(outputs).astype(int)
304 | all_metrics["ACCURACY"] = (predictions == labels).sum() / len(labels)
305 |
306 | return all_metrics
307 |
308 |
309 | # ## The big fight!
310 | #
311 | # Our model has trained up in the modern day version of a classic cinematic training montage!
312 | #
313 | # We can finally give it the final challange... this challenge just happens to be feeding it more data rather than fighting his own inner demons in the manifestation of a boxer.
314 |
315 | # In[36]:
316 |
317 |
318 | y_predictions = bst.predict_proba(x_valid)[:, 1:]
319 |
320 |
321 | # ## To the judges!
322 | #
323 | # Our model has fought well and forced the match to decision. Only the judges can give us the final results!
324 | #
325 | # You can see that we use the metric functions defined above, passing in what the model guessed and what the actual results **should be**. We then do the math and see how our fighter did.
326 | #
327 | # We won't go in depth into the comparison here since we go into it in-depth in the article.
328 | #
329 | # (Teaser: it lost but actually did fairly well for how simple it is!)
330 |
331 | # In[ ]:
332 |
333 |
334 | all_metrics = get_metrics(y_predictions, y_valid)
335 |
336 | print("FPR (0.4): " + str(all_metrics["FALSE_POSITIVE_RATE_4"][0]))
337 | print("FPR (0.5): " + str(all_metrics["FALSE_POSITIVE_RATE_5"][0]))
338 | print("FPR (0.6): " + str(all_metrics["FALSE_POSITIVE_RATE_6"][0]))
339 | print("")
340 | print("MDR (0.4): " + str(all_metrics["MISSED_DETECTION_RATE_4"][0]))
341 | print("MDR (0.5): " + str(all_metrics["MISSED_DETECTION_RATE_5"][0]))
342 | print("MDR (0.6): " + str(all_metrics["MISSED_DETECTION_RATE_6"][0]))
343 | print("")
344 | print("SUMMED FPR: " + str(all_metrics["FALSE_POSITIVE_RATE"][0]))
345 | print("SUMMED MDR: " + str(all_metrics["MISSED_DETECTION_RATE"][0]))
346 | print("TOTAL SUM: " + str(all_metrics["PIPPIN_METRIC"][0]))
347 | print("")
348 | print("ACCURACY: " + str(all_metrics["ACCURACY"][0]))
349 |
350 |
--------------------------------------------------------------------------------