├── README.md ├── cnn-model ├── LICENSE ├── README.md ├── Type Ia Supernova Classifier - Convolutional Neural Network.ipynb ├── Type Ia Supernova Classifier - Convolutional Neural Network.py └── space_utils.py ├── requirements.txt └── xgboost-baseline ├── README.md ├── XGBoost Comparison Model.ipynb └── XGBoost Comparison Model.py /README.md: -------------------------------------------------------------------------------- 1 | ## space2vec: Model Code 2 | 3 | Check our the posts here: [space2vec.com](http://space2vec.com) 4 | 5 | The project behind the code is talked about in detail throughout the blog posts. But this 6 | is where the cool code stuff happens! 7 | 8 | 9 | ## Data 10 | You can find the feature engineered CSV from the autoscan project (under the "Features" heading) site here: 11 | [http://portal.nersc.gov/project/dessn/autoscan/](http://portal.nersc.gov/project/dessn/autoscan/) 12 | 13 | 14 | ## Environment 15 | We have supplied requirements.txt file which you can use to setup the right environment. This was made for Python 3.6, 16 | so if you are getting errors about missing versions or something similar try removing anything after the "==" for that 17 | library in the requirements.txt and run again. 18 | 19 | 20 | ## Posts 21 | 22 | ### Week 2: building a baseline model 23 | See [/xgboost-baseline](https://github.com/pippinlee/space2vec-ml-code/tree/master/building-baseline-model) for code 24 | 25 | We pickled the feature engineered data for our above model, you can find the data here: 26 | [https://drive.google.com/open?id=1Pa4-imVbK7yfZuCX3mfF-mMae1eyhQqo](https://drive.google.com/open?id=1Pa4-imVbK7yfZuCX3mfF-mMae1eyhQqo) 27 | 28 | 29 | 30 | ###### Maintained by Pippin Lee (p.lee@dessa.com) and Cole Clifford (c.clifford@dessa.com) 31 | -------------------------------------------------------------------------------- /cnn-model/LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) Pippin Lee, Jinnah Ali-Clarke, Cole Clifford. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so, 10 | subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | -------------------------------------------------------------------------------- /cnn-model/README.md: -------------------------------------------------------------------------------- 1 | ## CNN Model 2 | 3 | We have given 3 files: 4 | 5 | 1. The iPython/Jupyter notebook file (`Type Ia Supernova Classifier - Convolutional Neural Network.ipynb`) 6 | 2. The .py file outputted from iPython/Jupyter (`Type Ia Supernova Classifier - Convolutional Neural Network.py`) 7 | 3. Functions that are used in the other 2 files (`space_utils.py`) 8 | 9 | For this specific model, we strongly recommend the iPython/Jupyter notebook file. The code 10 | explanation is a lot nicer in the notebook interface, it will be easier to learn what is going on! 11 | 12 | There are 2 main data files that are used in the code: 13 | 14 | 1. all_object_data_in_dictionary_format.pkl 15 | 2. normalized_image_object_data_in_numpy_format.pkl 16 | 17 | The descriptions for what each one does it in the code. 18 | 19 | However, there are 3 different sizes of each with the links below: 20 | 21 | | Filename | S3 Link | File Size | 22 | |--------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|-----------| 23 | | all_object_data_in_dictionary_format.pkl | https://s3.amazonaws.com/space2vec-public/post3/all_object_data_in_dictionary_format.pkl | 6.7GB | 24 | | normalized_image_object_data_in_numpy_format.pkl | https://s3.amazonaws.com/space2vec-public/post3/normalized_image_object_data_in_numpy_format.pkl | 13.0GB | 25 | | small_all_object_data_in_dictionary_format.pkl | https://s3.amazonaws.com/space2vec-public/post3/small_all_object_data_in_dictionary_format.pkl | 772.0MB | 26 | | small_normalized_image_object_data_in_numpy_format.pkl | https://s3.amazonaws.com/space2vec-public/post3/small_normalized_image_object_data_in_numpy_format.pkl | 1.5GB | 27 | | extra_small_all_object_data_in_dictionary_format.pkl | https://s3.amazonaws.com/space2vec-public/post3/extra_small_all_object_data_in_dictionary_format.pkl | 386.0MB | 28 | | extra_small_normalized_image_object_data_in_numpy_format.pkl | https://s3.amazonaws.com/space2vec-public/post3/extra_small_normalized_image_object_data_in_numpy_format.pkl | 744.2MB | 29 | 30 | You can pick any of the links from that table and use `wget ` to download the data. 31 | 32 | ## License 33 | 34 | `space2vec-ml-code` is available under the MIT license. See the LICENSE file for more info. 35 | 36 | Copyright 2018 Pippin Lee, Jinnah Ali-Clarke, Cole Clifford. 37 | -------------------------------------------------------------------------------- /cnn-model/Type Ia Supernova Classifier - Convolutional Neural Network.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "from sklearn.model_selection import StratifiedShuffleSplit\n", 10 | "from keras.callbacks import ModelCheckpoint, EarlyStopping\n", 11 | "from keras.layers.normalization import BatchNormalization\n", 12 | "from keras.layers import MaxPooling2D, Flatten, Conv2D\n", 13 | "from keras.layers import Dense, Dropout, Activation\n", 14 | "from keras.models import Sequential\n", 15 | "from matplotlib import pyplot as plt\n", 16 | "from slackclient import SlackClient\n", 17 | "from keras.models import load_model\n", 18 | "from keras.optimizers import Adam\n", 19 | "from space_utils import *\n", 20 | "\n", 21 | "from keras import regularizers\n", 22 | "from time import process_time\n", 23 | "from shutil import copyfile\n", 24 | "\n", 25 | "import pandas as pd\n", 26 | "import numpy as np\n", 27 | "\n", 28 | "import pickle\n", 29 | "import random\n", 30 | "import os\n", 31 | "\n", 32 | "pd.options.display.max_columns = 45" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "## Introduction\n", 40 | "---\n", 41 | "Hi and hello! Welcome to the step-by-step guide of how to train a model to detect supernova.\n", 42 | "\n", 43 | "Throughout this guide you will learn about the data that we used, the building of a model in Keras, and how we went about record keeping for our experiments.\n", 44 | "\n", 45 | "There is a seperate file called utils.py that holds any functions that we wrote for our project." 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "## Constants\n", 53 | "---\n", 54 | "We find it best to define a set of constants at the beginning of the notebook for clarity." 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "HOME_PATH = \"/home/ubuntu\"\n", 64 | "DATA_PATH = \"/home/ubuntu/data/\"\n", 65 | "MODEL_PATH = \"/home/ubuntu/model/\"\n", 66 | "RESULTS_PATH = \"/home/ubuntu/results/\"\n", 67 | "\n", 68 | "ALL_DATA_FILE = \"extra_small_all_object_data_in_dictionary_format.pkl\"\n", 69 | "NORMALIZED_IMAGE_DATA_FILE = \"extra_small_normalized_image_object_data_in_numpy_format.pkl\"\n", 70 | "\n", 71 | "MODEL_LOGGING_FILE = \"model_results.csv\"" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "## Data Loading\n", 79 | "---\n", 80 | "We first have to load in the data to be used for model training.\n", 81 | "\n", 82 | "This consists of 2 main data files stored in the variables ALL_DATA_FILE and NORMALIZED_IMAGE_DATA_FILE.\n", 83 | "\n", 84 | "ALL_DATA_FILE: We have any information that will be relevent to an object observation in here. This is a dictionary\n", 85 | "with 4 keys -- images, targets, file_paths, observation_numbers -- where each key holds a Numpy array. The indices of\n", 86 | "each array are all properly aligned according to their respective objects (explained in the table).\n", 87 | "\n", 88 | "| X_normalized | X | Y | file_path | observation_number |\n", 89 | "|---------------------|----------|----------|------------------|---------------------------|\n", 90 | "| obj_0_X_normalized | obj_0_X | obj_0_Y | obj_0_file_path | obj_0_observation_number |\n", 91 | "| obj_42_X_normalized | obj_42_X | obj_42_Y | obj_42_file_path | obj_42_observation_number |\n", 92 | "\n", 93 | "NORMALIZED_IMAGE_DATA_FILE: This is simply a Numpy array of photos ready to be fed into a model. They are normalized and the channels -- search image, template image, difference image -- are organized properly. The preparation of this data is in >>>FILL IN<<<." 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "all_data = pickle.load(open(DATA_PATH + ALL_DATA_FILE, \"rb\"))\n", 103 | "all_images_normalized = pickle.load(open(DATA_PATH + NORMALIZED_IMAGE_DATA_FILE, \"rb\"))" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "## Data Splitting\n", 111 | "---\n", 112 | "We have to split the data into 3 different sets: training, validation, and testing. Utilizing the *split_space_data*\n", 113 | "function we imported from *utils.py* this is pretty straightforward.\n", 114 | "\n", 115 | "P.S. Sorry that each line is so long... We tried multiple ways of making this easier on the eyes but this makes\n", 116 | "the most sense!" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "(X_train, X_train_normal, Y_train, file_path_train, observation_number_train), (X_test, X_test_normal, Y_test, file_path_test, observation_number_test) = split_space_data(\n", 126 | " all_images_normalized, \n", 127 | " all_data[\"images\"],\n", 128 | " all_data[\"targets\"], \n", 129 | " all_data[\"file_paths\"], \n", 130 | " all_data[\"observation_numbers\"], \n", 131 | " 0.1\n", 132 | ")" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "(X_train, X_train_normal, Y_train, file_path_train, observation_number_train), (X_valid, X_valid_normal, Y_valid, file_path_valid, observation_number_valid) = split_space_data(\n", 142 | " X_train,\n", 143 | " X_train_normal,\n", 144 | " Y_train,\n", 145 | " file_path_train,\n", 146 | " observation_number_train,\n", 147 | " 0.2\n", 148 | ")" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "## Model Definition\n", 156 | "---\n", 157 | "We define the model in a function just to keep things separated nicely. Feel free to change the model however\n", 158 | " you like! Try things out :D " 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "def build_model(X, Y, params):\n", 168 | " \n", 169 | " # Figure out the data shape\n", 170 | " input_shape = (X.shape[1], X.shape[2], X.shape[3])\n", 171 | " \n", 172 | " # Define the model object to append layers to\n", 173 | " model = Sequential()\n", 174 | " \n", 175 | " # Add first layer\n", 176 | " model.add(Conv2D(\n", 177 | " filters=params[\"NUMBER_OF_FILTERS_1\"],\n", 178 | " kernel_size=(3,3),\n", 179 | " strides=(1,1),\n", 180 | " border_mode='same',\n", 181 | " data_format='channels_first',\n", 182 | " input_shape=input_shape\n", 183 | " ))\n", 184 | " model.add(Activation('relu'))\n", 185 | " model.add(Conv2D(\n", 186 | " filters=params[\"NUMBER_OF_FILTERS_1\"],\n", 187 | " kernel_size=(3,3),\n", 188 | " strides=(2,2),\n", 189 | " border_mode='same',\n", 190 | " data_format='channels_first',\n", 191 | " input_shape=input_shape\n", 192 | " ))\n", 193 | " model.add(BatchNormalization(axis=1))\n", 194 | " model.add(Activation('relu'))\n", 195 | " \n", 196 | " # Second layer\n", 197 | " model.add(Conv2D(\n", 198 | " filters=params[\"NUMBER_OF_FILTERS_2\"],\n", 199 | " strides=(1,1),\n", 200 | " kernel_size=(3,3),\n", 201 | " border_mode='same',\n", 202 | " data_format='channels_first',\n", 203 | " ))\n", 204 | " model.add(Activation('relu'))\n", 205 | " model.add(Conv2D(\n", 206 | " filters=params[\"NUMBER_OF_FILTERS_2\"],\n", 207 | " strides=(2,2),\n", 208 | " kernel_size=(3,3),\n", 209 | " border_mode='same',\n", 210 | " data_format='channels_first',\n", 211 | " ))\n", 212 | " model.add(BatchNormalization(axis=1))\n", 213 | " model.add(Activation('relu'))\n", 214 | " \n", 215 | " # Third layer\n", 216 | " model.add(Conv2D(\n", 217 | " filters=params[\"NUMBER_OF_FILTERS_3\"],\n", 218 | " strides=(1,1),\n", 219 | " kernel_size=(3,3),\n", 220 | " border_mode='same',\n", 221 | " data_format='channels_first',\n", 222 | " ))\n", 223 | " model.add(Activation('relu'))\n", 224 | " model.add(Conv2D(\n", 225 | " filters=params[\"NUMBER_OF_FILTERS_3\"],\n", 226 | " strides=(2,2),\n", 227 | " kernel_size=(3,3),\n", 228 | " border_mode='same',\n", 229 | " data_format='channels_first',\n", 230 | " ))\n", 231 | " model.add(BatchNormalization(axis=1))\n", 232 | " model.add(Activation('relu'))\n", 233 | " \n", 234 | " # Fourth layer\n", 235 | " model.add(Conv2D(\n", 236 | " filters=params[\"NUMBER_OF_FILTERS_4\"],\n", 237 | " strides=(1,1),\n", 238 | " kernel_size=(3,3),\n", 239 | " border_mode='same',\n", 240 | " data_format='channels_first',\n", 241 | " ))\n", 242 | " model.add(Activation('relu'))\n", 243 | " model.add(Conv2D(\n", 244 | " filters=params[\"NUMBER_OF_FILTERS_4\"],\n", 245 | " strides=(2,2),\n", 246 | " kernel_size=(3,3),\n", 247 | " border_mode='same',\n", 248 | " data_format='channels_first',\n", 249 | " ))\n", 250 | " model.add(BatchNormalization(axis=1))\n", 251 | " model.add(Activation('relu'))\n", 252 | " \n", 253 | " # Fifth layer\n", 254 | " model.add(Conv2D(\n", 255 | " filters=params[\"NUMBER_OF_FILTERS_4\"],\n", 256 | " strides=(1,1),\n", 257 | " kernel_size=(3,3),\n", 258 | " border_mode='same',\n", 259 | " data_format='channels_first',\n", 260 | " ))\n", 261 | " model.add(Activation('relu'))\n", 262 | " \n", 263 | " # Output layers\n", 264 | " model.add(Flatten())\n", 265 | " model.add(Dense(128))\n", 266 | " model.add(Dropout(params[\"DROPOUT_PERCENT\"]))\n", 267 | " model.add(Dense(1))\n", 268 | " model.add(Activation(\"sigmoid\"))\n", 269 | " \n", 270 | " return model" 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "## Model Parameters\n", 278 | "---\n", 279 | "We have separated parameters into 2 buckets with the folowing definitions:\n", 280 | "- user_params: Information *about* the model for record keeping\n", 281 | "- model_params: Information *for* the model to consume" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": null, 287 | "metadata": {}, 288 | "outputs": [], 289 | "source": [ 290 | "user_params = {\n", 291 | " \"INITIALS\": \"cc\",\n", 292 | " \"MODEL_DESCRIPTION\": \"My first public model!\",\n", 293 | " \"VERSION\": \"1\"\n", 294 | "}\n", 295 | "\n", 296 | "model_params = {\n", 297 | " \"LEARNING_RATE\": 0.00014148226882681195,\n", 298 | " \"BATCH_SIZE\": 368,\n", 299 | " \"DROPOUT_PERCENT\": 0.4488113054975806,\n", 300 | " \"NUMBER_OF_FILTERS_1\": 25,\n", 301 | " \"NUMBER_OF_FILTERS_2\": 63,\n", 302 | " \"NUMBER_OF_FILTERS_3\": 119,\n", 303 | " \"NUMBER_OF_FILTERS_4\": 210, \n", 304 | " \"NUMBER_OF_EPOCHS\": 40,\n", 305 | "}" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": {}, 311 | "source": [ 312 | "## Model Experimentation\n", 313 | "---\n" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "metadata": {}, 320 | "outputs": [], 321 | "source": [ 322 | "MODEL_AMOUNT = 1\n", 323 | "\n", 324 | "for current_model_number in range(MODEL_AMOUNT):\n", 325 | " \n", 326 | " # Indicate and log model start\n", 327 | " print(\"START MODEL SEARCH (model {} of {})\".format(current_model_number, MODEL_AMOUNT))\n", 328 | " start = process_time()\n", 329 | " \n", 330 | " # Randomize specific parameters if we are doing a search\n", 331 | " # Feel free to add or change the current parameters\n", 332 | " if MODEL_AMOUNT > 1:\n", 333 | " params[\"LEARNING_RATE\"] = 10 ** np.random.uniform(-4, -2)\n", 334 | " params[\"BATCH_SIZE\"] = 16 * np.random.randint(1, 96)\n", 335 | " params[\"DROPOUT_PERCENT\"] = np.random.uniform(0.0, 0.6)\n", 336 | " params[\"NUMBER_OF_FILTERS_1\"] = np.random.randint(4, 32)\n", 337 | " params[\"NUMBER_OF_FILTERS_2\"] = np.random.randint(16, 64)\n", 338 | " params[\"NUMBER_OF_FILTERS_3\"] = np.random.randint(32, 128)\n", 339 | " params[\"NUMBER_OF_FILTERS_4\"] = np.random.randint(64, 256) \n", 340 | " \n", 341 | " # Build the model and catch if the model acrhitectur is not valid\n", 342 | " try:\n", 343 | " model = build_model(X_train, Y_train, model_params)\n", 344 | " except Exception as e:\n", 345 | " print(\"That didn't work!\")\n", 346 | " print(e)\n", 347 | " continue\n", 348 | " \n", 349 | " # Create the specific model name\n", 350 | " model_name = user_params[\"INITIALS\"] + \"_convolutional_\" + str(user_params[\"VERSION\"]) + str(current_model_number)\n", 351 | " user_params[\"VERSION\"] = user_params[\"VERSION\"] + str(1)\n", 352 | " \n", 353 | " # Define an optimizer for the model\n", 354 | " adam_optimizer = Adam(\n", 355 | " lr=model_params[\"LEARNING_RATE\"], \n", 356 | " beta_1=0.9, \n", 357 | " beta_2=0.999, \n", 358 | " epsilon=None, \n", 359 | " decay=0.0\n", 360 | " )\n", 361 | " \n", 362 | " # Compile the model\n", 363 | " model.compile(\n", 364 | " loss=\"binary_crossentropy\", \n", 365 | " optimizer=adam_optimizer,\n", 366 | " metrics=['accuracy']\n", 367 | " )\n", 368 | " \n", 369 | " # Figure out where to save the model checkpoints\n", 370 | " checkpoint_file = MODEL_PATH + \"mdl.hdf5\"\n", 371 | " checkpointer = ModelCheckpoint(filepath=checkpoint_file, verbose=2, save_best_only=True)\n", 372 | " \n", 373 | " # Create an early stopping callback\n", 374 | " early_stopping_callback = EarlyStopping(patience=5, min_delta=0.0005, verbose=2)\n", 375 | " \n", 376 | " # Actually train the model\n", 377 | " print(model_params)\n", 378 | " history = model.fit(\n", 379 | " X_train,\n", 380 | " Y_train,\n", 381 | " batch_size=model_params[\"BATCH_SIZE\"],\n", 382 | " nb_epoch=model_params[\"NUMBER_OF_EPOCHS\"],\n", 383 | " verbose=1,\n", 384 | " validation_data=(X_valid, Y_valid),\n", 385 | " callbacks=[checkpointer, early_stopping_callback]\n", 386 | " )\n", 387 | " \n", 388 | " # Reload the best model\n", 389 | " model = load_model(checkpoint_file)\n", 390 | " \n", 391 | " # Get final predictions for the model and write to a file\n", 392 | " predictions = model.predict(X_test).flatten()\n", 393 | " model_metrics = get_metrics(predictions, Y_test)\n", 394 | " create_result_csv(user_params, model_params, model_metrics, file_name=RESULTS_PATH + MODEL_LOGGING_FILE)\n", 395 | " \n", 396 | " # Save the model to a unique location if the Pippin metric is better than the papers\n", 397 | " if model_metrics[\"PIPPIN_METRIC\"] < 0.202:\n", 398 | " copyfile(checkpoint_file, checkpoint_filepath + \"{}.hdf5\".format(model_name))\n", 399 | " \n", 400 | " # Plot the model history\n", 401 | " plt.plot(history.history['loss'])\n", 402 | " plt.plot(history.history['val_loss'])\n", 403 | " plt.title('Training History')\n", 404 | " plt.ylabel('Binary Cross Entropy Loss')\n", 405 | " plt.xlabel('Epoch')\n", 406 | " plt.xlim([0, len(history.history['loss'])])\n", 407 | " plt.legend(['Training set', 'Validation set'], loc='upper right')\n", 408 | " plt.show()\n", 409 | " \n", 410 | " # Reset plot to clean up extra lines\n", 411 | " plt.clf()\n", 412 | " \n", 413 | " # Get some indication of process length\n", 414 | " final = process_time()\n", 415 | " print('FINISHED MODEL SEARCH. {} SECONDS.'.format(str(final-start)))" 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": null, 421 | "metadata": {}, 422 | "outputs": [], 423 | "source": [] 424 | } 425 | ], 426 | "metadata": { 427 | "kernelspec": { 428 | "display_name": "Environment (conda_tensorflow_p36)", 429 | "language": "python", 430 | "name": "conda_tensorflow_p36" 431 | }, 432 | "language_info": { 433 | "codemirror_mode": { 434 | "name": "ipython", 435 | "version": 3 436 | }, 437 | "file_extension": ".py", 438 | "mimetype": "text/x-python", 439 | "name": "python", 440 | "nbconvert_exporter": "python", 441 | "pygments_lexer": "ipython3", 442 | "version": "3.6.6" 443 | } 444 | }, 445 | "nbformat": 4, 446 | "nbformat_minor": 2 447 | } 448 | -------------------------------------------------------------------------------- /cnn-model/Type Ia Supernova Classifier - Convolutional Neural Network.py: -------------------------------------------------------------------------------- 1 | 2 | # coding: utf-8 3 | 4 | # In[ ]: 5 | 6 | 7 | from sklearn.model_selection import StratifiedShuffleSplit 8 | from keras.callbacks import ModelCheckpoint, EarlyStopping 9 | from keras.layers.normalization import BatchNormalization 10 | from keras.layers import MaxPooling2D, Flatten, Conv2D 11 | from keras.layers import Dense, Dropout, Activation 12 | from keras.models import Sequential 13 | from matplotlib import pyplot as plt 14 | from slackclient import SlackClient 15 | from keras.models import load_model 16 | from keras.optimizers import Adam 17 | from space_utils import * 18 | 19 | from keras import regularizers 20 | from time import process_time 21 | from shutil import copyfile 22 | 23 | import pandas as pd 24 | import numpy as np 25 | 26 | import pickle 27 | import random 28 | import os 29 | 30 | pd.options.display.max_columns = 45 31 | 32 | 33 | # ## Introduction 34 | # --- 35 | # Hi and hello! Welcome to the step-by-step guide of how to train a model to detect supernova. 36 | # 37 | # Throughout this guide you will learn about the data that we used, the building of a model in Keras, and how we went about record keeping for our experiments. 38 | # 39 | # There is a seperate file called utils.py that holds any functions that we wrote for our project. 40 | 41 | # ## Constants 42 | # --- 43 | # We find it best to define a set of constants at the beginning of the notebook for clarity. 44 | 45 | # In[ ]: 46 | 47 | 48 | HOME_PATH = "/home/ubuntu" 49 | DATA_PATH = "/home/ubuntu/data/" 50 | MODEL_PATH = "/home/ubuntu/model/" 51 | RESULTS_PATH = "/home/ubuntu/results/" 52 | 53 | ALL_DATA_FILE = "extra_small_all_object_data_in_dictionary_format.pkl" 54 | NORMALIZED_IMAGE_DATA_FILE = "extra_small_normalized_image_object_data_in_numpy_format.pkl" 55 | 56 | MODEL_LOGGING_FILE = "model_results.csv" 57 | 58 | 59 | # ## Data Loading 60 | # --- 61 | # We first have to load in the data to be used for model training. 62 | # 63 | # This consists of 2 main data files stored in the variables ALL_DATA_FILE and NORMALIZED_IMAGE_DATA_FILE. 64 | # 65 | # ALL_DATA_FILE: We have any information that will be relevent to an object observation in here. This is a dictionary 66 | # with 4 keys -- images, targets, file_paths, observation_numbers -- where each key holds a Numpy array. The indices of 67 | # each array are all properly aligned according to their respective objects (explained in the table). 68 | # 69 | # | X_normalized | X | Y | file_path | observation_number | 70 | # |---------------------|----------|----------|------------------|---------------------------| 71 | # | obj_0_X_normalized | obj_0_X | obj_0_Y | obj_0_file_path | obj_0_observation_number | 72 | # | obj_42_X_normalized | obj_42_X | obj_42_Y | obj_42_file_path | obj_42_observation_number | 73 | # 74 | # NORMALIZED_IMAGE_DATA_FILE: This is simply a Numpy array of photos ready to be fed into a model. They are normalized and the channels -- search image, template image, difference image -- are organized properly. The preparation of this data is in >>>FILL IN<<<. 75 | 76 | # In[ ]: 77 | 78 | 79 | all_data = pickle.load(open(DATA_PATH + ALL_DATA_FILE, "rb")) 80 | all_images_normalized = pickle.load(open(DATA_PATH + NORMALIZED_IMAGE_DATA_FILE, "rb")) 81 | 82 | 83 | # ## Data Splitting 84 | # --- 85 | # We have to split the data into 3 different sets: training, validation, and testing. Utilizing the *split_space_data* 86 | # function we imported from *utils.py* this is pretty straightforward. 87 | # 88 | # P.S. Sorry that each line is so long... We tried multiple ways of making this easier on the eyes but this makes 89 | # the most sense! 90 | 91 | # In[ ]: 92 | 93 | 94 | (X_train, X_train_normal, Y_train, file_path_train, observation_number_train), (X_test, X_test_normal, Y_test, file_path_test, observation_number_test) = split_space_data( 95 | all_images_normalized, 96 | all_data["images"], 97 | all_data["targets"], 98 | all_data["file_paths"], 99 | all_data["observation_numbers"], 100 | 0.1 101 | ) 102 | 103 | 104 | # In[ ]: 105 | 106 | 107 | (X_train, X_train_normal, Y_train, file_path_train, observation_number_train), (X_valid, X_valid_normal, Y_valid, file_path_valid, observation_number_valid) = split_space_data( 108 | X_train, 109 | X_train_normal, 110 | Y_train, 111 | file_path_train, 112 | observation_number_train, 113 | 0.2 114 | ) 115 | 116 | 117 | # ## Model Definition 118 | # --- 119 | # We define the model in a function just to keep things separated nicely. Feel free to change the model however 120 | # you like! Try things out :D 121 | 122 | # In[ ]: 123 | 124 | 125 | def build_model(X, Y, params): 126 | 127 | # Figure out the data shape 128 | input_shape = (X.shape[1], X.shape[2], X.shape[3]) 129 | 130 | # Define the model object to append layers to 131 | model = Sequential() 132 | 133 | # Add first layer 134 | model.add(Conv2D( 135 | filters=params["NUMBER_OF_FILTERS_1"], 136 | kernel_size=(3,3), 137 | strides=(1,1), 138 | border_mode='same', 139 | data_format='channels_first', 140 | input_shape=input_shape 141 | )) 142 | model.add(Activation('relu')) 143 | model.add(Conv2D( 144 | filters=params["NUMBER_OF_FILTERS_1"], 145 | kernel_size=(3,3), 146 | strides=(2,2), 147 | border_mode='same', 148 | data_format='channels_first', 149 | input_shape=input_shape 150 | )) 151 | model.add(BatchNormalization(axis=1)) 152 | model.add(Activation('relu')) 153 | 154 | # Second layer 155 | model.add(Conv2D( 156 | filters=params["NUMBER_OF_FILTERS_2"], 157 | strides=(1,1), 158 | kernel_size=(3,3), 159 | border_mode='same', 160 | data_format='channels_first', 161 | )) 162 | model.add(Activation('relu')) 163 | model.add(Conv2D( 164 | filters=params["NUMBER_OF_FILTERS_2"], 165 | strides=(2,2), 166 | kernel_size=(3,3), 167 | border_mode='same', 168 | data_format='channels_first', 169 | )) 170 | model.add(BatchNormalization(axis=1)) 171 | model.add(Activation('relu')) 172 | 173 | # Third layer 174 | model.add(Conv2D( 175 | filters=params["NUMBER_OF_FILTERS_3"], 176 | strides=(1,1), 177 | kernel_size=(3,3), 178 | border_mode='same', 179 | data_format='channels_first', 180 | )) 181 | model.add(Activation('relu')) 182 | model.add(Conv2D( 183 | filters=params["NUMBER_OF_FILTERS_3"], 184 | strides=(2,2), 185 | kernel_size=(3,3), 186 | border_mode='same', 187 | data_format='channels_first', 188 | )) 189 | model.add(BatchNormalization(axis=1)) 190 | model.add(Activation('relu')) 191 | 192 | # Fourth layer 193 | model.add(Conv2D( 194 | filters=params["NUMBER_OF_FILTERS_4"], 195 | strides=(1,1), 196 | kernel_size=(3,3), 197 | border_mode='same', 198 | data_format='channels_first', 199 | )) 200 | model.add(Activation('relu')) 201 | model.add(Conv2D( 202 | filters=params["NUMBER_OF_FILTERS_4"], 203 | strides=(2,2), 204 | kernel_size=(3,3), 205 | border_mode='same', 206 | data_format='channels_first', 207 | )) 208 | model.add(BatchNormalization(axis=1)) 209 | model.add(Activation('relu')) 210 | 211 | # Fifth layer 212 | model.add(Conv2D( 213 | filters=params["NUMBER_OF_FILTERS_4"], 214 | strides=(1,1), 215 | kernel_size=(3,3), 216 | border_mode='same', 217 | data_format='channels_first', 218 | )) 219 | model.add(Activation('relu')) 220 | 221 | # Output layers 222 | model.add(Flatten()) 223 | model.add(Dense(128)) 224 | model.add(Dropout(params["DROPOUT_PERCENT"])) 225 | model.add(Dense(1)) 226 | model.add(Activation("sigmoid")) 227 | 228 | return model 229 | 230 | 231 | # ## Model Parameters 232 | # --- 233 | # We have separated parameters into 2 buckets with the folowing definitions: 234 | # - user_params: Information *about* the model for record keeping 235 | # - model_params: Information *for* the model to consume 236 | 237 | # In[ ]: 238 | 239 | 240 | user_params = { 241 | "INITIALS": "cc", 242 | "MODEL_DESCRIPTION": "My first public model!", 243 | "VERSION": "1" 244 | } 245 | 246 | model_params = { 247 | "LEARNING_RATE": 0.00014148226882681195, 248 | "BATCH_SIZE": 368, 249 | "DROPOUT_PERCENT": 0.4488113054975806, 250 | "NUMBER_OF_FILTERS_1": 25, 251 | "NUMBER_OF_FILTERS_2": 63, 252 | "NUMBER_OF_FILTERS_3": 119, 253 | "NUMBER_OF_FILTERS_4": 210, 254 | "NUMBER_OF_EPOCHS": 40, 255 | } 256 | 257 | 258 | # ## Model Experimentation 259 | # --- 260 | # 261 | 262 | # In[ ]: 263 | 264 | 265 | MODEL_AMOUNT = 1 266 | 267 | for current_model_number in range(MODEL_AMOUNT): 268 | 269 | # Indicate and log model start 270 | print("START MODEL SEARCH (model {} of {})".format(current_model_number, MODEL_AMOUNT)) 271 | start = process_time() 272 | 273 | # Randomize specific parameters if we are doing a search 274 | # Feel free to add or change the current parameters 275 | if MODEL_AMOUNT > 1: 276 | params["LEARNING_RATE"] = 10 ** np.random.uniform(-4, -2) 277 | params["BATCH_SIZE"] = 16 * np.random.randint(1, 96) 278 | params["DROPOUT_PERCENT"] = np.random.uniform(0.0, 0.6) 279 | params["NUMBER_OF_FILTERS_1"] = np.random.randint(4, 32) 280 | params["NUMBER_OF_FILTERS_2"] = np.random.randint(16, 64) 281 | params["NUMBER_OF_FILTERS_3"] = np.random.randint(32, 128) 282 | params["NUMBER_OF_FILTERS_4"] = np.random.randint(64, 256) 283 | 284 | # Build the model and catch if the model acrhitectur is not valid 285 | try: 286 | model = build_model(X_train, Y_train, model_params) 287 | except Exception as e: 288 | print("That didn't work!") 289 | print(e) 290 | continue 291 | 292 | # Create the specific model name 293 | model_name = user_params["INITIALS"] + "_convolutional_" + str(user_params["VERSION"]) + str(current_model_number) 294 | user_params["VERSION"] = user_params["VERSION"] + str(1) 295 | 296 | # Define an optimizer for the model 297 | adam_optimizer = Adam( 298 | lr=model_params["LEARNING_RATE"], 299 | beta_1=0.9, 300 | beta_2=0.999, 301 | epsilon=None, 302 | decay=0.0 303 | ) 304 | 305 | # Compile the model 306 | model.compile( 307 | loss="binary_crossentropy", 308 | optimizer=adam_optimizer, 309 | metrics=['accuracy'] 310 | ) 311 | 312 | # Figure out where to save the model checkpoints 313 | checkpoint_file = MODEL_PATH + "mdl.hdf5" 314 | checkpointer = ModelCheckpoint(filepath=checkpoint_file, verbose=2, save_best_only=True) 315 | 316 | # Create an early stopping callback 317 | early_stopping_callback = EarlyStopping(patience=5, min_delta=0.0005, verbose=2) 318 | 319 | # Actually train the model 320 | print(model_params) 321 | history = model.fit( 322 | X_train, 323 | Y_train, 324 | batch_size=model_params["BATCH_SIZE"], 325 | nb_epoch=model_params["NUMBER_OF_EPOCHS"], 326 | verbose=1, 327 | validation_data=(X_valid, Y_valid), 328 | callbacks=[checkpointer, early_stopping_callback] 329 | ) 330 | 331 | # Reload the best model 332 | model = load_model(checkpoint_file) 333 | 334 | # Get final predictions for the model and write to a file 335 | predictions = model.predict(X_test).flatten() 336 | model_metrics = get_metrics(predictions, Y_test) 337 | create_result_csv(user_params, model_params, model_metrics, file_name=RESULTS_PATH + MODEL_LOGGING_FILE) 338 | 339 | # Save the model to a unique location if the Pippin metric is better than the papers 340 | if model_metrics["PIPPIN_METRIC"] < 0.202: 341 | copyfile(checkpoint_file, checkpoint_filepath + "{}.hdf5".format(model_name)) 342 | 343 | # Plot the model history 344 | plt.plot(history.history['loss']) 345 | plt.plot(history.history['val_loss']) 346 | plt.title('Training History') 347 | plt.ylabel('Binary Cross Entropy Loss') 348 | plt.xlabel('Epoch') 349 | plt.xlim([0, len(history.history['loss'])]) 350 | plt.legend(['Training set', 'Validation set'], loc='upper right') 351 | plt.show() 352 | 353 | # Reset plot to clean up extra lines 354 | plt.clf() 355 | 356 | # Get some indication of process length 357 | final = process_time() 358 | print('FINISHED MODEL SEARCH. {} SECONDS.'.format(str(final-start))) 359 | 360 | -------------------------------------------------------------------------------- /cnn-model/space_utils.py: -------------------------------------------------------------------------------- 1 | def split_space_data( 2 | X_normalized, 3 | X, 4 | Y, 5 | file_path, 6 | observation_number, 7 | test_size 8 | ): 9 | '''Seperate the data in a stratified way. 10 | 11 | The function takes in a few different datasets, where the indices of each are aligned to be of the 12 | same object. 13 | 14 | | X_normalized | X | Y | file_path | observation_number | 15 | |---------------------|----------|----------|------------------|---------------------------| 16 | | obj_0_X_normalized | obj_0_X | obj_0_Y | obj_0_file_path | obj_0_observation_number | 17 | | obj_42_X_normalized | obj_42_X | obj_42_Y | obj_42_file_path | obj_42_observation_number | 18 | 19 | It is important to make sure that the split data is stratified. Stratification means that if there 20 | is multiple classes in our dataset, then when we split our data the classes are make up a similar 21 | balance as when they were in the full data set. 22 | 23 | An example is: Our full dataset is 60% dogs and 40% cats. When we split our data into a training set 24 | and a test set, each set is still made of 60% dogs and 40% cats (or as close to this split as possible). 25 | ''' 26 | from sklearn.model_selection import StratifiedShuffleSplit 27 | 28 | # Create the helper object 29 | sss = StratifiedShuffleSplit( 30 | n_splits=1, 31 | test_size=test_size 32 | ) 33 | 34 | # Generate the indecis 35 | train_index, test_index = next(sss.split(X_normalized, Y)) 36 | 37 | # Shuffle and split the data 38 | X_normalized_train, X_normalized_test = X_normalized[train_index], X_normalized[test_index] 39 | X_train, X_test = X[train_index], X[test_index] 40 | Y_train, Y_test = Y[train_index], Y[test_index] 41 | file_path_train, file_path_test = file_path[train_index], file_path[test_index] 42 | observation_number_train, observation_number_test = observation_number[train_index], observation_number[test_index] 43 | 44 | return ( 45 | X_normalized_train, 46 | X_train, 47 | Y_train, 48 | file_path_train, 49 | observation_number_train 50 | ), ( 51 | X_normalized_test, 52 | X_test, 53 | Y_test, 54 | file_path_test, 55 | observation_number_test 56 | ) 57 | 58 | 59 | def metrics(outputs, labels, threshold=0.5): 60 | '''Gets all metrics that we need for model comparison. 61 | 62 | Throughout the paper they talk about 2 main metrics: False Positive Rate (FPR) and Missed Detection 63 | Rate (MDR). We get these by calculating 64 | 65 | True Positive: The number of times we said a supernova EXISTS and it DID 66 | False Positive: The number of times we said a supernova EXISTS and it DID NOT 67 | True Negative: The number of times we said a supernova DID NOT EXISTS and it DID NOT 68 | False Negative: The number of times we said a supernova DID NOT EXISTS and it DID 69 | ''' 70 | 71 | # Set the predicions to either 0 or 1 based on the given threshold 72 | predictions = outputs >= (1 - threshold) 73 | 74 | # Set the indices to either 0 or 1 based on the metric we are checking 75 | true_positive_indices = (predictions == 0.) * (labels == 0) 76 | false_positive_indices = (predictions == 0.) * (labels == 1) 77 | true_negative_indices = (predictions == 1.) * (labels == 1) 78 | false_negative_indices = (predictions == 1.) * (labels == 0) 79 | 80 | # Get the total count for each metric we are checking 81 | true_positive_count = true_positive_indices.sum() 82 | false_positive_count = false_positive_indices.sum() 83 | true_negative_count = true_negative_indices.sum() 84 | false_negative_count = false_negative_indices.sum() 85 | 86 | # Calculate and store the FPR and MDR in a dictionary for convenience 87 | fpr_and_mdr = { 88 | 'MDR': false_negative_count / (true_positive_count + false_negative_count), 89 | 'FPR': false_positive_count / (true_negative_count + false_positive_count) 90 | } 91 | 92 | return fpr_and_mdr 93 | 94 | 95 | def get_metrics(outputs, labels, with_acc=True): 96 | '''Get all metrics for all interesting thresholds. 97 | 98 | In the paper there is focus on 3 main thresholds -- 0.4, 0.5, and 0.6. We check 99 | 100 | To make sure we are all on the same page, a threshold is basically a boundry that dictates what decision 101 | the model is making. This happens because a models output is a float between 0 and 1. If a model outputs 102 | 0.42 we have to decide what that actually means. With a threshold of 0.4, a 0.42 would be pushed to a 1; where 103 | a threshold of 0.5 would push it to a 0. 104 | ''' 105 | import numpy as np 106 | 107 | all_metrics = {} 108 | 109 | # FPR and MDR 0.4 110 | temp = metrics(outputs, labels, threshold=0.4) 111 | all_metrics["FALSE_POSITIVE_RATE_4"] = temp["FPR"] 112 | all_metrics["MISSED_DETECTION_RATE_4"] = temp["MDR"] 113 | 114 | # FPR and MDR 0.5 115 | temp = metrics(outputs, labels, threshold=0.5) 116 | all_metrics["FALSE_POSITIVE_RATE_5"] = temp["FPR"] 117 | all_metrics["MISSED_DETECTION_RATE_5"] = temp["MDR"] 118 | 119 | # FPR and MDR 0.6 120 | temp = metrics(outputs, labels, threshold=0.6) 121 | all_metrics["FALSE_POSITIVE_RATE_6"] = temp["FPR"] 122 | all_metrics["MISSED_DETECTION_RATE_6"] = temp["MDR"] 123 | 124 | # Summed FPR and MDR 125 | all_metrics["FALSE_POSITIVE_RATE"] = all_metrics["FALSE_POSITIVE_RATE_4"] + all_metrics["FALSE_POSITIVE_RATE_5"] + all_metrics["FALSE_POSITIVE_RATE_6"] 126 | all_metrics["MISSED_DETECTION_RATE"] = all_metrics["MISSED_DETECTION_RATE_4"] + all_metrics["MISSED_DETECTION_RATE_5"] + all_metrics["MISSED_DETECTION_RATE_6"] 127 | 128 | # The true sum 129 | all_metrics["PIPPIN_METRIC"] = all_metrics["FALSE_POSITIVE_RATE"] + all_metrics["MISSED_DETECTION_RATE"] 130 | 131 | # Accuracy 132 | if with_acc: 133 | predictions = np.around(outputs).astype(int) 134 | all_metrics["ACCURACY"] = (predictions == labels).sum() / len(labels) 135 | 136 | return all_metrics 137 | 138 | 139 | def create_result_csv(user_params, model_params, metrics, extra_dict=None, file_name="results.csv"): 140 | '''Format information to be stored and write to a CSV on disk. 141 | 142 | This function is used for record keeping of model experiments that have been run. Each header listed in 143 | csv_header_order are the pieces of information that we care about when comparing models. Each row of the 144 | CSV becomes a record for specific experiment where we can then use Pandas Dataframes or Excel to sort and 145 | compare models. 146 | ''' 147 | import pandas as pd 148 | import os 149 | 150 | # Define results file 151 | results_file = file_name 152 | 153 | # Dictionary to be turned into a CSV 154 | csv_dict = {} 155 | 156 | # Set the important columns in a set order 157 | csv_header_order = [ 158 | "INITIALS", 159 | "MODEL_DESCRIPTION", 160 | "VERSION", 161 | "FALSE_POSITIVE_RATE_4", 162 | "MISSED_DETECTION_RATE_4", 163 | "FALSE_POSITIVE_RATE_5", 164 | "MISSED_DETECTION_RATE_5", 165 | "FALSE_POSITIVE_RATE_6", 166 | "MISSED_DETECTION_RATE_6", 167 | "FALSE_POSITIVE_RATE", 168 | "MISSED_DETECTION_RATE", 169 | "PIPPIN_METRIC", 170 | "ACCURACY", 171 | "NUMBER_OF_FILTERS_1", 172 | "NUMBER_OF_FILTERS_2", 173 | "NUMBER_OF_FILTERS_3", 174 | "NUMBER_OF_FILTERS_4", 175 | "LEARNING_RATE", 176 | "BATCH_SIZE", 177 | "NUMBER_OF_EPOCHS", 178 | "NUMBER_OF_FILTERS", 179 | "POOL_SIZE", 180 | "KERNAL_SIZE", 181 | "NUMBER_OF_LAYERS", 182 | "DROPOUT_PERCENT", 183 | "FPR_ALPHA", 184 | "MDR_ALPHA", 185 | "DENSE_LAYER_SHAPES" 186 | ] 187 | 188 | # Loop through headers and create the dictionary 189 | for header in csv_header_order: 190 | # Check where the header is 191 | if header in metrics.keys(): 192 | csv_dict[header] = str(metrics[header]) 193 | elif header in user_params.keys(): 194 | csv_dict[header] = str(user_params[header]) 195 | elif header in model_params.keys(): 196 | csv_dict[header] = str(model_params[header]) 197 | 198 | # Turn the current data to a Dataframe 199 | updated_df = pd.DataFrame(csv_dict, index=[0]) 200 | 201 | # Check if a CSV already exists so we can add the current experiment to the previously logged ones 202 | if os.path.isfile(results_file): 203 | df = pd.read_csv(results_file) 204 | updated_df = pd.concat([df, updated_df]) 205 | 206 | # Write the CSV to disk 207 | updated_df.to_csv(results_file, index=False) 208 | 209 | return updated_df 210 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | asn1crypto==0.22.0 2 | astropy==2.0.3 3 | attrs==17.4.0 4 | bleach==1.5.0 5 | certifi==2017.11.5 6 | cffi==1.11.2 7 | chardet==3.0.4 8 | cryptography==2.1.4 9 | cycler==0.10.0 10 | decorator==4.2.1 11 | h5py==2.7.1 12 | html5lib==0.9999999 13 | idna==2.6 14 | ipykernel==4.8.0 15 | ipython==6.2.1 16 | ipython-genutils==0.2.0 17 | jedi==0.11.1 18 | jupyter-client==5.2.2 19 | jupyter-core==4.4.0 20 | Keras==2.1.3 21 | Markdown==2.6.9 22 | matplotlib==2.1.2 23 | numpy==1.13.3 24 | pandas==0.22.0 25 | parso==0.1.1 26 | pexpect==4.3.1 27 | pickleshare==0.7.4 28 | pluggy==0.6.0 29 | prompt-toolkit==1.0.15 30 | ptyprocess==0.5.2 31 | py==1.5.2 32 | pycparser==2.18 33 | Pygments==2.2.0 34 | pyOpenSSL==17.4.0 35 | pyparsing==2.2.0 36 | PySocks==1.6.8 37 | pytest==3.3.2 38 | python-dateutil==2.6.1 39 | pytz==2017.3 40 | PyYAML==3.12 41 | pyzmq==16.0.3 42 | requests==2.18.4 43 | scikit-learn==0.19.1 44 | scipy==1.0.0 45 | tensorflow==1.4.1 46 | xgboost==0.7.post3 -------------------------------------------------------------------------------- /xgboost-baseline/README.md: -------------------------------------------------------------------------------- 1 | ## XGBoost Comparison Model 2 | 3 | We have given 2 files: 4 | 5 | 1. The iPython/Jupyter notebook file 6 | 2. The .py file outputted from iPython/Jupyter 7 | 8 | This will give you a few options to play with. We recommend iPython/Jupyter notebooks as they are 9 | more user friendly than terminal... but this is a README, we don't have control over you! 10 | 11 | Everything else should be explained in the code/notebook! -------------------------------------------------------------------------------- /xgboost-baseline/XGBoost Comparison Model.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "from sklearn.model_selection import train_test_split\n", 10 | "\n", 11 | "import xgboost as xgb\n", 12 | "import pandas as pd\n", 13 | "import numpy as np\n", 14 | "\n", 15 | "import pickle\n", 16 | "import random\n", 17 | "\n", 18 | "pd.set_option(\"max_columns\", 999)\n", 19 | "\n", 20 | "np.random.seed(1)" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "## Let's get started!\n", 28 | "\n", 29 | "First we have to load in the data, this is the feature engineered data right from the paper. We have actually taken the extra step of formatting it really nicely for Python.\n", 30 | "\n", 31 | "Make sure to change the path to where you downloaded the data!" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 2, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "path_to_data = \"/Users/clifford-laptop/Documents/space2vec/data/engineered-data.pkl\"\n", 41 | "\n", 42 | "data = pickle.load(open(path_to_data, 'rb'))" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "## Next the column types\n", 50 | "\n", 51 | "Not all of this is necessary but we wanted to make sure that we explicitly state what each column type is. That way we can be sure that we don't include columns that shouldn't be in the training data." 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 3, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "targets = [\n", 61 | " \"OBJECT_TYPE\",\n", 62 | "]\n", 63 | "\n", 64 | "ids = [\n", 65 | " \"ID\",\n", 66 | "]\n", 67 | "\n", 68 | "continuous = [\n", 69 | " \"AMP\",\n", 70 | " \"A_IMAGE\",\n", 71 | " \"A_REF\",\n", 72 | " \"B_IMAGE\",\n", 73 | " \"B_REF\",\n", 74 | " \"COLMEDS\",\n", 75 | " \"DIFFSUMRN\",\n", 76 | " \"ELLIPTICITY\",\n", 77 | " \"FLUX_RATIO\",\n", 78 | " \"GAUSS\",\n", 79 | " \"GFLUX\",\n", 80 | " \"L1\",\n", 81 | " \"LACOSMIC\",\n", 82 | " \"MAG\",\n", 83 | " \"MAGDIFF\",\n", 84 | " \"MAG_FROM_LIMIT\",\n", 85 | " \"MAG_REF\",\n", 86 | " \"MAG_REF_ERR\",\n", 87 | " \"MASKFRAC\",\n", 88 | " \"MIN_DISTANCE_TO_EDGE_IN_NEW\",\n", 89 | " \"NN_DIST_RENORM\",\n", 90 | " \"SCALE\",\n", 91 | " \"SNR\",\n", 92 | " \"SPREADERR_MODEL\",\n", 93 | " \"SPREAD_MODEL\",\n", 94 | "]\n", 95 | "\n", 96 | "categorical = [\n", 97 | " \"BAND\",\n", 98 | " \"CCDID\",\n", 99 | " \"FLAGS\",\n", 100 | "]\n", 101 | "\n", 102 | "ordinal = [\n", 103 | " \"N2SIG3\",\n", 104 | " \"N2SIG3SHIFT\",\n", 105 | " \"N2SIG5\",\n", 106 | " \"N2SIG5SHIFT\",\n", 107 | " \"N3SIG3\",\n", 108 | " \"N3SIG3SHIFT\",\n", 109 | " \"N3SIG5\",\n", 110 | " \"N3SIG5SHIFT\",\n", 111 | " \"NUMNEGRN\",\n", 112 | "]\n", 113 | "\n", 114 | "booleans = [\n", 115 | " \"MAGLIM\",\n", 116 | "]" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "## One hot encode any categorical columns\n", 124 | "\n", 125 | "Here we do something called one hot encoding (https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f).\n", 126 | "\n", 127 | "This is to turn any categorical columns into something that a machine learning model can understand. Let's say we have a column, maybe we call it BAND, and this column might have 4 different possible values:\n", 128 | "\n", 129 | "g, i, r, or z\n", 130 | "\n", 131 | "Well we can't really shove these into our network so we hit it with the \"one hot\"! The BAND column becomes 5 different columns:\n", 132 | "\n", 133 | "BAND_g, BAND_i, BAND_r, BAND_z, and BAND_nan\n", 134 | "\n", 135 | "Now, instead of a letter value; we have a binary representation with a 1 in it's corresponding column and a zero in the rest.\n", 136 | "\n", 137 | "The function is a bit interesting but it does exactly what we need!" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 4, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "data = pd.get_dummies(\n", 147 | " data, \n", 148 | " prefix = categorical, \n", 149 | " prefix_sep = '_',\n", 150 | " dummy_na = True, \n", 151 | " columns = categorical, \n", 152 | " sparse = False, \n", 153 | " drop_first = False\n", 154 | ")" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "## Split the inputs from the targets\n", 162 | "\n", 163 | "This is super important!\n", 164 | "\n", 165 | "We have to make sure we physically seperate the targets (aka labels) from our model input. This is to give us a piece of mind as we train.\n", 166 | "\n", 167 | "Obviously, the model should never train on our targets... That's like giving a student the exam answer sheet to study before the exam!" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 5, 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "target = data[targets]\n", 177 | "inputs = data.drop(columns = ids + targets)" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "## Shuffle and split the data\n", 185 | "\n", 186 | "Now we split the data again, this time into a training set and a validation set.\n", 187 | "\n", 188 | "This is comparable to having a bunch of practice questions before a test (the training set) and quiz questions (the validation set).\n", 189 | "\n", 190 | "**It's important to note that the model should never learn on the validation set!**\n", 191 | "\n", 192 | "We also shuffle the data to make sure we remove any possible patterns that could be happening within the data (not very likely to happen in this dataset but it doesn't hurt).\n", 193 | "\n", 194 | "Another **really** important point here is \"stratification\". That sounds fancy but it basically means that when we split the data, the distribution of the populations should be the same in the training and validation set as it was originally... That didn't help did it?\n", 195 | "\n", 196 | "Let's say that in the total dataset we have 50.5% of the population as supernova and the other 49.5% of the population being not a supernova. When we split the data into two subset, in a stratified way, both subsets should keep a very similar ratio of supernova to not-supernova (50.5% to 49.5%).\n", 197 | "\n", 198 | "This is getting way too long... Lastly I'll point out the **test_size = 0.2**. This simply means that 20% of the data is put into a validation set (leaving the other 80% as training data)." 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": 9, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [ 207 | "x_train, x_valid, y_train, y_valid = train_test_split(\n", 208 | " inputs, \n", 209 | " target, \n", 210 | " test_size = 0.2, \n", 211 | " random_state = 42,\n", 212 | " stratify = target.as_matrix()\n", 213 | ")" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": {}, 219 | "source": [ 220 | "## Parameters!\n", 221 | "\n", 222 | "Alright, we won't get too into the specifics here but you can definitely check out the documentation (http://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier).\n", 223 | "\n", 224 | "We just toyed around with the parameters to see what seemed to work the best.\n", 225 | "\n", 226 | "Once we get to the Convolutional Neural Network (CNN), the model we will more than likely use in the end, we will automate this parameter search.\n", 227 | "\n", 228 | "**The joys of this whole notebook thing is that you can run all of this! Try changing them and see what happens!**" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 26, 234 | "metadata": {}, 235 | "outputs": [], 236 | "source": [ 237 | "params = {\n", 238 | " 'max_depth': 6,\n", 239 | " 'learning_rate': 0.1,\n", 240 | " 'silent': 1,\n", 241 | " 'objective': 'binary:logistic',\n", 242 | " 'scale_pos_weight': 0.5,\n", 243 | " 'n_estimators': 40,\n", 244 | " \"gamma\": 0,\n", 245 | " \"min_child_weight\": 1,\n", 246 | " \"max_delta_step\": 0, \n", 247 | " \"subsample\": 0.9, \n", 248 | " \"colsample_bytree\": 0.8, \n", 249 | " \"colsample_bylevel\": 0.9, \n", 250 | " \"reg_alpha\": 0, \n", 251 | " \"reg_lambda\": 1, \n", 252 | " \"scale_pos_weight\": 1, \n", 253 | " \"base_score\": 0.5, \n", 254 | " \"seed\": 23, \n", 255 | " \"nthread\": 4\n", 256 | "}" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "## *Rocky training montage*\n", 264 | "\n", 265 | "Now for the part where Rocky runs through the streets training for the big fight!\n", 266 | "\n", 267 | "Ahaha, oh the joys of modern programming! All we need to do is define the XGBClassifier and `.fit()`!\n", 268 | "\n", 269 | "As long as we pass in the data and the metrics that we want to define then we are good to go." 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 27, 275 | "metadata": {}, 276 | "outputs": [ 277 | { 278 | "name": "stderr", 279 | "output_type": "stream", 280 | "text": [ 281 | "/Users/clifford-laptop/anaconda2/envs/space2vec/lib/python3.6/site-packages/sklearn/preprocessing/label.py:95: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", 282 | " y = column_or_1d(y, warn=True)\n", 283 | "/Users/clifford-laptop/anaconda2/envs/space2vec/lib/python3.6/site-packages/sklearn/preprocessing/label.py:128: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", 284 | " y = column_or_1d(y, warn=True)\n" 285 | ] 286 | }, 287 | { 288 | "name": "stdout", 289 | "output_type": "stream", 290 | "text": [ 291 | "[0]\tvalidation_0-auc:0.967291\tvalidation_1-auc:0.966996\n", 292 | "[1]\tvalidation_0-auc:0.974637\tvalidation_1-auc:0.974235\n", 293 | "[2]\tvalidation_0-auc:0.982206\tvalidation_1-auc:0.981863\n", 294 | "[3]\tvalidation_0-auc:0.982742\tvalidation_1-auc:0.982427\n", 295 | "[4]\tvalidation_0-auc:0.98447\tvalidation_1-auc:0.984314\n", 296 | "[5]\tvalidation_0-auc:0.985039\tvalidation_1-auc:0.984842\n", 297 | "[6]\tvalidation_0-auc:0.985353\tvalidation_1-auc:0.985177\n", 298 | "[7]\tvalidation_0-auc:0.985621\tvalidation_1-auc:0.985435\n", 299 | "[8]\tvalidation_0-auc:0.985995\tvalidation_1-auc:0.985823\n", 300 | "[9]\tvalidation_0-auc:0.986418\tvalidation_1-auc:0.986239\n", 301 | "[10]\tvalidation_0-auc:0.986688\tvalidation_1-auc:0.986519\n", 302 | "[11]\tvalidation_0-auc:0.986884\tvalidation_1-auc:0.986712\n", 303 | "[12]\tvalidation_0-auc:0.987164\tvalidation_1-auc:0.986975\n", 304 | "[13]\tvalidation_0-auc:0.987417\tvalidation_1-auc:0.987218\n", 305 | "[14]\tvalidation_0-auc:0.987586\tvalidation_1-auc:0.987418\n", 306 | "[15]\tvalidation_0-auc:0.987908\tvalidation_1-auc:0.987705\n", 307 | "[16]\tvalidation_0-auc:0.988169\tvalidation_1-auc:0.987992\n", 308 | "[17]\tvalidation_0-auc:0.988351\tvalidation_1-auc:0.988176\n", 309 | "[18]\tvalidation_0-auc:0.988474\tvalidation_1-auc:0.988304\n", 310 | "[19]\tvalidation_0-auc:0.988711\tvalidation_1-auc:0.988529\n", 311 | "[20]\tvalidation_0-auc:0.988923\tvalidation_1-auc:0.988739\n", 312 | "[21]\tvalidation_0-auc:0.989098\tvalidation_1-auc:0.988922\n", 313 | "[22]\tvalidation_0-auc:0.989229\tvalidation_1-auc:0.989033\n", 314 | "[23]\tvalidation_0-auc:0.989479\tvalidation_1-auc:0.989271\n", 315 | "[24]\tvalidation_0-auc:0.989585\tvalidation_1-auc:0.989385\n", 316 | "[25]\tvalidation_0-auc:0.989726\tvalidation_1-auc:0.989511\n", 317 | "[26]\tvalidation_0-auc:0.98986\tvalidation_1-auc:0.98965\n", 318 | "[27]\tvalidation_0-auc:0.990075\tvalidation_1-auc:0.989816\n", 319 | "[28]\tvalidation_0-auc:0.990221\tvalidation_1-auc:0.989966\n", 320 | "[29]\tvalidation_0-auc:0.990338\tvalidation_1-auc:0.990079\n", 321 | "[30]\tvalidation_0-auc:0.990426\tvalidation_1-auc:0.990162\n", 322 | "[31]\tvalidation_0-auc:0.990536\tvalidation_1-auc:0.990268\n", 323 | "[32]\tvalidation_0-auc:0.990654\tvalidation_1-auc:0.990391\n", 324 | "[33]\tvalidation_0-auc:0.990745\tvalidation_1-auc:0.990473\n", 325 | "[34]\tvalidation_0-auc:0.990834\tvalidation_1-auc:0.99055\n", 326 | "[35]\tvalidation_0-auc:0.990964\tvalidation_1-auc:0.990658\n", 327 | "[36]\tvalidation_0-auc:0.99106\tvalidation_1-auc:0.990747\n", 328 | "[37]\tvalidation_0-auc:0.991139\tvalidation_1-auc:0.990819\n", 329 | "[38]\tvalidation_0-auc:0.991254\tvalidation_1-auc:0.99093\n", 330 | "[39]\tvalidation_0-auc:0.991371\tvalidation_1-auc:0.991034\n" 331 | ] 332 | }, 333 | { 334 | "data": { 335 | "text/plain": [ 336 | "XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=0.9,\n", 337 | " colsample_bytree=0.8, gamma=0, learning_rate=0.1, max_delta_step=0,\n", 338 | " max_depth=6, min_child_weight=1, missing=None, n_estimators=40,\n", 339 | " n_jobs=1, nthread=4, objective='binary:logistic', random_state=0,\n", 340 | " reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=23, silent=1,\n", 341 | " subsample=0.9)" 342 | ] 343 | }, 344 | "execution_count": 27, 345 | "metadata": {}, 346 | "output_type": "execute_result" 347 | } 348 | ], 349 | "source": [ 350 | "bst = xgb.XGBClassifier(**params)\n", 351 | "\n", 352 | "bst.fit(\n", 353 | " x_train, \n", 354 | " y_train, \n", 355 | " eval_set = [(x_train, y_train), (x_valid, y_valid)], \n", 356 | " eval_metric = ['auc'], \n", 357 | " verbose = True\n", 358 | ")" 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "## Define the rules of the ring\n", 366 | "\n", 367 | "The rules of the big finale were described within the paper, these are the Missed Detection Rate (MDR) and the False Positive Rate (FPR). We won't dive in here as they are mentioned in depth in our blog post, but the following is the coded version of the metrics." 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 31, 373 | "metadata": {}, 374 | "outputs": [], 375 | "source": [ 376 | "def metrics(outputs, labels, threshold=0.5):\n", 377 | " predictions = outputs >= (1 - threshold)\n", 378 | " true_positive_indices = (predictions == 0) * (labels == 0)\n", 379 | " false_positive_indices = (predictions == 0) * (labels == 1)\n", 380 | " true_negative_indices = (predictions == 1) * (labels == 1)\n", 381 | " false_negative_indices = (predictions == 1) * (labels == 0)\n", 382 | "\n", 383 | " true_positive_count = true_positive_indices.sum()\n", 384 | " false_positive_count = false_positive_indices.sum()\n", 385 | " true_negative_count = true_negative_indices.sum()\n", 386 | " false_negative_count = false_negative_indices.sum()\n", 387 | " \n", 388 | " return {\n", 389 | " # Missed detection rate\n", 390 | " 'MDR': false_negative_count / (true_positive_count + false_negative_count),\n", 391 | " # True positive rate\n", 392 | " 'FPR': false_positive_count / (true_negative_count + false_positive_count)\n", 393 | " }" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "## Hiring the referee\n", 401 | "\n", 402 | "Great, now we have the rules for the big fight. But we also need someone (or something... or just a function) to take action on the rules.\n", 403 | "\n", 404 | "This is just a function that will run MDR and FPR on all 3 thresholds (0.4, 0.5, 0.6) and a few extras explained below:\n", 405 | "\n", 406 | "**FALSE_POSITIVE_RATE:** Is the sum of the FPR from all three thresholds, this helps us see how the models compare on a large scale.\n", 407 | "\n", 408 | "**MISSED_DETECTION_RATE:** Is the sum of the MDR from all three thresholds, this helps us see how the models compare on a large scale.\n", 409 | "\n", 410 | "**PIPPIN_METRIC:** Named after team member Pippin Lee, this is just **FALSE_POSITIVE_RATE** and **MISSED_DETECTION_RATE** summed to give us an even large scale of how the models compare.\n", 411 | "\n", 412 | "**ACCURACY:** Simply the percentage of guesses that we got right." 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": 30, 418 | "metadata": {}, 419 | "outputs": [], 420 | "source": [ 421 | "def get_metrics(outputs, labels, with_acc=True):\n", 422 | " \n", 423 | " all_metrics = {}\n", 424 | " \n", 425 | " # FPR and MDR 0.4\n", 426 | " temp = metrics(outputs, labels, threshold=0.4)\n", 427 | " all_metrics[\"FALSE_POSITIVE_RATE_4\"] = temp[\"FPR\"]\n", 428 | " all_metrics[\"MISSED_DETECTION_RATE_4\"] = temp[\"MDR\"]\n", 429 | " \n", 430 | " # FPR and MDR 0.5\n", 431 | " temp = metrics(outputs, labels, threshold=0.5)\n", 432 | " all_metrics[\"FALSE_POSITIVE_RATE_5\"] = temp[\"FPR\"]\n", 433 | " all_metrics[\"MISSED_DETECTION_RATE_5\"] = temp[\"MDR\"]\n", 434 | " \n", 435 | " # FPR and MDR 0.6\n", 436 | " temp = metrics(outputs, labels, threshold=0.6)\n", 437 | " all_metrics[\"FALSE_POSITIVE_RATE_6\"] = temp[\"FPR\"]\n", 438 | " all_metrics[\"MISSED_DETECTION_RATE_6\"] = temp[\"MDR\"]\n", 439 | " \n", 440 | " # Summed FPR and MDR\n", 441 | " all_metrics[\"FALSE_POSITIVE_RATE\"] = all_metrics[\"FALSE_POSITIVE_RATE_4\"] + all_metrics[\"FALSE_POSITIVE_RATE_5\"] + all_metrics[\"FALSE_POSITIVE_RATE_6\"] \n", 442 | " all_metrics[\"MISSED_DETECTION_RATE\"] = all_metrics[\"MISSED_DETECTION_RATE_4\"] + all_metrics[\"MISSED_DETECTION_RATE_5\"] + all_metrics[\"MISSED_DETECTION_RATE_6\"]\n", 443 | " \n", 444 | " # The true sum\n", 445 | " all_metrics[\"PIPPIN_METRIC\"] = all_metrics[\"FALSE_POSITIVE_RATE\"] + all_metrics[\"MISSED_DETECTION_RATE\"]\n", 446 | " \n", 447 | " # Accuracy\n", 448 | " if with_acc:\n", 449 | " predictions = np.around(outputs).astype(int)\n", 450 | " all_metrics[\"ACCURACY\"] = (predictions == labels).sum() / len(labels)\n", 451 | " \n", 452 | " return all_metrics" 453 | ] 454 | }, 455 | { 456 | "cell_type": "markdown", 457 | "metadata": {}, 458 | "source": [ 459 | "## The big fight!\n", 460 | "\n", 461 | "Our model has trained up in the modern day version of a classic cinematic training montage!\n", 462 | "\n", 463 | "We can finally give it the final challange... this challenge just happens to be feeding it more data rather than fighting his own inner demons in the manifestation of a boxer." 464 | ] 465 | }, 466 | { 467 | "cell_type": "code", 468 | "execution_count": 36, 469 | "metadata": {}, 470 | "outputs": [], 471 | "source": [ 472 | "y_predictions = bst.predict_proba(x_valid)[:, 1:]" 473 | ] 474 | }, 475 | { 476 | "cell_type": "markdown", 477 | "metadata": {}, 478 | "source": [ 479 | "## To the judges!\n", 480 | "\n", 481 | "Our model has fought well and forced the match to decision. Only the judges can give us the final results!\n", 482 | "\n", 483 | "You can see that we use the metric functions defined above, passing in what the model guessed and what the actual results **should be**. We then do the math and see how our fighter did.\n", 484 | "\n", 485 | "We won't go in depth into the comparison here since we go into it in-depth in the article. \n", 486 | "\n", 487 | "(Teaser: it lost but actually did fairly well for how simple it is!)" 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": null, 493 | "metadata": {}, 494 | "outputs": [], 495 | "source": [ 496 | "all_metrics = get_metrics(y_predictions, y_valid)\n", 497 | "\n", 498 | "print(\"FPR (0.4): \" + str(all_metrics[\"FALSE_POSITIVE_RATE_4\"][0]))\n", 499 | "print(\"FPR (0.5): \" + str(all_metrics[\"FALSE_POSITIVE_RATE_5\"][0]))\n", 500 | "print(\"FPR (0.6): \" + str(all_metrics[\"FALSE_POSITIVE_RATE_6\"][0]))\n", 501 | "print(\"\")\n", 502 | "print(\"MDR (0.4): \" + str(all_metrics[\"MISSED_DETECTION_RATE_4\"][0]))\n", 503 | "print(\"MDR (0.5): \" + str(all_metrics[\"MISSED_DETECTION_RATE_5\"][0]))\n", 504 | "print(\"MDR (0.6): \" + str(all_metrics[\"MISSED_DETECTION_RATE_6\"][0]))\n", 505 | "print(\"\")\n", 506 | "print(\"SUMMED FPR: \" + str(all_metrics[\"FALSE_POSITIVE_RATE\"][0]))\n", 507 | "print(\"SUMMED MDR: \" + str(all_metrics[\"MISSED_DETECTION_RATE\"][0]))\n", 508 | "print(\"TOTAL SUM: \" + str(all_metrics[\"PIPPIN_METRIC\"][0]))\n", 509 | "print(\"\")\n", 510 | "print(\"ACCURACY: \" + str(all_metrics[\"ACCURACY\"][0]))" 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "execution_count": null, 516 | "metadata": {}, 517 | "outputs": [], 518 | "source": [] 519 | } 520 | ], 521 | "metadata": { 522 | "kernelspec": { 523 | "display_name": "Python [conda env:space2vec]", 524 | "language": "python", 525 | "name": "conda-env-space2vec-py" 526 | }, 527 | "language_info": { 528 | "codemirror_mode": { 529 | "name": "ipython", 530 | "version": 3 531 | }, 532 | "file_extension": ".py", 533 | "mimetype": "text/x-python", 534 | "name": "python", 535 | "nbconvert_exporter": "python", 536 | "pygments_lexer": "ipython3", 537 | "version": "3.6.4" 538 | } 539 | }, 540 | "nbformat": 4, 541 | "nbformat_minor": 2 542 | } 543 | -------------------------------------------------------------------------------- /xgboost-baseline/XGBoost Comparison Model.py: -------------------------------------------------------------------------------- 1 | 2 | # coding: utf-8 3 | 4 | # In[1]: 5 | 6 | 7 | from sklearn.model_selection import train_test_split 8 | 9 | import xgboost as xgb 10 | import pandas as pd 11 | import numpy as np 12 | 13 | import pickle 14 | import random 15 | 16 | pd.set_option("max_columns", 999) 17 | 18 | np.random.seed(1) 19 | 20 | 21 | # ## Let's get started! 22 | # 23 | # First we have to load in the data, this is the feature engineered data right from the paper. We have actually taken the extra step of formatting it really nicely for Python. 24 | # 25 | # Make sure to change the path to where you downloaded the data! 26 | 27 | # In[2]: 28 | 29 | 30 | path_to_data = "/Users/clifford-laptop/Documents/space2vec/data/engineered-data.pkl" 31 | 32 | data = pickle.load(open(path_to_data, 'rb')) 33 | 34 | 35 | # ## Next the column types 36 | # 37 | # Not all of this is necessary but we wanted to make sure that we explicitly state what each column type is. That way we can be sure that we don't include columns that shouldn't be in the training data. 38 | 39 | # In[3]: 40 | 41 | 42 | targets = [ 43 | "OBJECT_TYPE", 44 | ] 45 | 46 | ids = [ 47 | "ID", 48 | ] 49 | 50 | continuous = [ 51 | "AMP", 52 | "A_IMAGE", 53 | "A_REF", 54 | "B_IMAGE", 55 | "B_REF", 56 | "COLMEDS", 57 | "DIFFSUMRN", 58 | "ELLIPTICITY", 59 | "FLUX_RATIO", 60 | "GAUSS", 61 | "GFLUX", 62 | "L1", 63 | "LACOSMIC", 64 | "MAG", 65 | "MAGDIFF", 66 | "MAG_FROM_LIMIT", 67 | "MAG_REF", 68 | "MAG_REF_ERR", 69 | "MASKFRAC", 70 | "MIN_DISTANCE_TO_EDGE_IN_NEW", 71 | "NN_DIST_RENORM", 72 | "SCALE", 73 | "SNR", 74 | "SPREADERR_MODEL", 75 | "SPREAD_MODEL", 76 | ] 77 | 78 | categorical = [ 79 | "BAND", 80 | "CCDID", 81 | "FLAGS", 82 | ] 83 | 84 | ordinal = [ 85 | "N2SIG3", 86 | "N2SIG3SHIFT", 87 | "N2SIG5", 88 | "N2SIG5SHIFT", 89 | "N3SIG3", 90 | "N3SIG3SHIFT", 91 | "N3SIG5", 92 | "N3SIG5SHIFT", 93 | "NUMNEGRN", 94 | ] 95 | 96 | booleans = [ 97 | "MAGLIM", 98 | ] 99 | 100 | 101 | # ## One hot encode any categorical columns 102 | # 103 | # Here we do something called one hot encoding (https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f). 104 | # 105 | # This is to turn any categorical columns into something that a machine learning model can understand. Let's say we have a column, maybe we call it BAND, and this column might have 4 different possible values: 106 | # 107 | # g, i, r, or z 108 | # 109 | # Well we can't really shove these into our network so we hit it with the "one hot"! The BAND column becomes 5 different columns: 110 | # 111 | # BAND_g, BAND_i, BAND_r, BAND_z, and BAND_nan 112 | # 113 | # Now, instead of a letter value; we have a binary representation with a 1 in it's corresponding column and a zero in the rest. 114 | # 115 | # The function is a bit interesting but it does exactly what we need! 116 | 117 | # In[4]: 118 | 119 | 120 | data = pd.get_dummies( 121 | data, 122 | prefix = categorical, 123 | prefix_sep = '_', 124 | dummy_na = True, 125 | columns = categorical, 126 | sparse = False, 127 | drop_first = False 128 | ) 129 | 130 | 131 | # ## Split the inputs from the targets 132 | # 133 | # This is super important! 134 | # 135 | # We have to make sure we physically seperate the targets (aka labels) from our model input. This is to give us a piece of mind as we train. 136 | # 137 | # Obviously, the model should never train on our targets... That's like giving a student the exam answer sheet to study before the exam! 138 | 139 | # In[5]: 140 | 141 | 142 | target = data[targets] 143 | inputs = data.drop(columns = ids + targets) 144 | 145 | 146 | # ## Shuffle and split the data 147 | # 148 | # Now we split the data again, this time into a training set and a validation set. 149 | # 150 | # This is comparable to having a bunch of practice questions before a test (the training set) and quiz questions (the validation set). 151 | # 152 | # **It's important to note that the model should never learn on the validation set!** 153 | # 154 | # We also shuffle the data to make sure we remove any possible patterns that could be happening within the data (not very likely to happen in this dataset but it doesn't hurt). 155 | # 156 | # Another **really** important point here is "stratification". That sounds fancy but it basically means that when we split the data, the distribution of the populations should be the same in the training and validation set as it was originally... That didn't help did it? 157 | # 158 | # Let's say that in the total dataset we have 50.5% of the population as supernova and the other 49.5% of the population being not a supernova. When we split the data into two subset, in a stratified way, both subsets should keep a very similar ratio of supernova to not-supernova (50.5% to 49.5%). 159 | # 160 | # This is getting way too long... Lastly I'll point out the **test_size = 0.2**. This simply means that 20% of the data is put into a validation set (leaving the other 80% as training data). 161 | 162 | # In[9]: 163 | 164 | 165 | x_train, x_valid, y_train, y_valid = train_test_split( 166 | inputs, 167 | target, 168 | test_size = 0.2, 169 | random_state = 42, 170 | stratify = target.as_matrix() 171 | ) 172 | 173 | 174 | # ## Parameters! 175 | # 176 | # Alright, we won't get too into the specifics here but you can definitely check out the documentation (http://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier). 177 | # 178 | # We just toyed around with the parameters to see what seemed to work the best. 179 | # 180 | # Once we get to the Convolutional Neural Network (CNN), the model we will more than likely use in the end, we will automate this parameter search. 181 | # 182 | # **The joys of this whole notebook thing is that you can run all of this! Try changing them and see what happens!** 183 | 184 | # In[26]: 185 | 186 | 187 | params = { 188 | 'max_depth': 6, 189 | 'learning_rate': 0.1, 190 | 'silent': 1, 191 | 'objective': 'binary:logistic', 192 | 'scale_pos_weight': 0.5, 193 | 'n_estimators': 40, 194 | "gamma": 0, 195 | "min_child_weight": 1, 196 | "max_delta_step": 0, 197 | "subsample": 0.9, 198 | "colsample_bytree": 0.8, 199 | "colsample_bylevel": 0.9, 200 | "reg_alpha": 0, 201 | "reg_lambda": 1, 202 | "scale_pos_weight": 1, 203 | "base_score": 0.5, 204 | "seed": 23, 205 | "nthread": 4 206 | } 207 | 208 | 209 | # ## *Rocky training montage* 210 | # 211 | # Now for the part where Rocky runs through the streets training for the big fight! 212 | # 213 | # Ahaha, oh the joys of modern programming! All we need to do is define the XGBClassifier and `.fit()`! 214 | # 215 | # As long as we pass in the data and the metrics that we want to define then we are good to go. 216 | 217 | # In[27]: 218 | 219 | 220 | bst = xgb.XGBClassifier(**params) 221 | 222 | bst.fit( 223 | x_train, 224 | y_train, 225 | eval_set = [(x_train, y_train), (x_valid, y_valid)], 226 | eval_metric = ['auc'], 227 | verbose = True 228 | ) 229 | 230 | 231 | # ## Define the rules of the ring 232 | # 233 | # The rules of the big finale were described within the paper, these are the Missed Detection Rate (MDR) and the False Positive Rate (FPR). We won't dive in here as they are mentioned in depth in our blog post, but the following is the coded version of the metrics. 234 | 235 | # In[31]: 236 | 237 | 238 | def metrics(outputs, labels, threshold=0.5): 239 | predictions = outputs >= (1 - threshold) 240 | true_positive_indices = (predictions == 0) * (labels == 0) 241 | false_positive_indices = (predictions == 0) * (labels == 1) 242 | true_negative_indices = (predictions == 1) * (labels == 1) 243 | false_negative_indices = (predictions == 1) * (labels == 0) 244 | 245 | true_positive_count = true_positive_indices.sum() 246 | false_positive_count = false_positive_indices.sum() 247 | true_negative_count = true_negative_indices.sum() 248 | false_negative_count = false_negative_indices.sum() 249 | 250 | return { 251 | # Missed detection rate 252 | 'MDR': false_negative_count / (true_positive_count + false_negative_count), 253 | # True positive rate 254 | 'FPR': false_positive_count / (true_negative_count + false_positive_count) 255 | } 256 | 257 | 258 | # ## Hiring the referee 259 | # 260 | # Great, now we have the rules for the big fight. But we also need someone (or something... or just a function) to take action on the rules. 261 | # 262 | # This is just a function that will run MDR and FPR on all 3 thresholds (0.4, 0.5, 0.6) and a few extras explained below: 263 | # 264 | # **FALSE_POSITIVE_RATE:** Is the sum of the FPR from all three thresholds, this helps us see how the models compare on a large scale. 265 | # 266 | # **MISSED_DETECTION_RATE:** Is the sum of the MDR from all three thresholds, this helps us see how the models compare on a large scale. 267 | # 268 | # **PIPPIN_METRIC:** Named after team member Pippin Lee, this is just **FALSE_POSITIVE_RATE** and **MISSED_DETECTION_RATE** summed to give us an even large scale of how the models compare. 269 | # 270 | # **ACCURACY:** Simply the percentage of guesses that we got right. 271 | 272 | # In[30]: 273 | 274 | 275 | def get_metrics(outputs, labels, with_acc=True): 276 | 277 | all_metrics = {} 278 | 279 | # FPR and MDR 0.4 280 | temp = metrics(outputs, labels, threshold=0.4) 281 | all_metrics["FALSE_POSITIVE_RATE_4"] = temp["FPR"] 282 | all_metrics["MISSED_DETECTION_RATE_4"] = temp["MDR"] 283 | 284 | # FPR and MDR 0.5 285 | temp = metrics(outputs, labels, threshold=0.5) 286 | all_metrics["FALSE_POSITIVE_RATE_5"] = temp["FPR"] 287 | all_metrics["MISSED_DETECTION_RATE_5"] = temp["MDR"] 288 | 289 | # FPR and MDR 0.6 290 | temp = metrics(outputs, labels, threshold=0.6) 291 | all_metrics["FALSE_POSITIVE_RATE_6"] = temp["FPR"] 292 | all_metrics["MISSED_DETECTION_RATE_6"] = temp["MDR"] 293 | 294 | # Summed FPR and MDR 295 | all_metrics["FALSE_POSITIVE_RATE"] = all_metrics["FALSE_POSITIVE_RATE_4"] + all_metrics["FALSE_POSITIVE_RATE_5"] + all_metrics["FALSE_POSITIVE_RATE_6"] 296 | all_metrics["MISSED_DETECTION_RATE"] = all_metrics["MISSED_DETECTION_RATE_4"] + all_metrics["MISSED_DETECTION_RATE_5"] + all_metrics["MISSED_DETECTION_RATE_6"] 297 | 298 | # The true sum 299 | all_metrics["PIPPIN_METRIC"] = all_metrics["FALSE_POSITIVE_RATE"] + all_metrics["MISSED_DETECTION_RATE"] 300 | 301 | # Accuracy 302 | if with_acc: 303 | predictions = np.around(outputs).astype(int) 304 | all_metrics["ACCURACY"] = (predictions == labels).sum() / len(labels) 305 | 306 | return all_metrics 307 | 308 | 309 | # ## The big fight! 310 | # 311 | # Our model has trained up in the modern day version of a classic cinematic training montage! 312 | # 313 | # We can finally give it the final challange... this challenge just happens to be feeding it more data rather than fighting his own inner demons in the manifestation of a boxer. 314 | 315 | # In[36]: 316 | 317 | 318 | y_predictions = bst.predict_proba(x_valid)[:, 1:] 319 | 320 | 321 | # ## To the judges! 322 | # 323 | # Our model has fought well and forced the match to decision. Only the judges can give us the final results! 324 | # 325 | # You can see that we use the metric functions defined above, passing in what the model guessed and what the actual results **should be**. We then do the math and see how our fighter did. 326 | # 327 | # We won't go in depth into the comparison here since we go into it in-depth in the article. 328 | # 329 | # (Teaser: it lost but actually did fairly well for how simple it is!) 330 | 331 | # In[ ]: 332 | 333 | 334 | all_metrics = get_metrics(y_predictions, y_valid) 335 | 336 | print("FPR (0.4): " + str(all_metrics["FALSE_POSITIVE_RATE_4"][0])) 337 | print("FPR (0.5): " + str(all_metrics["FALSE_POSITIVE_RATE_5"][0])) 338 | print("FPR (0.6): " + str(all_metrics["FALSE_POSITIVE_RATE_6"][0])) 339 | print("") 340 | print("MDR (0.4): " + str(all_metrics["MISSED_DETECTION_RATE_4"][0])) 341 | print("MDR (0.5): " + str(all_metrics["MISSED_DETECTION_RATE_5"][0])) 342 | print("MDR (0.6): " + str(all_metrics["MISSED_DETECTION_RATE_6"][0])) 343 | print("") 344 | print("SUMMED FPR: " + str(all_metrics["FALSE_POSITIVE_RATE"][0])) 345 | print("SUMMED MDR: " + str(all_metrics["MISSED_DETECTION_RATE"][0])) 346 | print("TOTAL SUM: " + str(all_metrics["PIPPIN_METRIC"][0])) 347 | print("") 348 | print("ACCURACY: " + str(all_metrics["ACCURACY"][0])) 349 | 350 | --------------------------------------------------------------------------------