├── Image_Captioning.ipynb ├── PCA with scikit learn.ipynb ├── README.md ├── ROC_curve_comparison_.ipynb ├── SVM-detail-analysis.ipynb ├── _config.yml ├── datasets.tar.gz ├── detail_analysis _of_various_hospital_facttors.ipynb └── exploring_principal_component_analysis.ipynb /Image_Captioning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Training a caption generator\n", 8 | "This notebook implements the Show and Tell caption generation model described in our corresponding article. The key portions of this notebook are loading the data with `get_data`, processing the text data with `preProBuildWordVocab`, building the `Caption_Generator` in `train` and tracking our progress.\n", 9 | "\n", 10 | "*Note:* create a directory to save your tensorflow models and assign this directory path to the `model_path` variable." 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 62, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "import math\n", 20 | "import os\n", 21 | "import tensorflow as tf\n", 22 | "import numpy as np\n", 23 | "import pandas as pd\n", 24 | "import pickle\n", 25 | "import cv2\n", 26 | "import skimage\n", 27 | "import pickle as pkl\n", 28 | "\n", 29 | "import tensorflow.python.platform\n", 30 | "from keras.preprocessing import sequence\n", 31 | "from collections import Counter" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "# Downloading Data\n", 39 | "As mentioned in the README, in order to run this notebook, you will need VGG-16 image embeddings for the Flickr-30K dataset. These image embeddings are available from our [Google Drive](https://drive.google.com/file/d/0B5o40yxdA9PqTnJuWGVkcFlqcG8/view?usp=sharing).\n", 40 | "\n", 41 | "Additionally, you will need the corresponding captions for these images (`results_20130124.token`), which can also be downloaded from our [Google Drive](https://drive.google.com/file/d/0B2vTU3h54lTydXFjSVM5T2t4WmM/view?usp=sharing).\n", 42 | "\n", 43 | "Place all of these downloads in the `./data/` folder.\n", 44 | "\n", 45 | "The feature embeddings will be in `./data/feats.npy` and the embeddings' corresponding captions will be saved to `./data/results_20130124.token` ." 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 63, 51 | "metadata": { 52 | "collapsed": true 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "os.getcwd()\n", 57 | "path = \"/home/niraj/Documents/artificial_intelligence/projects/oreilly image captioning/oreilly-captions-master/data\"" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 64, 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "data": { 67 | "text/plain": [ 68 | "'/home/niraj/Documents/artificial_intelligence/projects/oreilly image captioning/oreilly-captions-master'" 69 | ] 70 | }, 71 | "execution_count": 64, 72 | "metadata": {}, 73 | "output_type": "execute_result" 74 | } 75 | ], 76 | "source": [ 77 | "os.chdir = path\n", 78 | "os.getcwd()" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 65, 84 | "metadata": { 85 | "collapsed": true 86 | }, 87 | "outputs": [], 88 | "source": [ 89 | "model_path = '/home/niraj/Documents/artificial_intelligence/projects/oreilly image captioning/oreilly-captions-master'\n", 90 | "model_path_transfer = '/home/niraj/Documents/artificial_intelligence/projects/oreilly image captioning/oreilly-captions-master/models/tf_final'\n", 91 | "feature_path = '/home/niraj/Documents/artificial_intelligence/projects/oreilly image captioning/oreilly-captions-master/data/feats.npy'\n", 92 | "annotation_path = '/home/niraj/Documents/artificial_intelligence/projects/oreilly image captioning/oreilly-captions-master/data/results_20130124.token'" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "## Loading data\n", 100 | "Parse the image embedding features from the Flickr30k dataset `./data/feats.npy`, and load the caption data via `pandas` from `./data/results_20130124.token`" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 66, 106 | "metadata": { 107 | "collapsed": true 108 | }, 109 | "outputs": [], 110 | "source": [ 111 | "def get_data(annotation_path, feature_path):\n", 112 | " annotations = pd.read_table(annotation_path, sep='\\t', header=None, names=['image', 'caption'])\n", 113 | " return np.load(feature_path,'r'), annotations['caption'].values" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 67, 119 | "metadata": { 120 | "collapsed": true 121 | }, 122 | "outputs": [], 123 | "source": [ 124 | "feats, captions = get_data(annotation_path, feature_path)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": 68, 130 | "metadata": {}, 131 | "outputs": [ 132 | { 133 | "name": "stdout", 134 | "output_type": "stream", 135 | "text": [ 136 | "(158915, 4096)\n", 137 | "(158915,)\n" 138 | ] 139 | } 140 | ], 141 | "source": [ 142 | "print(feats.shape)\n", 143 | "print(captions.shape)\n", 144 | "#print(captions_images.shape)" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 69, 150 | "metadata": {}, 151 | "outputs": [ 152 | { 153 | "name": "stdout", 154 | "output_type": "stream", 155 | "text": [ 156 | "[ 0. 0. 0. ..., 0. 0.21412706\n", 157 | " 0.51223457]\n" 158 | ] 159 | } 160 | ], 161 | "source": [ 162 | "print(feats[0])" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 70, 168 | "metadata": {}, 169 | "outputs": [ 170 | { 171 | "name": "stdout", 172 | "output_type": "stream", 173 | "text": [ 174 | "Two young guys with shaggy hair look at their hands while hanging out in the yard .\n" 175 | ] 176 | } 177 | ], 178 | "source": [ 179 | "print(captions[0])" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 71, 185 | "metadata": {}, 186 | "outputs": [ 187 | { 188 | "name": "stdout", 189 | "output_type": "stream", 190 | "text": [ 191 | "Two young guys with shaggy hair look at their hands while hanging out in the yard .\n" 192 | ] 193 | } 194 | ], 195 | "source": [ 196 | "print(captions[0])" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 72, 202 | "metadata": { 203 | "collapsed": true 204 | }, 205 | "outputs": [], 206 | "source": [ 207 | "def preProBuildWordVocab(sentence_iterator, word_count_threshold=30): # function from Andre Karpathy's NeuralTalk\n", 208 | " print('preprocessing %d word vocab' % (word_count_threshold, ))\n", 209 | " word_counts = {}\n", 210 | " nsents = 0\n", 211 | " for sent in sentence_iterator:\n", 212 | " nsents += 1\n", 213 | " for w in sent.lower().split(' '):\n", 214 | " word_counts[w] = word_counts.get(w, 0) + 1\n", 215 | " vocab = [w for w in word_counts if word_counts[w] >= word_count_threshold]\n", 216 | " print('preprocessed words %d -> %d' % (len(word_counts), len(vocab)))\n", 217 | "\n", 218 | "# Giving every word a unique number and reversing the dictionary so every number will be key and word will be value \n", 219 | " ixtoword = {}\n", 220 | " ixtoword[0] = '.' \n", 221 | " wordtoix = {}\n", 222 | " wordtoix['#START#'] = 0 \n", 223 | " ix = 1\n", 224 | " for w in vocab:\n", 225 | " wordtoix[w] = ix\n", 226 | " ixtoword[ix] = w\n", 227 | " ix += 1\n", 228 | "\n", 229 | " word_counts['.'] = nsents\n", 230 | " bias_init_vector = np.array([1.0*word_counts[ixtoword[i]] for i in ixtoword])\n", 231 | " bias_init_vector /= np.sum(bias_init_vector) \n", 232 | " bias_init_vector = np.log(bias_init_vector)\n", 233 | " bias_init_vector -= np.max(bias_init_vector) \n", 234 | " return wordtoix, ixtoword, bias_init_vector.astype(np.float32)" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 76, 240 | "metadata": {}, 241 | "outputs": [], 242 | "source": [ 243 | "class Caption_Generator():\n", 244 | " def __init__(self, dim_in, dim_embed, dim_hidden, batch_size, n_lstm_steps, n_words, init_b):\n", 245 | "\n", 246 | " self.dim_in = dim_in\n", 247 | " self.dim_embed = dim_embed\n", 248 | " self.dim_hidden = dim_hidden\n", 249 | " self.batch_size = batch_size\n", 250 | " self.n_lstm_steps = n_lstm_steps\n", 251 | " self.n_words = n_words\n", 252 | " \n", 253 | " # declare the variables to be used for our word embeddings\n", 254 | " with tf.device(\"/cpu:0\"):\n", 255 | " self.word_embedding = tf.get_variable(\"word_embedding\",initializer= tf.random_uniform([self.n_words, self.dim_embed], -0.1, 0.1))\n", 256 | "\n", 257 | " self.embedding_bias = tf.get_variable(\"embedding_bias\",initializer= tf.zeros([dim_embed]))\n", 258 | " \n", 259 | " # declare the LSTM itself\n", 260 | " self.lstm = tf.contrib.rnn.BasicLSTMCell(dim_hidden)\n", 261 | " \n", 262 | " # declare the variables to be used to embed the image feature embedding to the word embedding space\n", 263 | " self.img_embedding = tf.get_variable(\"img_embedding\",initializer= tf.random_uniform([dim_in, dim_hidden], -0.1, 0.1))\n", 264 | " self.img_embedding_bias = tf.get_variable(\"img_embedding_bias\",initializer= tf.zeros([dim_hidden]))\n", 265 | "\n", 266 | " # declare the variables to go from an LSTM output to a word encoding output\n", 267 | " self.word_encoding = tf.get_variable(\"word_encoding\",initializer= tf.random_uniform([dim_hidden, n_words], -0.1, 0.1))\n", 268 | " # initialize this bias variable from the preProBuildWordVocab output\n", 269 | " self.word_encoding_bias = tf.get_variable(\"word_encoding_bias\",initializer= init_b)\n", 270 | "\n", 271 | " def build_model(self):\n", 272 | " # declaring the placeholders for our extracted image feature vectors, our caption, and our mask\n", 273 | " # (describes how long our caption is with an array of 0/1 values of length `maxlen` \n", 274 | " img = tf.placeholder(tf.float32, [self.batch_size, self.dim_in])\n", 275 | " caption_placeholder = tf.placeholder(tf.int32, [self.batch_size, self.n_lstm_steps])\n", 276 | " mask = tf.placeholder(tf.float32, [self.batch_size, self.n_lstm_steps])\n", 277 | " \n", 278 | " # getting an initial LSTM embedding from our image_imbedding\n", 279 | " image_embedding = tf.matmul(img, self.img_embedding) + self.img_embedding_bias\n", 280 | " \n", 281 | " # setting initial state of our LSTM\n", 282 | " state = self.lstm.zero_state(self.batch_size, dtype=tf.float32)\n", 283 | "\n", 284 | " total_loss = 0.0\n", 285 | " with tf.variable_scope(\"RNN\"):\n", 286 | " for i in range(self.n_lstm_steps): \n", 287 | " if i > 0:\n", 288 | " #if this isn’t the first iteration of our LSTM we need to get the word_embedding corresponding\n", 289 | " # to the (i-1)th word in our caption \n", 290 | " with tf.device(\"/cpu:0\"):\n", 291 | " current_embedding = tf.nn.embedding_lookup(self.word_embedding, caption_placeholder[:,i-1]) + self.embedding_bias\n", 292 | " else:\n", 293 | " #if this is the first iteration of our LSTM we utilize the embedded image as our input \n", 294 | " current_embedding = image_embedding\n", 295 | " if i > 0: \n", 296 | " # allows us to reuse the LSTM tensor variable on each iteration\n", 297 | " tf.get_variable_scope().reuse_variables()\n", 298 | "\n", 299 | " out, state = self.lstm(current_embedding, state)\n", 300 | " #out, state = self.tf.nn.dynamic_rnn(current_embedding, state)\n", 301 | "\n", 302 | " \n", 303 | " if i > 0:\n", 304 | " #get the one-hot representation of the next word in our caption \n", 305 | " labels = tf.expand_dims(caption_placeholder[:, i], 1)\n", 306 | " ix_range=tf.range(0, self.batch_size, 1)\n", 307 | " ixs = tf.expand_dims(ix_range, 1)\n", 308 | " concat = tf.concat([ixs, labels],1)\n", 309 | " onehot = tf.sparse_to_dense(\n", 310 | " concat, tf.stack([self.batch_size, self.n_words]), 1.0, 0.0)\n", 311 | "\n", 312 | "\n", 313 | " #perform a softmax classification to generate the next word in the caption\n", 314 | " logit = tf.matmul(out, self.word_encoding) + self.word_encoding_bias\n", 315 | " xentropy = tf.nn.softmax_cross_entropy_with_logits(logits=logit, labels=onehot)\n", 316 | " xentropy = xentropy * mask[:,i]\n", 317 | "\n", 318 | " loss = tf.reduce_sum(xentropy)\n", 319 | " total_loss += loss\n", 320 | "\n", 321 | " total_loss = total_loss / tf.reduce_sum(mask[:,1:])\n", 322 | " return total_loss, img, caption_placeholder, mask\n" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": 77, 328 | "metadata": {}, 329 | "outputs": [], 330 | "source": [ 331 | "### Parameters ###\n", 332 | "dim_embed = 256\n", 333 | "dim_hidden = 256\n", 334 | "dim_in = 4096\n", 335 | "batch_size = 128\n", 336 | "momentum = 0.9\n", 337 | "n_epochs = 150\n", 338 | "\n", 339 | "def train(learning_rate=0.001, continue_training=False, transfer=True):\n", 340 | " \n", 341 | " tf.reset_default_graph()\n", 342 | "\n", 343 | " feats, captions = get_data(annotation_path, feature_path)\n", 344 | " wordtoix, ixtoword, init_b = preProBuildWordVocab(captions)\n", 345 | "\n", 346 | " np.save('data/ixtoword', ixtoword)\n", 347 | "\n", 348 | " index = (np.arange(len(feats)).astype(int))\n", 349 | " np.random.shuffle(index)\n", 350 | "\n", 351 | "\n", 352 | " sess = tf.InteractiveSession()\n", 353 | " n_words = len(wordtoix)\n", 354 | " maxlen = np.max( [x for x in map(lambda x: len(x.split(' ')), captions) ] )\n", 355 | " caption_generator = Caption_Generator(dim_in, dim_hidden, dim_embed, batch_size, maxlen+2, n_words, init_b)\n", 356 | "\n", 357 | " loss, image, sentence, mask = caption_generator.build_model()\n", 358 | "\n", 359 | " saver = tf.train.Saver(max_to_keep=100)\n", 360 | " global_step=tf.Variable(0,trainable=False)\n", 361 | " learning_rate = tf.train.exponential_decay(learning_rate, global_step,\n", 362 | " int(len(index)/batch_size), 0.95)\n", 363 | " train_op = tf.train.AdamOptimizer(learning_rate).minimize(loss)\n", 364 | " tf.global_variables_initializer().run()\n", 365 | "\n", 366 | " if continue_training:\n", 367 | " if not transfer:\n", 368 | " saver.restore(sess,tf.train.latest_checkpoint(model_path))\n", 369 | " else:\n", 370 | " saver.restore(sess,tf.train.latest_checkpoint(model_path_transfer))\n", 371 | " losses=[]\n", 372 | " for epoch in range(n_epochs):\n", 373 | " for start, end in zip( range(0, len(index), batch_size), range(batch_size, len(index), batch_size)):\n", 374 | "\n", 375 | " current_feats = feats[index[start:end]]\n", 376 | " current_captions = captions[index[start:end]]\n", 377 | " current_caption_ind = [x for x in map(lambda cap: [wordtoix[word] for word in cap.lower().split(' ')[:-1] if word in wordtoix], current_captions)]\n", 378 | "\n", 379 | " current_caption_matrix = sequence.pad_sequences(current_caption_ind, padding='post', maxlen=maxlen+1)\n", 380 | " current_caption_matrix = np.hstack( [np.full( (len(current_caption_matrix),1), 0), current_caption_matrix] )\n", 381 | "\n", 382 | " current_mask_matrix = np.zeros((current_caption_matrix.shape[0], current_caption_matrix.shape[1]))\n", 383 | " nonzeros = np.array([x for x in map(lambda x: (x != 0).sum()+2, current_caption_matrix )])\n", 384 | "\n", 385 | " for ind, row in enumerate(current_mask_matrix):\n", 386 | " row[:nonzeros[ind]] = 1\n", 387 | "\n", 388 | " _, loss_value = sess.run([train_op, loss], feed_dict={\n", 389 | " image: current_feats.astype(np.float32),\n", 390 | " sentence : current_caption_matrix.astype(np.int32),\n", 391 | " mask : current_mask_matrix.astype(np.float32)\n", 392 | " })\n", 393 | "\n", 394 | " print(\"Current Cost: \", loss_value, \"\\t Epoch {}/{}\".format(epoch, n_epochs), \"\\t Iter {}/{}\".format(start,len(feats)))\n", 395 | " print(\"Saving the model from epoch: \", epoch)\n", 396 | " saver.save(sess, os.path.join(model_path, 'model'), global_step=epoch)" 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": 78, 402 | "metadata": { 403 | "scrolled": true 404 | }, 405 | "outputs": [ 406 | { 407 | "name": "stdout", 408 | "output_type": "stream", 409 | "text": [ 410 | "preprocessing 30 word vocab\n", 411 | "preprocessed words 20326 -> 2942\n" 412 | ] 413 | }, 414 | { 415 | "ename": "ValueError", 416 | "evalue": "Variable RNN/basic_lstm_cell/kernel does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?", 417 | "output_type": "error", 418 | "traceback": [ 419 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 420 | "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", 421 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;31m#train(.001,False,False) #train from scratch\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mtrain\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m.001\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mTrue\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m#continue training from pretrained weights @epoch500\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0;31m#train(.001) #train from previously saved weights\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyboardInterrupt\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 422 | "\u001b[0;32m\u001b[0m in \u001b[0;36mtrain\u001b[0;34m(learning_rate, continue_training, transfer)\u001b[0m\n\u001b[1;32m 25\u001b[0m \u001b[0mcaption_generator\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mCaption_Generator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdim_in\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdim_hidden\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdim_embed\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbatch_size\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmaxlen\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mn_words\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minit_b\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 26\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 27\u001b[0;31m \u001b[0mloss\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mimage\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msentence\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmask\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcaption_generator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuild_model\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 28\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 29\u001b[0m \u001b[0msaver\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSaver\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmax_to_keep\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m100\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 423 | "\u001b[0;32m\u001b[0m in \u001b[0;36mbuild_model\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 55\u001b[0m \u001b[0mtf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_variable_scope\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreuse_variables\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 56\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 57\u001b[0;31m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstate\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlstm\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcurrent_embedding\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstate\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 58\u001b[0m \u001b[0;31m#out, state = self.tf.nn.dynamic_rnn(current_embedding, state)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 59\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 424 | "\u001b[0;32m/home/niraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/rnn_cell_impl.pyc\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, inputs, state, scope)\u001b[0m\n\u001b[1;32m 178\u001b[0m with vs.variable_scope(vs.get_variable_scope(),\n\u001b[1;32m 179\u001b[0m custom_getter=self._rnn_get_variable):\n\u001b[0;32m--> 180\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0msuper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mRNNCell\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__call__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minputs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstate\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 181\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 182\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_rnn_get_variable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mgetter\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 425 | "\u001b[0;32m/home/niraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/layers/base.pyc\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, inputs, *args, **kwargs)\u001b[0m\n\u001b[1;32m 448\u001b[0m \u001b[0;31m# Check input assumptions set after layer building, e.g. input shape.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 449\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_assert_input_compatibility\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minputs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 450\u001b[0;31m \u001b[0moutputs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcall\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minputs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 451\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 452\u001b[0m \u001b[0;31m# Apply activity regularization.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 426 | "\u001b[0;32m/home/niraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/rnn_cell_impl.pyc\u001b[0m in \u001b[0;36mcall\u001b[0;34m(self, inputs, state)\u001b[0m\n\u001b[1;32m 399\u001b[0m \u001b[0mc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mh\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0marray_ops\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mstate\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnum_or_size_splits\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 400\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 401\u001b[0;31m \u001b[0mconcat\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_linear\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0minputs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mh\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m4\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_num_units\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 402\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 403\u001b[0m \u001b[0;31m# i = input_gate, j = new_input, f = forget_gate, o = output_gate\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 427 | "\u001b[0;32m/home/niraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/rnn_cell_impl.pyc\u001b[0m in \u001b[0;36m_linear\u001b[0;34m(args, output_size, bias, bias_initializer, kernel_initializer)\u001b[0m\n\u001b[1;32m 1037\u001b[0m \u001b[0m_WEIGHTS_VARIABLE_NAME\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mtotal_arg_size\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moutput_size\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1038\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1039\u001b[0;31m initializer=kernel_initializer)\n\u001b[0m\u001b[1;32m 1040\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1041\u001b[0m \u001b[0mres\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmath_ops\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmatmul\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mweights\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 428 | "\u001b[0;32m/home/niraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.pyc\u001b[0m in \u001b[0;36mget_variable\u001b[0;34m(name, shape, dtype, initializer, regularizer, trainable, collections, caching_device, partitioner, validate_shape, use_resource, custom_getter)\u001b[0m\n\u001b[1;32m 1063\u001b[0m \u001b[0mcollections\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcollections\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcaching_device\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcaching_device\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1064\u001b[0m \u001b[0mpartitioner\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mpartitioner\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalidate_shape\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalidate_shape\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1065\u001b[0;31m use_resource=use_resource, custom_getter=custom_getter)\n\u001b[0m\u001b[1;32m 1066\u001b[0m get_variable_or_local_docstring = (\n\u001b[1;32m 1067\u001b[0m \"\"\"%s\n", 429 | "\u001b[0;32m/home/niraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.pyc\u001b[0m in \u001b[0;36mget_variable\u001b[0;34m(self, var_store, name, shape, dtype, initializer, regularizer, reuse, trainable, collections, caching_device, partitioner, validate_shape, use_resource, custom_getter)\u001b[0m\n\u001b[1;32m 960\u001b[0m \u001b[0mcollections\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcollections\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcaching_device\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcaching_device\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 961\u001b[0m \u001b[0mpartitioner\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mpartitioner\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalidate_shape\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalidate_shape\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 962\u001b[0;31m use_resource=use_resource, custom_getter=custom_getter)\n\u001b[0m\u001b[1;32m 963\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 964\u001b[0m def _get_partitioned_variable(self,\n", 430 | "\u001b[0;32m/home/niraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.pyc\u001b[0m in \u001b[0;36mget_variable\u001b[0;34m(self, name, shape, dtype, initializer, regularizer, reuse, trainable, collections, caching_device, partitioner, validate_shape, use_resource, custom_getter)\u001b[0m\n\u001b[1;32m 358\u001b[0m \u001b[0mreuse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mreuse\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrainable\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtrainable\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcollections\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcollections\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 359\u001b[0m \u001b[0mcaching_device\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcaching_device\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpartitioner\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mpartitioner\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 360\u001b[0;31m validate_shape=validate_shape, use_resource=use_resource)\n\u001b[0m\u001b[1;32m 361\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 362\u001b[0m return _true_getter(\n", 431 | "\u001b[0;32m/home/niraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/rnn_cell_impl.pyc\u001b[0m in \u001b[0;36m_rnn_get_variable\u001b[0;34m(self, getter, *args, **kwargs)\u001b[0m\n\u001b[1;32m 181\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 182\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_rnn_get_variable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mgetter\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 183\u001b[0;31m \u001b[0mvariable\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgetter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 184\u001b[0m trainable = (variable in tf_variables.trainable_variables() or\n\u001b[1;32m 185\u001b[0m (isinstance(variable, tf_variables.PartitionedVariable) and\n", 432 | "\u001b[0;32m/home/niraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.pyc\u001b[0m in \u001b[0;36m_true_getter\u001b[0;34m(name, shape, dtype, initializer, regularizer, reuse, trainable, collections, caching_device, partitioner, validate_shape, use_resource)\u001b[0m\n\u001b[1;32m 350\u001b[0m \u001b[0mtrainable\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtrainable\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcollections\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcollections\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 351\u001b[0m \u001b[0mcaching_device\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcaching_device\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalidate_shape\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalidate_shape\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 352\u001b[0;31m use_resource=use_resource)\n\u001b[0m\u001b[1;32m 353\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 354\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcustom_getter\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 433 | "\u001b[0;32m/home/niraj/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.pyc\u001b[0m in \u001b[0;36m_get_single_variable\u001b[0;34m(self, name, shape, dtype, initializer, regularizer, partition_info, reuse, trainable, collections, caching_device, validate_shape, use_resource)\u001b[0m\n\u001b[1;32m 680\u001b[0m raise ValueError(\"Variable %s does not exist, or was not created with \"\n\u001b[1;32m 681\u001b[0m \u001b[0;34m\"tf.get_variable(). Did you mean to set reuse=None in \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 682\u001b[0;31m \"VarScope?\" % name)\n\u001b[0m\u001b[1;32m 683\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mshape\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_fully_defined\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0minitializing_from_value\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 684\u001b[0m raise ValueError(\"Shape of a new variable (%s) must be fully defined, \"\n", 434 | "\u001b[0;31mValueError\u001b[0m: Variable RNN/basic_lstm_cell/kernel does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?" 435 | ] 436 | } 437 | ], 438 | "source": [ 439 | "try:\n", 440 | " #train(.001,False,False) #train from scratch\n", 441 | " train(.001,True,True) #continue training from pretrained weights @epoch500\n", 442 | " #train(.001) #train from previously saved weights \n", 443 | "except KeyboardInterrupt:\n", 444 | " print('Exiting Training')" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": null, 450 | "metadata": { 451 | "collapsed": true 452 | }, 453 | "outputs": [], 454 | "source": [] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "metadata": { 460 | "collapsed": true 461 | }, 462 | "outputs": [], 463 | "source": [] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": null, 468 | "metadata": { 469 | "collapsed": true 470 | }, 471 | "outputs": [], 472 | "source": [] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": null, 477 | "metadata": { 478 | "collapsed": true 479 | }, 480 | "outputs": [], 481 | "source": [] 482 | }, 483 | { 484 | "cell_type": "code", 485 | "execution_count": null, 486 | "metadata": { 487 | "collapsed": true 488 | }, 489 | "outputs": [], 490 | "source": [] 491 | }, 492 | { 493 | "cell_type": "code", 494 | "execution_count": null, 495 | "metadata": { 496 | "collapsed": true 497 | }, 498 | "outputs": [], 499 | "source": [] 500 | }, 501 | { 502 | "cell_type": "code", 503 | "execution_count": null, 504 | "metadata": { 505 | "collapsed": true 506 | }, 507 | "outputs": [], 508 | "source": [] 509 | } 510 | ], 511 | "metadata": { 512 | "anaconda-cloud": {}, 513 | "kernelspec": { 514 | "display_name": "Python 2", 515 | "language": "python", 516 | "name": "python2" 517 | }, 518 | "language_info": { 519 | "codemirror_mode": { 520 | "name": "ipython", 521 | "version": 2 522 | }, 523 | "file_extension": ".py", 524 | "mimetype": "text/x-python", 525 | "name": "python", 526 | "nbconvert_exporter": "python", 527 | "pygments_lexer": "ipython2", 528 | "version": "2.7.13" 529 | } 530 | }, 531 | "nbformat": 4, 532 | "nbformat_minor": 1 533 | } 534 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | This repository contains source code corresponding to the various kernels I ran on Kaggle( https://www.kaggle.com/nirajvermafcb )and some of my deep learning pet projects. 2 | The dataset used for various kernels can be found on "datasets" folder and some links mentioned below. 3 | 4 | # Image Captioning using Tensorflow 5 | 6 | Image Embedding containing 4096 dimensionl feature vector from VGG-16 model was used for training 7 | the model using Transfer learning. 8 | Another embedding layer was utilised to map 4096 dimensional image 9 | features into the space of 256 dimensional textual features. 10 | Multi-layer Long Short Term Memory model 11 | was built . 12 | Masking technique was used to handle variable length input. 13 | 14 | ### Additional Downloads: 15 | 1) you will need VGG-16 image embeddings for the Flickr-30K dataset. These image embeddings are available on Google drive 16 | ( https://drive.google.com/file/d/0B5o40yxdA9PqTnJuWGVkcFlqcG8/view ). 17 | 18 | 2)Additionally, you will need the corresponding captions for these images (results_20130124.token), which can also be downloaded Google Drive. 19 | ( https://drive.google.com/file/d/0B2vTU3h54lTydXFjSVM5T2t4WmM/view ) 20 | 21 | Mention proper path pointing to the datasets at the start of running the model. 22 | 23 | # Comparison and Analysis of various Supervised classification ML models 24 | Performance comparison of various data science models like Logistic Regression, SVM, Random Forest, Decision Trees, Neural Network (MLP Classifier) & Gaussian Naïve Bayes on the basis of Precision, Recall, F-1-Score, ROC-AUC curve using Mushrooms classification dataset. THe dataset can be found on datasets folder named "Mushroom Claasification" 25 | 26 | # Exploring Principal Component Analysis 27 | Detailed step-by-step study of PCA without using Scikit-learn using dataset of Human Resources Analytics. Basic concepts such as covariance matrix, Eigen values and Eigen vectors was analysed.The dataset used was "Human Resources Analytics" which can be found on datasets folder. 28 | 29 | # Applying Principal component analysis with Scikit Learn 30 | 31 | This notebook contains the application of Principal component analysis on the given dataset using Scikit-learn and the dimensions(also known as components) with maximum variance(where the data is spread out)was found out.Features with little variance in the data are then projected into new lower dimension. Then the models are trained on transformed dataset to apply machine learning models.Then I have applied Random forest Regressor on old and the transformed datasets and compared them. The dataset used was "crowdness at the campus gym" which can be found in dataset folder 32 | 33 | # Detail-analysis-of-Support-Vector-Machine 34 | Detail study of SVM using dataset of Gender Recognition by voice, by comparing the default model with further tuned model. Various hyper-parameters such as kernel, C & gamma were tuned. The dataset used was "VoiceGender" which can be found on datasets folder. 35 | 36 | # Data Visualisation of IPL statistics 37 | Visualization of important stats were produced using Matplotlib and Seaborn library to analyse the trend and generate insights. Important stats included highest run getters, Toss-Win factor, total matches played-won factor, Most wins by big margin(Greater than 50 runs or more than 8 wickets)by teams etc. THe dataset used is no longer available.You can find the notebook with details of the dataset here: ( https://www.kaggle.com/nirajvermafcb/data-visualisation-for-ipl-datasets-1 ) 38 | 39 | # Detail Analysis of various Hospital Factors 40 | Detail study on hospital dataset was done and interesting insights is done by means of visualisation. The dataset used was "Hospital General Information" which can be found in dataset folder 41 | 42 | -------------------------------------------------------------------------------- /ROC_curve_comparison_.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "_cell_guid": "d89eb051-b775-94aa-7d61-baf5144aff04", 7 | "_uuid": "786568a75df6ac1ce698f0d6b0cd490684eed9b4" 8 | }, 9 | "source": [ 10 | "I am going to apply 6 Supervised machine learning models on the given dataset.The strategy is to apply default model first with no tuning of the hyperparameter and then tuned them with different hyperparameter values and then I'll plot ROC curve to select the best machine learning model.The models used are as follows:\n", 11 | "**1) Principal Component Analysis\n", 12 | "**2)Logistic Regression**\n", 13 | " **3)Gaussian Naive Bayes**\n", 14 | " **4)Support Vector Machine**\n", 15 | " **5)Random Forest Classifier** \n", 16 | " **6)Decision trees**\n", 17 | " **7)Simple neural network**" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": null, 23 | "metadata": { 24 | "_cell_guid": "e5bb6859-47c5-7612-d53b-1f6bd39cf320", 25 | "_uuid": "16365c74dd530e17edd2a0df7fcfa7472e377095" 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "# This Python 3 environment comes with many helpful analytics libraries installed\n", 30 | "# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n", 31 | "# For example, here's several helpful packages to load in \n", 32 | "# Input data files are available in the \"../input/\" directory.\n", 33 | "# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n", 34 | "\n", 35 | "from subprocess import check_output\n", 36 | "print(check_output([\"ls\", \"../input\"]).decode(\"utf8\"))\n", 37 | "\n", 38 | "# Any results you write to the current directory are saved as output." 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": { 44 | "_cell_guid": "dd3cebd3-cd87-dd42-fffa-c5f3aefde314", 45 | "_uuid": "f11ecccd5f04b6ec68765e8ffde8a56a5d87042e" 46 | }, 47 | "source": [ 48 | "### Importing all the libraries" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "metadata": { 55 | "_cell_guid": "bb4b600a-c486-9f3c-3a1c-26567f217391", 56 | "_uuid": "78d06754047df004d5776da9d66efe49ce4a6518" 57 | }, 58 | "outputs": [], 59 | "source": [ 60 | "import pandas as pd\n", 61 | "import numpy as np\n", 62 | "from matplotlib import pyplot as plt\n", 63 | "import seaborn as sns" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": { 69 | "_cell_guid": "080d387b-dd78-e7f0-36ac-dcb831910082", 70 | "_uuid": "56c011020f2eeffb91f3423642f16de37eb9666d" 71 | }, 72 | "source": [ 73 | "### Reading the file " 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": { 80 | "_cell_guid": "fe287208-001e-07e5-1a9b-dbf8d5a40c6a", 81 | "_uuid": "f4a36a4d4dd5a8a83e9332f77e7a78c3f36a6cac" 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "data = pd.read_csv(\"../input/mushrooms.csv\")\n", 86 | "data.head(6)" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": { 92 | "_cell_guid": "f9455820-2893-f798-b7ba-3f9090cfeac1", 93 | "_uuid": "72e8424a13408772cafa9792b9eb10df22fa6892" 94 | }, 95 | "source": [ 96 | "### Let us check if there is any null values" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": { 103 | "_cell_guid": "5499a675-fbba-d400-ecd3-57fc1a0f3915", 104 | "_uuid": "4d13cfe1b929b92fc17f3093a84f71c09a896474" 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "data.isnull().sum()" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": { 115 | "_cell_guid": "516a10c6-a887-10a1-af23-8846c6517cb2", 116 | "_uuid": "02811ac4e594a285c90deb527cad083088b205f1" 117 | }, 118 | "outputs": [], 119 | "source": [ 120 | "data['class'].unique()" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": { 126 | "_cell_guid": "2392cf2f-5d0b-0fbc-25d7-6ac8f6f9aa91", 127 | "_uuid": "d550638ac2c8411be39972a8097092514f4cd9fe" 128 | }, 129 | "source": [ 130 | "**Thus we have two claasification. Either the mushroom is poisonous or edible**" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "metadata": { 137 | "_cell_guid": "abf7629e-1970-5b51-bd93-594b84eb30cf", 138 | "_uuid": "bc497236c5adf11c58d7bac0f050e97057627960" 139 | }, 140 | "outputs": [], 141 | "source": [ 142 | "data.shape" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": { 148 | "_cell_guid": "8198f725-e1d9-13d3-e29b-01aa7942f93a", 149 | "_uuid": "833b4d14df12086f7a3a2497b50cbe7066ca4a01" 150 | }, 151 | "source": [ 152 | "**Thus we have 22 features(1st one is label) and 8124 instances.Now let us check which features constitutes maximum information.** " 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": { 158 | "_cell_guid": "b6e97737-4258-b9ae-62ab-dd6faebfd0a2", 159 | "_uuid": "d4ba71813bfd2c188673cbf8ba454f15c917f6a5" 160 | }, 161 | "source": [ 162 | "**We can see that the dataset has values in strings.We need to convert all the unique values to integers. Thus we perform label encoding on the data.**" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "metadata": { 169 | "_cell_guid": "af0ee2d3-add4-2b0c-140a-6815347c4d3c", 170 | "_uuid": "ab284b2d43e925e292d688a27a8635d1e2605665" 171 | }, 172 | "outputs": [], 173 | "source": [ 174 | "from sklearn.preprocessing import LabelEncoder\n", 175 | "labelencoder=LabelEncoder()\n", 176 | "for col in data.columns:\n", 177 | " data[col] = labelencoder.fit_transform(data[col])\n", 178 | " \n", 179 | "data.head()" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": { 185 | "_cell_guid": "b49e7feb-8261-9056-d345-9215e0d384bd", 186 | "_uuid": "b4fe437adb0432a8a2e88f90b555103fc1ae6f46" 187 | }, 188 | "source": [ 189 | "### Checking the encoded values" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": null, 195 | "metadata": { 196 | "_cell_guid": "262c2fe1-8e68-f5fe-a879-3bd9400b469c", 197 | "_uuid": "65a4999e0a6d9a2da74e055dc90745f672c5571a" 198 | }, 199 | "outputs": [], 200 | "source": [ 201 | "data['stalk-color-above-ring'].unique()" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": { 208 | "_cell_guid": "9ced0f59-a411-46a8-92a4-f9ef9fe47a0d", 209 | "_uuid": "315f7e1554e43c1de380f24c83c22679c2f280bf" 210 | }, 211 | "outputs": [], 212 | "source": [ 213 | "print(data.groupby('class').size())" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": { 219 | "_cell_guid": "8b40bb73-5d26-48f3-e97f-072b2467d0bc", 220 | "_uuid": "31d625b63e374172aafcae73e30d337193472db8" 221 | }, 222 | "source": [ 223 | "### Plotting boxplot to see the distribution of the data" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "metadata": { 230 | "_cell_guid": "3c0414fe-dac5-12f7-80c5-517f13236eec", 231 | "_uuid": "f13fbc158b36ef385fdc609a4a00ff831281cfd5" 232 | }, 233 | "outputs": [], 234 | "source": [ 235 | "'''\n", 236 | "# Create a figure instance\n", 237 | "fig, axes = plt.subplots(nrows=2 ,ncols=2 ,figsize=(9, 9))\n", 238 | "\n", 239 | "# Create an axes instance and the boxplot\n", 240 | "bp1 = axes[0,0].boxplot(data['stalk-color-above-ring'],patch_artist=True)\n", 241 | "\n", 242 | "bp2 = axes[0,1].boxplot(data['stalk-color-below-ring'],patch_artist=True)\n", 243 | "\n", 244 | "bp3 = axes[1,0].boxplot(data['stalk-surface-below-ring'],patch_artist=True)\n", 245 | "\n", 246 | "bp4 = axes[1,1].boxplot(data['stalk-surface-above-ring'],patch_artist=True)\n", 247 | "'''\n", 248 | "ax = sns.boxplot(x='class', y='stalk-color-above-ring', \n", 249 | " data=data)\n", 250 | "ax = sns.stripplot(x=\"class\", y='stalk-color-above-ring',\n", 251 | " data=data, jitter=True,\n", 252 | " edgecolor=\"gray\")\n", 253 | "sns.plt.title(\"Class w.r.t stalkcolor above ring\",fontsize=12)" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": { 259 | "_cell_guid": "3594a475-f8a8-3599-f2b8-c0209e87aeec", 260 | "_uuid": "9018af8e638a03e93193ae9fb187f74a894717d1" 261 | }, 262 | "source": [ 263 | "**Separating features and label**" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": null, 269 | "metadata": { 270 | "_cell_guid": "f15dc721-d721-9320-3964-66cc19b72a18", 271 | "_uuid": "cb784bf004a1a9fa96baa327c4fac13092747b1e" 272 | }, 273 | "outputs": [], 274 | "source": [ 275 | "X = data.iloc[:,1:23] # all rows, all the features and no labels\n", 276 | "y = data.iloc[:, 0] # all rows, label only\n", 277 | "X.head()\n", 278 | "y.head()" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": null, 284 | "metadata": { 285 | "_cell_guid": "19a00b00-b010-6918-11cb-1fa7b18aaa17", 286 | "_uuid": "dac81faac1dbb2bb8abf0058e73cd9186b214b44" 287 | }, 288 | "outputs": [], 289 | "source": [ 290 | "X.describe()" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": null, 296 | "metadata": { 297 | "_cell_guid": "9f8ebf80-0488-93ea-9a10-d128c8519bf6", 298 | "_uuid": "daf21399b2443eb751ed0f5fe1d21e53b59b487d" 299 | }, 300 | "outputs": [], 301 | "source": [ 302 | "y.head()" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "metadata": { 309 | "_cell_guid": "29c744c8-7da3-bcbd-d33d-e01d55833f07", 310 | "_uuid": "7b80dffdfaad294c40746ea97949d01fa4cea3f2" 311 | }, 312 | "outputs": [], 313 | "source": [ 314 | "data.corr()" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "metadata": { 320 | "_cell_guid": "5c52b254-e464-3262-05f8-523bec33fe44", 321 | "_uuid": "cf2cb9df27464a86437a02ba876d7f0862893647" 322 | }, 323 | "source": [ 324 | "# Standardising the data" 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": null, 330 | "metadata": { 331 | "_cell_guid": "54185e06-71a1-3738-e41b-cbc0ae4ca614", 332 | "_uuid": "ca49c4ef0a91b22f32d71145f03ad1dbb2d9ddb2" 333 | }, 334 | "outputs": [], 335 | "source": [ 336 | "# Scale the data to be between -1 and 1\n", 337 | "from sklearn.preprocessing import StandardScaler\n", 338 | "scaler = StandardScaler()\n", 339 | "X=scaler.fit_transform(X)\n", 340 | "X" 341 | ] 342 | }, 343 | { 344 | "cell_type": "markdown", 345 | "metadata": { 346 | "_cell_guid": "949d806b-30f9-2d26-0dde-65377690bef6", 347 | "_uuid": "252a822b8dcc636fb690697eb0fca5e491d717fd" 348 | }, 349 | "source": [ 350 | "**Note**: We can avoid PCA here since the dataset is very small." 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": { 356 | "_cell_guid": "80eb90a3-f197-cf7e-13ad-126911b51064", 357 | "_uuid": "994f70bc40134a4cce266472a8ff2876bcdd3a00" 358 | }, 359 | "source": [ 360 | "# Principal Component Analysis" 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": null, 366 | "metadata": { 367 | "_cell_guid": "0e3cc9bb-7d54-6133-8863-d10b6924ef5d", 368 | "_uuid": "286a68a1cf7a8711f48900837c474dce2549a16c" 369 | }, 370 | "outputs": [], 371 | "source": [ 372 | "from sklearn.decomposition import PCA\n", 373 | "pca = PCA()\n", 374 | "pca.fit_transform(X)" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "metadata": { 381 | "_cell_guid": "0904dbb8-7a4f-7cff-f978-37c84371b440", 382 | "_uuid": "45110aaafa26fe801c265c6a2eccc4d05fa57ed5" 383 | }, 384 | "outputs": [], 385 | "source": [ 386 | "covariance=pca.get_covariance()\n", 387 | "#covariance" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": null, 393 | "metadata": { 394 | "_cell_guid": "a936fc92-793d-a887-bdb6-ca72905ef330", 395 | "_uuid": "0b8b7e48edc985f05f637bdd6fa1137ec08eedd1" 396 | }, 397 | "outputs": [], 398 | "source": [ 399 | "explained_variance=pca.explained_variance_\n", 400 | "explained_variance" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": null, 406 | "metadata": { 407 | "_cell_guid": "19dd9eca-ee82-bd65-002b-bdd11590956c", 408 | "_uuid": "42a057446a7afa8bc91c1ba5bc281c887e51b24b" 409 | }, 410 | "outputs": [], 411 | "source": [ 412 | "with plt.style.context('dark_background'):\n", 413 | " plt.figure(figsize=(6, 4))\n", 414 | " \n", 415 | " plt.bar(range(22), explained_variance, alpha=0.5, align='center',\n", 416 | " label='individual explained variance')\n", 417 | " plt.ylabel('Explained variance ratio')\n", 418 | " plt.xlabel('Principal components')\n", 419 | " plt.legend(loc='best')\n", 420 | " plt.tight_layout()" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": { 426 | "_cell_guid": "a2e6f6ff-bb71-a216-33ba-05a43a61cb83", 427 | "_uuid": "ec3e3ea21ff565682085f0c34df897b3247f503e" 428 | }, 429 | "source": [ 430 | "**We can see that the last 4 components has less amount of variance of the data.The 1st 17 components retains more than 90% of the data.**" 431 | ] 432 | }, 433 | { 434 | "cell_type": "markdown", 435 | "metadata": { 436 | "_cell_guid": "cbcac3ce-62cf-5ac7-946e-a5de3571052a", 437 | "_uuid": "29e126263ce7126a6480cce546eba440be656885" 438 | }, 439 | "source": [ 440 | "### Let us take only first two principal components and visualise it using K-means clustering" 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": null, 446 | "metadata": { 447 | "_cell_guid": "12199b6a-ec63-32ca-4efb-2f930fcf004f", 448 | "_uuid": "a8b53b53a63697ad43b5dcb769cf573ed02d259f" 449 | }, 450 | "outputs": [], 451 | "source": [ 452 | "N=data.values\n", 453 | "pca = PCA(n_components=2)\n", 454 | "x = pca.fit_transform(N)\n", 455 | "plt.figure(figsize = (5,5))\n", 456 | "plt.scatter(x[:,0],x[:,1])\n", 457 | "plt.show()" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": null, 463 | "metadata": { 464 | "_cell_guid": "b504681c-c3b9-e808-fb8c-3b0ff036f8b7", 465 | "_uuid": "aae2d1c67f12ce9253ab106373ecdd158a2a6fb0" 466 | }, 467 | "outputs": [], 468 | "source": [ 469 | "from sklearn.cluster import KMeans\n", 470 | "kmeans = KMeans(n_clusters=2, random_state=5)\n", 471 | "X_clustered = kmeans.fit_predict(N)\n", 472 | "\n", 473 | "LABEL_COLOR_MAP = {0 : 'g',\n", 474 | " 1 : 'y'\n", 475 | " }\n", 476 | "\n", 477 | "label_color = [LABEL_COLOR_MAP[l] for l in X_clustered]\n", 478 | "plt.figure(figsize = (5,5))\n", 479 | "plt.scatter(x[:,0],x[:,1], c= label_color)\n", 480 | "plt.show()" 481 | ] 482 | }, 483 | { 484 | "cell_type": "markdown", 485 | "metadata": { 486 | "_cell_guid": "0a9f3343-5f46-0572-27f0-8269ef162a56", 487 | "_uuid": "e21ed30d38472557d356edc6fa8556e263eb9e68" 488 | }, 489 | "source": [ 490 | "### Thus using K-means we are able segregate 2 classes well using the first two components with maximum variance." 491 | ] 492 | }, 493 | { 494 | "cell_type": "markdown", 495 | "metadata": { 496 | "_cell_guid": "90fa31fa-cac2-a8a5-d69a-9bd53fc52920", 497 | "_uuid": "aa6bcb8164fe4168a62c0604a612de1927c67252" 498 | }, 499 | "source": [ 500 | "# Performing PCA by taking 17 components with maximum Variance" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "execution_count": null, 506 | "metadata": { 507 | "_cell_guid": "d4f93ff7-0bf8-1ecf-e14f-ddb56a1b4bb4", 508 | "_uuid": "8cc9f256ee82aeaa287552b84c19adcc54e84961" 509 | }, 510 | "outputs": [], 511 | "source": [ 512 | "pca_modified=PCA(n_components=17)\n", 513 | "pca_modified.fit_transform(X)" 514 | ] 515 | }, 516 | { 517 | "cell_type": "markdown", 518 | "metadata": { 519 | "_cell_guid": "a44859a3-c852-332c-3d29-ad48f8e090df", 520 | "_uuid": "d1587a4373a82994a3e38dfbc9223719e084d5d7" 521 | }, 522 | "source": [ 523 | "### Splitting the data into training and testing dataset" 524 | ] 525 | }, 526 | { 527 | "cell_type": "code", 528 | "execution_count": null, 529 | "metadata": { 530 | "_cell_guid": "200dc29e-9986-069a-7a59-2374dbf57b35", 531 | "_uuid": "75ec9e83cfd85cf463ecbe35d10879988051e9c5" 532 | }, 533 | "outputs": [], 534 | "source": [ 535 | "from sklearn.model_selection import train_test_split\n", 536 | "X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=4)" 537 | ] 538 | }, 539 | { 540 | "cell_type": "markdown", 541 | "metadata": { 542 | "_cell_guid": "801b54d9-d7d1-6ed2-722c-58fa17bb28f7", 543 | "_uuid": "5464234586b6bbc1c992a5bd69d24ee091417b03" 544 | }, 545 | "source": [ 546 | "# Default Logistic Regression" 547 | ] 548 | }, 549 | { 550 | "cell_type": "code", 551 | "execution_count": null, 552 | "metadata": { 553 | "_cell_guid": "47b309b1-8b11-26d5-112c-3838427e7ae6", 554 | "_uuid": "cc2e8f2d5cd0101ae76b2a6b1919242e1c645167" 555 | }, 556 | "outputs": [], 557 | "source": [ 558 | "from sklearn.linear_model import LogisticRegression\n", 559 | "from sklearn.model_selection import cross_val_score\n", 560 | "from sklearn import metrics\n", 561 | "\n", 562 | "model_LR= LogisticRegression()" 563 | ] 564 | }, 565 | { 566 | "cell_type": "code", 567 | "execution_count": null, 568 | "metadata": { 569 | "_cell_guid": "f1c6e97f-6207-1829-d4e2-4e5a5229c2a0", 570 | "_uuid": "1742429c5cd485ece083e6565abdc62a5b873ee8" 571 | }, 572 | "outputs": [], 573 | "source": [ 574 | "model_LR.fit(X_train,y_train)" 575 | ] 576 | }, 577 | { 578 | "cell_type": "code", 579 | "execution_count": null, 580 | "metadata": { 581 | "_cell_guid": "24aa6045-a438-f648-5d8c-88e45e1102a0", 582 | "_uuid": "f434b6250096175cb22979c6b43da8d7369acbc1" 583 | }, 584 | "outputs": [], 585 | "source": [ 586 | "y_prob = model_LR.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities \n", 587 | "y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.\n", 588 | "model_LR.score(X_test, y_pred)" 589 | ] 590 | }, 591 | { 592 | "cell_type": "code", 593 | "execution_count": null, 594 | "metadata": { 595 | "_cell_guid": "31c2969e-7fd6-f6a9-04f5-16ffd8fd6396", 596 | "_uuid": "2dc131abd61792b16d6adf23780fe35dca442242" 597 | }, 598 | "outputs": [], 599 | "source": [ 600 | "confusion_matrix=metrics.confusion_matrix(y_test,y_pred)\n", 601 | "confusion_matrix" 602 | ] 603 | }, 604 | { 605 | "cell_type": "code", 606 | "execution_count": null, 607 | "metadata": { 608 | "_cell_guid": "48f43519-85e2-fe26-9174-b8a8b1899812", 609 | "_uuid": "95b3fb6103fab9b16fb31cb0f9f8a08c66fa4eed" 610 | }, 611 | "outputs": [], 612 | "source": [ 613 | "auc_roc=metrics.roc_auc_score(y_test,y_pred)\n", 614 | "auc_roc" 615 | ] 616 | }, 617 | { 618 | "cell_type": "code", 619 | "execution_count": null, 620 | "metadata": { 621 | "_cell_guid": "84dbc1dc-d945-2277-e1b7-e976769becd3", 622 | "_uuid": "131a8116d32238aa4b5addd2c1c6a319b6069749" 623 | }, 624 | "outputs": [], 625 | "source": [ 626 | "from sklearn.metrics import roc_curve, auc\n", 627 | "false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)\n", 628 | "roc_auc = auc(false_positive_rate, true_positive_rate)\n", 629 | "roc_auc" 630 | ] 631 | }, 632 | { 633 | "cell_type": "code", 634 | "execution_count": null, 635 | "metadata": { 636 | "_cell_guid": "5ee904b5-7db8-e2ff-9953-02f34d868f25", 637 | "_uuid": "afb001763bd52c12baf41f75c8ba7aed70bf58da" 638 | }, 639 | "outputs": [], 640 | "source": [ 641 | "import matplotlib.pyplot as plt\n", 642 | "plt.figure(figsize=(10,10))\n", 643 | "plt.title('Receiver Operating Characteristic')\n", 644 | "plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)\n", 645 | "plt.legend(loc = 'lower right')\n", 646 | "plt.plot([0, 1], [0, 1],linestyle='--')\n", 647 | "plt.axis('tight')\n", 648 | "plt.ylabel('True Positive Rate')\n", 649 | "plt.xlabel('False Positive Rate')" 650 | ] 651 | }, 652 | { 653 | "cell_type": "markdown", 654 | "metadata": { 655 | "_cell_guid": "10ae42f6-254e-6d33-a064-7751ba99e6a5", 656 | "_uuid": "30e0a3db82f30aec719add091c501baf30a69aee" 657 | }, 658 | "source": [ 659 | "# Logistic Regression(Tuned model)" 660 | ] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": null, 665 | "metadata": { 666 | "_cell_guid": "e0858772-da3a-e0df-ae26-5be570167287", 667 | "_uuid": "8fb58f073c199a36e9fdb0ac888a17c7a06f079a" 668 | }, 669 | "outputs": [], 670 | "source": [ 671 | "from sklearn.linear_model import LogisticRegression\n", 672 | "from sklearn.model_selection import cross_val_score\n", 673 | "from sklearn import metrics\n", 674 | "\n", 675 | "LR_model= LogisticRegression()\n", 676 | "\n", 677 | "tuned_parameters = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] ,\n", 678 | " 'penalty':['l1','l2']\n", 679 | " }" 680 | ] 681 | }, 682 | { 683 | "cell_type": "markdown", 684 | "metadata": { 685 | "_cell_guid": "8140075f-2f17-91d3-a86f-c7d765cac500", 686 | "_uuid": "5ae0f30542def4c7db067dd5238528c3dae1dd70" 687 | }, 688 | "source": [ 689 | "**L1 and L2 are regularization parameters.They're used to avoid overfiting.Both L1 and L2 regularization prevents overfitting by shrinking (imposing a penalty) on the coefficients.** \n", 690 | " **L1 is the first moment norm |x1-x2| (|w| for regularization case) that is simply the absolute dıstance between two points where L2 is second moment norm corresponding to Eucledian Distance that is |x1-x2|^2 (|w|^2 for regularization case).** \n", 691 | " **In simple words,L2 (Ridge) shrinks all the coefficient by the same proportions but eliminates none, while L1 (Lasso) can shrink some coefficients to zero, performing variable selection.\n", 692 | "If all the features are correlated with the label, ridge outperforms lasso, as the coefficients are never zero in ridge. If only a subset of features are correlated with the label, lasso outperforms ridge as in lasso model some coefficient can be shrunken to zero.**" 693 | ] 694 | }, 695 | { 696 | "cell_type": "markdown", 697 | "metadata": { 698 | "_cell_guid": "f802a8e6-9471-cd39-7fa5-8112e1bddf5c", 699 | "_uuid": "27a8b413d62cc10e48dc271656c35570ff54d6cd" 700 | }, 701 | "source": [ 702 | "### Taking a look at the correlation " 703 | ] 704 | }, 705 | { 706 | "cell_type": "code", 707 | "execution_count": null, 708 | "metadata": { 709 | "_cell_guid": "be35cd37-d265-af2a-3c36-a699cca3ab12", 710 | "_uuid": "44be985ce957d155133f9b70417a2f837d1db3bc" 711 | }, 712 | "outputs": [], 713 | "source": [ 714 | "data.corr()" 715 | ] 716 | }, 717 | { 718 | "cell_type": "markdown", 719 | "metadata": { 720 | "_cell_guid": "30f38234-8a5b-4b16-b015-8f103d8eea97", 721 | "_uuid": "c0685973ec51c859681407c026e203521dea70ba" 722 | }, 723 | "source": [ 724 | "**The grid search provided by GridSearchCV exhaustively generates candidates from a grid of parameter values specified with the tuned_parameter.The GridSearchCV instance implements the usual estimator API: when “fitting” it on a dataset all the possible combinations of parameter values are evaluated and the best combination is retained.**" 725 | ] 726 | }, 727 | { 728 | "cell_type": "code", 729 | "execution_count": null, 730 | "metadata": { 731 | "_cell_guid": "a076d8d6-4c09-6768-12c9-fad9cc09497b", 732 | "_uuid": "58050d7b167bc53244c9177d34a87be5694fb130" 733 | }, 734 | "outputs": [], 735 | "source": [ 736 | "from sklearn.model_selection import GridSearchCV\n", 737 | "\n", 738 | "LR= GridSearchCV(LR_model, tuned_parameters,cv=10)" 739 | ] 740 | }, 741 | { 742 | "cell_type": "code", 743 | "execution_count": null, 744 | "metadata": { 745 | "_cell_guid": "15772de5-6f72-5a9c-3732-a5abea9db2d5", 746 | "_uuid": "63b3952d6831a8ad3275af2bfc10d0a28a2acba2" 747 | }, 748 | "outputs": [], 749 | "source": [ 750 | "LR.fit(X_train,y_train)" 751 | ] 752 | }, 753 | { 754 | "cell_type": "code", 755 | "execution_count": null, 756 | "metadata": { 757 | "_cell_guid": "7863a821-1ac5-5376-2d91-d4e1969af94f", 758 | "_uuid": "fb92efeaeeb32fc4eb3a0467ef28263700ee409f" 759 | }, 760 | "outputs": [], 761 | "source": [ 762 | "print(LR.best_params_)" 763 | ] 764 | }, 765 | { 766 | "cell_type": "code", 767 | "execution_count": null, 768 | "metadata": { 769 | "_cell_guid": "2242b89a-8495-1c07-db6c-c2f167a4be22", 770 | "_uuid": "1765707e72f09fe74c14a733cecaf5a1af48782c" 771 | }, 772 | "outputs": [], 773 | "source": [ 774 | "y_prob = LR.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities \n", 775 | "y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.\n", 776 | "LR.score(X_test, y_pred)" 777 | ] 778 | }, 779 | { 780 | "cell_type": "code", 781 | "execution_count": null, 782 | "metadata": { 783 | "_cell_guid": "60a50942-de06-9dc9-5118-d584bfa0b2eb", 784 | "_uuid": "2ea38ff65c47c6ceb8f81b57b8bd30746ab826c1" 785 | }, 786 | "outputs": [], 787 | "source": [ 788 | "confusion_matrix=metrics.confusion_matrix(y_test,y_pred)\n", 789 | "confusion_matrix" 790 | ] 791 | }, 792 | { 793 | "cell_type": "code", 794 | "execution_count": null, 795 | "metadata": { 796 | "_cell_guid": "5e486e4c-bde6-e849-bdfe-af9f9982185a", 797 | "_uuid": "a284298d724652f15cfb3c04ff12deff88afa766" 798 | }, 799 | "outputs": [], 800 | "source": [ 801 | "auc_roc=metrics.classification_report(y_test,y_pred)\n", 802 | "auc_roc" 803 | ] 804 | }, 805 | { 806 | "cell_type": "code", 807 | "execution_count": null, 808 | "metadata": { 809 | "_cell_guid": "f69d035c-5c51-fded-43a6-ec9c322a835c", 810 | "_uuid": "6e3585746cb60ba999290ee25c70afce10a94194" 811 | }, 812 | "outputs": [], 813 | "source": [ 814 | "auc_roc=metrics.roc_auc_score(y_test,y_pred)\n", 815 | "auc_roc" 816 | ] 817 | }, 818 | { 819 | "cell_type": "code", 820 | "execution_count": null, 821 | "metadata": { 822 | "_cell_guid": "8a9ab5d3-e42f-973d-927f-b0fd68b0ab38", 823 | "_uuid": "222756c0a785cd47345af6093d79624a55f6f440" 824 | }, 825 | "outputs": [], 826 | "source": [ 827 | "from sklearn.metrics import roc_curve, auc\n", 828 | "false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)\n", 829 | "roc_auc = auc(false_positive_rate, true_positive_rate)\n", 830 | "roc_auc" 831 | ] 832 | }, 833 | { 834 | "cell_type": "code", 835 | "execution_count": null, 836 | "metadata": { 837 | "_cell_guid": "500e2710-55e4-548e-34e2-43cad4375cf2", 838 | "_uuid": "17ddd6650e4cebe4ba37e7e14d535d127bf82af2" 839 | }, 840 | "outputs": [], 841 | "source": [ 842 | "\n", 843 | "import matplotlib.pyplot as plt\n", 844 | "plt.figure(figsize=(10,10))\n", 845 | "plt.title('Receiver Operating Characteristic')\n", 846 | "plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)\n", 847 | "plt.legend(loc = 'lower right')\n", 848 | "plt.plot([0, 1], [0, 1],linestyle='--')\n", 849 | "plt.axis('tight')\n", 850 | "plt.ylabel('True Positive Rate')\n", 851 | "plt.xlabel('False Positive Rate')" 852 | ] 853 | }, 854 | { 855 | "cell_type": "code", 856 | "execution_count": null, 857 | "metadata": { 858 | "_cell_guid": "3def21ce-7500-f16c-4283-607e07d851b2", 859 | "_uuid": "ab84b04b75cccb577a33226ebdb85d66ab74c434" 860 | }, 861 | "outputs": [], 862 | "source": [ 863 | "LR_ridge= LogisticRegression(penalty='l2')\n", 864 | "LR_ridge.fit(X_train,y_train)" 865 | ] 866 | }, 867 | { 868 | "cell_type": "code", 869 | "execution_count": null, 870 | "metadata": { 871 | "_cell_guid": "ac4b80d3-c4cb-1c6d-3755-dcc7af4f4172", 872 | "_uuid": "635c3c886ebd644d3464baeb169e2d3ed50d7bb8" 873 | }, 874 | "outputs": [], 875 | "source": [ 876 | "y_prob = LR_ridge.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities \n", 877 | "y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.\n", 878 | "LR_ridge.score(X_test, y_pred)" 879 | ] 880 | }, 881 | { 882 | "cell_type": "code", 883 | "execution_count": null, 884 | "metadata": { 885 | "_cell_guid": "7bf05fdb-e10c-387f-edf8-99a87877e96a", 886 | "_uuid": "49a6834825de9fbb0ff622494dbeaf11bf0fc4cf" 887 | }, 888 | "outputs": [], 889 | "source": [ 890 | "confusion_matrix=metrics.confusion_matrix(y_test,y_pred)\n", 891 | "confusion_matrix" 892 | ] 893 | }, 894 | { 895 | "cell_type": "code", 896 | "execution_count": null, 897 | "metadata": { 898 | "_cell_guid": "fc4f82d2-4bf5-0709-3b9a-6beef4f731ad", 899 | "_uuid": "b4bb30639e15eea4ee10e76ca70f850b95d4779b" 900 | }, 901 | "outputs": [], 902 | "source": [ 903 | "auc_roc=metrics.classification_report(y_test,y_pred)\n", 904 | "auc_roc" 905 | ] 906 | }, 907 | { 908 | "cell_type": "code", 909 | "execution_count": null, 910 | "metadata": { 911 | "_cell_guid": "f69d29c0-d88f-e5a7-c995-5004e033b31b", 912 | "_uuid": "de9f23b9b331c979bdd12806614a916378b954cc" 913 | }, 914 | "outputs": [], 915 | "source": [ 916 | "auc_roc=metrics.roc_auc_score(y_test,y_pred)\n", 917 | "auc_roc" 918 | ] 919 | }, 920 | { 921 | "cell_type": "code", 922 | "execution_count": null, 923 | "metadata": { 924 | "_cell_guid": "549cea57-3c1a-5636-f8ee-dd161cd05764", 925 | "_uuid": "89350b1d366b4f697f17382bfa25cfc7d9bff6a0" 926 | }, 927 | "outputs": [], 928 | "source": [ 929 | "\n", 930 | "from sklearn.metrics import roc_curve, auc\n", 931 | "false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)\n", 932 | "roc_auc = auc(false_positive_rate, true_positive_rate)\n", 933 | "roc_auc" 934 | ] 935 | }, 936 | { 937 | "cell_type": "code", 938 | "execution_count": null, 939 | "metadata": { 940 | "_cell_guid": "ab06322c-ffe0-5fba-8498-cd60bde01be5", 941 | "_uuid": "5f8a35b343dea66ac1a20914f11ed539cb75458a" 942 | }, 943 | "outputs": [], 944 | "source": [ 945 | "\n", 946 | "import matplotlib.pyplot as plt\n", 947 | "plt.figure(figsize=(10,10))\n", 948 | "plt.title('Receiver Operating Characteristic')\n", 949 | "plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)\n", 950 | "plt.legend(loc = 'lower right')\n", 951 | "plt.plot([0, 1], [0, 1],linestyle='--')\n", 952 | "plt.axis('tight')\n", 953 | "plt.ylabel('True Positive Rate')\n", 954 | "plt.xlabel('False Positive Rate')" 955 | ] 956 | }, 957 | { 958 | "cell_type": "markdown", 959 | "metadata": { 960 | "_cell_guid": "8159a67f-f410-9007-06cd-f59e00dd106a", 961 | "_uuid": "d00fcfaa0c8253f191eda817cf28f2f9b406e949" 962 | }, 963 | "source": [ 964 | "# Gaussian Naive Bayes" 965 | ] 966 | }, 967 | { 968 | "cell_type": "code", 969 | "execution_count": null, 970 | "metadata": { 971 | "_cell_guid": "59acd771-44d1-c91f-c0ec-5402446db5c1", 972 | "_uuid": "dd03ab44e86b12c0a6801cf64ae6224645d2f84a" 973 | }, 974 | "outputs": [], 975 | "source": [ 976 | "from sklearn.naive_bayes import GaussianNB\n", 977 | "model_naive = GaussianNB()\n", 978 | "model_naive.fit(X_train, y_train)" 979 | ] 980 | }, 981 | { 982 | "cell_type": "code", 983 | "execution_count": null, 984 | "metadata": { 985 | "_cell_guid": "454e7a22-1323-fa6a-d451-20e2aca84f23", 986 | "_uuid": "ad4ab023fa4413b9228a54cc075ee071467cf247" 987 | }, 988 | "outputs": [], 989 | "source": [ 990 | "y_prob = model_naive.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities \n", 991 | "y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.\n", 992 | "model_naive.score(X_test, y_pred)" 993 | ] 994 | }, 995 | { 996 | "cell_type": "code", 997 | "execution_count": null, 998 | "metadata": { 999 | "_cell_guid": "7c9facdd-4f7a-e378-d382-2bd64b4890ec", 1000 | "_uuid": "993ea36ffb05d942668705101de4ef2d9b511d3f" 1001 | }, 1002 | "outputs": [], 1003 | "source": [ 1004 | "print(\"Number of mislabeled points from %d points : %d\"\n", 1005 | " % (X_test.shape[0],(y_test!= y_pred).sum()))" 1006 | ] 1007 | }, 1008 | { 1009 | "cell_type": "code", 1010 | "execution_count": null, 1011 | "metadata": { 1012 | "_cell_guid": "84675635-78a3-27e5-604e-f6da06ca06dc", 1013 | "_uuid": "e622518aea2380e8205b526da906ea7c37721c55" 1014 | }, 1015 | "outputs": [], 1016 | "source": [ 1017 | "scores = cross_val_score(model_naive, X, y, cv=10, scoring='accuracy')\n", 1018 | "print(scores)" 1019 | ] 1020 | }, 1021 | { 1022 | "cell_type": "code", 1023 | "execution_count": null, 1024 | "metadata": { 1025 | "_cell_guid": "7bbb2136-5c8d-ea94-30eb-8efe1c865c75", 1026 | "_uuid": "0607bd34f114a0d019f6acf390aeb30c7ce27d7b" 1027 | }, 1028 | "outputs": [], 1029 | "source": [ 1030 | "scores.mean()" 1031 | ] 1032 | }, 1033 | { 1034 | "cell_type": "code", 1035 | "execution_count": null, 1036 | "metadata": { 1037 | "_cell_guid": "624934f3-6f27-93f8-786c-6132112ddc20", 1038 | "_uuid": "b3e7a091ec12ded2d5b5eadc49a22e66da0c6406" 1039 | }, 1040 | "outputs": [], 1041 | "source": [ 1042 | "confusion_matrix=metrics.confusion_matrix(y_test,y_pred)\n", 1043 | "confusion_matrix" 1044 | ] 1045 | }, 1046 | { 1047 | "cell_type": "code", 1048 | "execution_count": null, 1049 | "metadata": { 1050 | "_cell_guid": "4618ed82-cbf8-d1a6-d7a2-befe113077f0", 1051 | "_uuid": "1709d1a9cf0ee3e2ab18863fdf713a4fdaa5597c" 1052 | }, 1053 | "outputs": [], 1054 | "source": [ 1055 | "auc_roc=metrics.classification_report(y_test,y_pred)\n", 1056 | "auc_roc" 1057 | ] 1058 | }, 1059 | { 1060 | "cell_type": "code", 1061 | "execution_count": null, 1062 | "metadata": { 1063 | "_cell_guid": "91bdc65e-a302-6ab3-ff47-37d0ee726a53", 1064 | "_uuid": "604cfaa8076c60a4e675edaa0ca58e31c009c32d" 1065 | }, 1066 | "outputs": [], 1067 | "source": [ 1068 | "auc_roc=metrics.roc_auc_score(y_test,y_pred)\n", 1069 | "auc_roc" 1070 | ] 1071 | }, 1072 | { 1073 | "cell_type": "code", 1074 | "execution_count": null, 1075 | "metadata": { 1076 | "_cell_guid": "3c6b655c-826f-50eb-4d4f-79c1525708b9", 1077 | "_uuid": "c6ea7ee6381d87ce613994d7e4670a21e27a2156" 1078 | }, 1079 | "outputs": [], 1080 | "source": [ 1081 | "from sklearn.metrics import roc_curve, auc\n", 1082 | "false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)\n", 1083 | "roc_auc = auc(false_positive_rate, true_positive_rate)\n", 1084 | "roc_auc" 1085 | ] 1086 | }, 1087 | { 1088 | "cell_type": "code", 1089 | "execution_count": null, 1090 | "metadata": { 1091 | "_cell_guid": "cb33dfcd-950a-6372-3d96-2fc6fef3ec85", 1092 | "_uuid": "16e6b8d90415d94f14e834c8e702e2535b4484d8" 1093 | }, 1094 | "outputs": [], 1095 | "source": [ 1096 | "import matplotlib.pyplot as plt\n", 1097 | "plt.figure(figsize=(10,10))\n", 1098 | "plt.title('Receiver Operating Characteristic')\n", 1099 | "plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)\n", 1100 | "plt.legend(loc = 'lower right')\n", 1101 | "plt.plot([0, 1], [0, 1],linestyle='--')\n", 1102 | "plt.axis('tight')\n", 1103 | "plt.ylabel('True Positive Rate')\n", 1104 | "plt.xlabel('False Positive Rate')" 1105 | ] 1106 | }, 1107 | { 1108 | "cell_type": "markdown", 1109 | "metadata": { 1110 | "_cell_guid": "458e57b8-6935-176b-9211-83fce5d23b1a", 1111 | "_uuid": "4f0b1f9ad9c027e191cf22bbff957923c6199b05" 1112 | }, 1113 | "source": [ 1114 | "# Support Vector Machine" 1115 | ] 1116 | }, 1117 | { 1118 | "cell_type": "code", 1119 | "execution_count": null, 1120 | "metadata": { 1121 | "_cell_guid": "7625c838-94db-c5c3-3fe2-6ac22512f8c5", 1122 | "_uuid": "5a01be1060a9c92732777850c23c0382abbdf5a0" 1123 | }, 1124 | "outputs": [], 1125 | "source": [ 1126 | "from sklearn.svm import SVC\n", 1127 | "svm_model= SVC()" 1128 | ] 1129 | }, 1130 | { 1131 | "cell_type": "markdown", 1132 | "metadata": { 1133 | "_cell_guid": "1dff13e1-61f3-e9b5-a0d2-c275c208aa8e", 1134 | "_uuid": "44ebb631538f51956d4213469d7775aba707cb79" 1135 | }, 1136 | "source": [ 1137 | "The **gamma** parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. The **gamma** parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors.\n", 1138 | "\n", 1139 | "The **C** parameter trades off misclassification of training examples against simplicity of the decision surface. A low **C** makes the decision surface smooth, while a high **C** aims at classifying all training examples correctly by giving the model freedom to select more samples as support vectors." 1140 | ] 1141 | }, 1142 | { 1143 | "cell_type": "markdown", 1144 | "metadata": { 1145 | "_cell_guid": "f4335b19-f669-2529-e9c8-3cd9d8f2f8ef", 1146 | "_uuid": "a9e8d5d99fd2e85c06b7684f1add399de1040324" 1147 | }, 1148 | "source": [ 1149 | "# Support Vector Machine without polynomial kernel" 1150 | ] 1151 | }, 1152 | { 1153 | "cell_type": "code", 1154 | "execution_count": null, 1155 | "metadata": { 1156 | "_cell_guid": "1acfd379-4698-8609-8d34-75568232c606", 1157 | "_uuid": "9df67297271abdc255bfb91ebdb666f2be296a22" 1158 | }, 1159 | "outputs": [], 1160 | "source": [ 1161 | "tuned_parameters = {\n", 1162 | " 'C': [1, 10, 100,500, 1000], 'kernel': ['linear','rbf'],\n", 1163 | " 'C': [1, 10, 100,500, 1000], 'gamma': [1,0.1,0.01,0.001, 0.0001], 'kernel': ['rbf'],\n", 1164 | " #'degree': [2,3,4,5,6] , 'C':[1,10,100,500,1000] , 'kernel':['poly']\n", 1165 | " }" 1166 | ] 1167 | }, 1168 | { 1169 | "cell_type": "markdown", 1170 | "metadata": { 1171 | "_cell_guid": "d8a28afe-9c3a-9d18-cbe1-4610bc6cc00b", 1172 | "_uuid": "306ed0241e2f157829b96a61ca70b0078018eed3" 1173 | }, 1174 | "source": [ 1175 | "The grid search provided by GridSearchCV exhaustively generates candidates from a grid of parameter values specified with the tuned_parameter**.The GridSearchCV instance implements the usual estimator API: when “fitting” it on a dataset all the possible combinations of parameter values are evaluated and the best combination is retained.\n", 1176 | "But it is proving computationally expensive here.So I am opting for RandomizedSearchCV.\n", 1177 | "\n", 1178 | "RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:\n", 1179 | "1)A budget can be chosen independent of the number of parameters and possible values.\n", 1180 | "2)Adding parameters that do not influence the performance does not decrease efficiency." 1181 | ] 1182 | }, 1183 | { 1184 | "cell_type": "code", 1185 | "execution_count": null, 1186 | "metadata": { 1187 | "_cell_guid": "41a288c2-3707-a9ab-6d35-5ba9878777bb", 1188 | "_uuid": "ffeefa854f40f1746ed61cdc71cecc697065ee46" 1189 | }, 1190 | "outputs": [], 1191 | "source": [ 1192 | "from sklearn.grid_search import RandomizedSearchCV\n", 1193 | "\n", 1194 | "model_svm = RandomizedSearchCV(svm_model, tuned_parameters,cv=10,scoring='accuracy',n_iter=20)" 1195 | ] 1196 | }, 1197 | { 1198 | "cell_type": "code", 1199 | "execution_count": null, 1200 | "metadata": { 1201 | "_cell_guid": "6cf1ce3a-6b6d-083e-9b48-c24557b2e5ff", 1202 | "_uuid": "72195cd4b9d87a1035c7acb33434bc23a67d48b2" 1203 | }, 1204 | "outputs": [], 1205 | "source": [ 1206 | "model_svm.fit(X_train, y_train)\n", 1207 | "print(model_svm.best_score_)" 1208 | ] 1209 | }, 1210 | { 1211 | "cell_type": "code", 1212 | "execution_count": null, 1213 | "metadata": { 1214 | "_cell_guid": "fbab9fd3-65fe-3c00-27a5-0e9849b1a24d", 1215 | "_uuid": "4c3b0eb06708c2869aa4494d81a07ff78812075b" 1216 | }, 1217 | "outputs": [], 1218 | "source": [ 1219 | "print(model_svm.grid_scores_)" 1220 | ] 1221 | }, 1222 | { 1223 | "cell_type": "code", 1224 | "execution_count": null, 1225 | "metadata": { 1226 | "_cell_guid": "c28fdf03-1aaf-081c-1c19-d7daae4f2169", 1227 | "_uuid": "cbf88dac5fb5d2d3c7e0d51a8a9f3e2785f1a99a" 1228 | }, 1229 | "outputs": [], 1230 | "source": [ 1231 | "print(model_svm.best_params_)" 1232 | ] 1233 | }, 1234 | { 1235 | "cell_type": "code", 1236 | "execution_count": null, 1237 | "metadata": { 1238 | "_cell_guid": "e0ce400d-9a7c-9134-7bcc-30bfa23dbd72", 1239 | "_uuid": "65c9c8e5c6fdeafcb7457ae43f39c317acde6817" 1240 | }, 1241 | "outputs": [], 1242 | "source": [ 1243 | "\n", 1244 | "y_pred= model_svm.predict(X_test)\n", 1245 | "print(metrics.accuracy_score(y_pred,y_test))" 1246 | ] 1247 | }, 1248 | { 1249 | "cell_type": "code", 1250 | "execution_count": null, 1251 | "metadata": { 1252 | "_cell_guid": "a42b9916-f91b-2f90-dfae-64e876aeb6c5", 1253 | "_uuid": "7d0a0f60133afa53195b35529c8db150b945032c" 1254 | }, 1255 | "outputs": [], 1256 | "source": [ 1257 | "confusion_matrix=metrics.confusion_matrix(y_test,y_pred)\n", 1258 | "confusion_matrix" 1259 | ] 1260 | }, 1261 | { 1262 | "cell_type": "code", 1263 | "execution_count": null, 1264 | "metadata": { 1265 | "_cell_guid": "e5b968ae-af2a-d18c-8a0b-8aaa11bc3cd8", 1266 | "_uuid": "29b67753a55864e73baa55b1bad76c3c39367e01" 1267 | }, 1268 | "outputs": [], 1269 | "source": [ 1270 | "auc_roc=metrics.classification_report(y_test,y_pred)\n", 1271 | "auc_roc" 1272 | ] 1273 | }, 1274 | { 1275 | "cell_type": "code", 1276 | "execution_count": null, 1277 | "metadata": { 1278 | "_cell_guid": "89963240-7f86-3a73-5f18-86bc6dcc813a", 1279 | "_uuid": "243251bc506ec17358c2ae5ed4349572685165e1" 1280 | }, 1281 | "outputs": [], 1282 | "source": [ 1283 | "auc_roc=metrics.roc_auc_score(y_test,y_pred)\n", 1284 | "auc_roc" 1285 | ] 1286 | }, 1287 | { 1288 | "cell_type": "code", 1289 | "execution_count": null, 1290 | "metadata": { 1291 | "_cell_guid": "39d255a4-4527-378d-6218-1cdd548d90e9", 1292 | "_uuid": "a0d55a17d593742b13a43ce90a6c8dbac71a48aa" 1293 | }, 1294 | "outputs": [], 1295 | "source": [ 1296 | "from sklearn.metrics import roc_curve, auc\n", 1297 | "false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)\n", 1298 | "roc_auc = auc(false_positive_rate, true_positive_rate)\n", 1299 | "roc_auc" 1300 | ] 1301 | }, 1302 | { 1303 | "cell_type": "code", 1304 | "execution_count": null, 1305 | "metadata": { 1306 | "_cell_guid": "05af8089-0adc-50fc-4140-6e842e6e8a66", 1307 | "_uuid": "a23dae7bd8cc44a628421d1b71722ff5a9d99ce6" 1308 | }, 1309 | "outputs": [], 1310 | "source": [ 1311 | "import matplotlib.pyplot as plt\n", 1312 | "plt.figure(figsize=(10,10))\n", 1313 | "plt.title('Receiver Operating Characteristic')\n", 1314 | "plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)\n", 1315 | "plt.legend(loc = 'lower right')\n", 1316 | "plt.plot([0, 1], [0, 1],linestyle='--')\n", 1317 | "plt.axis('tight')\n", 1318 | "plt.ylabel('True Positive Rate')\n", 1319 | "plt.xlabel('False Positive Rate')" 1320 | ] 1321 | }, 1322 | { 1323 | "cell_type": "markdown", 1324 | "metadata": { 1325 | "_cell_guid": "c8453261-f76c-7134-64c8-e444aedda8f7", 1326 | "_uuid": "c467e0bb1484ae9994484bb1800f345aaf4937d7" 1327 | }, 1328 | "source": [ 1329 | "# Support Vector machine with polynomial Kernel" 1330 | ] 1331 | }, 1332 | { 1333 | "cell_type": "code", 1334 | "execution_count": null, 1335 | "metadata": { 1336 | "_cell_guid": "d38b7986-8e96-d201-0c0d-c4ea48488e59", 1337 | "_uuid": "4f7ab1a9e1d7e9f079cf1a39b62d48459e75456c" 1338 | }, 1339 | "outputs": [], 1340 | "source": [ 1341 | "tuned_parameters = {\n", 1342 | " 'C': [1, 10, 100,500, 1000], 'kernel': ['linear','rbf'],\n", 1343 | " 'C': [1, 10, 100,500, 1000], 'gamma': [1,0.1,0.01,0.001, 0.0001], 'kernel': ['rbf'],\n", 1344 | " 'degree': [2,3,4,5,6] , 'C':[1,10,100,500,1000] , 'kernel':['poly']\n", 1345 | " }" 1346 | ] 1347 | }, 1348 | { 1349 | "cell_type": "code", 1350 | "execution_count": null, 1351 | "metadata": { 1352 | "_cell_guid": "5e58725a-ec33-9f5b-060c-93e5116233dc", 1353 | "_uuid": "28ecbd118c80c5f1e3b6dc85e0e074615540f3e2" 1354 | }, 1355 | "outputs": [], 1356 | "source": [ 1357 | "from sklearn.grid_search import RandomizedSearchCV\n", 1358 | "\n", 1359 | "model_svm = RandomizedSearchCV(svm_model, tuned_parameters,cv=10,scoring='accuracy',n_iter=20)" 1360 | ] 1361 | }, 1362 | { 1363 | "cell_type": "code", 1364 | "execution_count": null, 1365 | "metadata": { 1366 | "_cell_guid": "8d83fa49-8da2-f561-f1e8-197fbe6fffa3", 1367 | "_uuid": "76ff20443055998b7435b4c7d0985b3b3b4a03b9" 1368 | }, 1369 | "outputs": [], 1370 | "source": [ 1371 | "model_svm.fit(X_train, y_train)\n", 1372 | "print(model_svm.best_score_)" 1373 | ] 1374 | }, 1375 | { 1376 | "cell_type": "code", 1377 | "execution_count": null, 1378 | "metadata": { 1379 | "_cell_guid": "81ff3083-2ccb-cc2f-df29-25a196593f0b", 1380 | "_uuid": "e0109029bcd549930e483fc1da5f93600cbfe217" 1381 | }, 1382 | "outputs": [], 1383 | "source": [ 1384 | "print(model_svm.grid_scores_)" 1385 | ] 1386 | }, 1387 | { 1388 | "cell_type": "code", 1389 | "execution_count": null, 1390 | "metadata": { 1391 | "_cell_guid": "ac8aefd3-1164-f84f-94d2-f2e7ba8deb11", 1392 | "_uuid": "ba7320f66da892adca87815fdb15a094e155bf45" 1393 | }, 1394 | "outputs": [], 1395 | "source": [ 1396 | "print(model_svm.best_params_)" 1397 | ] 1398 | }, 1399 | { 1400 | "cell_type": "code", 1401 | "execution_count": null, 1402 | "metadata": { 1403 | "_cell_guid": "7e2b0071-81cd-73fd-28da-25cd70516611", 1404 | "_uuid": "4f953dc87b6595b8730ffb4d00861da672a77fd6" 1405 | }, 1406 | "outputs": [], 1407 | "source": [ 1408 | "y_pred= model_svm.predict(X_test)\n", 1409 | "print(metrics.accuracy_score(y_pred,y_test))" 1410 | ] 1411 | }, 1412 | { 1413 | "cell_type": "code", 1414 | "execution_count": null, 1415 | "metadata": { 1416 | "_cell_guid": "3ad7298e-af70-067e-7d5d-00a144d84a45", 1417 | "_uuid": "15991f769ca68a6ca005f5ba491ca0562b4e16e2" 1418 | }, 1419 | "outputs": [], 1420 | "source": [ 1421 | "confusion_matrix=metrics.confusion_matrix(y_test,y_pred)\n", 1422 | "confusion_matrix" 1423 | ] 1424 | }, 1425 | { 1426 | "cell_type": "code", 1427 | "execution_count": null, 1428 | "metadata": { 1429 | "_cell_guid": "90cb8fc6-59e8-caa1-daa3-0e125ec25248", 1430 | "_uuid": "a96fa0e0b53e4436c8cc311bbfeb424e2e64e0ab" 1431 | }, 1432 | "outputs": [], 1433 | "source": [ 1434 | "auc_roc=metrics.classification_report(y_test,y_pred)\n", 1435 | "auc_roc" 1436 | ] 1437 | }, 1438 | { 1439 | "cell_type": "code", 1440 | "execution_count": null, 1441 | "metadata": { 1442 | "_cell_guid": "34497db1-b6ac-d7e5-1a04-1dc33e55cc47", 1443 | "_uuid": "e1c5fa87833a4cf1a8a5b6ab85afa7dd16a5027e" 1444 | }, 1445 | "outputs": [], 1446 | "source": [ 1447 | "auc_roc=metrics.roc_auc_score(y_test,y_pred)\n", 1448 | "auc_roc" 1449 | ] 1450 | }, 1451 | { 1452 | "cell_type": "code", 1453 | "execution_count": null, 1454 | "metadata": { 1455 | "_cell_guid": "9b7af233-30d6-1bb0-ce31-dda8a44aeddf", 1456 | "_uuid": "6099ba4128128c15ecaee1194a01dbe5215d9336" 1457 | }, 1458 | "outputs": [], 1459 | "source": [ 1460 | "from sklearn.metrics import roc_curve, auc\n", 1461 | "false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)\n", 1462 | "roc_auc = auc(false_positive_rate, true_positive_rate)\n", 1463 | "roc_auc" 1464 | ] 1465 | }, 1466 | { 1467 | "cell_type": "code", 1468 | "execution_count": null, 1469 | "metadata": { 1470 | "_cell_guid": "25277b51-305a-4370-45b2-85ecf437e68b", 1471 | "_uuid": "24b83cec037ccfeb99378fc544153765a1e9cce8" 1472 | }, 1473 | "outputs": [], 1474 | "source": [ 1475 | "import matplotlib.pyplot as plt\n", 1476 | "plt.figure(figsize=(10,10))\n", 1477 | "plt.title('Receiver Operating Characteristic')\n", 1478 | "plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)\n", 1479 | "plt.legend(loc = 'lower right')\n", 1480 | "plt.plot([0, 1], [0, 1],linestyle='--')\n", 1481 | "plt.axis('tight')\n", 1482 | "plt.ylabel('True Positive Rate')\n", 1483 | "plt.xlabel('False Positive Rate')" 1484 | ] 1485 | }, 1486 | { 1487 | "cell_type": "markdown", 1488 | "metadata": { 1489 | "_cell_guid": "16016745-71f6-9ce1-0595-912a4bc784db", 1490 | "_uuid": "48e4deeb6633eee4d5dce5d0bfd4f482babb419c" 1491 | }, 1492 | "source": [ 1493 | "### Trying default model" 1494 | ] 1495 | }, 1496 | { 1497 | "cell_type": "code", 1498 | "execution_count": null, 1499 | "metadata": { 1500 | "_cell_guid": "0456a449-6b1b-1208-bc8b-d0873517e327", 1501 | "_uuid": "1a4e730e75b0d397ba5ec17d2a2d8081c2f2679a" 1502 | }, 1503 | "outputs": [], 1504 | "source": [ 1505 | "from sklearn.ensemble import RandomForestClassifier\n", 1506 | "\n", 1507 | "model_RR=RandomForestClassifier()\n", 1508 | "\n", 1509 | "#tuned_parameters = {'min_samples_leaf': range(5,10,5), 'n_estimators' : range(50,200,50),\n", 1510 | " #'max_depth': range(5,15,5), 'max_features':range(5,20,5)\n", 1511 | " #}\n", 1512 | " " 1513 | ] 1514 | }, 1515 | { 1516 | "cell_type": "code", 1517 | "execution_count": null, 1518 | "metadata": { 1519 | "_cell_guid": "d2e24231-f8a6-d34c-6edb-609c1ffaf4ec", 1520 | "_uuid": "c841e5b9234e07bd2d26a7b1031bac0742b4a569" 1521 | }, 1522 | "outputs": [], 1523 | "source": [ 1524 | "model_RR.fit(X_train,y_train)" 1525 | ] 1526 | }, 1527 | { 1528 | "cell_type": "code", 1529 | "execution_count": null, 1530 | "metadata": { 1531 | "_cell_guid": "3ef4d341-8f38-a8fa-d8a6-27bd44b6ec00", 1532 | "_uuid": "0bfb00c62d64ccd891028be7c73d1e5c440ce228" 1533 | }, 1534 | "outputs": [], 1535 | "source": [ 1536 | "y_prob = model_RR.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities \n", 1537 | "y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.\n", 1538 | "model_RR.score(X_test, y_pred)" 1539 | ] 1540 | }, 1541 | { 1542 | "cell_type": "code", 1543 | "execution_count": null, 1544 | "metadata": { 1545 | "_cell_guid": "125dc2ea-b909-fc61-114b-21e9f9f23625", 1546 | "_uuid": "464f1579069470508797410a0c16e2205918be2a" 1547 | }, 1548 | "outputs": [], 1549 | "source": [ 1550 | "confusion_matrix=metrics.confusion_matrix(y_test,y_pred)\n", 1551 | "confusion_matrix" 1552 | ] 1553 | }, 1554 | { 1555 | "cell_type": "code", 1556 | "execution_count": null, 1557 | "metadata": { 1558 | "_cell_guid": "14f1721a-17db-0aa2-53d0-abd93e142c31", 1559 | "_uuid": "633c594998df978d4be1ca24eb575b7f84b2698e" 1560 | }, 1561 | "outputs": [], 1562 | "source": [ 1563 | "auc_roc=metrics.classification_report(y_test,y_pred)\n", 1564 | "auc_roc" 1565 | ] 1566 | }, 1567 | { 1568 | "cell_type": "code", 1569 | "execution_count": null, 1570 | "metadata": { 1571 | "_cell_guid": "a8248452-b6e5-6479-e666-537927eb89dd", 1572 | "_uuid": "75b9780eaa4b90a977520893384a85ae8111cadf" 1573 | }, 1574 | "outputs": [], 1575 | "source": [ 1576 | "auc_roc=metrics.roc_auc_score(y_test,y_pred)\n", 1577 | "auc_roc" 1578 | ] 1579 | }, 1580 | { 1581 | "cell_type": "code", 1582 | "execution_count": null, 1583 | "metadata": { 1584 | "_cell_guid": "c7a51e02-cf46-e6a9-80bd-96a4eaefaa42", 1585 | "_uuid": "3d8cc940f92a5d68e99485285379d761e74b9805" 1586 | }, 1587 | "outputs": [], 1588 | "source": [ 1589 | "from sklearn.metrics import roc_curve, auc\n", 1590 | "false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)\n", 1591 | "roc_auc = auc(false_positive_rate, true_positive_rate)" 1592 | ] 1593 | }, 1594 | { 1595 | "cell_type": "code", 1596 | "execution_count": null, 1597 | "metadata": { 1598 | "_cell_guid": "76f4605b-4aa7-3689-3878-76f64d987e52", 1599 | "_uuid": "fda943769c7aeceb4b09653f0c148ab9966d14cd" 1600 | }, 1601 | "outputs": [], 1602 | "source": [ 1603 | "import matplotlib.pyplot as plt\n", 1604 | "plt.figure(figsize=(10,10))\n", 1605 | "plt.title('Receiver Operating Characteristic')\n", 1606 | "plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)\n", 1607 | "plt.legend(loc = 'lower right')\n", 1608 | "plt.plot([0, 1], [0, 1],linestyle='--')\n", 1609 | "plt.axis('tight')\n", 1610 | "plt.ylabel('True Positive Rate')\n", 1611 | "plt.xlabel('False Positive Rate')" 1612 | ] 1613 | }, 1614 | { 1615 | "cell_type": "markdown", 1616 | "metadata": { 1617 | "_cell_guid": "ad487047-3130-5932-0ce5-2f7a47851a4c", 1618 | "_uuid": "fafdbf6eac43627a640a5f84cdf059bdf7eef3a8" 1619 | }, 1620 | "source": [ 1621 | "### Thus default Random forest model is giving us best accuracy." 1622 | ] 1623 | }, 1624 | { 1625 | "cell_type": "markdown", 1626 | "metadata": { 1627 | "_cell_guid": "ae95e479-db31-0877-2143-4a8467e78fba", 1628 | "_uuid": "a3fa2c65f7cc0b7e1c9071b1def76c06290e212b" 1629 | }, 1630 | "source": [ 1631 | "### Let us tuned the parameters of Random Forest just for the purpose of knowledge" 1632 | ] 1633 | }, 1634 | { 1635 | "cell_type": "markdown", 1636 | "metadata": { 1637 | "_cell_guid": "fe230638-8d8d-e978-a623-705568d8c0eb", 1638 | "_uuid": "95a514e40a04566d2dfb1e197b2ca7993458565e" 1639 | }, 1640 | "source": [ 1641 | "**There are 3 features which can be tuned to improve the performance of Random Forest** \n", 1642 | "\n", 1643 | "**1) max_features 2) n_estimators 3) min_sample_leaf**" 1644 | ] 1645 | }, 1646 | { 1647 | "cell_type": "markdown", 1648 | "metadata": { 1649 | "_cell_guid": "f4e6558a-16b5-b198-c83f-dd834b52c7e8", 1650 | "_uuid": "7dbdbada2b5bf46d12a66fb38418c334b4d163ad" 1651 | }, 1652 | "source": [ 1653 | " **A)max_features**: These are the maximum number of features Random Forest is allowed to try in individual tree.\n", 1654 | "**1)Auto** : This will simply take all the features which make sense in every tree.Here we simply do not put any restrictions on the individual tree.\n", 1655 | "**2)sqrt** : This option will take square root of the total number of features in individual run. For instance, if the total number of variables are 100, we can only take 10 of them in individual tree.\n", 1656 | "**3)log2**:It is another option which takes log to the base 2 of the features input.\n", 1657 | "\n", 1658 | "**Increasing max_features generally improves the performance of the model as at each node now we have a higher number of options to be considered.But, for sure, you decrease the speed of algorithm by increasing the max_features. Hence, you need to strike the right balance and choose the optimal max_features.**" 1659 | ] 1660 | }, 1661 | { 1662 | "cell_type": "markdown", 1663 | "metadata": { 1664 | "_cell_guid": "14de4e9c-4a3e-1b1e-d036-f023e690d95c", 1665 | "_uuid": "a3c8c75fd16c1c83bb2f069da829e11fd603da5c" 1666 | }, 1667 | "source": [ 1668 | "**B) n_estimators** : This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number of trees give you better performance but makes your code slower. You should choose as high value as your processor can handle because this makes your predictions stronger and more stable." 1669 | ] 1670 | }, 1671 | { 1672 | "cell_type": "markdown", 1673 | "metadata": { 1674 | "_cell_guid": "96f0f477-afe7-fc86-beab-75e85ecab3f3", 1675 | "_uuid": "564cf0e86c26b6050b919d484967d3adcd88f0a5" 1676 | }, 1677 | "source": [ 1678 | "**C)min_sample_leaf**: Leaf is the end node of a decision tree. A smaller leaf makes the model more prone to capturing noise in train data. Hence it is important to try different values to get good estimate." 1679 | ] 1680 | }, 1681 | { 1682 | "cell_type": "code", 1683 | "execution_count": null, 1684 | "metadata": { 1685 | "_cell_guid": "60ef6e9a-dcf4-e604-2d81-ef6a0c61492d", 1686 | "_uuid": "468e0ce62a833fb2f016e9bf065196f1f0458fdd" 1687 | }, 1688 | "outputs": [], 1689 | "source": [ 1690 | "from sklearn.ensemble import RandomForestClassifier\n", 1691 | "\n", 1692 | "model_RR=RandomForestClassifier()\n", 1693 | "\n", 1694 | "tuned_parameters = {'min_samples_leaf': range(10,100,10), 'n_estimators' : range(10,100,10),\n", 1695 | " 'max_features':['auto','sqrt','log2']\n", 1696 | " }\n", 1697 | " " 1698 | ] 1699 | }, 1700 | { 1701 | "cell_type": "markdown", 1702 | "metadata": { 1703 | "_cell_guid": "07e5baa5-9656-30b8-5b42-49894aa80442", 1704 | "_uuid": "04d170809ba881c8d2ce0882144cbaed755e2b8b" 1705 | }, 1706 | "source": [ 1707 | "### n_jobs\n", 1708 | "**This parameter tells the engine how many processors is it allowed to use. A value of “-1” means there is no restriction whereas a value of “1” means it can only use one processor.**" 1709 | ] 1710 | }, 1711 | { 1712 | "cell_type": "code", 1713 | "execution_count": null, 1714 | "metadata": { 1715 | "_cell_guid": "b333b5c3-785c-3fee-b089-b75ec777e5c2", 1716 | "_uuid": "88cef3ae9e85794e6d81ab8c97c478b8f3c488b6" 1717 | }, 1718 | "outputs": [], 1719 | "source": [ 1720 | "from sklearn.grid_search import RandomizedSearchCV\n", 1721 | "\n", 1722 | "RR_model= RandomizedSearchCV(model_RR, tuned_parameters,cv=10,scoring='accuracy',n_iter=20,n_jobs= -1)" 1723 | ] 1724 | }, 1725 | { 1726 | "cell_type": "code", 1727 | "execution_count": null, 1728 | "metadata": { 1729 | "_cell_guid": "f7221d0c-d309-cc8e-31e0-e872438733b1", 1730 | "_uuid": "fb2c22904cfff4d4a100a5033647636e79a744e6" 1731 | }, 1732 | "outputs": [], 1733 | "source": [ 1734 | "RR_model.fit(X_train,y_train)" 1735 | ] 1736 | }, 1737 | { 1738 | "cell_type": "code", 1739 | "execution_count": null, 1740 | "metadata": { 1741 | "_cell_guid": "5bed2aca-9482-1dc9-bd11-25ff1ce05fb1", 1742 | "_uuid": "368e6a3dbcabc4fb27a692d06fe522953f8a1b7b" 1743 | }, 1744 | "outputs": [], 1745 | "source": [ 1746 | "print(RR_model.grid_scores_)" 1747 | ] 1748 | }, 1749 | { 1750 | "cell_type": "code", 1751 | "execution_count": null, 1752 | "metadata": { 1753 | "_cell_guid": "a4553b00-8a25-09b1-fa6a-f8a6e8dbd177", 1754 | "_uuid": "502ac030a68e29757464f9757342baafb041e0df" 1755 | }, 1756 | "outputs": [], 1757 | "source": [ 1758 | "print(RR_model.best_score_)" 1759 | ] 1760 | }, 1761 | { 1762 | "cell_type": "code", 1763 | "execution_count": null, 1764 | "metadata": { 1765 | "_cell_guid": "02255028-9468-efdb-1da8-3d9458e7b9a7", 1766 | "_uuid": "79568934ac1eebe10074fccdcd22b7d319bb20e4" 1767 | }, 1768 | "outputs": [], 1769 | "source": [ 1770 | "print(RR_model.best_params_)" 1771 | ] 1772 | }, 1773 | { 1774 | "cell_type": "code", 1775 | "execution_count": null, 1776 | "metadata": { 1777 | "_cell_guid": "5feb4edc-6c89-b982-ffce-92ee75577da4", 1778 | "_uuid": "4bbb82cdfe737ed6c90c0d2851ea0435c59ecd94" 1779 | }, 1780 | "outputs": [], 1781 | "source": [ 1782 | "y_prob = RR_model.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities \n", 1783 | "y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.\n", 1784 | "RR_model.score(X_test, y_pred)" 1785 | ] 1786 | }, 1787 | { 1788 | "cell_type": "code", 1789 | "execution_count": null, 1790 | "metadata": { 1791 | "_cell_guid": "f4e72487-3075-97a9-3ec3-173d5fe2f887", 1792 | "_uuid": "e9e8337b6b9295ca3d437cb3f4035eccad5518e3" 1793 | }, 1794 | "outputs": [], 1795 | "source": [ 1796 | "confusion_matrix=metrics.confusion_matrix(y_test,y_pred)\n", 1797 | "confusion_matrix" 1798 | ] 1799 | }, 1800 | { 1801 | "cell_type": "code", 1802 | "execution_count": null, 1803 | "metadata": { 1804 | "_cell_guid": "582e0f14-8f68-c8ab-e2c9-f3fc6f51744b", 1805 | "_uuid": "945275fba35d2cd528051c2550e8e119233a9021" 1806 | }, 1807 | "outputs": [], 1808 | "source": [ 1809 | "auc_roc=metrics.classification_report(y_test,y_pred)\n", 1810 | "auc_roc" 1811 | ] 1812 | }, 1813 | { 1814 | "cell_type": "code", 1815 | "execution_count": null, 1816 | "metadata": { 1817 | "_cell_guid": "14df7ec6-e138-d5b1-a9f5-0ec73eeb46f3", 1818 | "_uuid": "7a9a8a023a5243e2231a3408308b2adb6ceae275" 1819 | }, 1820 | "outputs": [], 1821 | "source": [ 1822 | "auc_roc=metrics.roc_auc_score(y_test,y_pred)\n", 1823 | "auc_roc" 1824 | ] 1825 | }, 1826 | { 1827 | "cell_type": "code", 1828 | "execution_count": null, 1829 | "metadata": { 1830 | "_cell_guid": "5230b9e7-2f66-d169-5991-0823af6c3028", 1831 | "_uuid": "7c5fbcf1fb7cc5251ed2a22606a0fb8a94f541d8" 1832 | }, 1833 | "outputs": [], 1834 | "source": [ 1835 | "\n", 1836 | "from sklearn.metrics import roc_curve, auc\n", 1837 | "false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)\n", 1838 | "roc_auc = auc(false_positive_rate, true_positive_rate)\n", 1839 | "roc_auc" 1840 | ] 1841 | }, 1842 | { 1843 | "cell_type": "code", 1844 | "execution_count": null, 1845 | "metadata": { 1846 | "_cell_guid": "e16d56b7-0ae4-8991-cd58-5a3772d6d7f1", 1847 | "_uuid": "c1faa2317af2af3a8a2c47b35cde6a9e8efe9bb2" 1848 | }, 1849 | "outputs": [], 1850 | "source": [ 1851 | "import matplotlib.pyplot as plt\n", 1852 | "plt.figure(figsize=(10,10))\n", 1853 | "plt.title('Receiver Operating Characteristic')\n", 1854 | "plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)\n", 1855 | "plt.legend(loc = 'lower right')\n", 1856 | "plt.plot([0, 1], [0, 1],linestyle='--')\n", 1857 | "plt.axis('tight')\n", 1858 | "plt.ylabel('True Positive Rate')\n", 1859 | "plt.xlabel('False Positive Rate')" 1860 | ] 1861 | }, 1862 | { 1863 | "cell_type": "markdown", 1864 | "metadata": { 1865 | "_cell_guid": "5973819d-4c75-576e-59a3-732fcd22157f", 1866 | "_uuid": "28d122390fc2042015157cdc68b8648be3d25528" 1867 | }, 1868 | "source": [ 1869 | "### Default Decision Tree model" 1870 | ] 1871 | }, 1872 | { 1873 | "cell_type": "code", 1874 | "execution_count": null, 1875 | "metadata": { 1876 | "_cell_guid": "2ebd85f1-6ad9-21be-ab6a-05905df317b0", 1877 | "_uuid": "fe961b76b2aaf55e5e71d8507a9cd265aa6d84e5" 1878 | }, 1879 | "outputs": [], 1880 | "source": [ 1881 | "from sklearn.tree import DecisionTreeClassifier\n", 1882 | "\n", 1883 | "model_tree = DecisionTreeClassifier()" 1884 | ] 1885 | }, 1886 | { 1887 | "cell_type": "code", 1888 | "execution_count": null, 1889 | "metadata": { 1890 | "_cell_guid": "1446f642-8c09-76e2-18f9-871e88643083", 1891 | "_uuid": "b4c51d484ef474a6d654663840dc25708033b6e3" 1892 | }, 1893 | "outputs": [], 1894 | "source": [ 1895 | "model_tree.fit(X_train, y_train)" 1896 | ] 1897 | }, 1898 | { 1899 | "cell_type": "code", 1900 | "execution_count": null, 1901 | "metadata": { 1902 | "_cell_guid": "b1b91a2c-296f-f301-14b4-9d2648592a87", 1903 | "_uuid": "51019150d4f4d1e492056b04bbc92cbfb497df56" 1904 | }, 1905 | "outputs": [], 1906 | "source": [ 1907 | "y_prob = model_tree.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities \n", 1908 | "y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.\n", 1909 | "model_tree.score(X_test, y_pred)" 1910 | ] 1911 | }, 1912 | { 1913 | "cell_type": "code", 1914 | "execution_count": null, 1915 | "metadata": { 1916 | "_cell_guid": "aa6a00dc-7e3c-319a-5dcd-3ac083e79479", 1917 | "_uuid": "f7a06378b5f114f1c300bbb9da207c3cbca6be3f" 1918 | }, 1919 | "outputs": [], 1920 | "source": [ 1921 | "confusion_matrix=metrics.confusion_matrix(y_test,y_pred)\n", 1922 | "confusion_matrix" 1923 | ] 1924 | }, 1925 | { 1926 | "cell_type": "code", 1927 | "execution_count": null, 1928 | "metadata": { 1929 | "_cell_guid": "ff010249-5dae-c9b4-197f-6c843ae97903", 1930 | "_uuid": "46e7f42cf5b5635a5475ff75e7ec77eda53baf50" 1931 | }, 1932 | "outputs": [], 1933 | "source": [ 1934 | "auc_roc=metrics.classification_report(y_test,y_pred)\n", 1935 | "auc_roc" 1936 | ] 1937 | }, 1938 | { 1939 | "cell_type": "code", 1940 | "execution_count": null, 1941 | "metadata": { 1942 | "_cell_guid": "05adf0ac-898e-1aa3-c569-00b8dec3e263", 1943 | "_uuid": "88998a15264664fefc9eeb2e013504ba88d9a5ca" 1944 | }, 1945 | "outputs": [], 1946 | "source": [ 1947 | "auc_roc=metrics.roc_auc_score(y_test,y_pred)\n", 1948 | "auc_roc" 1949 | ] 1950 | }, 1951 | { 1952 | "cell_type": "code", 1953 | "execution_count": null, 1954 | "metadata": { 1955 | "_cell_guid": "db4b8464-8e5c-4eb3-06f8-f72382b6941f", 1956 | "_uuid": "dc033e1e4eabf2ebd60174dffc42a38b6e43da11" 1957 | }, 1958 | "outputs": [], 1959 | "source": [ 1960 | "from sklearn.metrics import roc_curve, auc\n", 1961 | "false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)\n", 1962 | "roc_auc = auc(false_positive_rate, true_positive_rate)" 1963 | ] 1964 | }, 1965 | { 1966 | "cell_type": "code", 1967 | "execution_count": null, 1968 | "metadata": { 1969 | "_cell_guid": "f6860830-c215-377f-c34c-5df3ad1142da", 1970 | "_uuid": "ee3ca580b17682ca24d77f07cea1fd72764aa0bc" 1971 | }, 1972 | "outputs": [], 1973 | "source": [ 1974 | "import matplotlib.pyplot as plt\n", 1975 | "plt.figure(figsize=(10,10))\n", 1976 | "plt.title('Receiver Operating Characteristic')\n", 1977 | "plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)\n", 1978 | "plt.legend(loc = 'lower right')\n", 1979 | "plt.plot([0, 1], [0, 1],linestyle='--')\n", 1980 | "plt.axis('tight')\n", 1981 | "plt.ylabel('True Positive Rate')\n", 1982 | "plt.xlabel('False Positive Rate')" 1983 | ] 1984 | }, 1985 | { 1986 | "cell_type": "markdown", 1987 | "metadata": { 1988 | "_cell_guid": "ee50bd8b-735c-d65d-71c9-bac21bed725a", 1989 | "_uuid": "340128e0928ea69089c7d3c74b7ee562435d5e33" 1990 | }, 1991 | "source": [ 1992 | "### Thus default decision tree model is giving us best accuracy score" 1993 | ] 1994 | }, 1995 | { 1996 | "cell_type": "markdown", 1997 | "metadata": { 1998 | "_cell_guid": "c2dff12c-8a12-33a5-16a3-b57490d0fbf4", 1999 | "_uuid": "89a9443d54e4b9f4b3a13109a2b6639a6929003c" 2000 | }, 2001 | "source": [ 2002 | "### Let us tune the hyperparameters of the Decision tree model" 2003 | ] 2004 | }, 2005 | { 2006 | "cell_type": "markdown", 2007 | "metadata": { 2008 | "_cell_guid": "bb16436e-ad87-2a0f-3d68-fbed40f5ed82", 2009 | "_uuid": "557a9535cc660cf2d1a44135057736842204faca" 2010 | }, 2011 | "source": [ 2012 | "**1)Criterion:** Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes.Decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes. The details of Gini and entropy needs detail explanation.\n", 2013 | "\n", 2014 | "2)**max_depth(Maximum depth of tree (vertical depth)):**\n", 2015 | "Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.\n", 2016 | "\n", 2017 | "**max_features** and **min_samples_leaf** is same as Random Forest classifier" 2018 | ] 2019 | }, 2020 | { 2021 | "cell_type": "code", 2022 | "execution_count": null, 2023 | "metadata": { 2024 | "_cell_guid": "471fc8d9-35a3-e59a-7e3a-791f56b15381", 2025 | "_uuid": "b0fe4275c53542068e5b11efd4fc0f72d2e565dd" 2026 | }, 2027 | "outputs": [], 2028 | "source": [ 2029 | "from sklearn.tree import DecisionTreeClassifier\n", 2030 | "\n", 2031 | "model_DD = DecisionTreeClassifier()\n", 2032 | "\n", 2033 | "\n", 2034 | "tuned_parameters= {'criterion': ['gini','entropy'], 'max_features': [\"auto\",\"sqrt\",\"log2\"],\n", 2035 | " 'min_samples_leaf': range(1,100,1) , 'max_depth': range(1,50,1)\n", 2036 | " }\n", 2037 | " " 2038 | ] 2039 | }, 2040 | { 2041 | "cell_type": "code", 2042 | "execution_count": null, 2043 | "metadata": { 2044 | "_cell_guid": "6e292088-c9c9-32b2-050b-d871ee7e15b2", 2045 | "_uuid": "7366a2a44ba819ab45438d1a0cb7b41bc7206976" 2046 | }, 2047 | "outputs": [], 2048 | "source": [ 2049 | "from sklearn.grid_search import RandomizedSearchCV\n", 2050 | "DD_model= RandomizedSearchCV(model_DD, tuned_parameters,cv=10,scoring='accuracy',n_iter=20,n_jobs= -1,random_state=5)" 2051 | ] 2052 | }, 2053 | { 2054 | "cell_type": "code", 2055 | "execution_count": null, 2056 | "metadata": { 2057 | "_cell_guid": "60b59643-c853-6607-6c9e-c58b16fc4c76", 2058 | "_uuid": "efa25fecff80965d008ae7c69251b4b4e4aea6de" 2059 | }, 2060 | "outputs": [], 2061 | "source": [ 2062 | "DD_model.fit(X_train, y_train)" 2063 | ] 2064 | }, 2065 | { 2066 | "cell_type": "code", 2067 | "execution_count": null, 2068 | "metadata": { 2069 | "_cell_guid": "d348db12-1afd-9c30-83b2-b01f94469d9c", 2070 | "_uuid": "a3cb5925f02d219e3e9e078c5f17e6f9ff640349" 2071 | }, 2072 | "outputs": [], 2073 | "source": [ 2074 | "print(DD_model.grid_scores_)" 2075 | ] 2076 | }, 2077 | { 2078 | "cell_type": "code", 2079 | "execution_count": null, 2080 | "metadata": { 2081 | "_cell_guid": "97affd45-ad62-915f-38e8-6b41149729dc", 2082 | "_uuid": "f374f78dd8813abb236605848d0cdfe2bf55b066" 2083 | }, 2084 | "outputs": [], 2085 | "source": [ 2086 | "print(DD_model.best_score_)" 2087 | ] 2088 | }, 2089 | { 2090 | "cell_type": "code", 2091 | "execution_count": null, 2092 | "metadata": { 2093 | "_cell_guid": "896cb38d-4b2f-bad4-631e-e399be807817", 2094 | "_uuid": "7c02757987a25750cc41a6a91c6c8b94f180705a" 2095 | }, 2096 | "outputs": [], 2097 | "source": [ 2098 | "print(DD_model.best_params_)" 2099 | ] 2100 | }, 2101 | { 2102 | "cell_type": "code", 2103 | "execution_count": null, 2104 | "metadata": { 2105 | "_cell_guid": "47b9dfa0-e529-abbd-95fb-80cebc41b81b", 2106 | "_uuid": "0d52e5be564ed3b638e29e4487160980e3efd00b" 2107 | }, 2108 | "outputs": [], 2109 | "source": [ 2110 | "y_prob = DD_model.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities \n", 2111 | "y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.\n", 2112 | "DD_model.score(X_test, y_pred)" 2113 | ] 2114 | }, 2115 | { 2116 | "cell_type": "code", 2117 | "execution_count": null, 2118 | "metadata": { 2119 | "_cell_guid": "515e2157-6408-6e85-399a-e6b56fa82157", 2120 | "_uuid": "0052924c4ef71dde902ce6815f6c80e182e70f5d" 2121 | }, 2122 | "outputs": [], 2123 | "source": [ 2124 | "confusion_matrix=metrics.confusion_matrix(y_test,y_pred)\n", 2125 | "confusion_matrix" 2126 | ] 2127 | }, 2128 | { 2129 | "cell_type": "code", 2130 | "execution_count": null, 2131 | "metadata": { 2132 | "_cell_guid": "07d47b4c-a626-0fb7-b96f-07f8ddb8cd95", 2133 | "_uuid": "db421b92da8a36fae039ebe48cd2e05e04f74f24" 2134 | }, 2135 | "outputs": [], 2136 | "source": [ 2137 | "auc_roc=metrics.classification_report(y_test,y_pred)\n", 2138 | "auc_roc" 2139 | ] 2140 | }, 2141 | { 2142 | "cell_type": "code", 2143 | "execution_count": null, 2144 | "metadata": { 2145 | "_cell_guid": "5fb54b29-2502-3e1a-f8a8-78a3ba5bad06", 2146 | "_uuid": "5f78ba5fbda830aaa39f7452d2676a91a1871bb3" 2147 | }, 2148 | "outputs": [], 2149 | "source": [ 2150 | "auc_roc=metrics.roc_auc_score(y_test,y_pred)\n", 2151 | "auc_roc" 2152 | ] 2153 | }, 2154 | { 2155 | "cell_type": "code", 2156 | "execution_count": null, 2157 | "metadata": { 2158 | "_cell_guid": "950dc38c-cfcb-b0a0-3b6c-e46cac880b7f", 2159 | "_uuid": "8891efe2cc1f786f90e153ae8145e360c3119df0" 2160 | }, 2161 | "outputs": [], 2162 | "source": [ 2163 | "from sklearn.metrics import roc_curve, auc\n", 2164 | "false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)\n", 2165 | "roc_auc = auc(false_positive_rate, true_positive_rate)\n", 2166 | "roc_auc" 2167 | ] 2168 | }, 2169 | { 2170 | "cell_type": "code", 2171 | "execution_count": null, 2172 | "metadata": { 2173 | "_cell_guid": "f895a8c3-1c2e-0805-8764-3e57a260651c", 2174 | "_uuid": "f613e8d18b5daef554c615e03beee52a18f207f1" 2175 | }, 2176 | "outputs": [], 2177 | "source": [ 2178 | "import matplotlib.pyplot as plt\n", 2179 | "plt.figure(figsize=(10,10))\n", 2180 | "plt.title('Receiver Operating Characteristic')\n", 2181 | "plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)\n", 2182 | "plt.legend(loc = 'lower right')\n", 2183 | "plt.plot([0, 1], [0, 1],linestyle='--')\n", 2184 | "plt.axis('tight')\n", 2185 | "plt.ylabel('True Positive Rate')\n", 2186 | "plt.xlabel('False Positive Rate')" 2187 | ] 2188 | }, 2189 | { 2190 | "cell_type": "markdown", 2191 | "metadata": { 2192 | "_cell_guid": "410f17b3-d60a-10e8-df53-19cc869894e4", 2193 | "_uuid": "cc3261a2eeb01c0fe53814338eee7d9e4f2ee7f0" 2194 | }, 2195 | "source": [ 2196 | "## Neural Network" 2197 | ] 2198 | }, 2199 | { 2200 | "cell_type": "markdown", 2201 | "metadata": { 2202 | "_cell_guid": "e4e497d7-78e2-1f59-4236-d1d543ddb632", 2203 | "_uuid": "2d155cb1762988d9849054d4615e87ee32113336" 2204 | }, 2205 | "source": [ 2206 | "### Applying default Neural Network model" 2207 | ] 2208 | }, 2209 | { 2210 | "cell_type": "code", 2211 | "execution_count": null, 2212 | "metadata": { 2213 | "_cell_guid": "b50b16ac-705b-523e-0125-1f8280087879", 2214 | "_uuid": "e0dd38ce8b8f7631192b3b8b2f715e9c3977b8a9" 2215 | }, 2216 | "outputs": [], 2217 | "source": [ 2218 | "from sklearn.neural_network import MLPClassifier" 2219 | ] 2220 | }, 2221 | { 2222 | "cell_type": "code", 2223 | "execution_count": null, 2224 | "metadata": { 2225 | "_cell_guid": "488e2203-595a-35f1-342f-7ff9a32a864e", 2226 | "_uuid": "301f9cb41f2c140e1485d53bf0c8279550db5afa" 2227 | }, 2228 | "outputs": [], 2229 | "source": [ 2230 | "mlp = MLPClassifier()\n", 2231 | "mlp.fit(X_train,y_train)" 2232 | ] 2233 | }, 2234 | { 2235 | "cell_type": "code", 2236 | "execution_count": null, 2237 | "metadata": { 2238 | "_cell_guid": "3eba1830-b130-2afc-45d6-7f9e182d3bf5", 2239 | "_uuid": "b14ff9d1d56a63095ab452e0fae45dbb8bd3a081" 2240 | }, 2241 | "outputs": [], 2242 | "source": [ 2243 | "y_prob = mlp.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities \n", 2244 | "y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.\n", 2245 | "mlp.score(X_test, y_pred)" 2246 | ] 2247 | }, 2248 | { 2249 | "cell_type": "code", 2250 | "execution_count": null, 2251 | "metadata": { 2252 | "_cell_guid": "df1fdb7e-4d0f-e42d-4046-878d8420699d", 2253 | "_uuid": "9d1d9b2ea0b11104f9dd71f867f576508f1b07b1" 2254 | }, 2255 | "outputs": [], 2256 | "source": [ 2257 | "confusion_matrix=metrics.confusion_matrix(y_test,y_pred)\n", 2258 | "confusion_matrix" 2259 | ] 2260 | }, 2261 | { 2262 | "cell_type": "code", 2263 | "execution_count": null, 2264 | "metadata": { 2265 | "_cell_guid": "b4f20678-d020-9c78-a794-b1b924c882bb", 2266 | "_uuid": "0f0859e946de3caed1173e28b152527b304080f2" 2267 | }, 2268 | "outputs": [], 2269 | "source": [ 2270 | "auc_roc=metrics.classification_report(y_test,y_pred)\n", 2271 | "auc_roc" 2272 | ] 2273 | }, 2274 | { 2275 | "cell_type": "code", 2276 | "execution_count": null, 2277 | "metadata": { 2278 | "_cell_guid": "6c5392b2-8de2-8686-809c-d545d9c318a4", 2279 | "_uuid": "9b125c1bab05d6bb25fcdaeaa8e5871c36df2515" 2280 | }, 2281 | "outputs": [], 2282 | "source": [ 2283 | "auc_roc=metrics.roc_auc_score(y_test,y_pred)\n", 2284 | "auc_roc" 2285 | ] 2286 | }, 2287 | { 2288 | "cell_type": "code", 2289 | "execution_count": null, 2290 | "metadata": { 2291 | "_cell_guid": "ec742bf2-a75f-6482-f9a0-e2156c67a04a", 2292 | "_uuid": "3c1981637a6e4604bbfc763dc2ab410c66a3f30b" 2293 | }, 2294 | "outputs": [], 2295 | "source": [ 2296 | "from sklearn.metrics import roc_curve, auc\n", 2297 | "false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)\n", 2298 | "roc_auc = auc(false_positive_rate, true_positive_rate)\n", 2299 | "roc_auc" 2300 | ] 2301 | }, 2302 | { 2303 | "cell_type": "code", 2304 | "execution_count": null, 2305 | "metadata": { 2306 | "_cell_guid": "a20af212-174c-bd5a-8465-6a980ca1dd32", 2307 | "_uuid": "a250099cc6ea45e6976f09e955b0698f9ea31cc1" 2308 | }, 2309 | "outputs": [], 2310 | "source": [ 2311 | "import matplotlib.pyplot as plt\n", 2312 | "plt.figure(figsize=(10,10))\n", 2313 | "plt.title('Receiver Operating Characteristic')\n", 2314 | "plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)\n", 2315 | "plt.legend(loc = 'lower right')\n", 2316 | "plt.plot([0, 1], [0, 1],linestyle='--')\n", 2317 | "plt.axis('tight')\n", 2318 | "plt.ylabel('True Positive Rate')\n", 2319 | "plt.xlabel('False Positive Rate')" 2320 | ] 2321 | }, 2322 | { 2323 | "cell_type": "markdown", 2324 | "metadata": { 2325 | "_cell_guid": "71b42e1b-5e31-e576-1459-fcad4ac140d7", 2326 | "_uuid": "b6984b27c54df935ff0ca2154e77455c8fe4b508" 2327 | }, 2328 | "source": [ 2329 | "### Tuning the hyperparameters of the neural network" 2330 | ] 2331 | }, 2332 | { 2333 | "cell_type": "markdown", 2334 | "metadata": { 2335 | "_cell_guid": "2447b4bd-fd97-ef55-1e68-454e686a4889", 2336 | "_uuid": "a2e3389e57e4d78c075211dfb3a11e78ef50816e" 2337 | }, 2338 | "source": [ 2339 | "It is turning out to be computationally expensive for me with tuned model. Hence I am not running this. Also any suggestion to improvise it is welcome. :)" 2340 | ] 2341 | }, 2342 | { 2343 | "cell_type": "markdown", 2344 | "metadata": { 2345 | "_cell_guid": "bd096438-f5d3-1f62-41a7-03b4f42be394", 2346 | "_uuid": "b8e8107b5b7f27a18297a48ec08151a7ba756c7a" 2347 | }, 2348 | "source": [ 2349 | "**1) hidden_layer_sizes** : Number of hidden layers in the network.(default is 100).Large number may overfit the data.\n", 2350 | "\n", 2351 | "**2)activation**: Activation function for the hidden layer.\n", 2352 | "A)‘logistic’, the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)).\n", 2353 | "B)‘tanh’, the hyperbolic tan function, returns f(x) = tanh(x).\n", 2354 | "C)‘relu’, the rectified linear unit function, returns f(x) = max(0, x)\n", 2355 | "\n", 2356 | "**3)alpha:** L2 penalty (regularization term) parameter.(default 0.0001)\n", 2357 | "\n", 2358 | "**4)max_iter:** Maximum number of iterations. The solver iterates until convergence (determined by ‘tol’) or this number of iterations.(default 200)" 2359 | ] 2360 | }, 2361 | { 2362 | "cell_type": "code", 2363 | "execution_count": null, 2364 | "metadata": { 2365 | "_cell_guid": "c134b79f-031c-9604-7003-d3bd25a5251c", 2366 | "_uuid": "d6ee5d0d1cb7254446252839351303c3a39b7308" 2367 | }, 2368 | "outputs": [], 2369 | "source": [ 2370 | "'''\n", 2371 | "from sklearn.neural_network import MLPClassifier\n", 2372 | "\n", 2373 | "mlp = MLPClassifier()\n", 2374 | "\n", 2375 | "tuned_parameters={'hidden_layer_sizes': range(1,200,10) , 'activation': ['tanh','logistic','relu'],\n", 2376 | " 'alpha':[0.0001,0.001,0.01,0.1,1,10], 'max_iter': range(50,200,50)\n", 2377 | " \n", 2378 | "}\n", 2379 | "'''" 2380 | ] 2381 | }, 2382 | { 2383 | "cell_type": "code", 2384 | "execution_count": null, 2385 | "metadata": { 2386 | "_cell_guid": "1897166d-0941-2348-5e96-3b1cc2c2d4f2", 2387 | "_uuid": "57c9c3b090be1cf6589d92da2df5c818c86a4626" 2388 | }, 2389 | "outputs": [], 2390 | "source": [ 2391 | "#from sklearn.grid_search import RandomizedSearchCV\n", 2392 | "#model_mlp= RandomizedSearchCV(mlp_model, tuned_parameters,cv=10,scoring='accuracy',n_iter=5,n_jobs= -1,random_state=5)" 2393 | ] 2394 | }, 2395 | { 2396 | "cell_type": "code", 2397 | "execution_count": null, 2398 | "metadata": { 2399 | "_cell_guid": "7272dbfc-550a-5a44-bbf5-98e041b815ee", 2400 | "_uuid": "1a76d9d4fdc0e6cae53e7ca3afc2f966731c701b" 2401 | }, 2402 | "outputs": [], 2403 | "source": [ 2404 | "#model_mlp.fit(X_train, y_train)" 2405 | ] 2406 | }, 2407 | { 2408 | "cell_type": "code", 2409 | "execution_count": null, 2410 | "metadata": { 2411 | "_cell_guid": "83de0f83-1245-04a2-a11a-01dc73b6d763", 2412 | "_uuid": "7a40de29ebb4214389f9aff7a6b1dbcb805669f6" 2413 | }, 2414 | "outputs": [], 2415 | "source": [ 2416 | "#print(model_mlp.grid_scores_)" 2417 | ] 2418 | }, 2419 | { 2420 | "cell_type": "code", 2421 | "execution_count": null, 2422 | "metadata": { 2423 | "_cell_guid": "ed41d08a-ccb7-41d3-31b3-f4a3d262a81e", 2424 | "_uuid": "bdebe2650463c7e26a81f750ed7d5c53bb04fe70" 2425 | }, 2426 | "outputs": [], 2427 | "source": [ 2428 | "#print(model_mlp.best_score_)" 2429 | ] 2430 | }, 2431 | { 2432 | "cell_type": "code", 2433 | "execution_count": null, 2434 | "metadata": { 2435 | "_cell_guid": "80c49de3-4ab4-ce7f-ad03-62fbca353380", 2436 | "_uuid": "5f20fb36d30d2bc194b28cddf2f412a57214808b" 2437 | }, 2438 | "outputs": [], 2439 | "source": [ 2440 | "#print(model_mlp.best_params_)" 2441 | ] 2442 | }, 2443 | { 2444 | "cell_type": "code", 2445 | "execution_count": null, 2446 | "metadata": { 2447 | "_cell_guid": "d5679052-0545-d3ce-e3ae-0f2246c9c833", 2448 | "_uuid": "967cc1a0327c3a29376163ba3452694d9f3df171" 2449 | }, 2450 | "outputs": [], 2451 | "source": [ 2452 | "'''\n", 2453 | "y_prob = model_LR.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities \n", 2454 | "y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.\n", 2455 | "model_LR.score(X_test, y_pred)\n", 2456 | "'''" 2457 | ] 2458 | }, 2459 | { 2460 | "cell_type": "code", 2461 | "execution_count": null, 2462 | "metadata": { 2463 | "_cell_guid": "65d57d14-8915-2d98-7c2b-81f64ab33e3c", 2464 | "_uuid": "9493183d2bcedc2661cbf0a375478c366bca10a5" 2465 | }, 2466 | "outputs": [], 2467 | "source": [ 2468 | "#confusion_matrix=metrics.confusion_matrix(y_test,y_pred)\n", 2469 | "#confusion_matrix" 2470 | ] 2471 | }, 2472 | { 2473 | "cell_type": "code", 2474 | "execution_count": null, 2475 | "metadata": { 2476 | "_cell_guid": "e2c734a0-e57f-3db2-4db7-b8b350518f98", 2477 | "_uuid": "582311345e15e261c7fcd203ebef2cb009031941" 2478 | }, 2479 | "outputs": [], 2480 | "source": [ 2481 | "#auc_roc=metrics.classification_report(y_test,y_pred)\n", 2482 | "#auc_roc" 2483 | ] 2484 | }, 2485 | { 2486 | "cell_type": "code", 2487 | "execution_count": null, 2488 | "metadata": { 2489 | "_cell_guid": "f5a50044-422f-6440-47aa-8c7a15a3b0fe", 2490 | "_uuid": "edcf42de7f06c2bc68a8fc507cd3093d7ff427e2" 2491 | }, 2492 | "outputs": [], 2493 | "source": [ 2494 | "#auc_roc=metrics.roc_auc_score(y_test,y_pred)\n", 2495 | "#auc_roc" 2496 | ] 2497 | }, 2498 | { 2499 | "cell_type": "code", 2500 | "execution_count": null, 2501 | "metadata": { 2502 | "_cell_guid": "b858023f-fa46-9912-f647-888741dbdbd2", 2503 | "_uuid": "a22c3f8ec94103e2ca63bcb86382e900a1aa9d62" 2504 | }, 2505 | "outputs": [], 2506 | "source": [ 2507 | "'''\n", 2508 | "from sklearn.metrics import roc_curve, auc\n", 2509 | "false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)\n", 2510 | "roc_auc = auc(false_positive_rate, true_positive_rate)\n", 2511 | "'''" 2512 | ] 2513 | }, 2514 | { 2515 | "cell_type": "code", 2516 | "execution_count": null, 2517 | "metadata": { 2518 | "_cell_guid": "1c96335c-abf0-9dfc-b081-3f8b7525c525", 2519 | "_uuid": "7981abee7ef22575c23651e067d169e2d1524e42" 2520 | }, 2521 | "outputs": [], 2522 | "source": [ 2523 | "'''\n", 2524 | "import matplotlib.pyplot as plt\n", 2525 | "plt.figure(figsize=(10,10))\n", 2526 | "plt.title('Receiver Operating Characteristic')\n", 2527 | "plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)\n", 2528 | "plt.legend(loc = 'lower right')\n", 2529 | "plt.plot([0, 1], [0, 1],linestyle='--')\n", 2530 | "plt.axis('tight')\n", 2531 | "plt.ylabel('True Positive Rate')\n", 2532 | "plt.xlabel('False Positive Rate')\n", 2533 | "'''" 2534 | ] 2535 | } 2536 | ], 2537 | "metadata": { 2538 | "_change_revision": 0, 2539 | "_is_fork": false, 2540 | "kernelspec": { 2541 | "display_name": "Python 3", 2542 | "language": "python", 2543 | "name": "python3" 2544 | }, 2545 | "language_info": { 2546 | "codemirror_mode": { 2547 | "name": "ipython", 2548 | "version": 3 2549 | }, 2550 | "file_extension": ".py", 2551 | "mimetype": "text/x-python", 2552 | "name": "python", 2553 | "nbconvert_exporter": "python", 2554 | "pygments_lexer": "ipython3", 2555 | "version": "3.5.2" 2556 | } 2557 | }, 2558 | "nbformat": 4, 2559 | "nbformat_minor": 0 2560 | } -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-cayman -------------------------------------------------------------------------------- /datasets.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nirajvermafcb/Data-Science-with-python/a5bad008365e2a941ee27e8f63e8a151972072d1/datasets.tar.gz -------------------------------------------------------------------------------- /detail_analysis _of_various_hospital_facttors.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "_cell_guid": "366487c0-d891-f945-c30d-26fb004b4693", 7 | "_uuid": "b68768d4adedae7fb4b3ad5c736f05c4b98416b2" 8 | }, 9 | "source": [ 10 | "This notebook provides detail analysis of the various factors by means of visualisation." 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "metadata": { 17 | "_cell_guid": "0d582f0e-ade1-4c77-f505-ab7ea419d2fb", 18 | "_uuid": "2c7402f360dbb673d1afc613839fed3534e44beb" 19 | }, 20 | "outputs": [], 21 | "source": [ 22 | "# This Python 3 environment comes with many helpful analytics libraries installed\n", 23 | "# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n", 24 | "# For example, here's several helpful packages to load in \n", 25 | "\n", 26 | "import numpy as np # linear algebra\n", 27 | "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", 28 | "\n", 29 | "# Input data files are available in the \"../input/\" directory.\n", 30 | "# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n", 31 | "\n", 32 | "from subprocess import check_output\n", 33 | "print(check_output([\"ls\", \"../input\"]).decode(\"utf8\"))\n", 34 | "\n", 35 | "# Any results you write to the current directory are saved as output." 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": { 42 | "_cell_guid": "83988554-0785-973d-2423-37ae5124c6ab", 43 | "_uuid": "df0a93619c78ecb04803dbe3293c7b247ab278cd" 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "#Loading all the necessary libraries\n", 48 | "\n", 49 | "import numpy as np # linear algebra\n", 50 | "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", 51 | "import matplotlib.pyplot as plt #for visualisation\n", 52 | "import seaborn as sns #for visualisation\n", 53 | "%matplotlib inline " 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "metadata": { 60 | "_cell_guid": "919a9c59-1ba2-0b43-323d-71e3f187554f", 61 | "_uuid": "363e17300c3130fb3bdbffc820c6d9fb5902d215" 62 | }, 63 | "outputs": [], 64 | "source": [ 65 | "hospital_data=pd.read_csv(\"../input/HospInfo.csv\")\n", 66 | "hospital_data.head()" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": { 73 | "_cell_guid": "744bd933-9898-fbfe-c1ba-a69ac7793d10", 74 | "_uuid": "cdb9108f9708b5c04306d8799b8c5a2bb6d3896d" 75 | }, 76 | "outputs": [], 77 | "source": [ 78 | "hospital_data.info()" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": { 84 | "_cell_guid": "5d282d4f-3eea-6ecf-9664-c14067386256", 85 | "_uuid": "295bba324145de915c4805b5d324de489fc79298" 86 | }, 87 | "source": [ 88 | "We can see that there are few columns which has missing values" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": { 94 | "_cell_guid": "75bbace2-c6fe-eb7b-a106-221cbe97343e", 95 | "_uuid": "24c8256bea7fa58d791b89401f873049f07d37d2" 96 | }, 97 | "source": [ 98 | "### Let us make it more clear by calculating number of missing values in a column" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": { 105 | "_cell_guid": "90a39713-3196-93c4-64b9-e3a1586a7011", 106 | "_uuid": "3be8bc92df9d6d17d48de537fa17b763d1e6c020" 107 | }, 108 | "outputs": [], 109 | "source": [ 110 | "def num_missing(x):\n", 111 | " return sum(x.isnull())\n", 112 | "\n", 113 | "#Applying per column:\n", 114 | "print (\"Missing values per column:\")\n", 115 | "print (hospital_data.apply(num_missing, axis=0) )#axis=0 defines that function is to be applied on each column" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": { 121 | "_cell_guid": "f935a47c-f571-714e-ccad-36eeb402c40f", 122 | "_uuid": "69284f12b822d11d232d09e1033b743b9859fbbf" 123 | }, 124 | "source": [ 125 | "### Above we can see that the location has all the values misssing except the 1st row.This column is of no use.Let us drop location column." 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": { 132 | "_cell_guid": "2ed5f8cf-2ed7-df4d-8071-ece5afd877eb", 133 | "_uuid": "2533604d437323978cdd6e1cd0cf25fc0bc8821a" 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "hospital_data.drop('Location',axis=1,inplace='True')" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": { 144 | "_cell_guid": "01833701-42c5-979f-d31b-23dd258f3777", 145 | "_uuid": "79bc76cb18b87f1ccb0f45af1531d10efb5deb6c" 146 | }, 147 | "outputs": [], 148 | "source": [ 149 | "hospital_data.shape" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "metadata": { 156 | "_cell_guid": "53b6e422-f5ef-aa74-3e67-a7e6a8844471", 157 | "_uuid": "a89514781349db49ef7e550b7861c326c50b1951" 158 | }, 159 | "outputs": [], 160 | "source": [ 161 | "hospital_data.describe()" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": { 168 | "_cell_guid": "359bcbaf-1d28-683b-c419-f13cc66c7dc8", 169 | "_uuid": "17c19654bb195ee01da6fe9976c127e45098b18e" 170 | }, 171 | "outputs": [], 172 | "source": [ 173 | "hospital_data.columns.tolist()" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": { 179 | "_cell_guid": "9fa11cb1-4de5-ced5-9759-b4be0fd49ef4", 180 | "_uuid": "3830ae1a32f23d353b78f3d55107a246b89291e3" 181 | }, 182 | "source": [ 183 | "# Checking Ownership of the hospitals\n", 184 | "Let us check how many hospitals are owned by a particular individual or government and others." 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": { 191 | "_cell_guid": "68716789-dad7-6f9e-90ba-f56e13c282a4", 192 | "_uuid": "8490c0dc72458407aeab8aa9201fb744cee4c287" 193 | }, 194 | "outputs": [], 195 | "source": [ 196 | "unique_hospital_ownership=hospital_data['Hospital Ownership'].unique()\n", 197 | "unique_hospital_ownership" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": { 204 | "_cell_guid": "d13d89bd-5784-f4e9-2670-cdf61ab4b4ce", 205 | "_uuid": "5eda72c1c17d194cb35a8990ecba03ece35470e2" 206 | }, 207 | "outputs": [], 208 | "source": [ 209 | "dummy_data=pd.get_dummies(hospital_data['Hospital Ownership'])\n", 210 | "dummy_data.head()\n", 211 | "#dummy_data.info()" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": null, 217 | "metadata": { 218 | "_cell_guid": "2191f6f9-fb43-aa96-3cc8-40f747c94da2", 219 | "_uuid": "a598cc19a3aa9b6cb912e11d8f32d3152e4d3836" 220 | }, 221 | "outputs": [], 222 | "source": [ 223 | "a=dummy_data['Government - Federal'].sum()\n", 224 | "b=dummy_data['Government - Hospital District or Authority'].sum()\n", 225 | "c=dummy_data['Government - Local'].sum()\n", 226 | "d=dummy_data['Government - State'].sum()\n", 227 | "e=dummy_data['Physician'].sum()\n", 228 | "f=dummy_data['Proprietary'].sum()\n", 229 | "g=dummy_data['Tribal'].sum()\n", 230 | "h=dummy_data['Voluntary non-profit - Church'].sum()\n", 231 | "i=dummy_data['Voluntary non-profit - Other'].sum()\n", 232 | "j=dummy_data['Voluntary non-profit - Private'].sum()\n", 233 | "list=[a,b,c,d,e,f,g,h,i,j]\n", 234 | "list" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": { 240 | "_cell_guid": "ad59e8ff-00c0-21fe-6985-d490e5915ec4", 241 | "_uuid": "a721228af82d403ca2a193ef4ba6cfd58aee2bca" 242 | }, 243 | "source": [ 244 | "**Here we got the total count of ownership of the hospitals by different groups**" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": null, 250 | "metadata": { 251 | "_cell_guid": "524de526-298f-6194-5d5b-0665434ddef1", 252 | "_uuid": "6c225db3b1e16659b9ccc9ba596f5c291a2ddc81" 253 | }, 254 | "outputs": [], 255 | "source": [ 256 | "ax=sns.barplot(y=unique_hospital_ownership,x=list,data=hospital_data)\n", 257 | "ax.set(xlabel='Number of hospitals', ylabel='Ownership')" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": { 263 | "_cell_guid": "63d5da64-2e53-c695-3ee6-da882482320a", 264 | "_uuid": "06694f36b3bba45193bebeae66aadd47d4cd7e6b" 265 | }, 266 | "source": [ 267 | "**We can see that most of the hospitals are owned by Physician.Also the hospitals under Church which is non-profit organisation are very few**" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": null, 273 | "metadata": { 274 | "_cell_guid": "d051e672-c710-48e8-fd25-2006c7b398e4", 275 | "_uuid": "5e82e53f5a7f4f1a2038c313508b5c9ee330280b" 276 | }, 277 | "outputs": [], 278 | "source": [ 279 | "a= pd.pivot_table(hospital_data,values=['Hospital overall rating'],index=['Hospital Ownership'],columns=['Hospital Type'],aggfunc='count',margins=False)\n", 280 | "\n", 281 | "plt.figure(figsize=(10,10))\n", 282 | "sns.heatmap(a['Hospital overall rating'],linewidths=.5,annot=True,vmin=0.01,cmap='YlGnBu')\n", 283 | "plt.title('Total rating of the types of hospitals under the ownership of various community')" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": { 289 | "_cell_guid": "097056d2-360a-92d3-6e58-d94ff61f72e2", 290 | "_uuid": "2a75f34fdedd8fcc2ed0bd17da6871146c5be877" 291 | }, 292 | "source": [ 293 | "#Categorising Hospitals w.r.t to their ratings" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": null, 299 | "metadata": { 300 | "_cell_guid": "71194c99-6936-88ae-c06f-58117d10c58d", 301 | "_uuid": "88ede442319b2593988ff774cbbd8423fa887ab6" 302 | }, 303 | "outputs": [], 304 | "source": [ 305 | "hospital_data['Hospital overall rating'].unique()" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": { 311 | "_cell_guid": "02a755cc-3b80-6a85-6cf7-06cf9e8a56b1", 312 | "_uuid": "2fa299f6fbf94512e5c31034b582441e1bfe308b" 313 | }, 314 | "source": [ 315 | "**Let us drop those rows where Hospital overall Rating==Not Available**" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": null, 321 | "metadata": { 322 | "_cell_guid": "c5b387fe-f1e4-1617-19fb-2918bda4bf56", 323 | "_uuid": "f7d5828d31ef33d6a72ee5fea93ac079f12b6049" 324 | }, 325 | "outputs": [], 326 | "source": [ 327 | "AvailableRating_data=hospital_data.drop(hospital_data[hospital_data['Hospital overall rating']=='Not Available'].index)\n", 328 | "#AvailableRating_data.info()" 329 | ] 330 | }, 331 | { 332 | "cell_type": "markdown", 333 | "metadata": { 334 | "_cell_guid": "54e12473-9e90-f414-b393-dfb572065bb6", 335 | "_uuid": "34716627aa271dcd4de4aa9615860e8b68ebac09" 336 | }, 337 | "source": [ 338 | "### Sorting the values in Descending order as per the overall rating of the hospitals" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": null, 344 | "metadata": { 345 | "_cell_guid": "180d516b-f485-bc6f-4188-411aa5f5e73b", 346 | "_uuid": "18b8b2be06c78ca4fc6940e72bd69e4960134809" 347 | }, 348 | "outputs": [], 349 | "source": [ 350 | "sorted_rating=AvailableRating_data.sort_values(['Hospital overall rating'], ascending=False)\n", 351 | "sorted_rating['Hospital overall rating'].head()\n", 352 | "sorted_rating[['Hospital Name','Hospital overall rating']].head()" 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": null, 358 | "metadata": { 359 | "_cell_guid": "17a30722-b65b-6371-0550-da8be1b674c3", 360 | "_uuid": "dda047de3568fd7b8889f52664ad97fb8ae7c703" 361 | }, 362 | "outputs": [], 363 | "source": [ 364 | "Unique_sorted_rating=sorted_rating['Hospital overall rating'].unique()\n", 365 | "Unique_sorted_rating" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": { 371 | "_cell_guid": "ca4a5314-e987-b6fc-cc90-75d39eea7ac0", 372 | "_uuid": "4c8d219bae998179b9a75b4253e847622589969b" 373 | }, 374 | "source": [ 375 | "### Finding all the rows with rating 5,4,3,2,1 and separating them and keeping a count of those rows which belongs to that particular rating category" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "metadata": { 382 | "_cell_guid": "d3f042f5-b86a-fffc-5a85-3c1f472ecc69", 383 | "_uuid": "45aec288fa3a711a79847cbf439457eb944520e9" 384 | }, 385 | "outputs": [], 386 | "source": [ 387 | "rating_with_5=sorted_rating.loc[sorted_rating['Hospital overall rating'] =='5']\n", 388 | "Rating_5=rating_with_5['Provider ID'].count()\n", 389 | "#rating_with_5[['Hospital Name','Hospital overall rating']].head()\n", 390 | "rating_with_4=sorted_rating.loc[sorted_rating['Hospital overall rating'] =='4']\n", 391 | "Rating_4=rating_with_4['Provider ID'].count()\n", 392 | "rating_with_3=sorted_rating.loc[sorted_rating['Hospital overall rating'] =='3']\n", 393 | "Rating_3=rating_with_3['Provider ID'].count()\n", 394 | "rating_with_2=sorted_rating.loc[sorted_rating['Hospital overall rating'] =='2']\n", 395 | "Rating_2=rating_with_2['Provider ID'].count()\n", 396 | "rating_with_1=sorted_rating.loc[sorted_rating['Hospital overall rating'] =='1']\n", 397 | "Rating_1=rating_with_1['Provider ID'].count()\n", 398 | "#Rating_5\n", 399 | "#Rating_4\n", 400 | "#Rating_3\n", 401 | "#Rating_2\n", 402 | "#Rating_1\n", 403 | "list=[Rating_5,Rating_4,Rating_3,Rating_2,Rating_1]\n", 404 | "list\n", 405 | "print(Rating_5,Rating_4,Rating_3,Rating_2,Rating_1)" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": null, 411 | "metadata": { 412 | "_cell_guid": "318b3103-b7c6-2b0d-345f-4785a63d0c97", 413 | "_uuid": "283bcbc84de859c7641d112e7fe1187522a39493" 414 | }, 415 | "outputs": [], 416 | "source": [ 417 | "ax=sns.barplot(x=Unique_sorted_rating,y=list,data=hospital_data,palette='pastel')\n", 418 | "ax.set(xlabel='Rating out of 5', ylabel='Number of hospitals')" 419 | ] 420 | }, 421 | { 422 | "cell_type": "markdown", 423 | "metadata": { 424 | "_cell_guid": "de19c0ce-1247-150a-27a8-b891ba388b40", 425 | "_uuid": "94591d8215633743d4a9c4b1f18848be3a8a6de5" 426 | }, 427 | "source": [ 428 | "**Thus we can see that most of the hospitals are given the rating of 3.Hospitals with very high rating(5) and very low rating(1) are very few.**" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": { 434 | "_cell_guid": "3088f1c6-25db-7885-9c64-8cb6b5b27a94", 435 | "_uuid": "867b2b55853d7c6f9754c07cf0192bde704ebba3" 436 | }, 437 | "source": [ 438 | "# Which states has maximum number of 5 star rating hospitals?" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": null, 444 | "metadata": { 445 | "_cell_guid": "b59bcece-d7c5-91dd-8f20-65bf72d6c104", 446 | "_uuid": "9171f1b61853d9fe9f384c7d82fe8f02229df741" 447 | }, 448 | "outputs": [], 449 | "source": [ 450 | "hospital_data['Hospital Type'].unique()" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": { 456 | "_cell_guid": "7f35c12b-1c07-88d5-a5d8-a28008f20c67", 457 | "_uuid": "8c5e729df634b12a39238d45c7c80f719e7563b8" 458 | }, 459 | "source": [ 460 | "# Acute care hospitals with 5 star rating." 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": null, 466 | "metadata": { 467 | "_cell_guid": "56b7017b-257f-9cc7-4606-7fc461a4f8a0", 468 | "_uuid": "62745ebe6260a9c32d9832c9e08d84239738ec28" 469 | }, 470 | "outputs": [], 471 | "source": [ 472 | "State_acute_5=hospital_data.loc[(hospital_data[\"Hospital Type\"]==\"Acute Care Hospitals\") & (hospital_data[\"Hospital overall rating\"]==\"5\"),[\"State\"]]\n", 473 | "State_acute_5.head()\n", 474 | "#State_acute_5['State'].unique()" 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": null, 480 | "metadata": { 481 | "_cell_guid": "0ffddc95-6bea-75f7-2aa8-4ad2b4477213", 482 | "_uuid": "00e83c72d41985f40d3737ceb1ddca438391bb5d" 483 | }, 484 | "outputs": [], 485 | "source": [ 486 | "S_A_5=State_acute_5['State'].value_counts()\n", 487 | "index=S_A_5.index\n", 488 | "values=S_A_5.values\n", 489 | "values" 490 | ] 491 | }, 492 | { 493 | "cell_type": "code", 494 | "execution_count": null, 495 | "metadata": { 496 | "_cell_guid": "291e0bd1-40dd-536e-9b10-60dc2e81a94b", 497 | "_uuid": "7a863c66aadd7de732a503634399174310f9c779" 498 | }, 499 | "outputs": [], 500 | "source": [ 501 | "dims = (8, 10)\n", 502 | "fig, ax = plt.subplots(figsize=dims)\n", 503 | "\n", 504 | "ax=sns.barplot(y=index,x=values,palette='GnBu_d')\n", 505 | "ax.set(xlabel='Total number of Acute Care hospitals with 5 rating', ylabel='States')" 506 | ] 507 | }, 508 | { 509 | "cell_type": "markdown", 510 | "metadata": { 511 | "_cell_guid": "d0a5b258-1845-7fa5-8172-8ef7c94c67c5", 512 | "_uuid": "88c4ee67eeaf2d17e9d3ec01ca6ff862ce9c9154" 513 | }, 514 | "source": [ 515 | "**Thus Texas leads with Acute care hospitals with 5 star rating**" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "metadata": { 521 | "_cell_guid": "64c9f0c1-7a56-f846-0b27-365b5db022cb", 522 | "_uuid": "70c5fdf7ff6d08069e8f3e6ae7891b1662b91770" 523 | }, 524 | "source": [ 525 | "# Critical Access Hospitals with 5 star rating" 526 | ] 527 | }, 528 | { 529 | "cell_type": "code", 530 | "execution_count": null, 531 | "metadata": { 532 | "_cell_guid": "c8f54117-13bf-8eb3-d575-7ca121773ceb", 533 | "_uuid": "4b9aa4b8370f24d6616abc3eff8a909cfb666b69" 534 | }, 535 | "outputs": [], 536 | "source": [ 537 | "Critical_access_5=hospital_data.loc[(hospital_data[\"Hospital Type\"]==\"Critical Access Hospitals\") & (hospital_data[\"Hospital overall rating\"]==\"5\"),[\"State\"]]\n", 538 | "C_A_5=Critical_access_5['State'].value_counts()\n", 539 | "C_A_5\n", 540 | "index=C_A_5.index\n", 541 | "values=C_A_5.values\n", 542 | "values" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": null, 548 | "metadata": { 549 | "_cell_guid": "7173a0aa-125a-2262-3340-cbf518738901", 550 | "_uuid": "d09a325098040d240746e9082c7c86bd24504421" 551 | }, 552 | "outputs": [], 553 | "source": [ 554 | "dims = (8, 2)\n", 555 | "fig, ax = plt.subplots(figsize=dims)\n", 556 | "\n", 557 | "ax=sns.barplot(y=index,x=values,palette='YlOrBr')\n", 558 | "ax.set(xlabel='Total number of Critical Care hospitals with 5 rating', ylabel='States')" 559 | ] 560 | }, 561 | { 562 | "cell_type": "markdown", 563 | "metadata": { 564 | "_cell_guid": "57e69462-9a0b-33bf-f413-adb8e5aa02df", 565 | "_uuid": "c166f89edfc4117df1faeba13032379399fafbd2" 566 | }, 567 | "source": [ 568 | "**Thus there are only two states with Critical Acess hospitals each with rating as 5**" 569 | ] 570 | }, 571 | { 572 | "cell_type": "markdown", 573 | "metadata": { 574 | "_cell_guid": "2f4aeee0-51f0-1e32-09b9-4f56976f5e8c", 575 | "_uuid": "c649c35fd65f89fc6a8bcc440358486f20201e5d" 576 | }, 577 | "source": [ 578 | "# Childrens Hospitals with 1 star rating" 579 | ] 580 | }, 581 | { 582 | "cell_type": "code", 583 | "execution_count": null, 584 | "metadata": { 585 | "_cell_guid": "8cd86017-357c-70f3-8c28-5aa086c6334a", 586 | "_uuid": "007dd5f6bbb81342f1ea187d3958281ad12c3e6b" 587 | }, 588 | "outputs": [], 589 | "source": [ 590 | "Chidrens_5=hospital_data.loc[(hospital_data[\"Hospital Type\"]==\"Childrens\") & (hospital_data[\"Hospital overall rating\"]==\"5\"),[\"State\"]]\n", 591 | "C_5=Chidrens_5['State'].value_counts()\n", 592 | "C_5\n", 593 | "index=C_5.index\n", 594 | "values=C_5.values\n", 595 | "values\n", 596 | "index" 597 | ] 598 | }, 599 | { 600 | "cell_type": "markdown", 601 | "metadata": { 602 | "_cell_guid": "c69acd45-b678-fb88-bd1c-d633a7918a2b", 603 | "_uuid": "6b0ad140db4436e7ef0916377ebef5304a2c0a1f" 604 | }, 605 | "source": [ 606 | "**Thus there no hospitals for childrens with 5 star rating**" 607 | ] 608 | }, 609 | { 610 | "cell_type": "markdown", 611 | "metadata": { 612 | "_cell_guid": "0b2675ad-7e51-c2bb-c812-8980928447d4", 613 | "_uuid": "9f1c9b835db02fb49e93e14a03394eeaa6ba9432" 614 | }, 615 | "source": [ 616 | "# Which states has maximum number of 1 star rating hospitals?" 617 | ] 618 | }, 619 | { 620 | "cell_type": "markdown", 621 | "metadata": { 622 | "_cell_guid": "80619b39-a65b-0194-db96-2e4b06dd8112", 623 | "_uuid": "6bac6937a9f8dc9b4c065c77f0d1fa5f47d87785" 624 | }, 625 | "source": [ 626 | "# Acute care hospitals with 1 star rating." 627 | ] 628 | }, 629 | { 630 | "cell_type": "code", 631 | "execution_count": null, 632 | "metadata": { 633 | "_cell_guid": "6c78fdd2-e23f-3312-36b5-c9574c079d34", 634 | "_uuid": "d93487d15c10284b2bdb9dcf04cbd342aa722e49" 635 | }, 636 | "outputs": [], 637 | "source": [ 638 | "State_acute_1=hospital_data.loc[(hospital_data[\"Hospital Type\"]==\"Acute Care Hospitals\") & (hospital_data[\"Hospital overall rating\"]==\"1\"),[\"State\"]]\n", 639 | "State_acute_1.head()\n", 640 | "#State_acute_1['State'].unique()\n", 641 | "S_A_1=State_acute_1['State'].value_counts()\n", 642 | "index=S_A_1.index\n", 643 | "values=S_A_1.values\n", 644 | "values" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": null, 650 | "metadata": { 651 | "_cell_guid": "b1aefaef-7136-57a7-3710-b6c0754aa535", 652 | "_uuid": "99c55122522876524cc511ee5047aa3f7f818ef5" 653 | }, 654 | "outputs": [], 655 | "source": [ 656 | "dims = (8, 10)\n", 657 | "fig, ax = plt.subplots(figsize=dims)\n", 658 | "\n", 659 | "ax=sns.barplot(y=index,x=values,palette='cubehelix')\n", 660 | "ax.set(xlabel='Total number of Acute Care hospitals with 1 rating', ylabel='States')" 661 | ] 662 | }, 663 | { 664 | "cell_type": "markdown", 665 | "metadata": { 666 | "_cell_guid": "1da8520c-ee66-4736-d8f4-f6dc005b49c2", 667 | "_uuid": "91375da9a3f1fdd8a909fc7b19bcd5a73ecb6654" 668 | }, 669 | "source": [ 670 | "**New york has maximum number of Acute care hospitals with 1 star rating**" 671 | ] 672 | }, 673 | { 674 | "cell_type": "markdown", 675 | "metadata": { 676 | "_cell_guid": "f8bfebe3-3def-31f6-0249-1c038caac9ac", 677 | "_uuid": "7f501c54e15515cbcdbe8b1898ae672f162f2844" 678 | }, 679 | "source": [ 680 | "# Critical Access Hospitals with 1 star rating" 681 | ] 682 | }, 683 | { 684 | "cell_type": "code", 685 | "execution_count": null, 686 | "metadata": { 687 | "_cell_guid": "9363365f-1372-2967-a1ae-3800df945eda", 688 | "_uuid": "984700a85a5ac3dcc9fdecf0fb17279d137f58ef" 689 | }, 690 | "outputs": [], 691 | "source": [ 692 | "Critical_access_1=hospital_data.loc[(hospital_data[\"Hospital Type\"]==\"Critical Access Hospitals\") & (hospital_data[\"Hospital overall rating\"]==\"1\"),[\"State\"]]\n", 693 | "C_A_1=Critical_access_1['State'].value_counts()\n", 694 | "C_A_1\n", 695 | "index=C_A_1.index\n", 696 | "values=C_A_1.values\n", 697 | "values" 698 | ] 699 | }, 700 | { 701 | "cell_type": "code", 702 | "execution_count": null, 703 | "metadata": { 704 | "_cell_guid": "bb005d7d-967b-d72e-d888-afc31c8d2658", 705 | "_uuid": "ebc8131e0b9e4d4340420fd3aec8c250a0b09aba" 706 | }, 707 | "outputs": [], 708 | "source": [ 709 | "dims = (8, 1)\n", 710 | "fig, ax = plt.subplots(figsize=dims)\n", 711 | "\n", 712 | "ax=sns.barplot(y=index,x=values,palette='Spectral')\n", 713 | "ax.set(xlabel='Total number of Critical Acess hospitals with 1 rating', ylabel='States')" 714 | ] 715 | }, 716 | { 717 | "cell_type": "markdown", 718 | "metadata": { 719 | "_cell_guid": "142b5fb4-58fd-ff3f-81ac-d85e9f392d98", 720 | "_uuid": "7ce8ebbe6229577e6deab7bb646f86cb209d3626" 721 | }, 722 | "source": [ 723 | "**Thus there is only one Critical Acess hospital in USA with 1 star rating which is Kentucky.**" 724 | ] 725 | }, 726 | { 727 | "cell_type": "markdown", 728 | "metadata": { 729 | "_cell_guid": "9d9e6687-dd1e-b1e7-febd-8792c196a958", 730 | "_uuid": "3a348ad13b6f01bcb524e397050f0faa13a87571" 731 | }, 732 | "source": [ 733 | "# Chidrens Hospitals with 1 star rating" 734 | ] 735 | }, 736 | { 737 | "cell_type": "code", 738 | "execution_count": null, 739 | "metadata": { 740 | "_cell_guid": "0605b73e-cb1b-9c17-190a-8e4f7adaec76", 741 | "_uuid": "e3403c7d2201bdfa080b62cd019908da445db31c" 742 | }, 743 | "outputs": [], 744 | "source": [ 745 | "Chidrens_1=hospital_data.loc[(hospital_data[\"Hospital Type\"]==\"Childrens\") & (hospital_data[\"Hospital overall rating\"]==\"1\"),[\"State\"]]\n", 746 | "C_1=Chidrens_1['State'].value_counts()\n", 747 | "C_1\n", 748 | "index=C_1.index\n", 749 | "values=C_1.values\n", 750 | "values\n", 751 | "index" 752 | ] 753 | }, 754 | { 755 | "cell_type": "markdown", 756 | "metadata": { 757 | "_cell_guid": "87795f37-e10d-1217-efd1-334604edb78e", 758 | "_uuid": "7dcdfd4637c1d4dc333f8bad471b0f55ece12d48" 759 | }, 760 | "source": [ 761 | "**Thus there no hospitals for childrens with 5 star rating**" 762 | ] 763 | }, 764 | { 765 | "cell_type": "markdown", 766 | "metadata": { 767 | "_cell_guid": "b53edbdd-f04a-f8a8-f7ca-558b6bc5af7a", 768 | "_uuid": "aa248871b78746978d69013cd35c21e6da95dcc8" 769 | }, 770 | "source": [ 771 | "### Checking which hospital types are more common" 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": null, 777 | "metadata": { 778 | "_cell_guid": "010b6058-8e79-d81c-548b-76bf4c6ac2df", 779 | "_uuid": "1604f20bbc1c0ca834123a56c0c9cf1327d0bffd" 780 | }, 781 | "outputs": [], 782 | "source": [ 783 | "unique_hospital_type=hospital_data['Hospital Type'].unique()\n", 784 | "#hospital_data['Hospital Type'].count()" 785 | ] 786 | }, 787 | { 788 | "cell_type": "code", 789 | "execution_count": null, 790 | "metadata": { 791 | "_cell_guid": "db9b9cc8-52a7-a442-8b52-0e910b66ea87", 792 | "_uuid": "4ae4756fe712db031ec90135f7df5d4feee170d9" 793 | }, 794 | "outputs": [], 795 | "source": [ 796 | "hospital_type=hospital_data.loc[hospital_data['Hospital Type']=='Acute Care Hospitals']\n", 797 | "Acute_care=hospital_type['Hospital Type'].count()\n", 798 | "\n", 799 | "hospital_type=hospital_data.loc[hospital_data['Hospital Type']=='Critical Access Hospitals']\n", 800 | "Critical_Acess=hospital_type['Hospital Type'].count()\n", 801 | "\n", 802 | "hospital_type=hospital_data.loc[hospital_data['Hospital Type']=='Childrens']\n", 803 | "Childrens=hospital_type['Hospital Type'].count()\n", 804 | "list=[Acute_care,Critical_Acess,Childrens]\n", 805 | "list" 806 | ] 807 | }, 808 | { 809 | "cell_type": "code", 810 | "execution_count": null, 811 | "metadata": { 812 | "_cell_guid": "077bf8be-af70-eaf2-b875-f6c7a8ff3873", 813 | "_uuid": "88514f17c66545e6726a56e4123b07b040c7afc4" 814 | }, 815 | "outputs": [], 816 | "source": [ 817 | "ax=sns.barplot(x=unique_hospital_type,y=list,data=hospital_data,palette='colorblind')\n", 818 | "ax.set(xlabel='Types of hospitals', ylabel='Number of hospitals')" 819 | ] 820 | }, 821 | { 822 | "cell_type": "markdown", 823 | "metadata": { 824 | "_cell_guid": "edb7516d-ca1d-fd2d-8a38-868e00714565", 825 | "_uuid": "87e702d114afb782dd7f91eda24d7a7a5fea24ac" 826 | }, 827 | "source": [ 828 | "###Thus there are large number of Acute Care Hospitals followed by Critical Acess Hospitals.Childrens hospitals are very rare." 829 | ] 830 | }, 831 | { 832 | "cell_type": "markdown", 833 | "metadata": { 834 | "_cell_guid": "e3566adc-6083-b9e8-183f-3863bef1ada7", 835 | "_uuid": "4aa229234135834a9424c4fa206ae4fe0463f99d" 836 | }, 837 | "source": [ 838 | "# The average hospital rating, by state" 839 | ] 840 | }, 841 | { 842 | "cell_type": "code", 843 | "execution_count": null, 844 | "metadata": { 845 | "_cell_guid": "77afa962-3be4-8eb7-6847-d3cd9f389215", 846 | "_uuid": "4b65e877fb0e068c622dc19c323dae02f2db3a46" 847 | }, 848 | "outputs": [], 849 | "source": [ 850 | "hospital_data['Hospital overall rating'].unique()" 851 | ] 852 | }, 853 | { 854 | "cell_type": "code", 855 | "execution_count": null, 856 | "metadata": { 857 | "_cell_guid": "3a7d23c7-c9d7-0cf6-0d19-76e4484c42a9", 858 | "_uuid": "384cd4b27079fbb739b87c7adeec0cd8dab90c35" 859 | }, 860 | "outputs": [], 861 | "source": [ 862 | "clean_hospital_data=hospital_data.drop(hospital_data[hospital_data['Hospital overall rating']=='Not Available'].index)\n", 863 | "#clean_hospital_data['Hospital overall rating'].astype(float)\n", 864 | "clean_hospital_data['Hospital overall rating'].unique()" 865 | ] 866 | }, 867 | { 868 | "cell_type": "markdown", 869 | "metadata": { 870 | "_cell_guid": "f488d1c3-ca83-302a-ea36-06878ba2cfaa", 871 | "_uuid": "9faa3a076f6c945fc6ed99658636d6ccad2279e3" 872 | }, 873 | "source": [ 874 | "### Converting it to float data type for calculation" 875 | ] 876 | }, 877 | { 878 | "cell_type": "code", 879 | "execution_count": null, 880 | "metadata": { 881 | "_cell_guid": "73bc21de-37de-961a-3376-68d3f707e537", 882 | "_uuid": "7fc626263c77d4a8ce7e54677a74a7fb62927040" 883 | }, 884 | "outputs": [], 885 | "source": [ 886 | "clean_hospital_data['Hospital overall rating']=clean_hospital_data['Hospital overall rating'].astype(float)" 887 | ] 888 | }, 889 | { 890 | "cell_type": "code", 891 | "execution_count": null, 892 | "metadata": { 893 | "_cell_guid": "bd5bee2c-2a30-70fd-4e0a-bb9c6a39f8f6", 894 | "_uuid": "bee96f2185f8e3c50f13cda125007900b9fdbb16" 895 | }, 896 | "outputs": [], 897 | "source": [ 898 | "clean_hospital_data['Hospital overall rating'].mean()\n", 899 | "clean_hospital_data['Hospital overall rating'].count()" 900 | ] 901 | }, 902 | { 903 | "cell_type": "code", 904 | "execution_count": null, 905 | "metadata": { 906 | "_cell_guid": "82bd30fd-dbe5-fce4-638d-d4649e0cfee4", 907 | "_uuid": "f3db0b459f30a49873e0ede10a1085aad5bc9a1d" 908 | }, 909 | "outputs": [], 910 | "source": [ 911 | "Statewise_avarage_rating=clean_hospital_data.groupby('State')['Hospital overall rating'].mean()\n", 912 | "#Statewise_avarage_rating.sort_values(ascending=False)" 913 | ] 914 | }, 915 | { 916 | "cell_type": "markdown", 917 | "metadata": { 918 | "_cell_guid": "5ca91327-1ac8-4e67-290f-c38d28786841", 919 | "_uuid": "61c261389b3276c2bf0244676b95ee5c79f2e1d6" 920 | }, 921 | "source": [ 922 | "### Separating index and values" 923 | ] 924 | }, 925 | { 926 | "cell_type": "code", 927 | "execution_count": null, 928 | "metadata": { 929 | "_cell_guid": "75cf6aa8-9183-ac63-c375-5fdb732beb97", 930 | "_uuid": "233877f42f03a456d632a1979db3c71761ac69ec" 931 | }, 932 | "outputs": [], 933 | "source": [ 934 | "index=Statewise_avarage_rating.sort_values(ascending=False).index\n", 935 | "values=Statewise_avarage_rating.sort_values(ascending=False).values\n", 936 | "#index\n", 937 | "#values" 938 | ] 939 | }, 940 | { 941 | "cell_type": "code", 942 | "execution_count": null, 943 | "metadata": { 944 | "_cell_guid": "76f7710b-9c8e-c883-2ee3-c8cdbef1b13e", 945 | "_uuid": "a58626ac1cdfb88b1c6ecadadea2e1e0298b5d90" 946 | }, 947 | "outputs": [], 948 | "source": [ 949 | "a4_dims = (8, 10)\n", 950 | "fig, ax = plt.subplots(figsize=a4_dims)\n", 951 | "\n", 952 | "ax=sns.barplot(y=index,x=values)\n", 953 | "ax.set(xlabel='Average rating of the hospitals', ylabel='State')" 954 | ] 955 | }, 956 | { 957 | "cell_type": "markdown", 958 | "metadata": { 959 | "_cell_guid": "9dd29f15-7c51-6907-f292-e1e5c472cec8", 960 | "_uuid": "4e76bebf8296ae956cb7a47945a4f10d1033903e" 961 | }, 962 | "source": [ 963 | "**Thus South Dacota has the best average rating of the hospitals.District of columbia has the worst average rating.**" 964 | ] 965 | }, 966 | { 967 | "cell_type": "markdown", 968 | "metadata": { 969 | "_cell_guid": "9607c0d7-c6b7-700e-5546-eab5002f132a", 970 | "_uuid": "4fd24bc141ce32009a303d51675403e080423499" 971 | }, 972 | "source": [ 973 | "# Let us check which types of hospitals are more likely to have not submitted proper data" 974 | ] 975 | }, 976 | { 977 | "cell_type": "markdown", 978 | "metadata": { 979 | "_cell_guid": "a4e589bd-1613-c7b3-0175-69cd94983c88", 980 | "_uuid": "f7b805c29816df61d06b306ec735e29758e0071b" 981 | }, 982 | "source": [ 983 | "### Which type of hospitals has highest Non-availabilty of Mortality comparison data?" 984 | ] 985 | }, 986 | { 987 | "cell_type": "code", 988 | "execution_count": null, 989 | "metadata": { 990 | "_cell_guid": "ebe59249-c8a3-022c-e1ba-f99ebc2a9d05", 991 | "_uuid": "4f4581e9764aa030b45107b20c2477a5a1fcd619" 992 | }, 993 | "outputs": [], 994 | "source": [ 995 | "Mortality_NotAvailable=hospital_data.loc[hospital_data['Mortality national comparison']=='Not Available']\n", 996 | "Mortality_NotAvailable['Mortality national comparison'].count()" 997 | ] 998 | }, 999 | { 1000 | "cell_type": "code", 1001 | "execution_count": null, 1002 | "metadata": { 1003 | "_cell_guid": "5283a712-85fd-5d44-a43f-93b2c84c3cfb", 1004 | "_uuid": "224b1f640e177ce2f17b6d01acd2cde120c39a83" 1005 | }, 1006 | "outputs": [], 1007 | "source": [ 1008 | "Non_available_data=Mortality_NotAvailable.groupby('Hospital Type')['Mortality national comparison'].count()\n", 1009 | "#Non_available_data\n", 1010 | "Non_available_data.sort_values(ascending=False)" 1011 | ] 1012 | }, 1013 | { 1014 | "cell_type": "code", 1015 | "execution_count": null, 1016 | "metadata": { 1017 | "_cell_guid": "02d5b52e-ae0c-c1cb-d687-b60a29bea9f9", 1018 | "_uuid": "f3b135ac6c48fa37616754ed73c651316026639c" 1019 | }, 1020 | "outputs": [], 1021 | "source": [ 1022 | "index=Non_available_data.sort_values(ascending=False).index\n", 1023 | "values=Non_available_data.sort_values(ascending=False).values\n", 1024 | "#index\n", 1025 | "#values" 1026 | ] 1027 | }, 1028 | { 1029 | "cell_type": "code", 1030 | "execution_count": null, 1031 | "metadata": { 1032 | "_cell_guid": "05216b03-eb33-7d81-34ab-282c0e4ba92d", 1033 | "_uuid": "971eba1fbeab835189df495c506b991684345668" 1034 | }, 1035 | "outputs": [], 1036 | "source": [ 1037 | "dims = (6, 6)\n", 1038 | "fig, ax = plt.subplots(figsize=dims)\n", 1039 | "\n", 1040 | "ax=sns.barplot(y=values,x=index,palette='PiYG')\n", 1041 | "ax.set(xlabel='Hospitals types', ylabel='Count of Mortality data Non-Availabilty') " 1042 | ] 1043 | }, 1044 | { 1045 | "cell_type": "markdown", 1046 | "metadata": { 1047 | "_cell_guid": "abccca6f-f810-55aa-6db1-2cb642ecf7b3", 1048 | "_uuid": "844b4265d92ca61598b627f6acbf73564cea6ef3" 1049 | }, 1050 | "source": [ 1051 | "**Thus Critical Acess hospitals has highest Non-availabilty of mortality comparison of the data and chidrens hospitals has minimum.**" 1052 | ] 1053 | }, 1054 | { 1055 | "cell_type": "markdown", 1056 | "metadata": { 1057 | "_cell_guid": "6bfe6803-6012-607b-2b85-0fb837f7ca24", 1058 | "_uuid": "ca911c716e58bd6534ded5b57418b4298893733f" 1059 | }, 1060 | "source": [ 1061 | "# Which type of hospitals has highest Non-availabilty of Safety of Care data?" 1062 | ] 1063 | }, 1064 | { 1065 | "cell_type": "code", 1066 | "execution_count": null, 1067 | "metadata": { 1068 | "_cell_guid": "0cdb39be-d109-112f-959e-bf8402215341", 1069 | "_uuid": "118635742855dd54203f88464e0c152017db9e4a" 1070 | }, 1071 | "outputs": [], 1072 | "source": [ 1073 | "SafetyOfCare_NotAvailable=hospital_data.loc[hospital_data['Safety of care national comparison']=='Not Available']\n", 1074 | "SafetyOfCare_NotAvailable['Safety of care national comparison'].count()" 1075 | ] 1076 | }, 1077 | { 1078 | "cell_type": "code", 1079 | "execution_count": null, 1080 | "metadata": { 1081 | "_cell_guid": "57efc21b-f1cc-c9dd-c19e-5059dd1bccb8", 1082 | "_uuid": "95146ee63d05c3fdcb23045cf4375869ce84c55d" 1083 | }, 1084 | "outputs": [], 1085 | "source": [ 1086 | "SafetyOfCare_NotAvailable=hospital_data.loc[hospital_data['Safety of care national comparison']=='Not Available']\n", 1087 | "SafetyOfCare_NotAvailable['Safety of care national comparison'].count()\n", 1088 | "Non_available_data=SafetyOfCare_NotAvailable.groupby('Hospital Type')['Safety of care national comparison'].count()\n", 1089 | "#Non_available_data\n", 1090 | "Non_available_data.sort_values(ascending=False)\n", 1091 | "index=Non_available_data.sort_values(ascending=False).index\n", 1092 | "values=Non_available_data.sort_values(ascending=False).values" 1093 | ] 1094 | }, 1095 | { 1096 | "cell_type": "code", 1097 | "execution_count": null, 1098 | "metadata": { 1099 | "_cell_guid": "32bc3e84-ab29-e820-29f9-e1aa83d53ca9", 1100 | "_uuid": "1b57bb408ef82ed1c473be74d326be56c55d39ab" 1101 | }, 1102 | "outputs": [], 1103 | "source": [ 1104 | "dims = (6, 6)\n", 1105 | "fig, ax = plt.subplots(figsize=dims)\n", 1106 | "\n", 1107 | "ax=sns.barplot(y=values,x=index,palette='BrBG')\n", 1108 | "ax.set(xlabel='Hospital Types ', ylabel='Count of Safety of care data Non-Availabilty')" 1109 | ] 1110 | }, 1111 | { 1112 | "cell_type": "markdown", 1113 | "metadata": { 1114 | "_cell_guid": "6e98ea92-ea7e-71dd-343d-1fd86f702f7a", 1115 | "_uuid": "095e5bf544972718b549c52f616150923b43fdff" 1116 | }, 1117 | "source": [ 1118 | "# Which type of hospitals has highest Non-availabilty of Readmission national comparison data?" 1119 | ] 1120 | }, 1121 | { 1122 | "cell_type": "code", 1123 | "execution_count": null, 1124 | "metadata": { 1125 | "_cell_guid": "30f93838-2ecb-434c-7fa1-92f57637653b", 1126 | "_uuid": "530d36f1692e60e87558b9992eb5abcec2e21d9a" 1127 | }, 1128 | "outputs": [], 1129 | "source": [ 1130 | "Readmission_NotAvailable=hospital_data.loc[hospital_data['Readmission national comparison']=='Not Available']\n", 1131 | "Readmission_NotAvailable['Readmission national comparison'].count()\n", 1132 | "Non_available_data=Readmission_NotAvailable.groupby('Hospital Type')['Readmission national comparison'].count()\n", 1133 | "#Non_available_data\n", 1134 | "Non_available_data.sort_values(ascending=False)\n", 1135 | "index=Non_available_data.sort_values(ascending=False).index\n", 1136 | "values=Non_available_data.sort_values(ascending=False).values\n", 1137 | "#index\n", 1138 | "#values" 1139 | ] 1140 | }, 1141 | { 1142 | "cell_type": "code", 1143 | "execution_count": null, 1144 | "metadata": { 1145 | "_cell_guid": "5f919bc7-2f8c-ae85-f997-32222e7aa99d", 1146 | "_uuid": "b9371379555bd7b44b0f54eed9d2798e2d154573" 1147 | }, 1148 | "outputs": [], 1149 | "source": [ 1150 | "dims = (6, 7)\n", 1151 | "fig, ax = plt.subplots(figsize=dims)\n", 1152 | "\n", 1153 | "ax=sns.barplot(y=values,x=index,palette='RdYlGn')\n", 1154 | "ax.set(xlabel='Hospital Types ', ylabel='Count of Readmission data Non-Availabilty')" 1155 | ] 1156 | }, 1157 | { 1158 | "cell_type": "markdown", 1159 | "metadata": { 1160 | "_cell_guid": "9cc260c0-8a00-2021-ba9c-9412d30fd2c4", 1161 | "_uuid": "af929dfcba3cd9c2fe68bc240c548cb2007faf15" 1162 | }, 1163 | "source": [ 1164 | "**Similarly there are few more columns which we can take into consideration**" 1165 | ] 1166 | }, 1167 | { 1168 | "cell_type": "code", 1169 | "execution_count": null, 1170 | "metadata": { 1171 | "_cell_guid": "b932d3b0-9224-b5c4-1be4-e70f173beb2d", 1172 | "_uuid": "770f92bcc24640e7ef26479a404de9f5c2714b28" 1173 | }, 1174 | "outputs": [], 1175 | "source": [ 1176 | "#Still Working" 1177 | ] 1178 | } 1179 | ], 1180 | "metadata": { 1181 | "_change_revision": 0, 1182 | "_is_fork": false, 1183 | "kernelspec": { 1184 | "display_name": "Python 3", 1185 | "language": "python", 1186 | "name": "python3" 1187 | }, 1188 | "language_info": { 1189 | "codemirror_mode": { 1190 | "name": "ipython", 1191 | "version": 3 1192 | }, 1193 | "file_extension": ".py", 1194 | "mimetype": "text/x-python", 1195 | "name": "python", 1196 | "nbconvert_exporter": "python", 1197 | "pygments_lexer": "ipython3", 1198 | "version": "3.6.0" 1199 | } 1200 | }, 1201 | "nbformat": 4, 1202 | "nbformat_minor": 0 1203 | } -------------------------------------------------------------------------------- /exploring_principal_component_analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "kernelspec": { 4 | "name": "python" 5 | }, 6 | "language_info": { 7 | "name": "python", 8 | "version": "3.5.1" 9 | } 10 | }, 11 | "nbformat": 4, 12 | "nbformat_minor": 0, 13 | "cells": [ 14 | { 15 | "cell_type": "markdown", 16 | "source": "**I'll explain the steps involved in PCA with codes without implemeting scikit-learn.In the end we'll see the shortcut(alternative) way to apply PCA using Scikit-learn.The main aim of this tutorial is to explain what actually happens in background when you apply PCA algorithm.**", 17 | "metadata": {} 18 | }, 19 | { 20 | "cell_type": "code", 21 | "source": "# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n# For example, here's several helpful packages to load in \n# Input data files are available in the \"../input/\" directory.\n# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n\nfrom subprocess import check_output\nprint(check_output([\"ls\", \"../input\"]).decode(\"utf8\"))\n\n# Any results you write to the current directory are saved as output.", 22 | "execution_count": null, 23 | "outputs": [], 24 | "metadata": {} 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "source": "# 1)Let us first import all the necessary libraries", 29 | "metadata": {} 30 | }, 31 | { 32 | "cell_type": "code", 33 | "source": "import numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\nimport matplotlib as mpl\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n%matplotlib inline", 34 | "execution_count": null, 35 | "outputs": [], 36 | "metadata": {} 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "source": "# 2)Loading the dataset\nTo import the dataset we will use Pandas library.It is the best Python library to play with the dataset and has a lot of functionalities. ", 41 | "metadata": {} 42 | }, 43 | { 44 | "cell_type": "code", 45 | "source": "df = pd.read_csv('../input/HR_comma_sep.csv')", 46 | "execution_count": null, 47 | "outputs": [], 48 | "metadata": {} 49 | }, 50 | { 51 | "cell_type": "code", 52 | "source": "columns_names=df.columns.tolist()\nprint(\"Columns names:\")\nprint(columns_names)", 53 | "execution_count": null, 54 | "outputs": [], 55 | "metadata": {} 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "source": "df.columns.tolist() fetches all the columns and then convert it into list type.This step is just to check out all the column names in our data.Columns are also called as features of our datasets.", 60 | "metadata": {} 61 | }, 62 | { 63 | "cell_type": "code", 64 | "source": "df.head()", 65 | "execution_count": null, 66 | "outputs": [], 67 | "metadata": {} 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "source": "df.head() displays first five rows of our datasets.", 72 | "metadata": {} 73 | }, 74 | { 75 | "cell_type": "code", 76 | "source": "df.corr()", 77 | "execution_count": null, 78 | "outputs": [], 79 | "metadata": {} 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "source": "**df.corr()** compute pairwise correlation of columns.Correlation shows how the two variables are related to each other.Positive values shows as one variable increases other variable increases as well. Negative values shows as one variable increases other variable decreases.Bigger the values,more strongly two varibles are correlated and viceversa.", 84 | "metadata": {} 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "source": "**Visualising correlation using Seaborn library**", 89 | "metadata": {} 90 | }, 91 | { 92 | "cell_type": "code", 93 | "source": "correlation = df.corr()\nplt.figure(figsize=(10,10))\nsns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='cubehelix')\n\nplt.title('Correlation between different fearures')", 94 | "execution_count": null, 95 | "outputs": [], 96 | "metadata": {} 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "source": "**Doing some visualisation before moving onto PCA**", 101 | "metadata": {} 102 | }, 103 | { 104 | "cell_type": "code", 105 | "source": "df['sales'].unique()", 106 | "execution_count": null, 107 | "outputs": [], 108 | "metadata": {} 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "source": "Here we are printing all the unique values in **sales** columns", 113 | "metadata": {} 114 | }, 115 | { 116 | "cell_type": "code", 117 | "source": "sales=df.groupby('sales').sum()\nsales", 118 | "execution_count": null, 119 | "outputs": [], 120 | "metadata": {} 121 | }, 122 | { 123 | "cell_type": "code", 124 | "source": "df['sales'].unique()", 125 | "execution_count": null, 126 | "outputs": [], 127 | "metadata": {} 128 | }, 129 | { 130 | "cell_type": "code", 131 | "source": "groupby_sales=df.groupby('sales').mean()\ngroupby_sales", 132 | "execution_count": null, 133 | "outputs": [], 134 | "metadata": {} 135 | }, 136 | { 137 | "cell_type": "code", 138 | "source": "IT=groupby_sales['satisfaction_level'].IT\nRandD=groupby_sales['satisfaction_level'].RandD\naccounting=groupby_sales['satisfaction_level'].accounting\nhr=groupby_sales['satisfaction_level'].hr\nmanagement=groupby_sales['satisfaction_level'].management\nmarketing=groupby_sales['satisfaction_level'].marketing\nproduct_mng=groupby_sales['satisfaction_level'].product_mng\nsales=groupby_sales['satisfaction_level'].sales\nsupport=groupby_sales['satisfaction_level'].support\ntechnical=groupby_sales['satisfaction_level'].technical\ntechnical", 139 | "execution_count": null, 140 | "outputs": [], 141 | "metadata": {} 142 | }, 143 | { 144 | "cell_type": "code", 145 | "source": "\ndepartment_name=('sales', 'accounting', 'hr', 'technical', 'support', 'management',\n 'IT', 'product_mng', 'marketing', 'RandD')\ndepartment=(sales, accounting, hr, technical, support, management,\n IT, product_mng, marketing, RandD)\ny_pos = np.arange(len(department))\nx=np.arange(0,1,0.1)\n\nplt.barh(y_pos, department, align='center', alpha=0.8)\nplt.yticks(y_pos,department_name )\nplt.xlabel('Satisfaction level')\nplt.title('Mean Satisfaction Level of each department')", 146 | "execution_count": null, 147 | "outputs": [], 148 | "metadata": {} 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "source": "# Principal Component Analysis", 153 | "metadata": {} 154 | }, 155 | { 156 | "cell_type": "code", 157 | "source": "df.head()", 158 | "execution_count": null, 159 | "outputs": [], 160 | "metadata": {} 161 | }, 162 | { 163 | "cell_type": "code", 164 | "source": "df_drop=df.drop(labels=['sales','salary'],axis=1)\ndf_drop.head()", 165 | "execution_count": null, 166 | "outputs": [], 167 | "metadata": {} 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "source": "**df.drop()** is the method to drop the columns in our dataframe", 172 | "metadata": {} 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "source": "Now we need to bring \"left\" column to the front as it is the label and not the feature.", 177 | "metadata": {} 178 | }, 179 | { 180 | "cell_type": "code", 181 | "source": "cols = df_drop.columns.tolist()\ncols", 182 | "execution_count": null, 183 | "outputs": [], 184 | "metadata": {} 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "source": "Here we are converting columns of the dataframe to list so it would be easier for us to reshuffle the columns.We are going to use cols.insert method", 189 | "metadata": {} 190 | }, 191 | { 192 | "cell_type": "code", 193 | "source": "cols.insert(0, cols.pop(cols.index('left')))", 194 | "execution_count": null, 195 | "outputs": [], 196 | "metadata": {} 197 | }, 198 | { 199 | "cell_type": "code", 200 | "source": "cols", 201 | "execution_count": null, 202 | "outputs": [], 203 | "metadata": {} 204 | }, 205 | { 206 | "cell_type": "code", 207 | "source": "df_drop = df_drop.reindex(columns= cols)", 208 | "execution_count": null, 209 | "outputs": [], 210 | "metadata": {} 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "source": "By using df_drop.reindex(columns= cols) we are converting list to columns again", 215 | "metadata": {} 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "source": "Now we are separating features of our dataframe from the labels.", 220 | "metadata": {} 221 | }, 222 | { 223 | "cell_type": "code", 224 | "source": "X = df_drop.iloc[:,1:8].values\ny = df_drop.iloc[:,0].values\nX", 225 | "execution_count": null, 226 | "outputs": [], 227 | "metadata": {} 228 | }, 229 | { 230 | "cell_type": "code", 231 | "source": "y", 232 | "execution_count": null, 233 | "outputs": [], 234 | "metadata": {} 235 | }, 236 | { 237 | "cell_type": "code", 238 | "source": "np.shape(X)", 239 | "execution_count": null, 240 | "outputs": [], 241 | "metadata": {} 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "source": "Thus X is now matrix with 14999 rows and 7 columns", 246 | "metadata": {} 247 | }, 248 | { 249 | "cell_type": "code", 250 | "source": "np.shape(y)", 251 | "execution_count": null, 252 | "outputs": [], 253 | "metadata": {} 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "source": "y is now matrix with 14999 rows and 1 column", 258 | "metadata": {} 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "source": "# 4) Data Standardisation\nStandardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance). It is useful to standardize attributes for a model.\nStandardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data ", 263 | "metadata": {} 264 | }, 265 | { 266 | "cell_type": "code", 267 | "source": "from sklearn.preprocessing import StandardScaler\nX_std = StandardScaler().fit_transform(X)", 268 | "execution_count": null, 269 | "outputs": [], 270 | "metadata": {} 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "source": "# 5) Computing Eigenvectors and Eigenvalues:\nBefore computing Eigen vectors and values we need to calculate covariance matrix.", 275 | "metadata": {} 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "source": "## Covariance matrix", 280 | "metadata": {} 281 | }, 282 | { 283 | "cell_type": "code", 284 | "source": "mean_vec = np.mean(X_std, axis=0)\ncov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)\nprint('Covariance matrix \\n%s' %cov_mat)", 285 | "execution_count": null, 286 | "outputs": [], 287 | "metadata": {} 288 | }, 289 | { 290 | "cell_type": "code", 291 | "source": "print('NumPy covariance matrix: \\n%s' %np.cov(X_std.T))", 292 | "execution_count": null, 293 | "outputs": [], 294 | "metadata": {} 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "source": "Equivalently we could have used Numpy np.cov to calculate covariance matrix", 299 | "metadata": {} 300 | }, 301 | { 302 | "cell_type": "code", 303 | "source": "plt.figure(figsize=(8,8))\nsns.heatmap(cov_mat, vmax=1, square=True,annot=True,cmap='cubehelix')\n\nplt.title('Correlation between different features')", 304 | "execution_count": null, 305 | "outputs": [], 306 | "metadata": {} 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "source": "# Eigen decomposition of the covariance matrix", 311 | "metadata": {} 312 | }, 313 | { 314 | "cell_type": "code", 315 | "source": "eig_vals, eig_vecs = np.linalg.eig(cov_mat)\n\nprint('Eigenvectors \\n%s' %eig_vecs)\nprint('\\nEigenvalues \\n%s' %eig_vals)", 316 | "execution_count": null, 317 | "outputs": [], 318 | "metadata": {} 319 | }, 320 | { 321 | "cell_type": "code", 322 | "source": "# 6) Selecting Principal Components¶", 323 | "execution_count": null, 324 | "outputs": [], 325 | "metadata": {} 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "source": "# 6) Selecting Principal Components\n\nT\nIn order to decide which eigenvector(s) can dropped without losing too much information for the construction of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: The eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data; those are the ones can be dropped.", 330 | "metadata": {} 331 | }, 332 | { 333 | "cell_type": "code", 334 | "source": "# Make a list of (eigenvalue, eigenvector) tuples\neig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]\n\n# Sort the (eigenvalue, eigenvector) tuples from high to low\neig_pairs.sort(key=lambda x: x[0], reverse=True)\n\n# Visually confirm that the list is correctly sorted by decreasing eigenvalues\nprint('Eigenvalues in descending order:')\nfor i in eig_pairs:\n print(i[0])", 335 | "execution_count": null, 336 | "outputs": [], 337 | "metadata": {} 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "source": "**Explained Variance**\nAfter sorting the eigenpairs, the next question is \"how many principal components are we going to choose for our new feature subspace?\" A useful measure is the so-called \"explained variance,\" which can be calculated from the eigenvalues. The explained variance tells us how much information (variance) can be attributed to each of the principal components.", 342 | "metadata": {} 343 | }, 344 | { 345 | "cell_type": "code", 346 | "source": "tot = sum(eig_vals)\nvar_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]", 347 | "execution_count": null, 348 | "outputs": [], 349 | "metadata": {} 350 | }, 351 | { 352 | "cell_type": "code", 353 | "source": "with plt.style.context('dark_background'):\n plt.figure(figsize=(6, 4))\n\n plt.bar(range(7), var_exp, alpha=0.5, align='center',\n label='individual explained variance')\n plt.ylabel('Explained variance ratio')\n plt.xlabel('Principal components')\n plt.legend(loc='best')\n plt.tight_layout()", 354 | "execution_count": null, 355 | "outputs": [], 356 | "metadata": {} 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "source": "The plot above clearly shows that maximum variance (somewhere around 26%) can be explained by the first principal component alone. The second,third,fourth and fifth principal component share almost equal amount of information.Comparatively 6th and 7th components share less amount of information as compared to the rest of the Principal components.But those information cannot be ignored since they both contribute almost 17% of the data.But we can drop the last component as it has less than 10% of the variance", 361 | "metadata": {} 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "source": "**Projection Matrix**", 366 | "metadata": {} 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "source": "The construction of the projection matrix that will be used to transform the Human resouces analytics data onto the new feature subspace. Suppose only 1st and 2nd principal component shares the maximum amount of information say around 90%.Hence we can drop other components. Here, we are reducing the 7-dimensional feature space to a 2-dimensional feature subspace, by choosing the “top 2” eigenvectors with the highest eigenvalues to construct our d×k-dimensional eigenvector matrix W", 371 | "metadata": {} 372 | }, 373 | { 374 | "cell_type": "code", 375 | "source": "matrix_w = np.hstack((eig_pairs[0][1].reshape(7,1), \n eig_pairs[1][1].reshape(7,1)\n ))\nprint('Matrix W:\\n', matrix_w)", 376 | "execution_count": null, 377 | "outputs": [], 378 | "metadata": {} 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "source": "**Projection Onto the New Feature Space**\nIn this last step we will use the 7×2-dimensional projection matrix W to transform our samples onto the new subspace via the equation\n**Y=X×W**", 383 | "metadata": {} 384 | }, 385 | { 386 | "cell_type": "code", 387 | "source": "Y = X_std.dot(matrix_w)\nY", 388 | "execution_count": null, 389 | "outputs": [], 390 | "metadata": {} 391 | }, 392 | { 393 | "cell_type": "markdown", 394 | "source": "# PCA in scikit-learn", 395 | "metadata": {} 396 | }, 397 | { 398 | "cell_type": "code", 399 | "source": "from sklearn.decomposition import PCA\npca = PCA().fit(X_std)\nplt.plot(np.cumsum(pca.explained_variance_ratio_))\nplt.xlim(0,7,1)\nplt.xlabel('Number of components')\nplt.ylabel('Cumulative explained variance')", 400 | "execution_count": null, 401 | "outputs": [], 402 | "metadata": {} 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "source": "The above plot shows almost 90% variance by the first 6 components. Therfore we can drop 7th component.", 407 | "metadata": {} 408 | }, 409 | { 410 | "cell_type": "code", 411 | "source": "from sklearn.decomposition import PCA \nsklearn_pca = PCA(n_components=6)\nY_sklearn = sklearn_pca.fit_transform(X_std)", 412 | "execution_count": null, 413 | "outputs": [], 414 | "metadata": {} 415 | }, 416 | { 417 | "cell_type": "code", 418 | "source": "print(Y_sklearn)", 419 | "execution_count": null, 420 | "outputs": [], 421 | "metadata": {} 422 | }, 423 | { 424 | "cell_type": "code", 425 | "source": "Y_sklearn.shape", 426 | "execution_count": null, 427 | "outputs": [], 428 | "metadata": {} 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "source": "Thus Principal Component Analysis is used to remove the redundant features from the datasets without losing much information.These features are low dimensional in nature.The first component has the highest variance followed by second, third and so on.PCA works best on data set having 3 or higher dimensions. Because, with higher dimensions, it becomes increasingly difficult to make interpretations from the resultant cloud of data.", 433 | "metadata": {} 434 | }, 435 | { 436 | "cell_type": "markdown", 437 | "source": "You can find my notebook on Github: \n(\"https://github.com/nirajvermafcb/Principal-component-analysis-PCA-/blob/master/Principal%2Bcomponent%2Banalysis.ipynb\")", 438 | "metadata": {} 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "source": "Here is my notebook for Principal Component Analysis with Scikit-learn:\n(https://www.kaggle.com/nirajvermafcb/d/nsrose7224/crowdedness-at-the-campus-gym/principal-component-analysis-with-scikit-learn)", 443 | "metadata": {} 444 | }, 445 | { 446 | "cell_type": "code", 447 | "source": null, 448 | "execution_count": null, 449 | "outputs": [], 450 | "metadata": {} 451 | } 452 | ] 453 | } --------------------------------------------------------------------------------