├── Anomaly_Detection_via_PyCaret.ipynb ├── Copy_of_DagsHub_Showcase.ipynb ├── Copy_of_Kaggle_Workflow.ipynb ├── Edit ├── Logistic_Regression_practice.ipynb ├── Practice_ml.ipynb ├── README.md ├── Titanic_submission_2.ipynb ├── Titanic_submission_3.ipynb ├── analyzing-historical-stock-revenue-data-and-buildi.ipynb ├── decision-tree-notes.ipynb ├── decision-tree-practice.ipynb ├── decision-tree.ipynb ├── distributions.ipynb ├── kmeans-mall-customers.ipynb ├── linear_regression_1.ipynb ├── logistic-regression-github-com-tushar2704.ipynb ├── market-basket-analysis-on-market-data.ipynb ├── random-forest-social-network.ipynb └── titanic-competition-submission.ipynb /Copy_of_DagsHub_Showcase.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "view-in-github", 7 | "colab_type": "text" 8 | }, 9 | "source": [ 10 | " $\"Open$ " 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "id": "cjbAYZDfB3JB" 17 | }, 18 | "source": [ 19 | " $\\\"DAGsHub\\\"$ \n", 20 | "\n", 21 | "

Welcome to the (thin version of) DagYard

\n", 22 | "\n", 23 | "

A fully configured DagsHub environment that suits YOUR needs.

\n", 24 | "\n", 25 | "---\n", 26 | "\n", 27 | "

With this notebook, you can easily pull all of your project’s components from DagsHub to Colab runtime, train the model, log the experiments, version the changes, and push them to DagsHub remotes.

\n", 28 | "\n", 29 | "This notebook automates the configuration process of your DagsHub project in the Colab environment. All you need to do is simply check some boxes, fill in your details, run the notebook, and you’re set to go! Behind the scenes, it will configure your DagsHub project with Colab and pull the project components you choose to the run time. From the DagYard cell on, you will be able to work on your project, version the changes with Git and DVC, and easily push them to the DagsHub remotes. \n", 30 | "\n", 31 | "\n", 32 | "\n", 33 | "\n", 34 | " | | | " 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "metadata": { 41 | "cellView": "form", 42 | "id": "_ult64024ro8" 43 | }, 44 | "outputs": [], 45 | "source": [ 46 | "#@title Notebook configurations 🏗\n", 47 | "\n", 48 | "#@markdown Clone the Git repo to the Colab runtime\n", 49 | "CLONE = True #@param {type:\"boolean\"}\n", 50 | "\n", 51 | "#@markdown Pull the changes from the Git server to Colab runtime\n", 52 | "PULL_GIT = True #@param {type:\"boolean\"}\n", 53 | "\n", 54 | "#@markdown Set DVC’s user configurations for DagsHub user (will be set locally - should only done per runtime)\n", 55 | "SET_DVC_USER = True #@param {type:\"boolean\"}\n", 56 | "\n", 57 | "#@markdown Configure MLflow remote tracking server\n", 58 | "MLFLOW = True #@param {type:\"boolean\"}\n", 59 | "\n", 60 | "#@markdown ---" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": { 67 | "cellView": "form", 68 | "id": "BHoo0Wrg540W" 69 | }, 70 | "outputs": [], 71 | "source": [ 72 | "#@title DagsHub Configurations 🐶\n", 73 | "\n", 74 | "#@markdown Enter the DAGsHub repository owner name:\n", 75 | "DAGSHUB_REPO_OWNER= \"nirbarazida\" #@param {type:\"string\"} \n", 76 | "\n", 77 | "#@markdown Enter the DAGsHub repository name:\n", 78 | "DAGSHUB_REPO_NAME= \"urban-audio-classifier\" #@param {type:\"string\"}\n", 79 | "\n", 80 | "#@markdown Enter the username of your DAGsHub account:\n", 81 | "DAGSHUB_USER_NAME = \"nirbarazida\" #@param {type:\"string\"}\n", 82 | "\n", 83 | "#@markdown Enter the email for your DAGsHub account:\n", 84 | "DAGSHUB_EMAIL = \"nirbarazida@gmail.com\" #@param {type:\"string\"}\n", 85 | "\n", 86 | "#@markdown Enter the branch name:\n", 87 | "BRANCH= \"stream\" #@param {type:\"string\"}" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": { 93 | "id": "surT5MN69cdB" 94 | }, 95 | "source": [ 96 | "# Additional information 💡" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": { 102 | "id": "d6xDJKLXB8N3" 103 | }, 104 | "source": [ 105 | "DagsHub" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "metadata": { 112 | "colab": { 113 | "base_uri": "https://localhost:8080/" 114 | }, 115 | "id": "b6XFNNq49bxI", 116 | "outputId": "b36cedd8-3657-4dc0-e559-5b620e434594" 117 | }, 118 | "outputs": [ 119 | { 120 | "name": "stdout", 121 | "output_type": "stream", 122 | "text": [ 123 | "Please enter your DAGsHub token or password: ··········\n" 124 | ] 125 | } 126 | ], 127 | "source": [ 128 | "import getpass\n", 129 | "DAGSHUB_TOKEN = getpass.getpass('Please enter your DAGsHub token or password: ')" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": { 135 | "id": "hLBCRLtOAWWd" 136 | }, 137 | "source": [ 138 | "# Help Functions 🚁" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": null, 144 | "metadata": { 145 | "id": "cwl_xiOq4LQl" 146 | }, 147 | "outputs": [], 148 | "source": [ 149 | "# Imports\n", 150 | "import requests\n", 151 | "import datetime\n", 152 | "import os\n", 153 | "from pathlib import Path" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": { 160 | "id": "Jw9qcBJ6Ah1M" 161 | }, 162 | "outputs": [], 163 | "source": [ 164 | "def git_push():\n", 165 | " !git push https://{DAGSHUB_USER_NAME}:{DAGSHUB_TOKEN}@dagshub.com/{DAGSHUB_REPO_OWNER}/{DAGSHUB_REPO_NAME}.git " 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": { 171 | "id": "C8NG1sUVANaf" 172 | }, 173 | "source": [ 174 | "# Black Magic 🪄 " 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": { 180 | "id": "PmubbQhV8lhU" 181 | }, 182 | "source": [ 183 | "Configure Git" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": { 190 | "id": "kMCsUmrb8fPD" 191 | }, 192 | "outputs": [], 193 | "source": [ 194 | "!git config --global user.email {DAGSHUB_EMAIL}\n", 195 | "!git config --global user.name {DAGSHUB_USER_NAME}" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": { 201 | "id": "WPNKFBEFTlkH" 202 | }, 203 | "source": [ 204 | "Clone the Repository" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": { 211 | "colab": { 212 | "base_uri": "https://localhost:8080/" 213 | }, 214 | "id": "IZdQl7CgCf9x", 215 | "outputId": "49614abc-2ccc-4dbd-ca0e-401448ec1634" 216 | }, 217 | "outputs": [ 218 | { 219 | "output_type": "stream", 220 | "name": "stdout", 221 | "text": [ 222 | "Cloning into 'urban-audio-classifier'...\n", 223 | "remote: Enumerating objects: 88, done.\u001b[K\n", 224 | "remote: Counting objects: 100% (88/88), done.\u001b[K\n", 225 | "remote: Compressing objects: 100% (85/85), done.\u001b[K\n", 226 | "remote: Total 88 (delta 37), reused 0 (delta 0), pack-reused 0\u001b[K\n", 227 | "Unpacking objects: 100% (88/88), done.\n", 228 | "/content/urban-audio-classifier\n" 229 | ] 230 | } 231 | ], 232 | "source": [ 233 | "!git clone -b {BRANCH} https://{DAGSHUB_USER_NAME}:{DAGSHUB_TOKEN}@dagshub.com/{DAGSHUB_REPO_OWNER}/{DAGSHUB_REPO_NAME}.git\n", 234 | "%cd {DAGSHUB_REPO_NAME}" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": { 240 | "id": "hU4GOn0GDDg1" 241 | }, 242 | "source": [ 243 | "Install Requirements" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": { 250 | "colab": { 251 | "base_uri": "https://localhost:8080/" 252 | }, 253 | "id": "ZMhk_W7m-QcO", 254 | "outputId": "e7127e8b-4af6-4adc-b7cf-7bd914cdfafb" 255 | }, 256 | "outputs": [ 257 | { 258 | "output_type": "stream", 259 | "name": "stdout", 260 | "text": [ 261 | "\u001b[K |████████████████████████████████| 2.1 MB 5.4 MB/s \n", 262 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m214.3/214.3 kB\u001b[0m \u001b[31m6.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 263 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m588.3/588.3 MB\u001b[0m \u001b[31m2.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 264 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m439.2/439.2 kB\u001b[0m \u001b[31m26.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 265 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m43.5/43.5 kB\u001b[0m \u001b[31m4.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 266 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m401.8/401.8 kB\u001b[0m \u001b[31m31.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 267 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m17.8/17.8 MB\u001b[0m \u001b[31m69.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 268 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m60.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 269 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.0/6.0 MB\u001b[0m \u001b[31m77.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 270 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m85.0/85.0 kB\u001b[0m \u001b[31m9.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 271 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m96.6/96.6 kB\u001b[0m \u001b[31m11.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 272 | "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", 273 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m109.5/109.5 kB\u001b[0m \u001b[31m12.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 274 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.9/40.9 kB\u001b[0m \u001b[31m4.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 275 | "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", 276 | " Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", 277 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m44.6/44.6 kB\u001b[0m \u001b[31m4.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 278 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m280.2/280.2 kB\u001b[0m \u001b[31m23.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 279 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m45.0/45.0 kB\u001b[0m \u001b[31m4.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 280 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m237.5/237.5 kB\u001b[0m \u001b[31m15.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 281 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m182.5/182.5 kB\u001b[0m \u001b[31m17.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 282 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m147.5/147.5 kB\u001b[0m \u001b[31m14.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 283 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m209.8/209.8 kB\u001b[0m \u001b[31m13.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 284 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m79.5/79.5 kB\u001b[0m \u001b[31m8.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 285 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.9/77.9 kB\u001b[0m \u001b[31m9.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 286 | "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", 287 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m346.7/346.7 kB\u001b[0m \u001b[31m30.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 288 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.7/3.7 MB\u001b[0m \u001b[31m64.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 289 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m502.2/502.2 kB\u001b[0m \u001b[31m23.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 290 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m55.3/55.3 kB\u001b[0m \u001b[31m6.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 291 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m62.8/62.8 kB\u001b[0m \u001b[31m6.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 292 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m140.6/140.6 kB\u001b[0m \u001b[31m13.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 293 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m62.7/62.7 kB\u001b[0m \u001b[31m7.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 294 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m69.0/69.0 kB\u001b[0m \u001b[31m8.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 295 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m132.6/132.6 kB\u001b[0m \u001b[31m11.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 296 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m53.1/53.1 kB\u001b[0m \u001b[31m6.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 297 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m96.5/96.5 kB\u001b[0m \u001b[31m6.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 298 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.6/2.6 MB\u001b[0m \u001b[31m59.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 299 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m322.0/322.0 kB\u001b[0m \u001b[31m14.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 300 | "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", 301 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m51.1/51.1 kB\u001b[0m \u001b[31m5.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 302 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m555.3/555.3 kB\u001b[0m \u001b[31m24.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 303 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.7/78.7 kB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 304 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m101.5/101.5 kB\u001b[0m \u001b[31m10.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 305 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m232.7/232.7 kB\u001b[0m \u001b[31m15.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 306 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m133.1/133.1 kB\u001b[0m \u001b[31m14.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 307 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.0/4.0 MB\u001b[0m \u001b[31m60.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 308 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m10.2/10.2 MB\u001b[0m \u001b[31m76.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 309 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m79.6/79.6 kB\u001b[0m \u001b[31m9.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 310 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m502.2/502.2 kB\u001b[0m \u001b[31m20.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 311 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m51.0/51.0 kB\u001b[0m \u001b[31m6.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 312 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m442.7/442.7 kB\u001b[0m \u001b[31m20.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 313 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m161.1/161.1 kB\u001b[0m \u001b[31m16.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 314 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m132.6/132.6 kB\u001b[0m \u001b[31m14.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 315 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m343.1/343.1 kB\u001b[0m \u001b[31m21.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 316 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.0/1.0 MB\u001b[0m \u001b[31m32.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 317 | "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", 318 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.9/77.9 kB\u001b[0m \u001b[31m9.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 319 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m49.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 320 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m209.3/209.3 kB\u001b[0m \u001b[31m21.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 321 | "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", 322 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m81.0/81.0 kB\u001b[0m \u001b[31m9.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 323 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m42.8/42.8 kB\u001b[0m \u001b[31m5.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 324 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m33.8/33.8 MB\u001b[0m \u001b[31m16.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 325 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m31.2/31.2 MB\u001b[0m \u001b[31m50.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 326 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m109.5/109.5 kB\u001b[0m \u001b[31m11.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 327 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m235.6/235.6 kB\u001b[0m \u001b[31m18.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 328 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.1/3.1 MB\u001b[0m \u001b[31m70.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 329 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m62.8/62.8 kB\u001b[0m \u001b[31m6.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 330 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m701.2/701.2 kB\u001b[0m \u001b[31m30.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 331 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m498.1/498.1 kB\u001b[0m \u001b[31m23.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 332 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m247.7/247.7 kB\u001b[0m \u001b[31m16.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 333 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m44.4/44.4 kB\u001b[0m \u001b[31m4.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 334 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m98.3/98.3 kB\u001b[0m \u001b[31m12.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 335 | "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", 336 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m295.6/295.6 kB\u001b[0m \u001b[31m15.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 337 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m31.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 338 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m56.3/56.3 kB\u001b[0m \u001b[31m6.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 339 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.8/40.8 kB\u001b[0m \u001b[31m4.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 340 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m65.5/65.5 kB\u001b[0m \u001b[31m6.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 341 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m17.1/17.1 MB\u001b[0m \u001b[31m71.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 342 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.5/3.5 MB\u001b[0m \u001b[31m71.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 343 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m47.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 344 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.1/14.1 MB\u001b[0m \u001b[31m74.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 345 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m298.0/298.0 kB\u001b[0m \u001b[31m29.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 346 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.8/84.8 kB\u001b[0m \u001b[31m10.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 347 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.5/4.5 MB\u001b[0m \u001b[31m73.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 348 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.8/4.8 MB\u001b[0m \u001b[31m65.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 349 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m57.5/57.5 kB\u001b[0m \u001b[31m6.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 350 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m182.5/182.5 kB\u001b[0m \u001b[31m19.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 351 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m139.5/139.5 kB\u001b[0m \u001b[31m12.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 352 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m80.6/80.6 kB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 353 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m7.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 354 | "\u001b[?25h Building wheel for configobj (setup.py) ... \u001b[?25l\u001b[?25hdone\n", 355 | " Building wheel for databricks-cli (setup.py) ... \u001b[?25l\u001b[?25hdone\n", 356 | " Building wheel for fusepy (setup.py) ... \u001b[?25l\u001b[?25hdone\n", 357 | " Building wheel for nanotime (setup.py) ... \u001b[?25l\u001b[?25hdone\n", 358 | " Building wheel for mailchecker (setup.py) ... \u001b[?25l\u001b[?25hdone\n", 359 | "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", 360 | "notebook 5.7.16 requires jinja2<=3.0.0, but you have jinja2 3.1.2 which is incompatible.\u001b[0m\u001b[31m\n", 361 | "\u001b[0m\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", 362 | "\u001b[0m" 363 | ] 364 | } 365 | ], 366 | "source": [ 367 | "from pathlib import Path\n", 368 | "\n", 369 | "!pip install --upgrade pip --quiet\n", 370 | "\n", 371 | "req_path = Path(\"requirements.txt\")\n", 372 | "if req_path.is_file():\n", 373 | " !pip install -r requirements.txt --quiet" 374 | ] 375 | }, 376 | { 377 | "cell_type": "markdown", 378 | "metadata": { 379 | "id": "3Ej5H3dJWBKj" 380 | }, 381 | "source": [ 382 | "Configure DVC" 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": null, 388 | "metadata": { 389 | "id": "OKazlYv0rKoC" 390 | }, 391 | "outputs": [], 392 | "source": [ 393 | "dvc_installed = !pip list -v | grep dvc\n", 394 | "if not dvc_installed:\n", 395 | " print(\"Installing DVC\")\n", 396 | " !pip install dvc>=2.8.1 --quiet\n", 397 | "\n", 398 | "if SET_DVC_USER:\n", 399 | " # General DVC user configuration\n", 400 | " !dvc remote modify --local origin auth basic\n", 401 | " !dvc remote modify --local origin user {DAGSHUB_USER_NAME}\n", 402 | " !dvc remote modify --local origin password {DAGSHUB_TOKEN}" 403 | ] 404 | }, 405 | { 406 | "cell_type": "markdown", 407 | "metadata": { 408 | "id": "kUrH4z2HBYZY" 409 | }, 410 | "source": [ 411 | "Configure MLflow" 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": null, 417 | "metadata": { 418 | "id": "xJ1ujzxI5zk8" 419 | }, 420 | "outputs": [], 421 | "source": [ 422 | " MLFLOW_EXPERIMENT_NAME = input(\"Please enter the MLFlow experiment name or skipe to use 'default'\") or \"default\"\n", 423 | " print(\"MLFlow experiment name: \",MLFLOW_EXPERIMENT_NAME)" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": null, 429 | "metadata": { 430 | "id": "LrZBqtl1CKAy" 431 | }, 432 | "outputs": [], 433 | "source": [ 434 | "if MLFLOW:\n", 435 | " \n", 436 | " import mlflow\n", 437 | "\n", 438 | " os.environ['MLFLOW_TRACKING_USERNAME'] = DAGSHUB_USER_NAME\n", 439 | " os.environ['MLFLOW_TRACKING_PASSWORD'] = DAGSHUB_TOKEN\n", 440 | " os.environ['MLFLOW_TRACKING_URI'] = f'https://dagshub.com/{DAGSHUB_REPO_OWNER}/{DAGSHUB_REPO_NAME}.mlflow'\n", 441 | " mlflow.set_tracking_uri(os.environ['MLFLOW_TRACKING_URI'])\n", 442 | " os.environ['MLFLOW_EXPERIMENT_NAME'] = MLFLOW_EXPERIMENT_NAME" 443 | ] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "metadata": { 448 | "id": "9pRPbeM-Pz5S" 449 | }, 450 | "source": [ 451 | "# Direct Data Access (DDA) 🚿" 452 | ] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "metadata": { 457 | "id": "15BylELeRe-8" 458 | }, 459 | "source": [ 460 | "## Configurations" 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": { 466 | "id": "oXSV2mCUdAV3" 467 | }, 468 | "source": [ 469 | "Install the client library" 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": null, 475 | "metadata": { 476 | "id": "qJemIIO1cFFG" 477 | }, 478 | "outputs": [], 479 | "source": [ 480 | "!pip install dagshub --quiet" 481 | ] 482 | }, 483 | { 484 | "cell_type": "markdown", 485 | "metadata": { 486 | "id": "wud8zng0dJYN" 487 | }, 488 | "source": [ 489 | "log in to DagsHub - Using all functionality of DDA requires authentication, we will need to log in using:\n", 490 | "1. `dagshub login`\n", 491 | "2. `dagshub login --token `" 492 | ] 493 | }, 494 | { 495 | "cell_type": "code", 496 | "execution_count": null, 497 | "metadata": { 498 | "id": "lG5CKNsRdJLz" 499 | }, 500 | "outputs": [], 501 | "source": [ 502 | "!dagshub login --token {DAGSHUB_TOKEN}" 503 | ] 504 | }, 505 | { 506 | "cell_type": "markdown", 507 | "metadata": { 508 | "id": "a2sBDRTZRZqG" 509 | }, 510 | "source": [ 511 | "## Add DDA to the project" 512 | ] 513 | }, 514 | { 515 | "cell_type": "markdown", 516 | "metadata": { 517 | "id": "-8O53ePFkSq7" 518 | }, 519 | "source": [ 520 | "- Adding DDA\n", 521 | "\n", 522 | "`src/model/train.py`\n", 523 | "```python\n", 524 | "from dagshub.streaming import install_hooks\n", 525 | "install_hooks()\n", 526 | "```\n", 527 | "\n", 528 | "- Using only a subset of our dataset 👇\n", 529 | "\n", 530 | "`src/model/train.py`\n", 531 | "```python\n", 532 | "df = pd.read_csv(os.path.join(const.BASE_DIR,const.CSV_PATH),nrows=30)\n", 533 | "```\n", 534 | "\n", 535 | "- A CUDA specification\n", 536 | "\n", 537 | "`src/const.py`\n", 538 | "```python\n", 539 | "BATCH_SIZE = 16\n", 540 | "```" 541 | ] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": { 546 | "id": "PuV6YCXRLyt0" 547 | }, 548 | "source": [ 549 | "# 🏃‍♀️Train a model" 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": null, 555 | "metadata": { 556 | "id": "H_4mex4zQDbS" 557 | }, 558 | "outputs": [], 559 | "source": [ 560 | "import IPython\n", 561 | "display(IPython.display.IFrame(f\"https://dagshub.com/{DAGSHUB_REPO_OWNER}/{DAGSHUB_REPO_NAME}\",'100%',600))" 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": null, 567 | "metadata": { 568 | "id": "WFvGbPL1cE9z" 569 | }, 570 | "outputs": [], 571 | "source": [ 572 | "!python3 -m src.model.train" 573 | ] 574 | }, 575 | { 576 | "cell_type": "markdown", 577 | "metadata": { 578 | "id": "sxCKmkLEL9X4" 579 | }, 580 | "source": [ 581 | "# ♻ Version & push the results" 582 | ] 583 | }, 584 | { 585 | "cell_type": "code", 586 | "execution_count": null, 587 | "metadata": { 588 | "id": "1KdlNBPCNdBH" 589 | }, 590 | "outputs": [], 591 | "source": [ 592 | "!git checkout -b dagshub-showcase" 593 | ] 594 | }, 595 | { 596 | "cell_type": "code", 597 | "execution_count": null, 598 | "metadata": { 599 | "colab": { 600 | "background_save": true 601 | }, 602 | "id": "tpS4P7DDMDwx" 603 | }, 604 | "outputs": [], 605 | "source": [ 606 | "!dvc push -r origin" 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": null, 612 | "metadata": { 613 | "id": "eGoR1D_HQS-g" 614 | }, 615 | "outputs": [], 616 | "source": [ 617 | "import IPython\n", 618 | "display(IPython.display.IFrame(f\"https://dagshub.com/{DAGSHUB_REPO_OWNER}/{DAGSHUB_REPO_NAME}\",'100%',600))" 619 | ] 620 | }, 621 | { 622 | "cell_type": "code", 623 | "execution_count": null, 624 | "metadata": { 625 | "id": "sYkmi0akMGDL" 626 | }, 627 | "outputs": [], 628 | "source": [ 629 | "!git add metrics.csv params.yml src/const.py src/model/train.py\n", 630 | "!git commit -m \"Addign DDA to the code\"\n", 631 | "!git push --set-upstream origin dagshub-showcase https://{DAGSHUB_USER_NAME}:{DAGSHUB_TOKEN}@dagshub.com/{DAGSHUB_REPO_OWNER}/{DAGSHUB_REPO_NAME}.git" 632 | ] 633 | }, 634 | { 635 | "cell_type": "code", 636 | "execution_count": null, 637 | "metadata": { 638 | "id": "dEKi2bY0QVfN" 639 | }, 640 | "outputs": [], 641 | "source": [ 642 | "import IPython\n", 643 | "display(IPython.display.IFrame(f\"https://dagshub.com/{DAGSHUB_REPO_OWNER}/{DAGSHUB_REPO_NAME}\",'100%',600))" 644 | ] 645 | } 646 | ], 647 | "metadata": { 648 | "colab": { 649 | "provenance": [], 650 | "include_colab_link": true 651 | }, 652 | "kernelspec": { 653 | "display_name": "Python 3", 654 | "name": "python3" 655 | } 656 | }, 657 | "nbformat": 4, 658 | "nbformat_minor": 0 659 | } -------------------------------------------------------------------------------- /Edit: -------------------------------------------------------------------------------- 1 | Edit 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data-Science-Master1111111111 2 | Data Science Master repository for ML models 3 | 1)Binary_Logistic_Regression with Diabetes data 4 | 2)Practice ML with Titanic dataset 5 | -------------------------------------------------------------------------------- /Titanic_submission_2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "authorship_tag": "ABX9TyPUACbGP/kUS7uP4/1GMlPC", 8 | "include_colab_link": true 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | }, 14 | "language_info": { 15 | "name": "python" 16 | } 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "markdown", 21 | "metadata": { 22 | "id": "view-in-github", 23 | "colab_type": "text" 24 | }, 25 | "source": [ 26 | " $\"Open$ " 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "source": [ 32 | "#by #github.com/tushar2704,kaggle.com/tusharaggarwal27, linkedin.com/in/tusharaggarwalinseec" 33 | ], 34 | "metadata": { 35 | "id": "vZNdlokRD4KA" 36 | } 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "source": [ 41 | "# The challenge\n", 42 | "\n", 43 | "The competition is simple: we want you to use the Titanic passenger data (name, age, price of ticket, etc) to try to predict who will survive and who will die." 44 | ], 45 | "metadata": { 46 | "id": "_ZksggO0GvU2" 47 | } 48 | }, 49 | { 50 | "cell_type": "code", 51 | "source": [ 52 | "pip install lazypredict" 53 | ], 54 | "metadata": { 55 | "colab": { 56 | "base_uri": "https://localhost:8080/" 57 | }, 58 | "id": "h_-nfeJFDyA3", 59 | "outputId": "68df0f16-f8d3-48a8-84b3-f62bde086a75" 60 | }, 61 | "execution_count": 13, 62 | "outputs": [ 63 | { 64 | "output_type": "stream", 65 | "name": "stdout", 66 | "text": [ 67 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 68 | "Requirement already satisfied: lazypredict in /usr/local/lib/python3.8/dist-packages (0.2.12)\n", 69 | "Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages (from lazypredict) (1.2.0)\n", 70 | "Requirement already satisfied: lightgbm in /usr/local/lib/python3.8/dist-packages (from lazypredict) (2.2.3)\n", 71 | "Requirement already satisfied: xgboost in /usr/local/lib/python3.8/dist-packages (from lazypredict) (0.90)\n", 72 | "Requirement already satisfied: click in /usr/local/lib/python3.8/dist-packages (from lazypredict) (7.1.2)\n", 73 | "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.8/dist-packages (from lazypredict) (1.0.2)\n", 74 | "Requirement already satisfied: tqdm in /usr/local/lib/python3.8/dist-packages (from lazypredict) (4.64.1)\n", 75 | "Requirement already satisfied: pandas in /usr/local/lib/python3.8/dist-packages (from lazypredict) (1.3.5)\n", 76 | "Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (from lightgbm->lazypredict) (1.7.3)\n", 77 | "Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from lightgbm->lazypredict) (1.21.6)\n", 78 | "Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas->lazypredict) (2022.6)\n", 79 | "Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas->lazypredict) (2.8.2)\n", 80 | "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/dist-packages (from python-dateutil>=2.7.3->pandas->lazypredict) (1.15.0)\n", 81 | "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from scikit-learn->lazypredict) (3.1.0)\n" 82 | ] 83 | } 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": 39, 89 | "metadata": { 90 | "id": "UdfuGXETDak8" 91 | }, 92 | "outputs": [], 93 | "source": [ 94 | "# Data manipulation imports\n", 95 | "import numpy as np\n", 96 | "import pandas as pd\n", 97 | "\n", 98 | "# Visualization imports\n", 99 | "import matplotlib.pyplot as plt\n", 100 | "import plotly.express as px\n", 101 | "\n", 102 | "#Additional\n", 103 | "import warnings\n", 104 | "warnings.filterwarnings('ignore')\n", 105 | "\n", 106 | "# Modeling imports\n", 107 | "from sklearn.model_selection import train_test_split\n", 108 | "from sklearn.linear_model import LogisticRegression\n", 109 | "from sklearn.impute import SimpleImputer\n", 110 | "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n", 111 | "from sklearn.compose import ColumnTransformer\n", 112 | "from sklearn.pipeline import Pipeline\n", 113 | "from sklearn.neighbors import KNeighborsClassifier\n", 114 | "from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay\n", 115 | "import lazypredict\n", 116 | "from lazypredict.Supervised import LazyClassifier\n", 117 | "\n" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "source": [ 123 | "#loading train_data\n", 124 | "titanic_data = pd.read_csv(\"/content/train (2).csv\")\n", 125 | "titanic_data.head(11)" 126 | ], 127 | "metadata": { 128 | "colab": { 129 | "base_uri": "https://localhost:8080/", 130 | "height": 996 131 | }, 132 | "id": "uJi1Yxi3DguB", 133 | "outputId": "eefe3c1f-ca24-4a8d-8cf9-477280df0b71" 134 | }, 135 | "execution_count": 15, 136 | "outputs": [ 137 | { 138 | "output_type": "execute_result", 139 | "data": { 140 | "text/plain": [ 141 | " PassengerId Survived Pclass \\\n", 142 | "0 1 0 3 \n", 143 | "1 2 1 1 \n", 144 | "2 3 1 3 \n", 145 | "3 4 1 1 \n", 146 | "4 5 0 3 \n", 147 | "5 6 0 3 \n", 148 | "6 7 0 1 \n", 149 | "7 8 0 3 \n", 150 | "8 9 1 3 \n", 151 | "9 10 1 2 \n", 152 | "10 11 1 3 \n", 153 | "\n", 154 | " Name Sex Age SibSp \\\n", 155 | "0 Braund, Mr. Owen Harris male 22.00 1 \n", 156 | "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.00 1 \n", 157 | "2 Heikkinen, Miss. Laina female 26.00 0 \n", 158 | "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.00 1 \n", 159 | "4 Allen, Mr. William Henry male 35.00 0 \n", 160 | "5 Moran, Mr. James male NaN 0 \n", 161 | "6 McCarthy, Mr. Timothy J male 54.00 0 \n", 162 | "7 Palsson, Master. Gosta Leonard male 2.00 3 \n", 163 | "8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.00 0 \n", 164 | "9 Nasser, Mrs. Nicholas (Adele Achem) female 14.00 1 \n", 165 | "10 Sandstrom, Miss. Marguerite Rut female 4.00 1 \n", 166 | "\n", 167 | " Parch Ticket Fare Cabin Embarked \n", 168 | "0 0 A/5 21171 7.25 NaN S \n", 169 | "1 0 PC 17599 71.28 C85 C \n", 170 | "2 0 STON/O2. 3101282 7.92 NaN S \n", 171 | "3 0 113803 53.10 C123 S \n", 172 | "4 0 373450 8.05 NaN S \n", 173 | "5 0 330877 8.46 NaN Q \n", 174 | "6 0 17463 51.86 E46 S \n", 175 | "7 1 349909 21.07 NaN S \n", 176 | "8 2 347742 11.13 NaN S \n", 177 | "9 0 237736 30.07 NaN C \n", 178 | "10 1 PP 9549 16.70 G6 S " 179 | ], 180 | "text/html": [ 181 | "\n", 182 | "
\n", 183 | "
\n", 184 | "
\n", 185 | "\n", 198 | "\n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | "
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.00 1 0 A/5 21171 7.25 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.00 1 0 PC 17599 71.28 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.00 0 0 STON/O2. 3101282 7.92 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.00 1 0 113803 53.10 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.00 0 0 373450 8.05 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.46 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.00 0 0 17463 51.86 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.00 3 1 349909 21.07 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.00 0 2 347742 11.13 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.00 1 0 237736 30.07 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.00 1 1 PP 9549 16.70 G6 S
\n", 384 | "
\n", 385 | " \n", 395 | " \n", 396 | " \n", 433 | "\n", 434 | " \n", 458 | "
\n", 459 | "
\n", 460 | " " 461 | ] 462 | }, 463 | "metadata": {}, 464 | "execution_count": 15 465 | } 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "source": [ 471 | "#loadding test_data\n", 472 | "titanic_test = pd.read_csv(\"/content/test.csv\")\n", 473 | "titanic_test.head(), titanic_test.shape" 474 | ], 475 | "metadata": { 476 | "colab": { 477 | "base_uri": "https://localhost:8080/" 478 | }, 479 | "id": "5ahV83lGEwNq", 480 | "outputId": "cbcda269-863c-4742-edaa-f71a289ea9e7" 481 | }, 482 | "execution_count": 16, 483 | "outputs": [ 484 | { 485 | "output_type": "execute_result", 486 | "data": { 487 | "text/plain": [ 488 | "( PassengerId Pclass Name Sex \\\n", 489 | " 0 892 3 Kelly, Mr. James male \n", 490 | " 1 893 3 Wilkes, Mrs. James (Ellen Needs) female \n", 491 | " 2 894 2 Myles, Mr. Thomas Francis male \n", 492 | " 3 895 3 Wirz, Mr. Albert male \n", 493 | " 4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female \n", 494 | " \n", 495 | " Age SibSp Parch Ticket Fare Cabin Embarked \n", 496 | " 0 34.50 0 0 330911 7.83 NaN Q \n", 497 | " 1 47.00 1 0 363272 7.00 NaN S \n", 498 | " 2 62.00 0 0 240276 9.69 NaN Q \n", 499 | " 3 27.00 0 0 315154 8.66 NaN S \n", 500 | " 4 22.00 1 1 3101298 12.29 NaN S , (418, 11))" 501 | ] 502 | }, 503 | "metadata": {}, 504 | "execution_count": 16 505 | } 506 | ] 507 | }, 508 | { 509 | "cell_type": "code", 510 | "source": [ 511 | "#Checking pattern for women\n", 512 | "women = titanic_data.loc[titanic_data.Sex == 'female'][\"Survived\"]\n", 513 | "rate_women = sum(women)/len(women)\n", 514 | "\n", 515 | "print(\"% of women who survived:\", rate_women)" 516 | ], 517 | "metadata": { 518 | "colab": { 519 | "base_uri": "https://localhost:8080/" 520 | }, 521 | "id": "9CjClw5aEwe4", 522 | "outputId": "49ed20eb-87aa-496a-d971-329d14ae3335" 523 | }, 524 | "execution_count": 17, 525 | "outputs": [ 526 | { 527 | "output_type": "stream", 528 | "name": "stdout", 529 | "text": [ 530 | "% of women who survived: 0.7420382165605095\n" 531 | ] 532 | } 533 | ] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "source": [ 538 | "#Checking pattern for men\n", 539 | "men = titanic_data.loc[titanic_data.Sex == 'male'][\"Survived\"]\n", 540 | "rate_men = sum(men)/len(men)\n", 541 | "\n", 542 | "print(\"% of women who survived:\", rate_men)" 543 | ], 544 | "metadata": { 545 | "colab": { 546 | "base_uri": "https://localhost:8080/" 547 | }, 548 | "id": "ZNLpU4iqEwte", 549 | "outputId": "83246fc8-9ba0-4ffe-ff38-96305e6b2013" 550 | }, 551 | "execution_count": 18, 552 | "outputs": [ 553 | { 554 | "output_type": "stream", 555 | "name": "stdout", 556 | "text": [ 557 | "% of women who survived: 0.18890814558058924\n" 558 | ] 559 | } 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "source": [ 565 | "from sklearn.ensemble import RandomForestClassifier\n", 566 | "\n", 567 | "y = titanic_data[\"Survived\"]\n", 568 | "\n", 569 | "features = [\"Pclass\", \"Sex\", \"SibSp\", \"Parch\"]\n", 570 | "X = pd.get_dummies(titanic_data[features])\n", 571 | "X_test = pd.get_dummies(titanic_test[features])\n", 572 | "\n", 573 | "model = RandomForestClassifier(n_estimators=101, max_depth=5, random_state=123)\n", 574 | "model.fit(X, y)\n", 575 | "predictions = model.predict(X_test)\n", 576 | "\n", 577 | "output = pd.DataFrame({'PassengerId': titanic_test.PassengerId, 'Survived': predictions})\n", 578 | "output.to_csv('submission.csv', index=False)\n", 579 | "print(\"Your submission was successfully saved!\")" 580 | ], 581 | "metadata": { 582 | "colab": { 583 | "base_uri": "https://localhost:8080/" 584 | }, 585 | "id": "M-MpovypExHU", 586 | "outputId": "1b77921c-fcf5-4002-f337-3153b32865fc" 587 | }, 588 | "execution_count": 47, 589 | "outputs": [ 590 | { 591 | "output_type": "stream", 592 | "name": "stdout", 593 | "text": [ 594 | "Your submission was successfully saved!\n" 595 | ] 596 | } 597 | ] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "source": [], 602 | "metadata": { 603 | "id": "hDrHGgZ4Gm1R" 604 | }, 605 | "execution_count": 19, 606 | "outputs": [] 607 | }, 608 | { 609 | "cell_type": "markdown", 610 | "source": [ 611 | "2nd submission" 612 | ], 613 | "metadata": { 614 | "id": "_qYSFvarGnSd" 615 | } 616 | }, 617 | { 618 | "cell_type": "code", 619 | "source": [ 620 | "#checking X & y shape\n", 621 | "X.shape, y.shape" 622 | ], 623 | "metadata": { 624 | "colab": { 625 | "base_uri": "https://localhost:8080/" 626 | }, 627 | "id": "6NN-2g-jHmYx", 628 | "outputId": "18570718-a47d-4810-e338-a03740800f3f" 629 | }, 630 | "execution_count": 24, 631 | "outputs": [ 632 | { 633 | "output_type": "execute_result", 634 | "data": { 635 | "text/plain": [ 636 | "((891, 5), (891,))" 637 | ] 638 | }, 639 | "metadata": {}, 640 | "execution_count": 24 641 | } 642 | ] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "source": [ 647 | "#making training and tests sets\n", 648 | "X_train, X_test, y_train, y_test = train_test_split(\n", 649 | " X, y, test_size=0.33, random_state=42, stratify=y)" 650 | ], 651 | "metadata": { 652 | "id": "KQfqjsFuHcRv" 653 | }, 654 | "execution_count": 25, 655 | "outputs": [] 656 | }, 657 | { 658 | "cell_type": "code", 659 | "source": [ 660 | "#checking for set's shape\n", 661 | "X_train.shape, X_test.shape, y_train.shape, y_test.shape" 662 | ], 663 | "metadata": { 664 | "colab": { 665 | "base_uri": "https://localhost:8080/" 666 | }, 667 | "id": "0WF9scHGIuxt", 668 | "outputId": "cc1391a8-ebf3-4004-ca43-a536291dcb59" 669 | }, 670 | "execution_count": 28, 671 | "outputs": [ 672 | { 673 | "output_type": "execute_result", 674 | "data": { 675 | "text/plain": [ 676 | "((596, 5), (295, 5), (596,), (295,))" 677 | ] 678 | }, 679 | "metadata": {}, 680 | "execution_count": 28 681 | } 682 | ] 683 | }, 684 | { 685 | "cell_type": "code", 686 | "source": [ 687 | "#model instance\n", 688 | "all_sup_model = LazyClassifier(verbose=0, ignore_warnings=True,custom_metric=None)\n" 689 | ], 690 | "metadata": { 691 | "id": "Yaif_vbPI0u8" 692 | }, 693 | "execution_count": 30, 694 | "outputs": [] 695 | }, 696 | { 697 | "cell_type": "code", 698 | "source": [ 699 | "#fitting\n", 700 | "models, predictions = all_sup_model.fit(X_train, X_test, y_train, y_test)" 701 | ], 702 | "metadata": { 703 | "colab": { 704 | "base_uri": "https://localhost:8080/" 705 | }, 706 | "id": "a1zH357zKPfD", 707 | "outputId": "a032eb94-f799-4b5c-bf96-375792a91e94" 708 | }, 709 | "execution_count": 31, 710 | "outputs": [ 711 | { 712 | "output_type": "stream", 713 | "name": "stderr", 714 | "text": [ 715 | "100%|██████████| 29/29 [00:01<00:00, 16.55it/s]\n" 716 | ] 717 | } 718 | ] 719 | }, 720 | { 721 | "cell_type": "code", 722 | "source": [ 723 | "models.reset_index()" 724 | ], 725 | "metadata": { 726 | "colab": { 727 | "base_uri": "https://localhost:8080/", 728 | "height": 957 729 | }, 730 | "id": "TP-ihK1NKhDj", 731 | "outputId": "f42b43a7-b110-4de0-9145-ae789d535a23" 732 | }, 733 | "execution_count": 36, 734 | "outputs": [ 735 | { 736 | "output_type": "execute_result", 737 | "data": { 738 | "text/plain": [ 739 | " Model Accuracy Balanced Accuracy ROC AUC \\\n", 740 | "0 GaussianNB 0.81 0.78 0.78 \n", 741 | "1 LogisticRegression 0.81 0.78 0.78 \n", 742 | "2 AdaBoostClassifier 0.80 0.78 0.78 \n", 743 | "3 CalibratedClassifierCV 0.80 0.78 0.78 \n", 744 | "4 NuSVC 0.80 0.78 0.78 \n", 745 | "5 LinearSVC 0.80 0.77 0.77 \n", 746 | "6 RidgeClassifierCV 0.79 0.77 0.77 \n", 747 | "7 RidgeClassifier 0.79 0.77 0.77 \n", 748 | "8 LinearDiscriminantAnalysis 0.79 0.77 0.77 \n", 749 | "9 PassiveAggressiveClassifier 0.79 0.77 0.77 \n", 750 | "10 BernoulliNB 0.78 0.76 0.76 \n", 751 | "11 NearestCentroid 0.78 0.76 0.76 \n", 752 | "12 XGBClassifier 0.77 0.74 0.74 \n", 753 | "13 KNeighborsClassifier 0.77 0.74 0.74 \n", 754 | "14 Perceptron 0.76 0.74 0.74 \n", 755 | "15 LGBMClassifier 0.77 0.74 0.74 \n", 756 | "16 SVC 0.77 0.74 0.74 \n", 757 | "17 LabelPropagation 0.76 0.73 0.73 \n", 758 | "18 SGDClassifier 0.78 0.73 0.73 \n", 759 | "19 BaggingClassifier 0.76 0.73 0.73 \n", 760 | "20 ExtraTreesClassifier 0.76 0.73 0.73 \n", 761 | "21 DecisionTreeClassifier 0.76 0.73 0.73 \n", 762 | "22 RandomForestClassifier 0.76 0.73 0.73 \n", 763 | "23 LabelSpreading 0.76 0.73 0.73 \n", 764 | "24 ExtraTreeClassifier 0.76 0.73 0.73 \n", 765 | "25 QuadraticDiscriminantAnalysis 0.69 0.70 0.70 \n", 766 | "26 DummyClassifier 0.62 0.50 0.50 \n", 767 | "\n", 768 | " F1 Score Time Taken \n", 769 | "0 0.81 0.02 \n", 770 | "1 0.80 0.02 \n", 771 | "2 0.80 0.12 \n", 772 | "3 0.80 0.17 \n", 773 | "4 0.80 0.03 \n", 774 | "5 0.79 0.05 \n", 775 | "6 0.79 0.01 \n", 776 | "7 0.79 0.03 \n", 777 | "8 0.79 0.04 \n", 778 | "9 0.79 0.02 \n", 779 | "10 0.78 0.02 \n", 780 | "11 0.78 0.01 \n", 781 | "12 0.76 0.19 \n", 782 | "13 0.77 0.04 \n", 783 | "14 0.76 0.01 \n", 784 | "15 0.76 0.06 \n", 785 | "16 0.76 0.03 \n", 786 | "17 0.76 0.09 \n", 787 | "18 0.77 0.01 \n", 788 | "19 0.76 0.04 \n", 789 | "20 0.75 0.34 \n", 790 | "21 0.75 0.02 \n", 791 | "22 0.75 0.17 \n", 792 | "23 0.75 0.07 \n", 793 | "24 0.75 0.02 \n", 794 | "25 0.70 0.03 \n", 795 | "26 0.47 0.02 " 796 | ], 797 | "text/html": [ 798 | "\n", 799 | "
\n", 800 | "
\n", 801 | "
\n", 802 | "\n", 815 | "\n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | "
Model Accuracy Balanced Accuracy ROC AUC F1 Score Time Taken
0 GaussianNB 0.81 0.78 0.78 0.81 0.02
1 LogisticRegression 0.81 0.78 0.78 0.80 0.02
2 AdaBoostClassifier 0.80 0.78 0.78 0.80 0.12
3 CalibratedClassifierCV 0.80 0.78 0.78 0.80 0.17
4 NuSVC 0.80 0.78 0.78 0.80 0.03
5 LinearSVC 0.80 0.77 0.77 0.79 0.05
6 RidgeClassifierCV 0.79 0.77 0.77 0.79 0.01
7 RidgeClassifier 0.79 0.77 0.77 0.79 0.03
8 LinearDiscriminantAnalysis 0.79 0.77 0.77 0.79 0.04
9 PassiveAggressiveClassifier 0.79 0.77 0.77 0.79 0.02
10 BernoulliNB 0.78 0.76 0.76 0.78 0.02
11 NearestCentroid 0.78 0.76 0.76 0.78 0.01
12 XGBClassifier 0.77 0.74 0.74 0.76 0.19
13 KNeighborsClassifier 0.77 0.74 0.74 0.77 0.04
14 Perceptron 0.76 0.74 0.74 0.76 0.01
15 LGBMClassifier 0.77 0.74 0.74 0.76 0.06
16 SVC 0.77 0.74 0.74 0.76 0.03
17 LabelPropagation 0.76 0.73 0.73 0.76 0.09
18 SGDClassifier 0.78 0.73 0.73 0.77 0.01
19 BaggingClassifier 0.76 0.73 0.73 0.76 0.04
20 ExtraTreesClassifier 0.76 0.73 0.73 0.75 0.34
21 DecisionTreeClassifier 0.76 0.73 0.73 0.75 0.02
22 RandomForestClassifier 0.76 0.73 0.73 0.75 0.17
23 LabelSpreading 0.76 0.73 0.73 0.75 0.07
24 ExtraTreeClassifier 0.76 0.73 0.73 0.75 0.02
25 QuadraticDiscriminantAnalysis 0.69 0.70 0.70 0.70 0.03
26 DummyClassifier 0.62 0.50 0.50 0.47 0.02
\n", 1073 | "
\n", 1074 | " \n", 1084 | " \n", 1085 | " \n", 1122 | "\n", 1123 | " \n", 1147 | "
\n", 1148 | "
\n", 1149 | " " 1150 | ] 1151 | }, 1152 | "metadata": {}, 1153 | "execution_count": 36 1154 | } 1155 | ] 1156 | }, 1157 | { 1158 | "cell_type": "code", 1159 | "source": [], 1160 | "metadata": { 1161 | "id": "JqYo0nIIKlp4" 1162 | }, 1163 | "execution_count": null, 1164 | "outputs": [] 1165 | }, 1166 | { 1167 | "cell_type": "code", 1168 | "source": [ 1169 | "#After checking using Logistic Regression for 2 submission without para tuning\n", 1170 | "\n", 1171 | "logreg = LogisticRegression()\n", 1172 | "\n", 1173 | "logreg.fit(X, y)" 1174 | ], 1175 | "metadata": { 1176 | "colab": { 1177 | "base_uri": "https://localhost:8080/" 1178 | }, 1179 | "id": "Z3d4abnbLOVA", 1180 | "outputId": "793441b2-fc00-42b9-d9c4-62c1673918be" 1181 | }, 1182 | "execution_count": 48, 1183 | "outputs": [ 1184 | { 1185 | "output_type": "execute_result", 1186 | "data": { 1187 | "text/plain": [ 1188 | "LogisticRegression()" 1189 | ] 1190 | }, 1191 | "metadata": {}, 1192 | "execution_count": 48 1193 | } 1194 | ] 1195 | }, 1196 | { 1197 | "cell_type": "code", 1198 | "source": [ 1199 | "predictions_2 = logreg.predict(X_test)\n", 1200 | "predictions_2.shape" 1201 | ], 1202 | "metadata": { 1203 | "colab": { 1204 | "base_uri": "https://localhost:8080/" 1205 | }, 1206 | "id": "6XhIZazMLZpX", 1207 | "outputId": "0a5e7a70-d538-4a3c-9ace-a8de21b45133" 1208 | }, 1209 | "execution_count": 49, 1210 | "outputs": [ 1211 | { 1212 | "output_type": "execute_result", 1213 | "data": { 1214 | "text/plain": [ 1215 | "(418,)" 1216 | ] 1217 | }, 1218 | "metadata": {}, 1219 | "execution_count": 49 1220 | } 1221 | ] 1222 | }, 1223 | { 1224 | "cell_type": "code", 1225 | "source": [ 1226 | "output = pd.DataFrame({'PassengerId': titanic_test.PassengerId, 'Survived': predictions_2})\n", 1227 | "output.to_csv('submission_2.csv', index=False)\n", 1228 | "print(\"Your submission was successfully saved!\")" 1229 | ], 1230 | "metadata": { 1231 | "colab": { 1232 | "base_uri": "https://localhost:8080/" 1233 | }, 1234 | "id": "AqlpwthdL8FI", 1235 | "outputId": "650108c6-80a0-441f-a280-df188cd1dd06" 1236 | }, 1237 | "execution_count": 50, 1238 | "outputs": [ 1239 | { 1240 | "output_type": "stream", 1241 | "name": "stdout", 1242 | "text": [ 1243 | "Your submission was successfully saved!\n" 1244 | ] 1245 | } 1246 | ] 1247 | }, 1248 | { 1249 | "cell_type": "code", 1250 | "source": [], 1251 | "metadata": { 1252 | "id": "vu9T6D4UMJkn" 1253 | }, 1254 | "execution_count": null, 1255 | "outputs": [] 1256 | } 1257 | ] 1258 | } -------------------------------------------------------------------------------- /decision-tree-notes.ipynb: -------------------------------------------------------------------------------- 1 | {"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat_minor":4,"nbformat":4,"cells":[{"source":" $\"Kaggle\"$ ","metadata":{},"cell_type":"markdown","outputs":[],"execution_count":0},{"cell_type":"markdown","source":"## Decision Tree\n\nDecision tree algorithm is one of the most versatile algorithms in machine learning which can perform both classification and regression analysis. It is very powerful and works great with complex datasets. Apart from that, it is very easy to understand and read. That makes it more popular to use. When coupled with ensemble techniques – which we will learn very soon- it performs even better.\nAs the name suggests, this algorithm works by dividing the whole dataset into a tree-like structure based on some rules and conditions and then gives prediction based on those conditions.\nLet’s understand the approach to decision tree with a basic scenario. \nSuppose it’s Friday night and you are not able to decide if you should go out or stay at home. Let the decision tree decide it for you.\n\n\n\n \nAlthough we may or may not use the decision tree for such decisions, this was a basic example to help you understand how a decision tree makes a decision.\nSo how did it work?\n\tIt selects a root node based on a given condition, e.g. our root node was chosen as time >10 pm.\n\tThen, the root node was split into child notes based on the given condition. The right child node in the above figure fulfilled the condition, so no more questions were asked.\n\tThe left child node didn’t fulfil the condition, so again it was split based on a new condition.\n\tThis process continues till all the conditions are met or if you have predefined the depth of your tree, e.g. the depth of our tree is 3, and it reached there when all the conditions were exhausted.\n\nLet’s see how the parent nodes and condition is chosen for the splitting to work.\n\n#### Decision Tree for Regression\nWhen performing regression with a decision tree, we try to divide the given values of X into distinct and non-overlapping regions, e.g. for a set of possible values X1, X2,..., Xp; we will try to divide them into J distinct and non-overlapping regions R1, R2, . . . , RJ.\nFor a given observation falling into the region Rj, the prediction is equal to the mean of the response(y) values for each training observations(x) in the region Rj. \nThe regions R1,R2, . . . , RJ are selected in a way to reduce the following sum of squares of residuals :\n\n\n\n \nWhere, yrj (second term) is the mean of all the response variables in the region ‘j’.\n\n\n\n#### Recursive binary splitting(Greedy approach)\nAs mentioned above, we try to divide the X values into j regions, but it is very expensive in terms of computational time to try to fit every set of X values into j regions. Thus, decision tree opts for a top-down greedy approach in which nodes are divided into two regions based on the given condition, i.e. not every node will be split but the ones which satisfy the condition are split into two branches. It is called greedy because it does the best split at a given step at that point of time rather than looking for splitting a step for a better tree in upcoming steps. It decides a threshold value(say s) to divide the observations into different regions(j) such that the RSS for Xj>= s and Xj \n \nHere for the above equation, j and s are found such that this equation has the minimum value.\nThe regions R1, R2 are selected based on that value of s and j such that the equation above has the minimum value.\nSimilarly, more regions are split out of the regions created above based on some condition with the same logic. This continues until a stopping criterion (predefined) is achieved.\nOnce all the regions are split, the prediction is made based on the mean of observations in that region.\n\nThe process mentioned above has a high chance of overfitting the training data as it will be very complex. \n\n\n### Classification Trees\n\nRegression trees are used for quantitative data. In the case of qualitative data or categorical data, we use classification trees. In regression trees, we split the nodes based on RSS criteria, but in classification, it is done using classification error rate, Gini impurity and entropy.\nLet’s understand these terms in detail.\n\n#### Entropy\nEntropy is the measure of randomness in the data. In other words, it gives the impurity present in the dataset.\n\n\n \nWhen we split our nodes into two regions and put different observations in both the regions, the main goal is to reduce the entropy i.e. reduce the randomness in the region and divide our data cleanly than it was in the previous node. If splitting the node doesn’t lead into entropy reduction, we try to split based on a different condition, or we stop. \nA region is clean (low entropy) when it contains data with the same labels and random if there is a mixture of labels present (high entropy).\nLet’s suppose there are ‘m’ observations and we need to classify them into categories 1 and 2.\nLet’s say that category 1 has ‘n’ observations and category 2 has ‘m-n’ observations.\n\np= n/m and q = m-n/m = 1-p\n\nthen, entropy for the given set is:\n\n\n E = -plog2(p) – qlog2(q) \n \n \nWhen all the observations belong to category 1, then p = 1 and all observations belong to category 2, then p =0, int both cases E =0, as there is no randomness in the categories.\nIf half of the observations are in category 1 and another half in category 2, then p =1/2 and q =1/2, and the entropy is maximum, E =1.\n\n\n\n \n\n#### Information Gain\nInformation gain calculates the decrease in entropy after splitting a node. It is the difference between entropies before and after the split. The more the information gain, the more entropy is removed. \n\n\n\n \nWhere, T is the parent node before split and X is the split node from T.\n\nA tree which is splitted on basis of entropy and information gain value looks like:\n\n\n\n#### Ginni Impurity\nAccording to wikipedia, ‘Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labelled if it was randomly labelled according to the distribution of labels in the subset.’\nIt is calculated by multiplying the probability that a given observation is classified into the correct class and sum of all the probabilities when that particular observation is classified into the wrong class.\nLet’s suppose there are k number of classes and an observation belongs to the class ‘i’, then Ginni impurity is given as:\n\n\n \nGinni impurity value lies between 0 and 1, 0 being no impurity and 1 denoting random distribution.\nThe node for which the Ginni impurity is least is selected as the root node to split.\n\n\nA tree which is splitted on basis of ginni impurity value looks like:\n\n\n\n\n\n","metadata":{"id":"XDcUAXjpGqVJ"}},{"cell_type":"markdown","source":"### Maths behind Decision Tree Classifier\nBefore we see the python implementation of decision tree. let's first understand the math behind the decision tree classfication. We will see how all the above mentioned terms are used for splitting.\n\nWe will use a simple dataset which contains information about students of different classes and gender and see whether they stay in school's hostel or not.","metadata":{"id":"wGKGzVu0GqVS"}},{"cell_type":"markdown","source":"This is how our data set looks like :\n\n\n","metadata":{"id":"kHVbisn7GqVT"}},{"cell_type":"markdown","source":"Let's try and understand how the root node is selected by calcualting gini impurity. We will use the above mentioned data.\n\nWe have two features which we can use for nodes: \"Class\" and \"Gender\".\nWe will calculate gini impurity for each of the features and then select that feature which has least gini impurity.\n\nLet's review the formula for calculating ginni impurity:\n\n\n\nLet's start with class, we will try to gini impurity for all different values in \"class\". \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThis is how our Decision tree node is selected by calculating gini impurity for each node individually.\nIf the number of feautures increases, then we just need to repeat the same steps after the selection of the root node.","metadata":{"id":"XEAo25FiGqVT"}},{"cell_type":"markdown","source":"We will try and find the root nodes for the same dataset by calculating entropy and information gain.\n\nDataSet:\n\n\n\nWe have two features and we will try to choose the root node by calculating the information gain by splitting each feature.\n\nLet' review the formula for entropy and information gain:\n\n\n\n\n\n\nLet's start with feature \"class\" :\n\n\n\n\n\n\n\n\n\n\n\n\nLet' see the information gain from feature \"gender\" :\n\n\n\n\n\n\n\n\n\n\n\n\n\n","metadata":{"id":"IWlyaD_4GqVU"}},{"cell_type":"markdown","source":"### Different Algorithms for Decision Tree\n\n\n ID3 (Iterative Dichotomiser) : It is one of the algorithms used to construct decision tree for classification. It uses Information gain as the criteria for finding the root nodes and splitting them. It only accepts categorical attributes.\n\n* C4.5 : It is an extension of ID3 algorithm, and better than ID3 as it deals both continuous and discreet values.It is also used for classfication purposes.\n\n\n* Classfication and Regression Algorithm(CART) : It is the most popular algorithm used for constructing decison trees. It uses ginni impurity as the default calculation for selecting root nodes, however one can use \"entropy\" for criteria as well. This algorithm works on both regression as well as classfication problems. We will use this algorithm in our pyhton implementation. \n\n\nEntropy and Ginni impurity can be used reversibly. It doesn't affects the result much. Although, ginni is easier to compute than entropy, since entropy has a log term calculation. That's why CART algorithm uses ginni as the default algorithm.\n\nIf we plot ginni vs entropy graph, we can see there is not much difference between them:\n\n\n\n","metadata":{"id":"NGF42IqYGqVU"}},{"cell_type":"markdown","source":"##### Advantages of Decision Tree:\n\n * It can be used for both Regression and Classification problems.\n * Decision Trees are very easy to grasp as the rules of splitting is clearly mentioned.\n * Complex decision tree models are very simple when visualized. It can be understood just by visualising.\n * Scaling and normalization are not needed.\n\n\n##### Disadvantages of Decision Tree:\n\n\n * A small change in data can cause instability in the model because of the greedy approach.\n * Probability of overfitting is very high for Decision Trees.\n * It takes more time to train a decision tree model than other classification algorithms.","metadata":{"id":"vPxt4lUfGqVV"}},{"cell_type":"markdown","source":"## Business Case:-Based on given features we need to find whether an employee will leave the company or not.","metadata":{"id":"81MWsN13GqVV"}},{"cell_type":"code","source":"## Importing the libraries\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n%matplotlib inline","metadata":{"executionInfo":{"elapsed":1748,"status":"ok","timestamp":1619774881813,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"GTvp_TcrGqVW"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Target variable:-","metadata":{"id":"h9OlcUPLGqVW"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Loading the data\ndata=pd.read_csv('HR-Employee-Attrition.csv')","metadata":{"id":"Tt_FHAtHGqVX"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.head()\npd.set_option('display.max_columns',None)","metadata":{"id":"pHCvnVhtGqVX","outputId":"14dfc600-ab01-4f18-cc40-38c872c064c2"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.head()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Getting some rows\ndata.columns","metadata":{"id":"3hvXDP0MGqVY","outputId":"deccd908-f355-4142-db22-a46da4485ff0"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"pd.set_option('display.max_columns',None)","metadata":{"id":"XrfN1JniGqVY"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Basic Checks","metadata":{"id":"SifTu1HxGqVZ"}},{"cell_type":"code","source":"data.tail()","metadata":{"id":"VaotZpJ3GqVZ","outputId":"f7c60733-2736-484d-8805-09fe9bd236a0"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.describe()","metadata":{"id":"DQJs1mkVGqVZ","outputId":"7c6c572f-821a-4257-b1cf-a26980f5507f"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.describe(include=['O'])","metadata":{"id":"nHxP8RSpGqVa","outputId":"685d7fea-c131-4bf3-d522-1f99123f89de"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.info()","metadata":{"id":"2grCr8khGqVa"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Exploratory Data Analysis","metadata":{"id":"lgxJSrcjGqVa"}},{"cell_type":"code","source":"## Univariate Analysis\n!pip install sweetviz","metadata":{"executionInfo":{"elapsed":6423,"status":"ok","timestamp":1619775151488,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"4ptW3IqPGqVa","outputId":"bb95af17-7530-441b-c08b-5d2bf46de589"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"import sweetviz as sv\nmy_report = sv.analyze(data)\nmy_report.show_html() # Default arguments will generate to \"SWEETVIZ_REPORT.html\"","metadata":{"executionInfo":{"elapsed":6980,"status":"error","timestamp":1619775152057,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"IYqXdgGvGqVb","outputId":"361846ee-10bc-4fa4-aad0-1faf54824ef5"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Insights from univariate analysis-->task","metadata":{"executionInfo":{"elapsed":6968,"status":"aborted","timestamp":1619775152048,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"UGde0x-iGqVb"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Bivaraite Analysis checking relationship of all variables with respect to ","metadata":{"executionInfo":{"elapsed":6967,"status":"aborted","timestamp":1619775152050,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"t-ljpiZyGqVb"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"categorical_col = []\nfor column in data.columns:\n if data[column].dtype == object and len(data[column].unique()) <= 50:\n categorical_col.append(column)\n print(f\"{column} : {data[column].unique()}\")\n print(\"====================================\")","metadata":{"executionInfo":{"elapsed":6961,"status":"aborted","timestamp":1619775152051,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"bhAo8-QaGqVc"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Categorical Data","metadata":{"executionInfo":{"elapsed":6955,"status":"aborted","timestamp":1619775152052,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"rQ_KEiy5GqVc"}},{"cell_type":"code","source":"## Create a new dataframe with categorical variables only\ndata1=data[['Attrition',\n 'BusinessTravel',\n 'Department',\n 'EducationField',\n 'Gender',\n 'JobRole',\n 'MaritalStatus',\n 'Over18',\n 'OverTime']]","metadata":{"executionInfo":{"elapsed":6952,"status":"aborted","timestamp":1619775152053,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"FX0KQNHSGqVc"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data1","metadata":{"executionInfo":{"elapsed":6947,"status":"aborted","timestamp":1619775152054,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"97QP0vXTGqVc"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# Plotting how every categorical feature correlate with the \"target\"\nplt.figure(figsize=(25,25), facecolor='white')\nplotnumber = 1\n\nfor column in data1:\n if plotnumber<=16 :\n ax = plt.subplot(4,4,plotnumber)\n sns.countplot(x=data1[column].dropna(axis=0)\n ,hue=data.Attrition)\n plt.xlabel(column,fontsize=20)\n plt.ylabel('Attrition',fontsize=20)\n plotnumber+=1\nplt.tight_layout()","metadata":{"executionInfo":{"elapsed":6942,"status":"aborted","timestamp":1619775152055,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"K11mCBysGqVd"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"plt.figure(figsize=(40,25), facecolor='white')\nsns.countplot(x='JobRole',hue='Attrition',data=data1)\nplt.xlabel('JobRole',fontsize=40)","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"numerical_col = []\nfor column in data.columns:\n if data[column].dtype == int and len(data[column].unique()) >= 10:\n numerical_col.append(column)\n ","metadata":{"executionInfo":{"elapsed":6940,"status":"aborted","timestamp":1619775152056,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"4aBbGbVoGqVd"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.info()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"numerical_col","metadata":{"scrolled":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Discrete data","metadata":{"id":"xMnzw7AxGqVd"}},{"cell_type":"code","source":"discrete_col = []\nfor column in data.columns:\n if data[column].dtype == int and len(data[column].unique()) <= 10:\n discrete_col.append(column)\n ","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"discrete_col","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data3=data[['Education',\n 'EmployeeCount',\n 'EnvironmentSatisfaction',\n 'JobInvolvement',\n 'JobLevel',\n 'JobSatisfaction',\n 'NumCompaniesWorked',\n 'PerformanceRating',\n 'RelationshipSatisfaction',\n 'StandardHours',\n 'StockOptionLevel',\n 'TrainingTimesLastYear',\n 'WorkLifeBalance']]","metadata":{"id":"NTWZl0nTGqVe"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# Plotting how every discrete feature correlate with the \"target\"\nplt.figure(figsize=(20,25), facecolor='white')\nplotnumber = 1\n\nfor column in data3:\n if plotnumber<=16 :\n ax = plt.subplot(4,4,plotnumber)\n sns.countplot(x=data3[column].dropna(axis=0)\n ,hue=data.Attrition)\n plt.xlabel(column,fontsize=20)\n plt.ylabel('Attrition',fontsize=20)\n plotnumber+=1\nplt.tight_layout()","metadata":{"id":"HNzZGjgTGqVe","outputId":"a4453bd4-53e0-4037-c7aa-eae0dcc3d216"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"numerical_col","metadata":{"id":"0_mHNEfJGqVe","outputId":"32e15048-df45-4877-80ef-2be800b38ef5"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data2=data[['Age',\n 'DailyRate',\n 'DistanceFromHome',\n 'EmployeeNumber',\n 'HourlyRate',\n 'MonthlyIncome',\n 'MonthlyRate',\n 'NumCompaniesWorked',\n 'PercentSalaryHike',\n 'TotalWorkingYears',\n 'YearsAtCompany',\n 'YearsInCurrentRole',\n 'YearsSinceLastPromotion',\n 'YearsWithCurrManager']]","metadata":{"id":"5DwOl-OhGqVf"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# Plotting how every numerical feature correlate with the \"target\"\nplt.figure(figsize=(20,25), facecolor='white')\nplotnumber = 1\n\nfor column in data2:\n if plotnumber<=16 :\n ax = plt.subplot(4,4,plotnumber)\n sns.histplot(x=data2[column].dropna(axis=0)\n ,hue=data.Attrition)\n plt.xlabel(column,fontsize=20)\n plt.ylabel('Attrition',fontsize=20)\n plotnumber+=1\nplt.tight_layout()","metadata":{"id":"R82tDyP2GqVf","outputId":"aad28ca6-7152-4015-a2f5-24faa0767ffb"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"","metadata":{"id":"ZPq5HV-GGqVf"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Final conclusions\nBusinessTravel : The workers who travel alot are more likely to quit then other employees.\n\nDepartment : The worker in Research & Development are more likely to stay then the workers on other departement.\n\nEducationField : The workers with Human Resources and Technical Degree are more likely to quit then employees from other fields of educations.\n\nGender : The Male are more likely to quit.\n\nJobRole : The workers in Laboratory Technician, Sales Representative, and Human Resources are more likely to quit the workers in other positions.\n\nMaritalStatus : The workers who have Single marital status are more likely to quit the Married, and Divorced.\n\nOverTime : Attrition rate is almost equal","metadata":{"id":"NnQIsJzMGqVf"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Data Preprocessing","metadata":{"id":"Q9k1w33zGqVg"}},{"cell_type":"code","source":"## Checking missing values\ndata.isnull().sum()","metadata":{"id":"Sa-TlTe8GqVg","outputId":"5f8c541e-50a0-4cb2-b110-93fc9a19cdbe"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Categorical data conversion\ndata1.head()","metadata":{"id":"_J_u9FdIGqVg"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Manual encoding Attrition feature\ndata.Attrition=data.Attrition.map({'Yes':1,'No':0})","metadata":{"id":"g8jvPI4xGqVg"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.Attrition.unique()","metadata":{"id":"n2ddHOLsGqVh","outputId":"ebe1af6b-c561-429d-a9cc-73f1e05e0034"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Encoding BusinessTravel, this feature told the worker who travelled frequesnlty has quited the job so let do the\n##manual encoding\ndata.BusinessTravel=data.BusinessTravel.map({'Travel_Frequently':2,'Travel_Rarely':1,'Non-Travel':0})\n","metadata":{"id":"V2APHYAdGqVh"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.head()","metadata":{"id":"mVUQ5MzcGqVh"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.Department=data.Department.map({'Research & Development':2,'Sales':1,'Human Resources':0})\n","metadata":{"id":"YHE92tgcGqVh"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.Department\n","metadata":{"id":"6E51sFceGqVh"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.EducationField=data.EducationField.map({'Life Sciences':5,'Medical':4,'Marketing':3,'Technical Degree':2,'Other':1,'Human Resources':0 })\n \n ","metadata":{"id":"450XCLRIGqVi"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.head()","metadata":{"id":"5qLTareTGqVi","outputId":"66d31fd4-0b43-4d17-e27e-b9631de4c502","collapsed":true,"jupyter":{"outputs_hidden":true}},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Encoding Gender\ndata.Gender=pd.get_dummies(data.Gender,drop_first=True)","metadata":{"id":"VKyEmGfDGqVi"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.Gender.value_counts()","metadata":{"id":"APyj9DcRGqVi","outputId":"27a620e0-cb69-4dfb-c2a6-264142d2bfd1","scrolled":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.Gender","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Encoding JobRole\ndata.JobRole=data.JobRole.map({'Laboratory Technician':8,'Sales Executive':7,'Research Scientist':6,'Sales Representative':5,\n 'Human Resources':4,'Manufacturing Director':3,'Healthcare Representative':2,'Manager':1,'Research Director':0 })\n \n \n ","metadata":{"id":"9FVbyyqkGqVi"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.head()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.JobRole","metadata":{"id":"ehJ6tv84GqVj"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Encoding MaritalStatus\n\nfrom sklearn.preprocessing import LabelEncoder\n\nlabel = LabelEncoder()\ndata.MaritalStatus=label.fit_transform(data.MaritalStatus)","metadata":{"id":"8Lkw25rOGqVj"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.MaritalStatus","metadata":{"id":"bswiSdTYGqVj"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Encoding OverTime\ndata.OverTime=label.fit_transform(data.OverTime)","metadata":{"id":"9fhBprtnGqVj"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.head()","metadata":{"id":"xhSE2-g4GqVk"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Feature Selection","metadata":{"id":"eY7sHt4EGqVk"}},{"cell_type":"code","source":"## Checking correlation\n\nplt.figure(figsize=(30, 30))\nsns.heatmap(data2.corr(), annot=True, cmap=\"RdYlGn\", annot_kws={\"size\":15})","metadata":{"id":"eplBC4upGqVk","outputId":"9261a36a-9cf0-44fd-afec-ba6625b76d99","collapsed":true,"jupyter":{"outputs_hidden":true}},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Removing constant features\ndata.drop(['EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours'], axis=\"columns\", inplace=True)","metadata":{"id":"0IZ7NazaGqVk"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.describe()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Model Creation","metadata":{"id":"TNccjAhwGqVl"}},{"cell_type":"code","source":"## Creating independent and dependent variable\nX = data.drop('Attrition', axis=1)\ny = data.Attrition","metadata":{"id":"hstSuG5WGqVl"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Balacing the data\nfrom collections import Counter\nfrom imblearn.over_sampling import SMOTE\nsm=SMOTE()\nprint(Counter(y))\nX_sm,y_sm=sm.fit_resample(X,y)\nprint(Counter(y_sm))","metadata":{"id":"2-vfEK3qGqVl","outputId":"8cc6621c-c747-45d0-b4dc-cd7fd0e6f512"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## preparing training and testing data\nfrom sklearn.model_selection import train_test_split\n\n\n\nX_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.25, random_state=42)","metadata":{"id":"ENFDdjN6GqVl"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"from sklearn.tree import DecisionTreeClassifier\ndt=DecisionTreeClassifier()\ndt.fit(X_train,y_train)\ny_hat=dt.predict(X_test)","metadata":{"id":"bO4z0WfgGqVl"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Evalauting the model\nfrom sklearn.metrics import accuracy_score,classification_report,f1_score\n## Training score\n#train_predict=dt.predict(X_train)\n#cc_train=accuracy_score(y_train,train_predict)\n#cc_train","metadata":{"id":"oLebPUzFGqVm"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"print(classification_report(y_train,train_predict))","metadata":{"id":"gcZKrmmXGqVm","outputId":"6f8978e5-59ae-45a2-bcec-c8859417599f"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"pd.crosstab(y_train,y_train_predict)","metadata":{"id":"_nWcvr9AGqVm","outputId":"c6724240-4648-4bba-ac63-86934197e889"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## test acc\ntest_acc=accuracy_score(y_test,y_hat)\ntest_acc","metadata":{"id":"SHIIwd0tGqVm","outputId":"de310265-7fce-4677-c0b0-f0c5d67864a8","scrolled":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"","metadata":{"id":"OH29Hq0GGqVn"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## test score\ntest_f1=f1_score(y_test,y_hat)\ntest_f1","metadata":{"id":"ndTtoD1dGqVn","outputId":"a345da1f-30bf-4ead-f594-1622fb401d72"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"print(classification_report(y_test,y_hat))","metadata":{"id":"uzKcjjRBGqVn","outputId":"a3f5c57c-c04d-4cfa-89c4-6a474a050da7"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"pd.crosstab(y_test,y_hat)","metadata":{"id":"YGdOtp55GqVn","outputId":"1a17d8de-7e1c-4beb-9672-c00a1bd3b542"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Hyperparameters of DecisionTree","metadata":{"id":"xHp4KJoMGqVn"}},{"cell_type":"markdown","source":"* criterion: The function to measure the quality of a split. Supported criteria are \"gini\" for the Gini impurity and \"entropy\" for the information gain.\n\n\n* splitter: The strategy used to choose the split at each node. Supported strategies are \"best\" to choose the best split and \"random\" to choose the best random split.\n\n* max_depth: It tells how deep the decision tree can be.The maximum depth of the tree.Deeper the tree more split it has and it captures mopre info from data.In general a DT overfits for large depth value.The tree perfectly fits the training data and fails to generalize on testing data.\n\n* min_samples_split: The minimum number of samples required to split an internal node.Ideal range is 1 to 40.\n\n* min_samples_leaf: The minimum number of samples required to be at a leaf node.Similarr to min sample split ,this describes the minimum number of samples at the leaf,the base of tree.Ideal range is 1 to 20.\n","metadata":{"id":"3sXgNhQKGqVo"}},{"cell_type":"code","source":"https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"from sklearn.model_selection import GridSearchCV","metadata":{"id":"hhVF_1VgGqVo"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"\n\nparams = {\n \"criterion\":(\"gini\", \"entropy\"), \n \"splitter\":(\"best\", \"random\"), \n \"max_depth\":(list(range(1, 20))), \n \"min_samples_split\":[2, 3, 4], \n \"min_samples_leaf\":list(range(1, 20)), \n}\n\n\ntree_clf = DecisionTreeClassifier(random_state=3)\ntree_cv = GridSearchCV(tree_clf, params, scoring=\"f1\", n_jobs=-1, verbose=1, cv=3)\ntree_cv.fit(X_train,y_train)\nbest_params = tree_cv.best_params_\nprint(f\"Best paramters: {best_params})\")\n\n","metadata":{"id":"CI6CDPDyGqVo","outputId":"232b8759-cdc3-4fc9-c0ef-f3ef47931b60","scrolled":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"fitting 3 folds for each of 4332 candidates, totalling 12996 fits\nBest paramters: {'criterion': 'entropy', 'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'random'})\n","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"tree_cv.best_params_","metadata":{"id":"G02O7SWoGqVo"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"tree_cv.best_score_","metadata":{"id":"NtJLO8QRGqVo","outputId":"a2ed2226-83ba-4c4f-9db8-1f1aa7124661"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"dt1=DecisionTreeClassifier(criterion='gini',\n max_depth=13,min_samples_leaf=1,\n min_samples_split=2,splitter='random')","metadata":{"id":"BsjmSDTHGqVp"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"dt1.fit(X_train,y_train)","metadata":{"id":"QIpcP_d4GqVp","outputId":"25114b13-8270-4d29-acf8-2145a7581ad8"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"y_hat1=dt1.predict(X_test)","metadata":{"id":"hgxF9WufGqVp"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"test_accuracy=accuracy_score(y_test,y_hat1)\ntest_accuracy","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"test_f1=f1_score(y_test,y_hat1)\ntest_f1","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"print(classification_report(y_test,y_hat1))","metadata":{"id":"SNcxd2n4GqVp","outputId":"4f32d1e2-421c-4fb5-848f-0a9ca1217d73"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## RandomForest Implementation","metadata":{"id":"eOhvjBIxGqVp"}},{"cell_type":"code","source":"from sklearn.ensemble import RandomForestClassifier\n\nrf_clf = RandomForestClassifier(n_estimators=100)\nrf_clf.fit(X_train,y_train)","metadata":{"id":"CREynTLyGqVq","outputId":"0dccb0c3-84b6-4106-f441-494ea08fe631"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"y_predict=rf_clf.predict(X_test)","metadata":{"id":"K6yeNrbLGqVq"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"\nprint(classification_report(y_test,y_predict))","metadata":{"id":"2PxP7zz7GqVq","outputId":"7e5549f1-4429-4e55-99d9-c27ce91f2f40"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"f_Score=f1_score(y_test,y_predict)\nf_Score","metadata":{"id":"Q1693iiVGqVq","outputId":"844968d4-550b-4540-9d34-7267d5bac380"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Hyperparameter Tuning","metadata":{"id":"GGQzgT2GGqVq"}},{"cell_type":"markdown","source":"* n_estimators = number of trees in the foreset\n* max_features = max number of features considered for creating the tree.\n* max_depth = max number of levels in each decision tree\n* min_samples_split = min number of data points placed in a node before the node is split\n* min_samples_leaf = min number of data points allowed in a leaf node\n* bootstrap = method for sampling data points (with or without replacement)","metadata":{"id":"J2NDxMIKGqVq"}},{"cell_type":"code","source":"from sklearn.model_selection import RandomizedSearchCV\n\nn_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]\nmax_features = ['auto', 'sqrt']\nmax_depth = [int(x) for x in np.linspace(10, 110, num=11)]\nmax_depth.append(None)\nmin_samples_split = [2, 5, 10]\nmin_samples_leaf = [1, 2, 4]\nbootstrap = [True, False]\n\nrandom_grid = {'n_estimators': n_estimators, 'max_features': max_features,\n 'max_depth': max_depth, 'min_samples_split': min_samples_split,\n 'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap}\n\nrf_clf1 = RandomForestClassifier(random_state=42)\n\nrf_cv = RandomizedSearchCV(estimator=rf_clf1, scoring='f1',param_distributions=random_grid, n_iter=100, cv=3, \n verbose=3, random_state=42, n_jobs=-1)\n\nrf_cv.fit(X_train, y_train)\nrf_best_params = rf_cv.best_params_\nprint(f\"Best paramters: {rf_best_params})\")\n \n","metadata":{"id":"9xUOaCSsGqVr","outputId":"29e21a0f-eadc-4d00-9c96-183a9563a346"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"rf_clf2 = RandomForestClassifier(rf_best_params)\nrf_clf2.fit(X_train, y_train)\ny_predict=rf_clf2.predict(X_test)\nf1_score=f1_score(y_test,y_predict)","metadata":{"id":"HXDZsw6OGqVr"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"f1_score","metadata":{"id":"e9vcwyZiGqVr","outputId":"38f9e72a-9ba9-4526-a1d5-566414c653af"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Decision Tree Visualization","metadata":{}},{"cell_type":"code","source":"import matplotlib.pyplot as plt","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"from sklearn import tree\nplt.figure(figsize=(15,10))\ntree.plot_tree(dt,filled=True)","metadata":{"collapsed":true,"jupyter":{"outputs_hidden":true}},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"! pip install graphviz","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"import graphviz\n# DOT data\ndot_data = tree.export_graphviz(dt, \n feature_names=X.columns, \n class_names=y,\n filled=True)\n\n# Draw graph\ngraph = graphviz.Source(dot_data, format=\"png\") \ngraph","metadata":{"collapsed":true,"jupyter":{"outputs_hidden":true}},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"","metadata":{},"execution_count":null,"outputs":[]}]} -------------------------------------------------------------------------------- /random-forest-social-network.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"source":" $\"Kaggle\"$ ","metadata":{},"cell_type":"markdown","outputs":[],"execution_count":0},{"cell_type":"markdown","id":"b8c80528","metadata":{"execution":{"iopub.execute_input":"2023-01-10T16:31:59.598789Z","iopub.status.busy":"2023-01-10T16:31:59.598342Z","iopub.status.idle":"2023-01-10T16:31:59.62834Z","shell.execute_reply":"2023-01-10T16:31:59.6266Z","shell.execute_reply.started":"2023-01-10T16:31:59.598756Z"},"papermill":{"duration":0.008579,"end_time":"2023-01-12T13:50:23.814162","exception":false,"start_time":"2023-01-12T13:50:23.805583","status":"completed"},"tags":[]},"source":["
\n","
\n"," 🌲🌳Random Forest for Sales🌲🌳\n"," \n","
\n"," \n"," \n","
"]},{"cell_type":"markdown","id":"f901a4f0","metadata":{"papermill":{"duration":0.006836,"end_time":"2023-01-12T13:50:23.828316","exception":false,"start_time":"2023-01-12T13:50:23.82148","status":"completed"},"tags":[]},"source":["
I brewed this notebook from scratch, If this notebook helped, please consider upvoting and cite me if sharing ,Thank you!
\n","\n","\n","
\n"," Lets connect on LinkedIn!\n"," \n","
\n","
\n","Follow me on Github too!
\n","\n","
\n"," Also checkout my Medium posts!\n"," \n","
"]},{"cell_type":"markdown","id":"982b9176","metadata":{"papermill":{"duration":0.007112,"end_time":"2023-01-12T13:50:23.842482","exception":false,"start_time":"2023-01-12T13:50:23.83537","status":"completed"},"tags":[]},"source":["
\n"," In this project, I am using Random Forest algorithm on Social Network Ads data to uncover if the Sales was successful or not,\n"," \n","
\n","
\n"," Some things to note:**\n","

The Random forest classifier creates a set of decision trees from a randomly selected subset of the training set. It is basically a set of decision trees (DT) from a randomly selected subset of the training set and then It collects the votes from different decision trees to decide the final prediction.\n","

\n"," \n","
"]},{"cell_type":"markdown","id":"be2afea3","metadata":{"papermill":{"duration":0.006883,"end_time":"2023-01-12T13:50:23.856471","exception":false,"start_time":"2023-01-12T13:50:23.849588","status":"completed"},"tags":[]},"source":["
Importing the required Libraries\n"," \n","
"]},{"cell_type":"code","execution_count":1,"id":"8a1ed862","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:23.873133Z","iopub.status.busy":"2023-01-12T13:50:23.872328Z","iopub.status.idle":"2023-01-12T13:50:25.58106Z","shell.execute_reply":"2023-01-12T13:50:25.5802Z"},"papermill":{"duration":1.720153,"end_time":"2023-01-12T13:50:25.583717","exception":false,"start_time":"2023-01-12T13:50:23.863564","status":"completed"},"tags":[]},"outputs":[],"source":["import pandas as pd\n","import numpy as np\n","import seaborn as sns\n","import matplotlib.pyplot as plt\n","import statsmodels.formula.api as smf\n","from sklearn.model_selection import train_test_split\n","from sklearn.preprocessing import StandardScaler\n","from sklearn.tree import DecisionTreeClassifier\n","from sklearn.ensemble import RandomForestClassifier\n","from sklearn.metrics import confusion_matrix, accuracy_score\n","from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, classification_report"]},{"cell_type":"markdown","id":"fd356bd4","metadata":{"papermill":{"duration":0.008211,"end_time":"2023-01-12T13:50:25.599087","exception":false,"start_time":"2023-01-12T13:50:25.590876","status":"completed"},"tags":[]},"source":["
\n"," Loading the data\n","
"]},{"cell_type":"code","execution_count":2,"id":"941634e5","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:25.615353Z","iopub.status.busy":"2023-01-12T13:50:25.614339Z","iopub.status.idle":"2023-01-12T13:50:25.633752Z","shell.execute_reply":"2023-01-12T13:50:25.632982Z"},"papermill":{"duration":0.030404,"end_time":"2023-01-12T13:50:25.636371","exception":false,"start_time":"2023-01-12T13:50:25.605967","status":"completed"},"tags":[]},"outputs":[],"source":["#Reading the data file\n","social_data = pd.read_csv('/kaggle/input/social-network/Social_Network_Ads.csv')"]},{"cell_type":"code","execution_count":3,"id":"06b49c59","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:25.652417Z","iopub.status.busy":"2023-01-12T13:50:25.652052Z","iopub.status.idle":"2023-01-12T13:50:25.672265Z","shell.execute_reply":"2023-01-12T13:50:25.671056Z"},"papermill":{"duration":0.031072,"end_time":"2023-01-12T13:50:25.6747","exception":false,"start_time":"2023-01-12T13:50:25.643628","status":"completed"},"tags":[]},"outputs":[{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
User ID Gender Age EstimatedSalary Purchased
0 15624510 Male 19 19000 0
1 15810944 Male 35 20000 0
2 15668575 Female 26 43000 0
3 15603246 Female 27 57000 0
4 15804002 Male 19 76000 0
5 15728773 Male 27 58000 0
6 15598044 Female 27 84000 0
7 15694829 Female 32 150000 1
8 15600575 Male 25 33000 0
9 15727311 Female 35 65000 0
10 15570769 Female 26 80000 0
\n","
"],"text/plain":[" User ID Gender Age EstimatedSalary Purchased\n","0 15624510 Male 19 19000 0\n","1 15810944 Male 35 20000 0\n","2 15668575 Female 26 43000 0\n","3 15603246 Female 27 57000 0\n","4 15804002 Male 19 76000 0\n","5 15728773 Male 27 58000 0\n","6 15598044 Female 27 84000 0\n","7 15694829 Female 32 150000 1\n","8 15600575 Male 25 33000 0\n","9 15727311 Female 35 65000 0\n","10 15570769 Female 26 80000 0"]},"execution_count":3,"metadata":{},"output_type":"execute_result"}],"source":["#Printing the head for data\n","social_data.head(11)"]},{"cell_type":"code","execution_count":4,"id":"2c3d07a5","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:25.691067Z","iopub.status.busy":"2023-01-12T13:50:25.690639Z","iopub.status.idle":"2023-01-12T13:50:25.714479Z","shell.execute_reply":"2023-01-12T13:50:25.712914Z"},"papermill":{"duration":0.035015,"end_time":"2023-01-12T13:50:25.717108","exception":false,"start_time":"2023-01-12T13:50:25.682093","status":"completed"},"tags":[]},"outputs":[{"name":"stdout","output_type":"stream","text":["\n","RangeIndex: 400 entries, 0 to 399\n","Data columns (total 5 columns):\n"," # Column Non-Null Count Dtype \n","--- ------ -------------- ----- \n"," 0 User ID 400 non-null int64 \n"," 1 Gender 400 non-null object\n"," 2 Age 400 non-null int64 \n"," 3 EstimatedSalary 400 non-null int64 \n"," 4 Purchased 400 non-null int64 \n","dtypes: int64(4), object(1)\n","memory usage: 15.8+ KB\n"]}],"source":["#Checking info for the data\n","social_data.info()"]},{"cell_type":"markdown","id":"03cf9d87","metadata":{"papermill":{"duration":0.007016,"end_time":"2023-01-12T13:50:25.731691","exception":false,"start_time":"2023-01-12T13:50:25.724675","status":"completed"},"tags":[]},"source":["
\n"," Selecting data using iloc - Here we select all rows but only required columns\n","
"]},{"cell_type":"code","execution_count":5,"id":"b4729237","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:25.747892Z","iopub.status.busy":"2023-01-12T13:50:25.747463Z","iopub.status.idle":"2023-01-12T13:50:25.753799Z","shell.execute_reply":"2023-01-12T13:50:25.75267Z"},"papermill":{"duration":0.016979,"end_time":"2023-01-12T13:50:25.755995","exception":false,"start_time":"2023-01-12T13:50:25.739016","status":"completed"},"tags":[]},"outputs":[],"source":["#Seprating for features and target variable\n","X = social_data.iloc[:, [2,3]].values\n","y = social_data.iloc[:,4].values"]},{"cell_type":"code","execution_count":6,"id":"585dc6a8","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:25.772257Z","iopub.status.busy":"2023-01-12T13:50:25.771914Z","iopub.status.idle":"2023-01-12T13:50:25.776359Z","shell.execute_reply":"2023-01-12T13:50:25.775345Z"},"papermill":{"duration":0.01506,"end_time":"2023-01-12T13:50:25.778527","exception":false,"start_time":"2023-01-12T13:50:25.763467","status":"completed"},"tags":[]},"outputs":[],"source":["#Checking for columns\n","#print(X), print(y)"]},{"cell_type":"markdown","id":"27bf42c5","metadata":{"papermill":{"duration":0.006953,"end_time":"2023-01-12T13:50:25.792902","exception":false,"start_time":"2023-01-12T13:50:25.785949","status":"completed"},"tags":[]},"source":["
\n"," Splitting the dataset into the Training set and Test set\n","
"]},{"cell_type":"code","execution_count":7,"id":"13a7de91","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:25.809117Z","iopub.status.busy":"2023-01-12T13:50:25.808703Z","iopub.status.idle":"2023-01-12T13:50:25.814549Z","shell.execute_reply":"2023-01-12T13:50:25.813605Z"},"papermill":{"duration":0.016406,"end_time":"2023-01-12T13:50:25.816612","exception":false,"start_time":"2023-01-12T13:50:25.800206","status":"completed"},"tags":[]},"outputs":[],"source":["#Splitting train and test sets\n","X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=0)"]},{"cell_type":"code","execution_count":8,"id":"e1a0500f","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:25.833814Z","iopub.status.busy":"2023-01-12T13:50:25.832638Z","iopub.status.idle":"2023-01-12T13:50:25.840818Z","shell.execute_reply":"2023-01-12T13:50:25.8398Z"},"papermill":{"duration":0.019024,"end_time":"2023-01-12T13:50:25.843198","exception":false,"start_time":"2023-01-12T13:50:25.824174","status":"completed"},"tags":[]},"outputs":[{"name":"stdout","output_type":"stream","text":["(280, 2)\n","(120, 2)\n","(280,)\n","(120,)\n"]},{"data":{"text/plain":["(None, None, None, None)"]},"execution_count":8,"metadata":{},"output_type":"execute_result"}],"source":["#Checking X_train, X_test, y_train, y_test shapes\n","print(X_train.shape), print(X_test.shape), print(y_train.shape), print(y_test.shape)"]},{"cell_type":"markdown","id":"1258e9f4","metadata":{"papermill":{"duration":0.007164,"end_time":"2023-01-12T13:50:25.858256","exception":false,"start_time":"2023-01-12T13:50:25.851092","status":"completed"},"tags":[]},"source":["
\n"," Feature Scaling\n","

Here we only scale X_train and X_test\n","
"]},{"cell_type":"code","execution_count":9,"id":"bc6ef9fd","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:25.875163Z","iopub.status.busy":"2023-01-12T13:50:25.874531Z","iopub.status.idle":"2023-01-12T13:50:25.879531Z","shell.execute_reply":"2023-01-12T13:50:25.878505Z"},"papermill":{"duration":0.016021,"end_time":"2023-01-12T13:50:25.881654","exception":false,"start_time":"2023-01-12T13:50:25.865633","status":"completed"},"tags":[]},"outputs":[],"source":["sc = StandardScaler() # creating the instance"]},{"cell_type":"code","execution_count":10,"id":"786d8de3","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:25.899209Z","iopub.status.busy":"2023-01-12T13:50:25.898127Z","iopub.status.idle":"2023-01-12T13:50:25.904092Z","shell.execute_reply":"2023-01-12T13:50:25.903208Z"},"papermill":{"duration":0.017008,"end_time":"2023-01-12T13:50:25.906462","exception":false,"start_time":"2023-01-12T13:50:25.889454","status":"completed"},"tags":[]},"outputs":[],"source":["#Standardizing the features\n","X_train = sc.fit_transform(X_train)\n","X_test = sc.fit_transform(X_test)"]},{"cell_type":"code","execution_count":11,"id":"0555a96a","metadata":{"collapsed":true,"execution":{"iopub.execute_input":"2023-01-12T13:50:25.923388Z","iopub.status.busy":"2023-01-12T13:50:25.923004Z","iopub.status.idle":"2023-01-12T13:50:25.941375Z","shell.execute_reply":"2023-01-12T13:50:25.939064Z"},"jupyter":{"outputs_hidden":true},"papermill":{"duration":0.029437,"end_time":"2023-01-12T13:50:25.943628","exception":false,"start_time":"2023-01-12T13:50:25.914191","status":"completed"},"tags":[]},"outputs":[{"name":"stdout","output_type":"stream","text":["[[-1.1631724 -1.5849703 ]\n"," [ 2.17018137 0.93098672]\n"," [ 0.0133054 1.22017719]\n"," [ 0.20938504 1.07558195]\n"," [ 0.40546467 -0.48604654]\n"," [-0.28081405 -0.31253226]\n"," [ 0.99370357 -0.8330751 ]\n"," [ 0.99370357 1.8563962 ]\n"," [ 0.0133054 1.24909623]\n"," [-0.86905295 2.26126285]\n"," [-1.1631724 -1.5849703 ]\n"," [ 2.17018137 -0.80415605]\n"," [-1.35925203 -1.46929411]\n"," [ 0.40546467 2.2901819 ]\n"," [ 0.79762394 0.75747245]\n"," [-0.96709276 -0.31253226]\n"," [ 0.11134522 0.75747245]\n"," [-0.96709276 0.55503912]\n"," [ 0.30742485 0.06341534]\n"," [ 0.69958412 -1.26686079]\n"," [-0.47689368 -0.0233418 ]\n"," [-1.7514113 0.3526058 ]\n"," [-0.67297331 0.12125343]\n"," [ 0.40546467 0.29476771]\n"," [-0.28081405 0.06341534]\n"," [-0.47689368 2.2901819 ]\n"," [ 0.20938504 0.03449629]\n"," [ 1.28782302 2.20342476]\n"," [ 0.79762394 0.26584866]\n"," [-0.28081405 0.15017248]\n"," [ 0.0133054 -0.54388463]\n"," [-0.18277423 0.15017248]\n"," [-0.08473441 0.23692961]\n"," [ 0.0133054 -0.25469417]\n"," [ 2.17018137 1.104501 ]\n"," [-1.7514113 0.3526058 ]\n"," [ 1.87606192 0.12125343]\n"," [ 0.40546467 -0.13901799]\n"," [-1.1631724 0.29476771]\n"," [ 0.79762394 1.36477242]\n"," [-0.28081405 -0.25469417]\n"," [-1.65337148 -0.05226085]\n"," [-0.96709276 -0.74631796]\n"," [ 0.30742485 0.49720103]\n"," [-0.08473441 -1.06442747]\n"," [-1.06513258 0.58395817]\n"," [ 0.11134522 -0.80415605]\n"," [-0.96709276 1.53828669]\n"," [-0.67297331 1.39369146]\n"," [-1.26121221 0.49720103]\n"," [-0.28081405 0.03449629]\n"," [-0.08473441 0.00557724]\n"," [-0.28081405 -0.89091319]\n"," [ 0.89566375 -1.35361793]\n"," [-0.28081405 2.2323438 ]\n"," [ 0.99370357 1.97207239]\n"," [-1.1631724 0.46828198]\n"," [-1.26121221 0.26584866]\n"," [ 1.38586284 1.97207239]\n"," [ 1.28782302 -1.35361793]\n"," [-0.28081405 -0.28361322]\n"," [-0.47689368 1.24909623]\n"," [-0.77101313 1.07558195]\n"," [ 0.99370357 -1.06442747]\n"," [ 0.30742485 0.29476771]\n"," [ 0.99370357 0.75747245]\n"," [-0.67297331 -1.49821316]\n"," [-0.67297331 0.03449629]\n"," [ 0.50350449 1.71180097]\n"," [ 2.07214155 0.17909152]\n"," [-1.94749093 -0.74631796]\n"," [-0.18277423 1.39369146]\n"," [ 0.40546467 0.58395817]\n"," [ 0.89566375 -1.1511846 ]\n"," [-1.1631724 -0.775237 ]\n"," [ 0.20938504 0.23692961]\n"," [ 0.79762394 -0.31253226]\n"," [ 2.07214155 -0.80415605]\n"," [ 0.79762394 0.12125343]\n"," [-0.28081405 0.61287722]\n"," [-0.96709276 -0.31253226]\n"," [ 0.20938504 -0.37037036]\n"," [ 2.07214155 2.11666762]\n"," [ 1.87606192 -1.26686079]\n"," [ 1.38586284 -0.91983223]\n"," [ 0.89566375 1.24909623]\n"," [ 1.48390265 2.11666762]\n"," [-0.28081405 -1.23794174]\n"," [ 1.97410174 0.90206768]\n"," [ 0.69958412 -0.71739891]\n"," [-1.45729185 0.3526058 ]\n"," [ 0.79762394 -1.35361793]\n"," [ 0.40546467 -0.13901799]\n"," [-0.96709276 0.41044389]\n"," [ 0.0133054 -0.31253226]\n"," [-1.1631724 0.41044389]\n"," [-0.86905295 -1.2090227 ]\n"," [-0.08473441 0.03449629]\n"," [-1.55533166 -0.42820845]\n"," [ 0.99370357 -1.00658937]\n"," [ 1.09174339 -1.2090227 ]\n"," [ 0.0133054 -0.13901799]\n"," [-1.06513258 -1.52713221]\n"," [ 0.79762394 -1.2090227 ]\n"," [ 0.99370357 2.05882953]\n"," [-1.1631724 -1.52713221]\n"," [-0.28081405 0.78639149]\n"," [ 0.11134522 -0.31253226]\n"," [-1.35925203 -1.23794174]\n"," [-0.5749335 -1.49821316]\n"," [ 0.79762394 0.52612008]\n"," [-0.28081405 -0.34145131]\n"," [ 1.7780221 -0.28361322]\n"," [ 0.89566375 -1.03550842]\n"," [ 0.20938504 0.06341534]\n"," [-0.5749335 0.87314863]\n"," [-1.84945111 -1.41145602]\n"," [-1.26121221 0.58395817]\n"," [-0.28081405 0.52612008]\n"," [-0.96709276 -1.09334651]\n"," [ 1.1897832 -1.44037507]\n"," [ 0.20938504 -0.31253226]\n"," [ 1.1897832 -0.74631796]\n"," [-0.28081405 0.06341534]\n"," [ 0.20938504 2.08774857]\n"," [ 0.79762394 -1.09334651]\n"," [ 0.11134522 0.03449629]\n"," [-1.7514113 0.12125343]\n"," [-0.86905295 0.15017248]\n"," [-0.67297331 0.17909152]\n"," [ 0.89566375 -1.29577984]\n"," [ 0.20938504 -0.25469417]\n"," [-0.37885386 1.22017719]\n"," [ 0.0133054 0.29476771]\n"," [ 0.40546467 0.15017248]\n"," [ 0.89566375 -0.65956082]\n"," [ 0.11134522 0.15017248]\n"," [-1.84945111 -1.29577984]\n"," [-0.08473441 0.29476771]\n"," [-0.18277423 -0.28361322]\n"," [ 0.30742485 -0.51496559]\n"," [-0.18277423 1.59612479]\n"," [ 0.99370357 -1.18010365]\n"," [-0.18277423 1.62504383]\n"," [ 1.28782302 1.8563962 ]\n"," [-1.06513258 -0.37037036]\n"," [ 0.0133054 0.03449629]\n"," [ 0.11134522 -0.25469417]\n"," [-1.55533166 -1.23794174]\n"," [-0.47689368 -0.28361322]\n"," [ 0.99370357 0.12125343]\n"," [ 1.97410174 -1.35361793]\n"," [ 1.48390265 0.06341534]\n"," [-0.5749335 1.36477242]\n"," [ 1.58194247 0.00557724]\n"," [-0.77101313 0.29476771]\n"," [ 1.97410174 0.7285534 ]\n"," [-1.1631724 -0.51496559]\n"," [ 0.69958412 0.26584866]\n"," [-1.35925203 -0.42820845]\n"," [ 0.20938504 0.15017248]\n"," [-0.47689368 -1.2090227 ]\n"," [ 0.6015443 2.00099143]\n"," [-1.55533166 -1.49821316]\n"," [-0.47689368 -0.54388463]\n"," [ 0.50350449 1.82747716]\n"," [-1.35925203 -1.09334651]\n"," [ 0.79762394 -1.38253697]\n"," [-0.28081405 -0.42820845]\n"," [ 1.58194247 0.98882482]\n"," [ 0.99370357 1.42261051]\n"," [-0.28081405 -0.48604654]\n"," [-0.08473441 2.14558666]\n"," [-1.45729185 -0.11009894]\n"," [-0.08473441 1.94315334]\n"," [-0.67297331 -0.34145131]\n"," [-0.47689368 -0.8330751 ]\n"," [ 0.69958412 -1.38253697]\n"," [-0.77101313 -1.5849703 ]\n"," [-1.84945111 -1.46929411]\n"," [ 1.09174339 0.12125343]\n"," [ 0.11134522 1.50936765]\n"," [-0.28081405 0.09233438]\n"," [ 0.11134522 0.03449629]\n"," [-1.35925203 -1.35361793]\n"," [ 0.30742485 0.06341534]\n"," [-0.86905295 0.38152485]\n"," [ 1.58194247 -1.26686079]\n"," [-0.28081405 -0.74631796]\n"," [-0.08473441 0.15017248]\n"," [-0.86905295 -0.65956082]\n"," [-0.67297331 -0.05226085]\n"," [ 0.40546467 -0.45712749]\n"," [-0.77101313 1.88531525]\n"," [ 1.38586284 1.27801528]\n"," [ 1.1897832 -0.97767033]\n"," [ 1.7780221 1.82747716]\n"," [-0.86905295 -0.25469417]\n"," [-0.77101313 0.55503912]\n"," [-1.1631724 -1.55605125]\n"," [-0.47689368 -1.12226556]\n"," [ 0.30742485 0.06341534]\n"," [-0.18277423 -1.06442747]\n"," [ 1.67998229 1.59612479]\n"," [ 0.99370357 1.76963906]\n"," [ 0.30742485 0.03449629]\n"," [-0.77101313 -0.22577513]\n"," [-0.08473441 0.06341534]\n"," [ 0.30742485 -0.19685608]\n"," [ 1.97410174 -0.65956082]\n"," [-0.77101313 1.33585337]\n"," [-1.7514113 -0.60172273]\n"," [-0.08473441 0.12125343]\n"," [ 0.30742485 -0.31253226]\n"," [ 1.09174339 0.55503912]\n"," [-0.96709276 0.26584866]\n"," [ 1.48390265 0.3526058 ]\n"," [ 0.20938504 -0.37037036]\n"," [ 2.17018137 -1.03550842]\n"," [-0.28081405 1.104501 ]\n"," [-1.65337148 0.06341534]\n"," [ 0.0133054 0.03449629]\n"," [ 0.11134522 1.04666291]\n"," [-0.08473441 -0.37037036]\n"," [-1.1631724 0.06341534]\n"," [-0.28081405 -1.35361793]\n"," [ 1.58194247 1.104501 ]\n"," [-0.77101313 -1.52713221]\n"," [ 0.11134522 1.8563962 ]\n"," [-0.86905295 -0.775237 ]\n"," [-0.47689368 -0.775237 ]\n"," [-0.28081405 -0.91983223]\n"," [ 0.30742485 -0.71739891]\n"," [ 0.30742485 0.06341534]\n"," [ 0.11134522 1.8563962 ]\n"," [-1.06513258 1.94315334]\n"," [-1.65337148 -1.55605125]\n"," [-1.1631724 -1.09334651]\n"," [-0.67297331 -0.11009894]\n"," [ 0.11134522 0.09233438]\n"," [ 0.30742485 0.26584866]\n"," [ 0.89566375 -0.57280368]\n"," [ 0.30742485 -1.1511846 ]\n"," [-0.08473441 0.67071531]\n"," [ 2.17018137 -0.68847986]\n"," [-1.26121221 -1.38253697]\n"," [-0.96709276 -0.94875128]\n"," [ 0.0133054 -0.42820845]\n"," [-0.18277423 -0.45712749]\n"," [-1.7514113 -0.97767033]\n"," [ 1.7780221 0.98882482]\n"," [ 0.20938504 -0.37037036]\n"," [ 0.40546467 1.104501 ]\n"," [-1.7514113 -1.35361793]\n"," [ 0.20938504 -0.13901799]\n"," [ 0.89566375 -1.44037507]\n"," [-1.94749093 0.46828198]\n"," [-0.28081405 0.26584866]\n"," [ 1.87606192 -1.06442747]\n"," [-0.37885386 0.06341534]\n"," [ 1.09174339 -0.89091319]\n"," [-1.06513258 -1.12226556]\n"," [-1.84945111 0.00557724]\n"," [ 0.11134522 0.26584866]\n"," [-1.1631724 0.32368675]\n"," [-1.26121221 0.29476771]\n"," [-0.96709276 0.43936294]\n"," [ 1.67998229 -0.89091319]\n"," [ 1.1897832 0.52612008]\n"," [ 1.09174339 0.52612008]\n"," [ 1.38586284 2.31910094]\n"," [-0.28081405 -0.13901799]\n"," [ 0.40546467 -0.45712749]\n"," [-0.37885386 -0.775237 ]\n"," [-0.08473441 -0.51496559]\n"," [ 0.99370357 -1.1511846 ]\n"," [-0.86905295 -0.775237 ]\n"," [-0.18277423 -0.51496559]\n"," [-1.06513258 -0.45712749]\n"," [-1.1631724 1.39369146]]\n","[[-0.64807267 0.53080315]\n"," [ 0.07535729 -0.59737588]\n"," [-0.19592895 0.16490725]\n"," [-0.64807267 0.28687255]\n"," [-0.19592895 -0.59737588]\n"," [-0.9193589 -1.51211563]\n"," [-0.55764392 -1.66457226]\n"," [-0.1055002 2.26880869]\n"," [-1.7332176 -0.04853203]\n"," [ 0.88921599 -0.81081516]\n"," [-0.64807267 -0.6278672 ]\n"," [-0.82893016 -0.44491925]\n"," [-0.01507146 -0.44491925]\n"," [ 0.16578603 0.2258899 ]\n"," [-1.55236011 0.50031183]\n"," [-0.46721518 1.44554291]\n"," [-0.01507146 0.2258899 ]\n"," [-1.64278886 0.4698205 ]\n"," [ 1.61264594 1.84193014]\n"," [-0.19592895 -1.45113298]\n"," [-0.19592895 -0.68884985]\n"," [ 0.88921599 2.26880869]\n"," [ 0.34664352 -0.56688455]\n"," [ 0.88921599 1.07964701]\n"," [-1.28107388 -1.26818503]\n"," [ 1.07007347 2.17733471]\n"," [-0.82893016 0.53080315]\n"," [-0.73850141 0.31736388]\n"," [-0.01507146 -0.23147998]\n"," [-0.46721518 0.50031183]\n"," [-1.46193137 0.56129448]\n"," [-0.01507146 0.28687255]\n"," [ 1.79350343 -0.29246263]\n"," [-0.01507146 -0.5059019 ]\n"," [-1.19064513 -0.35344528]\n"," [-1.7332176 -0.53639323]\n"," [-1.37150262 0.3478552 ]\n"," [-0.28635769 -0.81081516]\n"," [-0.55764392 -1.08523708]\n"," [ 1.07007347 -1.02425443]\n"," [-0.9193589 0.56129448]\n"," [ 0.34664352 -0.53639323]\n"," [-0.9193589 0.43932918]\n"," [-0.19592895 -1.51211563]\n"," [ 0.52750101 1.29308628]\n"," [-0.9193589 -0.35344528]\n"," [-0.01507146 0.31736388]\n"," [ 1.34135971 0.62227713]\n"," [-1.00978765 -1.20720238]\n"," [ 1.07007347 0.50031183]\n"," [ 1.79350343 1.59799953]\n"," [-0.28635769 -1.35965901]\n"," [-0.19592895 -0.3839366 ]\n"," [-0.28635769 1.38456026]\n"," [ 1.97436092 0.56129448]\n"," [ 0.7083585 -1.14621973]\n"," [-0.73850141 0.40883785]\n"," [-1.00978765 0.31736388]\n"," [ 1.07007347 -1.26818503]\n"," [-1.28107388 -1.51211563]\n"," [-0.46721518 -1.57309829]\n"," [ 2.06478966 -0.84130648]\n"," [-1.64278886 0.19539858]\n"," [-0.1055002 0.89669905]\n"," [-1.64278886 -1.32916768]\n"," [ 2.06478966 0.40883785]\n"," [-1.19064513 0.5917858 ]\n"," [-0.9193589 -0.35344528]\n"," [ 0.25621478 -0.68884985]\n"," [ 0.43707226 0.01245062]\n"," [-0.46721518 2.45175664]\n"," [-0.19592895 0.2258899 ]\n"," [-1.37150262 -0.20098865]\n"," [ 0.7083585 -1.45113298]\n"," [-0.9193589 0.5917858 ]\n"," [-1.7332176 0.37834653]\n"," [ 0.43707226 0.28687255]\n"," [ 0.25621478 -0.29246263]\n"," [ 1.43178845 -1.08523708]\n"," [ 0.88921599 1.14062966]\n"," [ 1.88393217 2.26880869]\n"," [ 1.97436092 0.40883785]\n"," [-1.19064513 -0.44491925]\n"," [-1.00978765 -1.05474576]\n"," [ 1.88393217 -0.96327178]\n"," [ 0.43707226 0.31736388]\n"," [ 0.25621478 0.16490725]\n"," [ 1.97436092 1.84193014]\n"," [ 0.79878724 -0.87179781]\n"," [ 0.34664352 -0.29246263]\n"," [ 0.43707226 -0.17049733]\n"," [-0.01507146 2.32979134]\n"," [-1.28107388 -0.65835853]\n"," [-1.10021639 -1.11572841]\n"," [-1.19064513 0.43932918]\n"," [-0.9193589 0.80522508]\n"," [-1.28107388 -0.20098865]\n"," [ 0.97964473 -1.11572841]\n"," [ 0.97964473 0.62227713]\n"," [ 0.43707226 1.04915568]\n"," [ 0.61792975 -0.93278046]\n"," [-0.46721518 1.53701688]\n"," [ 0.07535729 -0.59737588]\n"," [-0.46721518 1.99438676]\n"," [ 1.34135971 -1.48162431]\n"," [ 1.43178845 1.04915568]\n"," [ 0.16578603 -0.84130648]\n"," [ 0.07535729 -0.2619713 ]\n"," [-0.1055002 -0.59737588]\n"," [-0.1055002 -0.20098865]\n"," [-0.19592895 -1.35965901]\n"," [-0.19592895 -0.59737588]\n"," [ 0.43707226 0.1039246 ]\n"," [ 0.88921599 -0.6278672 ]\n"," [ 1.97436092 -1.23769371]\n"," [ 1.07007347 -0.140006 ]\n"," [ 0.7083585 1.87242146]\n"," [-0.55764392 0.5917858 ]\n"," [ 0.79878724 0.37834653]\n"," [ 0.88921599 -0.56688455]]\n"]},{"data":{"text/plain":["(None, None)"]},"execution_count":11,"metadata":{},"output_type":"execute_result"}],"source":["#Checking X_train, X_test\n","print(X_train), print(X_test)"]},{"cell_type":"markdown","id":"3183563a","metadata":{"papermill":{"duration":0.008149,"end_time":"2023-01-12T13:50:25.960149","exception":false,"start_time":"2023-01-12T13:50:25.952","status":"completed"},"tags":[]},"source":["
\n"," Building the DT Model using the Training data\n","

Here we create a DT classifier object\n","
"]},{"cell_type":"code","execution_count":12,"id":"7c8f99a6","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:25.978255Z","iopub.status.busy":"2023-01-12T13:50:25.977587Z","iopub.status.idle":"2023-01-12T13:50:25.981513Z","shell.execute_reply":"2023-01-12T13:50:25.980718Z"},"papermill":{"duration":0.015287,"end_time":"2023-01-12T13:50:25.983487","exception":false,"start_time":"2023-01-12T13:50:25.9682","status":"completed"},"tags":[]},"outputs":[],"source":["#Fitting the model on trainig data\n","#classifier.fit(X_train, y_train)"]},{"cell_type":"code","execution_count":13,"id":"fafac011","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:26.00157Z","iopub.status.busy":"2023-01-12T13:50:26.000933Z","iopub.status.idle":"2023-01-12T13:50:26.004564Z","shell.execute_reply":"2023-01-12T13:50:26.003829Z"},"papermill":{"duration":0.014905,"end_time":"2023-01-12T13:50:26.006506","exception":false,"start_time":"2023-01-12T13:50:25.991601","status":"completed"},"tags":[]},"outputs":[],"source":["#Creting the insatance\n","#classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)"]},{"cell_type":"markdown","id":"61d9f4f3","metadata":{"papermill":{"duration":0.007868,"end_time":"2023-01-12T13:50:26.02249","exception":false,"start_time":"2023-01-12T13:50:26.014622","status":"completed"},"tags":[]},"source":["
\n"," Building the model - Random Forest Classifier\n","
"]},{"cell_type":"code","execution_count":14,"id":"68f5419a","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:26.040449Z","iopub.status.busy":"2023-01-12T13:50:26.039778Z","iopub.status.idle":"2023-01-12T13:50:26.044293Z","shell.execute_reply":"2023-01-12T13:50:26.043348Z"},"papermill":{"duration":0.015863,"end_time":"2023-01-12T13:50:26.046334","exception":false,"start_time":"2023-01-12T13:50:26.030471","status":"completed"},"tags":[]},"outputs":[],"source":["random_cal = RandomForestClassifier(n_estimators=25, criterion='entropy', random_state=123)"]},{"cell_type":"code","execution_count":15,"id":"9d3a13a6","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:26.064286Z","iopub.status.busy":"2023-01-12T13:50:26.063919Z","iopub.status.idle":"2023-01-12T13:50:26.111206Z","shell.execute_reply":"2023-01-12T13:50:26.110037Z"},"papermill":{"duration":0.059165,"end_time":"2023-01-12T13:50:26.113729","exception":false,"start_time":"2023-01-12T13:50:26.054564","status":"completed"},"tags":[]},"outputs":[{"data":{"text/plain":["RandomForestClassifier(criterion='entropy', n_estimators=25, random_state=123)"]},"execution_count":15,"metadata":{},"output_type":"execute_result"}],"source":["#Fitting the model on trainig data\n","random_cal.fit(X_train, y_train)"]},{"cell_type":"code","execution_count":16,"id":"b5c3137d","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:26.132919Z","iopub.status.busy":"2023-01-12T13:50:26.132281Z","iopub.status.idle":"2023-01-12T13:50:26.144208Z","shell.execute_reply":"2023-01-12T13:50:26.143033Z"},"papermill":{"duration":0.023598,"end_time":"2023-01-12T13:50:26.146208","exception":false,"start_time":"2023-01-12T13:50:26.12261","status":"completed"},"tags":[]},"outputs":[{"name":"stdout","output_type":"stream","text":["[0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0\n"," 0 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 1 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 1 0 0 1\n"," 0 0 0 0 1 1 1 1 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 1 1 1 1 1 0 1 1 1 0 0 0 0 0\n"," 0 0 0 1 0 1 0 1 1]\n"]}],"source":["#Predicting on \"Test Data\"\n","y_pred = random_cal.predict(X_test)\n","print(y_pred)"]},{"cell_type":"code","execution_count":17,"id":"4f553b8c","metadata":{"collapsed":true,"execution":{"iopub.execute_input":"2023-01-12T13:50:26.165114Z","iopub.status.busy":"2023-01-12T13:50:26.164394Z","iopub.status.idle":"2023-01-12T13:50:26.171712Z","shell.execute_reply":"2023-01-12T13:50:26.170338Z"},"jupyter":{"outputs_hidden":true},"papermill":{"duration":0.01941,"end_time":"2023-01-12T13:50:26.174051","exception":false,"start_time":"2023-01-12T13:50:26.154641","status":"completed"},"tags":[]},"outputs":[{"name":"stdout","output_type":"stream","text":["[[0 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [1 1]\n"," [0 0]\n"," [1 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [1 0]\n"," [0 0]\n"," [0 0]\n"," [1 1]\n"," [0 0]\n"," [0 0]\n"," [1 1]\n"," [0 0]\n"," [1 1]\n"," [0 0]\n"," [1 1]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 1]\n"," [1 1]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [1 1]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [1 1]\n"," [0 0]\n"," [0 0]\n"," [1 1]\n"," [0 0]\n"," [1 1]\n"," [1 1]\n"," [0 0]\n"," [0 0]\n"," [1 0]\n"," [1 1]\n"," [1 1]\n"," [0 0]\n"," [0 0]\n"," [1 1]\n"," [0 0]\n"," [0 0]\n"," [1 1]\n"," [0 0]\n"," [1 1]\n"," [0 0]\n"," [1 1]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [1 1]\n"," [0 0]\n"," [0 0]\n"," [1 1]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [1 1]\n"," [1 1]\n"," [1 1]\n"," [1 0]\n"," [0 0]\n"," [0 0]\n"," [1 1]\n"," [0 1]\n"," [0 0]\n"," [1 1]\n"," [1 1]\n"," [0 0]\n"," [0 0]\n"," [1 1]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [1 1]\n"," [0 0]\n"," [1 1]\n"," [1 1]\n"," [1 1]\n"," [1 0]\n"," [1 1]\n"," [0 0]\n"," [1 1]\n"," [1 1]\n"," [1 1]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 0]\n"," [0 1]\n"," [0 1]\n"," [1 1]\n"," [0 0]\n"," [1 1]\n"," [0 0]\n"," [1 0]\n"," [1 1]]\n"]}],"source":["#Checking for \n","print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))"]},{"cell_type":"code","execution_count":18,"id":"814a597c","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:26.193324Z","iopub.status.busy":"2023-01-12T13:50:26.192688Z","iopub.status.idle":"2023-01-12T13:50:26.203575Z","shell.execute_reply":"2023-01-12T13:50:26.202423Z"},"papermill":{"duration":0.022974,"end_time":"2023-01-12T13:50:26.20582","exception":false,"start_time":"2023-01-12T13:50:26.182846","status":"completed"},"tags":[]},"outputs":[{"name":"stdout","output_type":"stream","text":["[[73 6]\n"," [ 4 37]]\n"]},{"data":{"text/plain":["0.9166666666666666"]},"execution_count":18,"metadata":{},"output_type":"execute_result"}],"source":["#Getting the confusion matrix and accuracy score\n","cm = confusion_matrix(y_test, y_pred)\n","print(cm)\n","accuracy_score(y_test, y_pred)"]},{"cell_type":"code","execution_count":19,"id":"f68b5a98","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:26.224993Z","iopub.status.busy":"2023-01-12T13:50:26.224566Z","iopub.status.idle":"2023-01-12T13:50:26.234472Z","shell.execute_reply":"2023-01-12T13:50:26.233122Z"},"papermill":{"duration":0.022739,"end_time":"2023-01-12T13:50:26.237206","exception":false,"start_time":"2023-01-12T13:50:26.214467","status":"completed"},"tags":[]},"outputs":[{"name":"stdout","output_type":"stream","text":["[[73 6]\n"," [ 4 37]]\n"]},{"data":{"text/plain":["0.9166666666666666"]},"execution_count":19,"metadata":{},"output_type":"execute_result"}],"source":["#Getting the confusion matrix and accuracy score\n","cm=confusion_matrix(y_test, y_pred)\n","\n","print(cm)\n","accuracy_score(y_test, y_pred)"]},{"cell_type":"code","execution_count":20,"id":"c908e291","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:26.257902Z","iopub.status.busy":"2023-01-12T13:50:26.257316Z","iopub.status.idle":"2023-01-12T13:50:26.26624Z","shell.execute_reply":"2023-01-12T13:50:26.26478Z"},"papermill":{"duration":0.022118,"end_time":"2023-01-12T13:50:26.268398","exception":false,"start_time":"2023-01-12T13:50:26.24628","status":"completed"},"tags":[]},"outputs":[{"name":"stdout","output_type":"stream","text":[" precision recall f1-score support\n","\n"," 0 0.95 0.92 0.94 79\n"," 1 0.86 0.90 0.88 41\n","\n"," accuracy 0.92 120\n"," macro avg 0.90 0.91 0.91 120\n","weighted avg 0.92 0.92 0.92 120\n","\n"]}],"source":["#printing Classification_report\n","print(classification_report(y_test, y_pred))"]},{"cell_type":"markdown","id":"2928d8ba","metadata":{"execution":{"iopub.execute_input":"2022-12-01T02:35:09.646519Z","iopub.status.busy":"2022-12-01T02:35:09.646109Z","iopub.status.idle":"2022-12-01T02:35:09.653678Z","shell.execute_reply":"2022-12-01T02:35:09.652382Z","shell.execute_reply.started":"2022-12-01T02:35:09.646486Z"},"papermill":{"duration":0.008672,"end_time":"2023-01-12T13:50:26.286177","exception":false,"start_time":"2023-01-12T13:50:26.277505","status":"completed"},"tags":[]},"source":["================================================================="]},{"cell_type":"markdown","id":"6097664c","metadata":{"papermill":{"duration":0.010352,"end_time":"2023-01-12T13:50:26.305495","exception":false,"start_time":"2023-01-12T13:50:26.295143","status":"completed"},"tags":[]},"source":["
\n"," The modelcan be tunned further, keep an eye updated version for this notebook!\n","
"]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.12"},"papermill":{"default_parameters":{},"duration":11.826605,"end_time":"2023-01-12T13:50:27.035887","environment_variables":{},"exception":null,"input_path":"notebook.ipynb","output_path":"notebook.ipynb","parameters":{},"start_time":"2023-01-12T13:50:15.209282","version":"2.3.4"}},"nbformat":4,"nbformat_minor":5} -------------------------------------------------------------------------------- /titanic-competition-submission.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"source":" $\"Kaggle\"$ ","metadata":{},"cell_type":"markdown","outputs":[],"execution_count":0},{"cell_type":"markdown","id":"6bb1e547","metadata":{"papermill":{"duration":0.004232,"end_time":"2022-12-01T18:35:22.824111","exception":false,"start_time":"2022-12-01T18:35:22.819879","status":"completed"},"tags":[]},"source":["Titanic_competition_submission"]},{"cell_type":"code","execution_count":1,"id":"021a3c18","metadata":{"_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","execution":{"iopub.execute_input":"2022-12-01T18:35:22.832906Z","iopub.status.busy":"2022-12-01T18:35:22.832402Z","iopub.status.idle":"2022-12-01T18:35:22.847605Z","shell.execute_reply":"2022-12-01T18:35:22.846287Z"},"papermill":{"duration":0.023231,"end_time":"2022-12-01T18:35:22.850783","exception":false,"start_time":"2022-12-01T18:35:22.827552","status":"completed"},"tags":[]},"outputs":[{"name":"stdout","output_type":"stream","text":["/kaggle/input/titanic/train.csv\n","/kaggle/input/titanic/test.csv\n","/kaggle/input/titanic/gender_submission.csv\n"]}],"source":["# This Python 3 environment comes with many helpful analytics libraries installed\n","# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python\n","# For example, here's several helpful packages to load\n","\n","import numpy as np # linear algebra\n","import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n","\n","# Input data files are available in the read-only \"../input/\" directory\n","# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n","\n","import os\n","for dirname, _, filenames in os.walk('/kaggle/input'):\n"," for filename in filenames:\n"," print(os.path.join(dirname, filename))\n","\n","# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using \"Save & Run All\" \n","# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session"]},{"cell_type":"code","execution_count":2,"id":"c870dd3b","metadata":{"execution":{"iopub.execute_input":"2022-12-01T18:35:22.859571Z","iopub.status.busy":"2022-12-01T18:35:22.859159Z","iopub.status.idle":"2022-12-01T18:35:25.09791Z","shell.execute_reply":"2022-12-01T18:35:25.096837Z"},"papermill":{"duration":2.245887,"end_time":"2022-12-01T18:35:25.100641","exception":false,"start_time":"2022-12-01T18:35:22.854754","status":"completed"},"tags":[]},"outputs":[],"source":["# Data manipulation imports\n","import numpy as np\n","import pandas as pd\n","\n","# Visualization imports\n","import matplotlib.pyplot as plt\n","import plotly.express as px\n","\n","# Modeling imports\n","from sklearn.model_selection import train_test_split\n","from sklearn.impute import SimpleImputer\n","from sklearn.preprocessing import OneHotEncoder, StandardScaler\n","from sklearn.compose import ColumnTransformer\n","from sklearn.pipeline import Pipeline\n","from sklearn.neighbors import KNeighborsClassifier\n","from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay"]},{"cell_type":"code","execution_count":3,"id":"f45b0f58","metadata":{"execution":{"iopub.execute_input":"2022-12-01T18:35:25.108876Z","iopub.status.busy":"2022-12-01T18:35:25.108461Z","iopub.status.idle":"2022-12-01T18:35:25.148511Z","shell.execute_reply":"2022-12-01T18:35:25.147343Z"},"papermill":{"duration":0.047113,"end_time":"2022-12-01T18:35:25.151031","exception":false,"start_time":"2022-12-01T18:35:25.103918","status":"completed"},"tags":[]},"outputs":[{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
\n","
"],"text/plain":[" PassengerId Survived Pclass \\\n","0 1 0 3 \n","1 2 1 1 \n","2 3 1 3 \n","3 4 1 1 \n","4 5 0 3 \n","5 6 0 3 \n","6 7 0 1 \n","7 8 0 3 \n","8 9 1 3 \n","9 10 1 2 \n","10 11 1 3 \n","\n"," Name Sex Age SibSp \\\n","0 Braund, Mr. Owen Harris male 22.0 1 \n","1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n","2 Heikkinen, Miss. Laina female 26.0 0 \n","3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n","4 Allen, Mr. William Henry male 35.0 0 \n","5 Moran, Mr. James male NaN 0 \n","6 McCarthy, Mr. Timothy J male 54.0 0 \n","7 Palsson, Master. Gosta Leonard male 2.0 3 \n","8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 \n","9 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 \n","10 Sandstrom, Miss. Marguerite Rut female 4.0 1 \n","\n"," Parch Ticket Fare Cabin Embarked \n","0 0 A/5 21171 7.2500 NaN S \n","1 0 PC 17599 71.2833 C85 C \n","2 0 STON/O2. 3101282 7.9250 NaN S \n","3 0 113803 53.1000 C123 S \n","4 0 373450 8.0500 NaN S \n","5 0 330877 8.4583 NaN Q \n","6 0 17463 51.8625 E46 S \n","7 1 349909 21.0750 NaN S \n","8 2 347742 11.1333 NaN S \n","9 0 237736 30.0708 NaN C \n","10 1 PP 9549 16.7000 G6 S "]},"execution_count":3,"metadata":{},"output_type":"execute_result"}],"source":["#loading train_data\n","titanic_data = pd.read_csv(\"/kaggle/input/titanic/train.csv\")\n","titanic_data.head(11)"]},{"cell_type":"code","execution_count":4,"id":"17b6198f","metadata":{"execution":{"iopub.execute_input":"2022-12-01T18:35:25.159704Z","iopub.status.busy":"2022-12-01T18:35:25.159297Z","iopub.status.idle":"2022-12-01T18:35:25.182487Z","shell.execute_reply":"2022-12-01T18:35:25.181426Z"},"papermill":{"duration":0.030802,"end_time":"2022-12-01T18:35:25.185425","exception":false,"start_time":"2022-12-01T18:35:25.154623","status":"completed"},"tags":[]},"outputs":[{"data":{"text/plain":["( PassengerId Pclass Name Sex \\\n"," 0 892 3 Kelly, Mr. James male \n"," 1 893 3 Wilkes, Mrs. James (Ellen Needs) female \n"," 2 894 2 Myles, Mr. Thomas Francis male \n"," 3 895 3 Wirz, Mr. Albert male \n"," 4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female \n"," \n"," Age SibSp Parch Ticket Fare Cabin Embarked \n"," 0 34.5 0 0 330911 7.8292 NaN Q \n"," 1 47.0 1 0 363272 7.0000 NaN S \n"," 2 62.0 0 0 240276 9.6875 NaN Q \n"," 3 27.0 0 0 315154 8.6625 NaN S \n"," 4 22.0 1 1 3101298 12.2875 NaN S ,\n"," (418, 11))"]},"execution_count":4,"metadata":{},"output_type":"execute_result"}],"source":["#loadding test_data\n","titanic_test = pd.read_csv(\"/kaggle/input/titanic/test.csv\")\n","titanic_test.head(), titanic_test.shape"]},{"cell_type":"code","execution_count":5,"id":"f8b36fab","metadata":{"execution":{"iopub.execute_input":"2022-12-01T18:35:25.194472Z","iopub.status.busy":"2022-12-01T18:35:25.194064Z","iopub.status.idle":"2022-12-01T18:35:25.206425Z","shell.execute_reply":"2022-12-01T18:35:25.205321Z"},"papermill":{"duration":0.020541,"end_time":"2022-12-01T18:35:25.209739","exception":false,"start_time":"2022-12-01T18:35:25.189198","status":"completed"},"tags":[]},"outputs":[{"name":"stdout","output_type":"stream","text":["% of women who survived: 0.7420382165605095\n"]}],"source":["#Checking pattern for women\n","women = titanic_data.loc[titanic_data.Sex == 'female'][\"Survived\"]\n","rate_women = sum(women)/len(women)\n","\n","print(\"% of women who survived:\", rate_women)"]},{"cell_type":"code","execution_count":6,"id":"218d65df","metadata":{"execution":{"iopub.execute_input":"2022-12-01T18:35:25.219923Z","iopub.status.busy":"2022-12-01T18:35:25.219493Z","iopub.status.idle":"2022-12-01T18:35:25.227441Z","shell.execute_reply":"2022-12-01T18:35:25.226154Z"},"papermill":{"duration":0.015976,"end_time":"2022-12-01T18:35:25.230242","exception":false,"start_time":"2022-12-01T18:35:25.214266","status":"completed"},"tags":[]},"outputs":[{"name":"stdout","output_type":"stream","text":["% of women who survived: 0.18890814558058924\n"]}],"source":["#Checking pattern for men\n","men = titanic_data.loc[titanic_data.Sex == 'male'][\"Survived\"]\n","rate_men = sum(men)/len(men)\n","\n","print(\"% of women who survived:\", rate_men)"]},{"cell_type":"code","execution_count":7,"id":"6a6da08e","metadata":{"execution":{"iopub.execute_input":"2022-12-01T18:35:25.239736Z","iopub.status.busy":"2022-12-01T18:35:25.239303Z","iopub.status.idle":"2022-12-01T18:35:25.52152Z","shell.execute_reply":"2022-12-01T18:35:25.520193Z"},"papermill":{"duration":0.289828,"end_time":"2022-12-01T18:35:25.523961","exception":false,"start_time":"2022-12-01T18:35:25.234133","status":"completed"},"tags":[]},"outputs":[{"name":"stdout","output_type":"stream","text":["Your submission was successfully saved!\n"]}],"source":["from sklearn.ensemble import RandomForestClassifier\n","\n","y = titanic_data[\"Survived\"]\n","\n","features = [\"Pclass\", \"Sex\", \"SibSp\", \"Parch\"]\n","X = pd.get_dummies(titanic_data[features])\n","X_test = pd.get_dummies(titanic_test[features])\n","\n","model = RandomForestClassifier(n_estimators=101, max_depth=5, random_state=123)\n","model.fit(X, y)\n","predictions = model.predict(X_test)\n","\n","output = pd.DataFrame({'PassengerId': titanic_test.PassengerId, 'Survived': predictions})\n","output.to_csv('submission.csv', index=False)\n","print(\"Your submission was successfully saved!\")"]},{"cell_type":"code","execution_count":null,"id":"223bb683","metadata":{"papermill":{"duration":0.003424,"end_time":"2022-12-01T18:35:25.531277","exception":false,"start_time":"2022-12-01T18:35:25.527853","status":"completed"},"tags":[]},"outputs":[],"source":[]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.12"},"papermill":{"default_parameters":{},"duration":13.313314,"end_time":"2022-12-01T18:35:26.359542","environment_variables":{},"exception":null,"input_path":"notebook.ipynb","output_path":"notebook.ipynb","parameters":{},"start_time":"2022-12-01T18:35:13.046228","version":"2.3.4"}},"nbformat":4,"nbformat_minor":5} --------------------------------------------------------------------------------

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.00	1	0	A/5 21171	7.25	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.00	1	0	PC 17599	71.28	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.00	0	0	STON/O2. 3101282	7.92	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.00	1	0	113803	53.10	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.00	0	0	373450	8.05	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.46	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.00	0	0	17463	51.86	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.00	3	1	349909	21.07	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.00	0	2	347742	11.13	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.00	1	0	237736	30.07	NaN	C
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.00	1	1	PP 9549	16.70	G6	S

	Model	Accuracy	Balanced Accuracy	ROC AUC	F1 Score	Time Taken
0	GaussianNB	0.81	0.78	0.78	0.81	0.02
1	LogisticRegression	0.81	0.78	0.78	0.80	0.02
2	AdaBoostClassifier	0.80	0.78	0.78	0.80	0.12
3	CalibratedClassifierCV	0.80	0.78	0.78	0.80	0.17
4	NuSVC	0.80	0.78	0.78	0.80	0.03
5	LinearSVC	0.80	0.77	0.77	0.79	0.05
6	RidgeClassifierCV	0.79	0.77	0.77	0.79	0.01
7	RidgeClassifier	0.79	0.77	0.77	0.79	0.03
8	LinearDiscriminantAnalysis	0.79	0.77	0.77	0.79	0.04
9	PassiveAggressiveClassifier	0.79	0.77	0.77	0.79	0.02
10	BernoulliNB	0.78	0.76	0.76	0.78	0.02
11	NearestCentroid	0.78	0.76	0.76	0.78	0.01
12	XGBClassifier	0.77	0.74	0.74	0.76	0.19
13	KNeighborsClassifier	0.77	0.74	0.74	0.77	0.04
14	Perceptron	0.76	0.74	0.74	0.76	0.01
15	LGBMClassifier	0.77	0.74	0.74	0.76	0.06
16	SVC	0.77	0.74	0.74	0.76	0.03
17	LabelPropagation	0.76	0.73	0.73	0.76	0.09
18	SGDClassifier	0.78	0.73	0.73	0.77	0.01
19	BaggingClassifier	0.76	0.73	0.73	0.76	0.04
20	ExtraTreesClassifier	0.76	0.73	0.73	0.75	0.34
21	DecisionTreeClassifier	0.76	0.73	0.73	0.75	0.02
22	RandomForestClassifier	0.76	0.73	0.73	0.75	0.17
23	LabelSpreading	0.76	0.73	0.73	0.75	0.07
24	ExtraTreeClassifier	0.76	0.73	0.73	0.75	0.02
25	QuadraticDiscriminantAnalysis	0.69	0.70	0.70	0.70	0.03
26	DummyClassifier	0.62	0.50	0.50	0.47	0.02

	User ID	Gender	Age	EstimatedSalary	Purchased
0	15624510	Male	19	19000	0
1	15810944	Male	35	20000	0
2	15668575	Female	26	43000	0
3	15603246	Female	27	57000	0
4	15804002	Male	19	76000	0
5	15728773	Male	27	58000	0
6	15598044	Female	27	84000	0
7	15694829	Female	32	150000	1
8	15600575	Male	25	33000	0
9	15727311	Female	35	65000	0
10	15570769	Female	26	80000	0

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S