├── .gitattributes
├── .gitignore
├── 1.1-imdb-datasets.ipynb
├── 1.2-scraping-imdb.ipynb
├── 1.3-data-merge-clean-encode.ipynb
├── 2.1-eda.ipynb
├── 2.2-data-preprocessing.ipynb
├── 3.1-modeling.ipynb
├── 3.2-best-model.ipynb
├── 3.3-wordclouds.ipynb
├── data
    ├── clean_df.tsv
    ├── encoded_genres.tsv
    └── imdb_movie_list.csv
├── demo
    ├── models
    │   ├── my_best_model.pkl
    │   ├── my_best_scaler.pkl
    │   └── my_best_tfidf.pkl
    ├── predict.py
    ├── static
    │   ├── css
    │   │   └── styles.css
    │   └── images
    │   │   ├── Action.png
    │   │   ├── Adventure.png
    │   │   ├── Animation.png
    │   │   ├── Biography.png
    │   │   ├── Comedy.png
    │   │   ├── Crime.png
    │   │   ├── Documentary.png
    │   │   ├── Drama.png
    │   │   ├── Family.png
    │   │   ├── Fantasy.png
    │   │   ├── Film-noir.png
    │   │   ├── History.png
    │   │   ├── Horror.png
    │   │   ├── Music.png
    │   │   ├── Musical.png
    │   │   ├── Mystery.png
    │   │   ├── Romance.png
    │   │   ├── Sci-fi.png
    │   │   ├── Sport.png
    │   │   ├── Thriller.png
    │   │   ├── War.png
    │   │   ├── Western.png
    │   │   └── magic-lamp.png
    ├── templates
    │   └── predict.html
    └── train_medians.csv
├── images
    ├── app.png
    ├── app2.png
    ├── genre-counts-graph.png
    ├── imdb-bottom.png
    ├── imdb-top.png
    ├── results-graph.png
    └── wc_img.png
├── models
    ├── my_1vr_logreg_0.01.pkl
    ├── my_1vr_logreg_default.pkl
    ├── my_best_model.pkl
    ├── my_best_scaler.pkl
    ├── my_best_tfidf.pkl
    ├── my_minmax_scaler.pkl
    ├── my_standard_scaler.pkl
    └── my_tfidf_min20.pkl
└── readme.md


/.gitattributes:
--------------------------------------------------------------------------------
1 | *.tsv filter=lfs diff=lfs merge=lfs -text


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .tsv


--------------------------------------------------------------------------------
/1.1-imdb-datasets.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "## Genre Genie - Multi-label Classification with NLP\n",
   8 |     "### Part 1.1: IMDb dataset\n",
   9 |     "\n",
  10 |     "#### Tom Keith\n",
  11 |     "\n",
  12 |     "---\n",
  13 |     "\n",
  14 |     "**Goal:** Explore IMDb datasets."
  15 |    ]
  16 |   },
  17 |   {
  18 |    "cell_type": "markdown",
  19 |    "metadata": {},
  20 |    "source": [
  21 |     "IMDb offers datasets with loads of information. I explored these sets to see what I could use while still figuring out what direction to go with this project.\n",
  22 |     "\n",
  23 |     "More information on these datasets and the features they have can be found here: https://www.imdb.com/interfaces/\n",
  24 |     "\n",
  25 |     "I will be using the direct links for the compressed `.tsv` files: https://datasets.imdbws.com/\n",
  26 |     "\n",
  27 |     "---"
  28 |    ]
  29 |   },
  30 |   {
  31 |    "cell_type": "code",
  32 |    "execution_count": 48,
  33 |    "metadata": {},
  34 |    "outputs": [],
  35 |    "source": [
  36 |     "import pandas as pd\n",
  37 |     "import numpy as np\n",
  38 |     "import matplotlib.pyplot as plt\n",
  39 |     "from PIL import Image\n",
  40 |     "\n",
  41 |     "pd.set_option('display.max_rows', 200)"
  42 |    ]
  43 |   },
  44 |   {
  45 |    "cell_type": "markdown",
  46 |    "metadata": {},
  47 |    "source": [
  48 |     "---\n",
  49 |     "**Exploring IMDb Datasets**\n",
  50 |     "\n",
  51 |     "Loop though to peek at what each dataset looks like. While these files are updated daily, the data used throughout this project was fetched February 2, 2020.\n",
  52 |     "\n",
  53 |     "An importnat note is that `NULL` values are represented as `\\N` in these sets."
  54 |    ]
  55 |   },
  56 |   {
  57 |    "cell_type": "code",
  58 |    "execution_count": 47,
  59 |    "metadata": {
  60 |     "scrolled": true
  61 |    },
  62 |    "outputs": [
  63 |     {
  64 |      "data": {
  65 |       "text/html": [
  66 |        "<div>\n",
  67 |        "<style scoped>\n",
  68 |        "    .dataframe tbody tr th:only-of-type {\n",
  69 |        "        vertical-align: middle;\n",
  70 |        "    }\n",
  71 |        "\n",
  72 |        "    .dataframe tbody tr th {\n",
  73 |        "        vertical-align: top;\n",
  74 |        "    }\n",
  75 |        "\n",
  76 |        "    .dataframe thead th {\n",
  77 |        "        text-align: right;\n",
  78 |        "    }\n",
  79 |        "</style>\n",
  80 |        "<table border=\"1\" class=\"dataframe\">\n",
  81 |        "  <thead>\n",
  82 |        "    <tr style=\"text-align: right;\">\n",
  83 |        "      <th></th>\n",
  84 |        "      <th>tconst</th>\n",
  85 |        "      <th>titleType</th>\n",
  86 |        "      <th>primaryTitle</th>\n",
  87 |        "      <th>originalTitle</th>\n",
  88 |        "      <th>isAdult</th>\n",
  89 |        "      <th>startYear</th>\n",
  90 |        "      <th>endYear</th>\n",
  91 |        "      <th>runtimeMinutes</th>\n",
  92 |        "      <th>genres</th>\n",
  93 |        "    </tr>\n",
  94 |        "  </thead>\n",
  95 |        "  <tbody>\n",
  96 |        "    <tr>\n",
  97 |        "      <td>0</td>\n",
  98 |        "      <td>tt0000001</td>\n",
  99 |        "      <td>short</td>\n",
 100 |        "      <td>Carmencita</td>\n",
 101 |        "      <td>Carmencita</td>\n",
 102 |        "      <td>0</td>\n",
 103 |        "      <td>1894</td>\n",
 104 |        "      <td>\\N</td>\n",
 105 |        "      <td>1</td>\n",
 106 |        "      <td>Documentary,Short</td>\n",
 107 |        "    </tr>\n",
 108 |        "    <tr>\n",
 109 |        "      <td>1</td>\n",
 110 |        "      <td>tt0000002</td>\n",
 111 |        "      <td>short</td>\n",
 112 |        "      <td>Le clown et ses chiens</td>\n",
 113 |        "      <td>Le clown et ses chiens</td>\n",
 114 |        "      <td>0</td>\n",
 115 |        "      <td>1892</td>\n",
 116 |        "      <td>\\N</td>\n",
 117 |        "      <td>5</td>\n",
 118 |        "      <td>Animation,Short</td>\n",
 119 |        "    </tr>\n",
 120 |        "    <tr>\n",
 121 |        "      <td>2</td>\n",
 122 |        "      <td>tt0000003</td>\n",
 123 |        "      <td>short</td>\n",
 124 |        "      <td>Pauvre Pierrot</td>\n",
 125 |        "      <td>Pauvre Pierrot</td>\n",
 126 |        "      <td>0</td>\n",
 127 |        "      <td>1892</td>\n",
 128 |        "      <td>\\N</td>\n",
 129 |        "      <td>4</td>\n",
 130 |        "      <td>Animation,Comedy,Romance</td>\n",
 131 |        "    </tr>\n",
 132 |        "    <tr>\n",
 133 |        "      <td>3</td>\n",
 134 |        "      <td>tt0000004</td>\n",
 135 |        "      <td>short</td>\n",
 136 |        "      <td>Un bon bock</td>\n",
 137 |        "      <td>Un bon bock</td>\n",
 138 |        "      <td>0</td>\n",
 139 |        "      <td>1892</td>\n",
 140 |        "      <td>\\N</td>\n",
 141 |        "      <td>12</td>\n",
 142 |        "      <td>Animation,Short</td>\n",
 143 |        "    </tr>\n",
 144 |        "    <tr>\n",
 145 |        "      <td>4</td>\n",
 146 |        "      <td>tt0000005</td>\n",
 147 |        "      <td>short</td>\n",
 148 |        "      <td>Blacksmith Scene</td>\n",
 149 |        "      <td>Blacksmith Scene</td>\n",
 150 |        "      <td>0</td>\n",
 151 |        "      <td>1893</td>\n",
 152 |        "      <td>\\N</td>\n",
 153 |        "      <td>1</td>\n",
 154 |        "      <td>Comedy,Short</td>\n",
 155 |        "    </tr>\n",
 156 |        "    <tr>\n",
 157 |        "      <td>...</td>\n",
 158 |        "      <td>...</td>\n",
 159 |        "      <td>...</td>\n",
 160 |        "      <td>...</td>\n",
 161 |        "      <td>...</td>\n",
 162 |        "      <td>...</td>\n",
 163 |        "      <td>...</td>\n",
 164 |        "      <td>...</td>\n",
 165 |        "      <td>...</td>\n",
 166 |        "      <td>...</td>\n",
 167 |        "    </tr>\n",
 168 |        "    <tr>\n",
 169 |        "      <td>6672689</td>\n",
 170 |        "      <td>tt9916848</td>\n",
 171 |        "      <td>tvEpisode</td>\n",
 172 |        "      <td>Episode #3.17</td>\n",
 173 |        "      <td>Episode #3.17</td>\n",
 174 |        "      <td>0</td>\n",
 175 |        "      <td>2010</td>\n",
 176 |        "      <td>\\N</td>\n",
 177 |        "      <td>\\N</td>\n",
 178 |        "      <td>Action,Drama,Family</td>\n",
 179 |        "    </tr>\n",
 180 |        "    <tr>\n",
 181 |        "      <td>6672690</td>\n",
 182 |        "      <td>tt9916850</td>\n",
 183 |        "      <td>tvEpisode</td>\n",
 184 |        "      <td>Episode #3.19</td>\n",
 185 |        "      <td>Episode #3.19</td>\n",
 186 |        "      <td>0</td>\n",
 187 |        "      <td>2010</td>\n",
 188 |        "      <td>\\N</td>\n",
 189 |        "      <td>\\N</td>\n",
 190 |        "      <td>Action,Drama,Family</td>\n",
 191 |        "    </tr>\n",
 192 |        "    <tr>\n",
 193 |        "      <td>6672691</td>\n",
 194 |        "      <td>tt9916852</td>\n",
 195 |        "      <td>tvEpisode</td>\n",
 196 |        "      <td>Episode #3.20</td>\n",
 197 |        "      <td>Episode #3.20</td>\n",
 198 |        "      <td>0</td>\n",
 199 |        "      <td>2010</td>\n",
 200 |        "      <td>\\N</td>\n",
 201 |        "      <td>\\N</td>\n",
 202 |        "      <td>Action,Drama,Family</td>\n",
 203 |        "    </tr>\n",
 204 |        "    <tr>\n",
 205 |        "      <td>6672692</td>\n",
 206 |        "      <td>tt9916856</td>\n",
 207 |        "      <td>short</td>\n",
 208 |        "      <td>The Wind</td>\n",
 209 |        "      <td>The Wind</td>\n",
 210 |        "      <td>0</td>\n",
 211 |        "      <td>2015</td>\n",
 212 |        "      <td>\\N</td>\n",
 213 |        "      <td>27</td>\n",
 214 |        "      <td>Short</td>\n",
 215 |        "    </tr>\n",
 216 |        "    <tr>\n",
 217 |        "      <td>6672693</td>\n",
 218 |        "      <td>tt9916880</td>\n",
 219 |        "      <td>tvEpisode</td>\n",
 220 |        "      <td>Horrid Henry Knows It All</td>\n",
 221 |        "      <td>Horrid Henry Knows It All</td>\n",
 222 |        "      <td>0</td>\n",
 223 |        "      <td>2014</td>\n",
 224 |        "      <td>\\N</td>\n",
 225 |        "      <td>10</td>\n",
 226 |        "      <td>Animation,Comedy,Family</td>\n",
 227 |        "    </tr>\n",
 228 |        "  </tbody>\n",
 229 |        "</table>\n",
 230 |        "<p>6672694 rows × 9 columns</p>\n",
 231 |        "</div>"
 232 |       ],
 233 |       "text/plain": [
 234 |        "            tconst  titleType               primaryTitle  \\\n",
 235 |        "0        tt0000001      short                 Carmencita   \n",
 236 |        "1        tt0000002      short     Le clown et ses chiens   \n",
 237 |        "2        tt0000003      short             Pauvre Pierrot   \n",
 238 |        "3        tt0000004      short                Un bon bock   \n",
 239 |        "4        tt0000005      short           Blacksmith Scene   \n",
 240 |        "...            ...        ...                        ...   \n",
 241 |        "6672689  tt9916848  tvEpisode              Episode #3.17   \n",
 242 |        "6672690  tt9916850  tvEpisode              Episode #3.19   \n",
 243 |        "6672691  tt9916852  tvEpisode              Episode #3.20   \n",
 244 |        "6672692  tt9916856      short                   The Wind   \n",
 245 |        "6672693  tt9916880  tvEpisode  Horrid Henry Knows It All   \n",
 246 |        "\n",
 247 |        "                     originalTitle  isAdult startYear endYear runtimeMinutes  \\\n",
 248 |        "0                       Carmencita        0      1894      \\N              1   \n",
 249 |        "1           Le clown et ses chiens        0      1892      \\N              5   \n",
 250 |        "2                   Pauvre Pierrot        0      1892      \\N              4   \n",
 251 |        "3                      Un bon bock        0      1892      \\N             12   \n",
 252 |        "4                 Blacksmith Scene        0      1893      \\N              1   \n",
 253 |        "...                            ...      ...       ...     ...            ...   \n",
 254 |        "6672689              Episode #3.17        0      2010      \\N             \\N   \n",
 255 |        "6672690              Episode #3.19        0      2010      \\N             \\N   \n",
 256 |        "6672691              Episode #3.20        0      2010      \\N             \\N   \n",
 257 |        "6672692                   The Wind        0      2015      \\N             27   \n",
 258 |        "6672693  Horrid Henry Knows It All        0      2014      \\N             10   \n",
 259 |        "\n",
 260 |        "                           genres  \n",
 261 |        "0               Documentary,Short  \n",
 262 |        "1                 Animation,Short  \n",
 263 |        "2        Animation,Comedy,Romance  \n",
 264 |        "3                 Animation,Short  \n",
 265 |        "4                    Comedy,Short  \n",
 266 |        "...                           ...  \n",
 267 |        "6672689       Action,Drama,Family  \n",
 268 |        "6672690       Action,Drama,Family  \n",
 269 |        "6672691       Action,Drama,Family  \n",
 270 |        "6672692                     Short  \n",
 271 |        "6672693   Animation,Comedy,Family  \n",
 272 |        "\n",
 273 |        "[6672694 rows x 9 columns]"
 274 |       ]
 275 |      },
 276 |      "metadata": {},
 277 |      "output_type": "display_data"
 278 |     },
 279 |     {
 280 |      "data": {
 281 |       "text/html": [
 282 |        "<div>\n",
 283 |        "<style scoped>\n",
 284 |        "    .dataframe tbody tr th:only-of-type {\n",
 285 |        "        vertical-align: middle;\n",
 286 |        "    }\n",
 287 |        "\n",
 288 |        "    .dataframe tbody tr th {\n",
 289 |        "        vertical-align: top;\n",
 290 |        "    }\n",
 291 |        "\n",
 292 |        "    .dataframe thead th {\n",
 293 |        "        text-align: right;\n",
 294 |        "    }\n",
 295 |        "</style>\n",
 296 |        "<table border=\"1\" class=\"dataframe\">\n",
 297 |        "  <thead>\n",
 298 |        "    <tr style=\"text-align: right;\">\n",
 299 |        "      <th></th>\n",
 300 |        "      <th>tconst</th>\n",
 301 |        "      <th>averageRating</th>\n",
 302 |        "      <th>numVotes</th>\n",
 303 |        "    </tr>\n",
 304 |        "  </thead>\n",
 305 |        "  <tbody>\n",
 306 |        "    <tr>\n",
 307 |        "      <td>0</td>\n",
 308 |        "      <td>tt0000001</td>\n",
 309 |        "      <td>5.6</td>\n",
 310 |        "      <td>1591</td>\n",
 311 |        "    </tr>\n",
 312 |        "    <tr>\n",
 313 |        "      <td>1</td>\n",
 314 |        "      <td>tt0000002</td>\n",
 315 |        "      <td>6.1</td>\n",
 316 |        "      <td>194</td>\n",
 317 |        "    </tr>\n",
 318 |        "    <tr>\n",
 319 |        "      <td>2</td>\n",
 320 |        "      <td>tt0000003</td>\n",
 321 |        "      <td>6.5</td>\n",
 322 |        "      <td>1264</td>\n",
 323 |        "    </tr>\n",
 324 |        "    <tr>\n",
 325 |        "      <td>3</td>\n",
 326 |        "      <td>tt0000004</td>\n",
 327 |        "      <td>6.2</td>\n",
 328 |        "      <td>120</td>\n",
 329 |        "    </tr>\n",
 330 |        "    <tr>\n",
 331 |        "      <td>4</td>\n",
 332 |        "      <td>tt0000005</td>\n",
 333 |        "      <td>6.1</td>\n",
 334 |        "      <td>2025</td>\n",
 335 |        "    </tr>\n",
 336 |        "    <tr>\n",
 337 |        "      <td>...</td>\n",
 338 |        "      <td>...</td>\n",
 339 |        "      <td>...</td>\n",
 340 |        "      <td>...</td>\n",
 341 |        "    </tr>\n",
 342 |        "    <tr>\n",
 343 |        "      <td>1019001</td>\n",
 344 |        "      <td>tt9916576</td>\n",
 345 |        "      <td>6.0</td>\n",
 346 |        "      <td>9</td>\n",
 347 |        "    </tr>\n",
 348 |        "    <tr>\n",
 349 |        "      <td>1019002</td>\n",
 350 |        "      <td>tt9916578</td>\n",
 351 |        "      <td>8.5</td>\n",
 352 |        "      <td>16</td>\n",
 353 |        "    </tr>\n",
 354 |        "    <tr>\n",
 355 |        "      <td>1019003</td>\n",
 356 |        "      <td>tt9916720</td>\n",
 357 |        "      <td>5.5</td>\n",
 358 |        "      <td>48</td>\n",
 359 |        "    </tr>\n",
 360 |        "    <tr>\n",
 361 |        "      <td>1019004</td>\n",
 362 |        "      <td>tt9916766</td>\n",
 363 |        "      <td>6.8</td>\n",
 364 |        "      <td>13</td>\n",
 365 |        "    </tr>\n",
 366 |        "    <tr>\n",
 367 |        "      <td>1019005</td>\n",
 368 |        "      <td>tt9916778</td>\n",
 369 |        "      <td>7.2</td>\n",
 370 |        "      <td>20</td>\n",
 371 |        "    </tr>\n",
 372 |        "  </tbody>\n",
 373 |        "</table>\n",
 374 |        "<p>1019006 rows × 3 columns</p>\n",
 375 |        "</div>"
 376 |       ],
 377 |       "text/plain": [
 378 |        "            tconst  averageRating  numVotes\n",
 379 |        "0        tt0000001            5.6      1591\n",
 380 |        "1        tt0000002            6.1       194\n",
 381 |        "2        tt0000003            6.5      1264\n",
 382 |        "3        tt0000004            6.2       120\n",
 383 |        "4        tt0000005            6.1      2025\n",
 384 |        "...            ...            ...       ...\n",
 385 |        "1019001  tt9916576            6.0         9\n",
 386 |        "1019002  tt9916578            8.5        16\n",
 387 |        "1019003  tt9916720            5.5        48\n",
 388 |        "1019004  tt9916766            6.8        13\n",
 389 |        "1019005  tt9916778            7.2        20\n",
 390 |        "\n",
 391 |        "[1019006 rows x 3 columns]"
 392 |       ]
 393 |      },
 394 |      "metadata": {},
 395 |      "output_type": "display_data"
 396 |     },
 397 |     {
 398 |      "data": {
 399 |       "text/html": [
 400 |        "<div>\n",
 401 |        "<style scoped>\n",
 402 |        "    .dataframe tbody tr th:only-of-type {\n",
 403 |        "        vertical-align: middle;\n",
 404 |        "    }\n",
 405 |        "\n",
 406 |        "    .dataframe tbody tr th {\n",
 407 |        "        vertical-align: top;\n",
 408 |        "    }\n",
 409 |        "\n",
 410 |        "    .dataframe thead th {\n",
 411 |        "        text-align: right;\n",
 412 |        "    }\n",
 413 |        "</style>\n",
 414 |        "<table border=\"1\" class=\"dataframe\">\n",
 415 |        "  <thead>\n",
 416 |        "    <tr style=\"text-align: right;\">\n",
 417 |        "      <th></th>\n",
 418 |        "      <th>nconst</th>\n",
 419 |        "      <th>primaryName</th>\n",
 420 |        "      <th>birthYear</th>\n",
 421 |        "      <th>deathYear</th>\n",
 422 |        "      <th>primaryProfession</th>\n",
 423 |        "      <th>knownForTitles</th>\n",
 424 |        "    </tr>\n",
 425 |        "  </thead>\n",
 426 |        "  <tbody>\n",
 427 |        "    <tr>\n",
 428 |        "      <td>0</td>\n",
 429 |        "      <td>nm0000001</td>\n",
 430 |        "      <td>Fred Astaire</td>\n",
 431 |        "      <td>1899</td>\n",
 432 |        "      <td>1987</td>\n",
 433 |        "      <td>soundtrack,actor,miscellaneous</td>\n",
 434 |        "      <td>tt0050419,tt0072308,tt0053137,tt0043044</td>\n",
 435 |        "    </tr>\n",
 436 |        "    <tr>\n",
 437 |        "      <td>1</td>\n",
 438 |        "      <td>nm0000002</td>\n",
 439 |        "      <td>Lauren Bacall</td>\n",
 440 |        "      <td>1924</td>\n",
 441 |        "      <td>2014</td>\n",
 442 |        "      <td>actress,soundtrack</td>\n",
 443 |        "      <td>tt0117057,tt0038355,tt0037382,tt0071877</td>\n",
 444 |        "    </tr>\n",
 445 |        "    <tr>\n",
 446 |        "      <td>2</td>\n",
 447 |        "      <td>nm0000003</td>\n",
 448 |        "      <td>Brigitte Bardot</td>\n",
 449 |        "      <td>1934</td>\n",
 450 |        "      <td>\\N</td>\n",
 451 |        "      <td>actress,soundtrack,producer</td>\n",
 452 |        "      <td>tt0054452,tt0057345,tt0059956,tt0049189</td>\n",
 453 |        "    </tr>\n",
 454 |        "    <tr>\n",
 455 |        "      <td>3</td>\n",
 456 |        "      <td>nm0000004</td>\n",
 457 |        "      <td>John Belushi</td>\n",
 458 |        "      <td>1949</td>\n",
 459 |        "      <td>1982</td>\n",
 460 |        "      <td>actor,soundtrack,writer</td>\n",
 461 |        "      <td>tt0072562,tt0077975,tt0080455,tt0078723</td>\n",
 462 |        "    </tr>\n",
 463 |        "    <tr>\n",
 464 |        "      <td>4</td>\n",
 465 |        "      <td>nm0000005</td>\n",
 466 |        "      <td>Ingmar Bergman</td>\n",
 467 |        "      <td>1918</td>\n",
 468 |        "      <td>2007</td>\n",
 469 |        "      <td>writer,director,actor</td>\n",
 470 |        "      <td>tt0069467,tt0050986,tt0050976,tt0083922</td>\n",
 471 |        "    </tr>\n",
 472 |        "    <tr>\n",
 473 |        "      <td>...</td>\n",
 474 |        "      <td>...</td>\n",
 475 |        "      <td>...</td>\n",
 476 |        "      <td>...</td>\n",
 477 |        "      <td>...</td>\n",
 478 |        "      <td>...</td>\n",
 479 |        "      <td>...</td>\n",
 480 |        "    </tr>\n",
 481 |        "    <tr>\n",
 482 |        "      <td>9982871</td>\n",
 483 |        "      <td>nm9993714</td>\n",
 484 |        "      <td>Romeo del Rosario</td>\n",
 485 |        "      <td>\\N</td>\n",
 486 |        "      <td>\\N</td>\n",
 487 |        "      <td>animation_department,art_department</td>\n",
 488 |        "      <td>tt2455546</td>\n",
 489 |        "    </tr>\n",
 490 |        "    <tr>\n",
 491 |        "      <td>9982872</td>\n",
 492 |        "      <td>nm9993716</td>\n",
 493 |        "      <td>Essias Loberg</td>\n",
 494 |        "      <td>\\N</td>\n",
 495 |        "      <td>\\N</td>\n",
 496 |        "      <td>NaN</td>\n",
 497 |        "      <td>\\N</td>\n",
 498 |        "    </tr>\n",
 499 |        "    <tr>\n",
 500 |        "      <td>9982873</td>\n",
 501 |        "      <td>nm9993717</td>\n",
 502 |        "      <td>Harikrishnan Rajan</td>\n",
 503 |        "      <td>\\N</td>\n",
 504 |        "      <td>\\N</td>\n",
 505 |        "      <td>cinematographer</td>\n",
 506 |        "      <td>tt8736744</td>\n",
 507 |        "    </tr>\n",
 508 |        "    <tr>\n",
 509 |        "      <td>9982874</td>\n",
 510 |        "      <td>nm9993718</td>\n",
 511 |        "      <td>Aayush Nair</td>\n",
 512 |        "      <td>\\N</td>\n",
 513 |        "      <td>\\N</td>\n",
 514 |        "      <td>cinematographer</td>\n",
 515 |        "      <td>\\N</td>\n",
 516 |        "    </tr>\n",
 517 |        "    <tr>\n",
 518 |        "      <td>9982875</td>\n",
 519 |        "      <td>nm9993719</td>\n",
 520 |        "      <td>Andre Hill</td>\n",
 521 |        "      <td>\\N</td>\n",
 522 |        "      <td>\\N</td>\n",
 523 |        "      <td>NaN</td>\n",
 524 |        "      <td>\\N</td>\n",
 525 |        "    </tr>\n",
 526 |        "  </tbody>\n",
 527 |        "</table>\n",
 528 |        "<p>9982876 rows × 6 columns</p>\n",
 529 |        "</div>"
 530 |       ],
 531 |       "text/plain": [
 532 |        "            nconst         primaryName birthYear deathYear  \\\n",
 533 |        "0        nm0000001        Fred Astaire      1899      1987   \n",
 534 |        "1        nm0000002       Lauren Bacall      1924      2014   \n",
 535 |        "2        nm0000003     Brigitte Bardot      1934        \\N   \n",
 536 |        "3        nm0000004        John Belushi      1949      1982   \n",
 537 |        "4        nm0000005      Ingmar Bergman      1918      2007   \n",
 538 |        "...            ...                 ...       ...       ...   \n",
 539 |        "9982871  nm9993714   Romeo del Rosario        \\N        \\N   \n",
 540 |        "9982872  nm9993716       Essias Loberg        \\N        \\N   \n",
 541 |        "9982873  nm9993717  Harikrishnan Rajan        \\N        \\N   \n",
 542 |        "9982874  nm9993718         Aayush Nair        \\N        \\N   \n",
 543 |        "9982875  nm9993719          Andre Hill        \\N        \\N   \n",
 544 |        "\n",
 545 |        "                           primaryProfession  \\\n",
 546 |        "0             soundtrack,actor,miscellaneous   \n",
 547 |        "1                         actress,soundtrack   \n",
 548 |        "2                actress,soundtrack,producer   \n",
 549 |        "3                    actor,soundtrack,writer   \n",
 550 |        "4                      writer,director,actor   \n",
 551 |        "...                                      ...   \n",
 552 |        "9982871  animation_department,art_department   \n",
 553 |        "9982872                                  NaN   \n",
 554 |        "9982873                      cinematographer   \n",
 555 |        "9982874                      cinematographer   \n",
 556 |        "9982875                                  NaN   \n",
 557 |        "\n",
 558 |        "                                  knownForTitles  \n",
 559 |        "0        tt0050419,tt0072308,tt0053137,tt0043044  \n",
 560 |        "1        tt0117057,tt0038355,tt0037382,tt0071877  \n",
 561 |        "2        tt0054452,tt0057345,tt0059956,tt0049189  \n",
 562 |        "3        tt0072562,tt0077975,tt0080455,tt0078723  \n",
 563 |        "4        tt0069467,tt0050986,tt0050976,tt0083922  \n",
 564 |        "...                                          ...  \n",
 565 |        "9982871                                tt2455546  \n",
 566 |        "9982872                                       \\N  \n",
 567 |        "9982873                                tt8736744  \n",
 568 |        "9982874                                       \\N  \n",
 569 |        "9982875                                       \\N  \n",
 570 |        "\n",
 571 |        "[9982876 rows x 6 columns]"
 572 |       ]
 573 |      },
 574 |      "metadata": {},
 575 |      "output_type": "display_data"
 576 |     },
 577 |     {
 578 |      "data": {
 579 |       "text/html": [
 580 |        "<div>\n",
 581 |        "<style scoped>\n",
 582 |        "    .dataframe tbody tr th:only-of-type {\n",
 583 |        "        vertical-align: middle;\n",
 584 |        "    }\n",
 585 |        "\n",
 586 |        "    .dataframe tbody tr th {\n",
 587 |        "        vertical-align: top;\n",
 588 |        "    }\n",
 589 |        "\n",
 590 |        "    .dataframe thead th {\n",
 591 |        "        text-align: right;\n",
 592 |        "    }\n",
 593 |        "</style>\n",
 594 |        "<table border=\"1\" class=\"dataframe\">\n",
 595 |        "  <thead>\n",
 596 |        "    <tr style=\"text-align: right;\">\n",
 597 |        "      <th></th>\n",
 598 |        "      <th>tconst</th>\n",
 599 |        "      <th>ordering</th>\n",
 600 |        "      <th>nconst</th>\n",
 601 |        "      <th>category</th>\n",
 602 |        "      <th>job</th>\n",
 603 |        "      <th>characters</th>\n",
 604 |        "    </tr>\n",
 605 |        "  </thead>\n",
 606 |        "  <tbody>\n",
 607 |        "    <tr>\n",
 608 |        "      <td>0</td>\n",
 609 |        "      <td>tt0000001</td>\n",
 610 |        "      <td>1</td>\n",
 611 |        "      <td>nm1588970</td>\n",
 612 |        "      <td>self</td>\n",
 613 |        "      <td>\\N</td>\n",
 614 |        "      <td>[\"Self\"]</td>\n",
 615 |        "    </tr>\n",
 616 |        "    <tr>\n",
 617 |        "      <td>1</td>\n",
 618 |        "      <td>tt0000001</td>\n",
 619 |        "      <td>2</td>\n",
 620 |        "      <td>nm0005690</td>\n",
 621 |        "      <td>director</td>\n",
 622 |        "      <td>\\N</td>\n",
 623 |        "      <td>\\N</td>\n",
 624 |        "    </tr>\n",
 625 |        "    <tr>\n",
 626 |        "      <td>2</td>\n",
 627 |        "      <td>tt0000001</td>\n",
 628 |        "      <td>3</td>\n",
 629 |        "      <td>nm0374658</td>\n",
 630 |        "      <td>cinematographer</td>\n",
 631 |        "      <td>director of photography</td>\n",
 632 |        "      <td>\\N</td>\n",
 633 |        "    </tr>\n",
 634 |        "    <tr>\n",
 635 |        "      <td>3</td>\n",
 636 |        "      <td>tt0000002</td>\n",
 637 |        "      <td>1</td>\n",
 638 |        "      <td>nm0721526</td>\n",
 639 |        "      <td>director</td>\n",
 640 |        "      <td>\\N</td>\n",
 641 |        "      <td>\\N</td>\n",
 642 |        "    </tr>\n",
 643 |        "    <tr>\n",
 644 |        "      <td>4</td>\n",
 645 |        "      <td>tt0000002</td>\n",
 646 |        "      <td>2</td>\n",
 647 |        "      <td>nm1335271</td>\n",
 648 |        "      <td>composer</td>\n",
 649 |        "      <td>\\N</td>\n",
 650 |        "      <td>\\N</td>\n",
 651 |        "    </tr>\n",
 652 |        "    <tr>\n",
 653 |        "      <td>...</td>\n",
 654 |        "      <td>...</td>\n",
 655 |        "      <td>...</td>\n",
 656 |        "      <td>...</td>\n",
 657 |        "      <td>...</td>\n",
 658 |        "      <td>...</td>\n",
 659 |        "      <td>...</td>\n",
 660 |        "    </tr>\n",
 661 |        "    <tr>\n",
 662 |        "      <td>38538893</td>\n",
 663 |        "      <td>tt9916880</td>\n",
 664 |        "      <td>5</td>\n",
 665 |        "      <td>nm0996406</td>\n",
 666 |        "      <td>director</td>\n",
 667 |        "      <td>principal director</td>\n",
 668 |        "      <td>\\N</td>\n",
 669 |        "    </tr>\n",
 670 |        "    <tr>\n",
 671 |        "      <td>38538894</td>\n",
 672 |        "      <td>tt9916880</td>\n",
 673 |        "      <td>6</td>\n",
 674 |        "      <td>nm1482639</td>\n",
 675 |        "      <td>writer</td>\n",
 676 |        "      <td>\\N</td>\n",
 677 |        "      <td>\\N</td>\n",
 678 |        "    </tr>\n",
 679 |        "    <tr>\n",
 680 |        "      <td>38538895</td>\n",
 681 |        "      <td>tt9916880</td>\n",
 682 |        "      <td>7</td>\n",
 683 |        "      <td>nm2586970</td>\n",
 684 |        "      <td>writer</td>\n",
 685 |        "      <td>books</td>\n",
 686 |        "      <td>\\N</td>\n",
 687 |        "    </tr>\n",
 688 |        "    <tr>\n",
 689 |        "      <td>38538896</td>\n",
 690 |        "      <td>tt9916880</td>\n",
 691 |        "      <td>8</td>\n",
 692 |        "      <td>nm1594058</td>\n",
 693 |        "      <td>producer</td>\n",
 694 |        "      <td>producer</td>\n",
 695 |        "      <td>\\N</td>\n",
 696 |        "    </tr>\n",
 697 |        "    <tr>\n",
 698 |        "      <td>38538897</td>\n",
 699 |        "      <td>tt9916880</td>\n",
 700 |        "      <td>9</td>\n",
 701 |        "      <td>nm2676923</td>\n",
 702 |        "      <td>actress</td>\n",
 703 |        "      <td>\\N</td>\n",
 704 |        "      <td>[\"Sour Susan\",\"Goody-Goody Gordon\",\"Singing So...</td>\n",
 705 |        "    </tr>\n",
 706 |        "  </tbody>\n",
 707 |        "</table>\n",
 708 |        "<p>38538898 rows × 6 columns</p>\n",
 709 |        "</div>"
 710 |       ],
 711 |       "text/plain": [
 712 |        "             tconst  ordering     nconst         category  \\\n",
 713 |        "0         tt0000001         1  nm1588970             self   \n",
 714 |        "1         tt0000001         2  nm0005690         director   \n",
 715 |        "2         tt0000001         3  nm0374658  cinematographer   \n",
 716 |        "3         tt0000002         1  nm0721526         director   \n",
 717 |        "4         tt0000002         2  nm1335271         composer   \n",
 718 |        "...             ...       ...        ...              ...   \n",
 719 |        "38538893  tt9916880         5  nm0996406         director   \n",
 720 |        "38538894  tt9916880         6  nm1482639           writer   \n",
 721 |        "38538895  tt9916880         7  nm2586970           writer   \n",
 722 |        "38538896  tt9916880         8  nm1594058         producer   \n",
 723 |        "38538897  tt9916880         9  nm2676923          actress   \n",
 724 |        "\n",
 725 |        "                              job  \\\n",
 726 |        "0                              \\N   \n",
 727 |        "1                              \\N   \n",
 728 |        "2         director of photography   \n",
 729 |        "3                              \\N   \n",
 730 |        "4                              \\N   \n",
 731 |        "...                           ...   \n",
 732 |        "38538893       principal director   \n",
 733 |        "38538894                       \\N   \n",
 734 |        "38538895                    books   \n",
 735 |        "38538896                 producer   \n",
 736 |        "38538897                       \\N   \n",
 737 |        "\n",
 738 |        "                                                 characters  \n",
 739 |        "0                                                  [\"Self\"]  \n",
 740 |        "1                                                        \\N  \n",
 741 |        "2                                                        \\N  \n",
 742 |        "3                                                        \\N  \n",
 743 |        "4                                                        \\N  \n",
 744 |        "...                                                     ...  \n",
 745 |        "38538893                                                 \\N  \n",
 746 |        "38538894                                                 \\N  \n",
 747 |        "38538895                                                 \\N  \n",
 748 |        "38538896                                                 \\N  \n",
 749 |        "38538897  [\"Sour Susan\",\"Goody-Goody Gordon\",\"Singing So...  \n",
 750 |        "\n",
 751 |        "[38538898 rows x 6 columns]"
 752 |       ]
 753 |      },
 754 |      "metadata": {},
 755 |      "output_type": "display_data"
 756 |     },
 757 |     {
 758 |      "data": {
 759 |       "text/html": [
 760 |        "<div>\n",
 761 |        "<style scoped>\n",
 762 |        "    .dataframe tbody tr th:only-of-type {\n",
 763 |        "        vertical-align: middle;\n",
 764 |        "    }\n",
 765 |        "\n",
 766 |        "    .dataframe tbody tr th {\n",
 767 |        "        vertical-align: top;\n",
 768 |        "    }\n",
 769 |        "\n",
 770 |        "    .dataframe thead th {\n",
 771 |        "        text-align: right;\n",
 772 |        "    }\n",
 773 |        "</style>\n",
 774 |        "<table border=\"1\" class=\"dataframe\">\n",
 775 |        "  <thead>\n",
 776 |        "    <tr style=\"text-align: right;\">\n",
 777 |        "      <th></th>\n",
 778 |        "      <th>tconst</th>\n",
 779 |        "      <th>directors</th>\n",
 780 |        "      <th>writers</th>\n",
 781 |        "    </tr>\n",
 782 |        "  </thead>\n",
 783 |        "  <tbody>\n",
 784 |        "    <tr>\n",
 785 |        "      <td>0</td>\n",
 786 |        "      <td>tt0000001</td>\n",
 787 |        "      <td>nm0005690</td>\n",
 788 |        "      <td>\\N</td>\n",
 789 |        "    </tr>\n",
 790 |        "    <tr>\n",
 791 |        "      <td>1</td>\n",
 792 |        "      <td>tt0000002</td>\n",
 793 |        "      <td>nm0721526</td>\n",
 794 |        "      <td>\\N</td>\n",
 795 |        "    </tr>\n",
 796 |        "    <tr>\n",
 797 |        "      <td>2</td>\n",
 798 |        "      <td>tt0000003</td>\n",
 799 |        "      <td>nm0721526</td>\n",
 800 |        "      <td>\\N</td>\n",
 801 |        "    </tr>\n",
 802 |        "    <tr>\n",
 803 |        "      <td>3</td>\n",
 804 |        "      <td>tt0000004</td>\n",
 805 |        "      <td>nm0721526</td>\n",
 806 |        "      <td>\\N</td>\n",
 807 |        "    </tr>\n",
 808 |        "    <tr>\n",
 809 |        "      <td>4</td>\n",
 810 |        "      <td>tt0000005</td>\n",
 811 |        "      <td>nm0005690</td>\n",
 812 |        "      <td>\\N</td>\n",
 813 |        "    </tr>\n",
 814 |        "    <tr>\n",
 815 |        "      <td>...</td>\n",
 816 |        "      <td>...</td>\n",
 817 |        "      <td>...</td>\n",
 818 |        "      <td>...</td>\n",
 819 |        "    </tr>\n",
 820 |        "    <tr>\n",
 821 |        "      <td>6672689</td>\n",
 822 |        "      <td>tt9916848</td>\n",
 823 |        "      <td>nm5519454,nm5519375</td>\n",
 824 |        "      <td>nm6182221,nm1628284,nm2921377</td>\n",
 825 |        "    </tr>\n",
 826 |        "    <tr>\n",
 827 |        "      <td>6672690</td>\n",
 828 |        "      <td>tt9916850</td>\n",
 829 |        "      <td>nm5519375,nm5519454</td>\n",
 830 |        "      <td>nm6182221,nm1628284,nm2921377</td>\n",
 831 |        "    </tr>\n",
 832 |        "    <tr>\n",
 833 |        "      <td>6672691</td>\n",
 834 |        "      <td>tt9916852</td>\n",
 835 |        "      <td>nm5519375,nm5519454</td>\n",
 836 |        "      <td>nm6182221,nm1628284,nm2921377</td>\n",
 837 |        "    </tr>\n",
 838 |        "    <tr>\n",
 839 |        "      <td>6672692</td>\n",
 840 |        "      <td>tt9916856</td>\n",
 841 |        "      <td>nm10538645</td>\n",
 842 |        "      <td>nm6951431</td>\n",
 843 |        "    </tr>\n",
 844 |        "    <tr>\n",
 845 |        "      <td>6672693</td>\n",
 846 |        "      <td>tt9916880</td>\n",
 847 |        "      <td>nm0996406</td>\n",
 848 |        "      <td>nm1482639,nm2586970</td>\n",
 849 |        "    </tr>\n",
 850 |        "  </tbody>\n",
 851 |        "</table>\n",
 852 |        "<p>6672694 rows × 3 columns</p>\n",
 853 |        "</div>"
 854 |       ],
 855 |       "text/plain": [
 856 |        "            tconst            directors                        writers\n",
 857 |        "0        tt0000001            nm0005690                             \\N\n",
 858 |        "1        tt0000002            nm0721526                             \\N\n",
 859 |        "2        tt0000003            nm0721526                             \\N\n",
 860 |        "3        tt0000004            nm0721526                             \\N\n",
 861 |        "4        tt0000005            nm0005690                             \\N\n",
 862 |        "...            ...                  ...                            ...\n",
 863 |        "6672689  tt9916848  nm5519454,nm5519375  nm6182221,nm1628284,nm2921377\n",
 864 |        "6672690  tt9916850  nm5519375,nm5519454  nm6182221,nm1628284,nm2921377\n",
 865 |        "6672691  tt9916852  nm5519375,nm5519454  nm6182221,nm1628284,nm2921377\n",
 866 |        "6672692  tt9916856           nm10538645                      nm6951431\n",
 867 |        "6672693  tt9916880            nm0996406            nm1482639,nm2586970\n",
 868 |        "\n",
 869 |        "[6672694 rows x 3 columns]"
 870 |       ]
 871 |      },
 872 |      "metadata": {},
 873 |      "output_type": "display_data"
 874 |     },
 875 |     {
 876 |      "data": {
 877 |       "text/html": [
 878 |        "<div>\n",
 879 |        "<style scoped>\n",
 880 |        "    .dataframe tbody tr th:only-of-type {\n",
 881 |        "        vertical-align: middle;\n",
 882 |        "    }\n",
 883 |        "\n",
 884 |        "    .dataframe tbody tr th {\n",
 885 |        "        vertical-align: top;\n",
 886 |        "    }\n",
 887 |        "\n",
 888 |        "    .dataframe thead th {\n",
 889 |        "        text-align: right;\n",
 890 |        "    }\n",
 891 |        "</style>\n",
 892 |        "<table border=\"1\" class=\"dataframe\">\n",
 893 |        "  <thead>\n",
 894 |        "    <tr style=\"text-align: right;\">\n",
 895 |        "      <th></th>\n",
 896 |        "      <th>titleId</th>\n",
 897 |        "      <th>ordering</th>\n",
 898 |        "      <th>title</th>\n",
 899 |        "      <th>region</th>\n",
 900 |        "      <th>language</th>\n",
 901 |        "      <th>types</th>\n",
 902 |        "      <th>attributes</th>\n",
 903 |        "      <th>isOriginalTitle</th>\n",
 904 |        "    </tr>\n",
 905 |        "  </thead>\n",
 906 |        "  <tbody>\n",
 907 |        "    <tr>\n",
 908 |        "      <td>0</td>\n",
 909 |        "      <td>tt0000001</td>\n",
 910 |        "      <td>1</td>\n",
 911 |        "      <td>Carmencita</td>\n",
 912 |        "      <td>DE</td>\n",
 913 |        "      <td>\\N</td>\n",
 914 |        "      <td>\\N</td>\n",
 915 |        "      <td>literal title</td>\n",
 916 |        "      <td>0</td>\n",
 917 |        "    </tr>\n",
 918 |        "    <tr>\n",
 919 |        "      <td>1</td>\n",
 920 |        "      <td>tt0000001</td>\n",
 921 |        "      <td>2</td>\n",
 922 |        "      <td>Carmencita - spanyol tánc</td>\n",
 923 |        "      <td>HU</td>\n",
 924 |        "      <td>\\N</td>\n",
 925 |        "      <td>imdbDisplay</td>\n",
 926 |        "      <td>\\N</td>\n",
 927 |        "      <td>0</td>\n",
 928 |        "    </tr>\n",
 929 |        "    <tr>\n",
 930 |        "      <td>2</td>\n",
 931 |        "      <td>tt0000001</td>\n",
 932 |        "      <td>3</td>\n",
 933 |        "      <td>Καρμενσίτα</td>\n",
 934 |        "      <td>GR</td>\n",
 935 |        "      <td>\\N</td>\n",
 936 |        "      <td>imdbDisplay</td>\n",
 937 |        "      <td>\\N</td>\n",
 938 |        "      <td>0</td>\n",
 939 |        "    </tr>\n",
 940 |        "    <tr>\n",
 941 |        "      <td>3</td>\n",
 942 |        "      <td>tt0000001</td>\n",
 943 |        "      <td>4</td>\n",
 944 |        "      <td>Карменсита</td>\n",
 945 |        "      <td>RU</td>\n",
 946 |        "      <td>\\N</td>\n",
 947 |        "      <td>imdbDisplay</td>\n",
 948 |        "      <td>\\N</td>\n",
 949 |        "      <td>0</td>\n",
 950 |        "    </tr>\n",
 951 |        "    <tr>\n",
 952 |        "      <td>4</td>\n",
 953 |        "      <td>tt0000001</td>\n",
 954 |        "      <td>5</td>\n",
 955 |        "      <td>Carmencita</td>\n",
 956 |        "      <td>US</td>\n",
 957 |        "      <td>\\N</td>\n",
 958 |        "      <td>\\N</td>\n",
 959 |        "      <td>\\N</td>\n",
 960 |        "      <td>0</td>\n",
 961 |        "    </tr>\n",
 962 |        "    <tr>\n",
 963 |        "      <td>...</td>\n",
 964 |        "      <td>...</td>\n",
 965 |        "      <td>...</td>\n",
 966 |        "      <td>...</td>\n",
 967 |        "      <td>...</td>\n",
 968 |        "      <td>...</td>\n",
 969 |        "      <td>...</td>\n",
 970 |        "      <td>...</td>\n",
 971 |        "      <td>...</td>\n",
 972 |        "    </tr>\n",
 973 |        "    <tr>\n",
 974 |        "      <td>20839948</td>\n",
 975 |        "      <td>tt9916852</td>\n",
 976 |        "      <td>3</td>\n",
 977 |        "      <td>Folge #3.20</td>\n",
 978 |        "      <td>DE</td>\n",
 979 |        "      <td>de</td>\n",
 980 |        "      <td>\\N</td>\n",
 981 |        "      <td>\\N</td>\n",
 982 |        "      <td>0</td>\n",
 983 |        "    </tr>\n",
 984 |        "    <tr>\n",
 985 |        "      <td>20839949</td>\n",
 986 |        "      <td>tt9916852</td>\n",
 987 |        "      <td>4</td>\n",
 988 |        "      <td>エピソード #3.20</td>\n",
 989 |        "      <td>JP</td>\n",
 990 |        "      <td>ja</td>\n",
 991 |        "      <td>\\N</td>\n",
 992 |        "      <td>\\N</td>\n",
 993 |        "      <td>0</td>\n",
 994 |        "    </tr>\n",
 995 |        "    <tr>\n",
 996 |        "      <td>20839950</td>\n",
 997 |        "      <td>tt9916852</td>\n",
 998 |        "      <td>5</td>\n",
 999 |        "      <td>Episódio #3.20</td>\n",
1000 |        "      <td>PT</td>\n",
1001 |        "      <td>pt</td>\n",
1002 |        "      <td>\\N</td>\n",
1003 |        "      <td>\\N</td>\n",
1004 |        "      <td>0</td>\n",
1005 |        "    </tr>\n",
1006 |        "    <tr>\n",
1007 |        "      <td>20839951</td>\n",
1008 |        "      <td>tt9916852</td>\n",
1009 |        "      <td>6</td>\n",
1010 |        "      <td>Episodio #3.20</td>\n",
1011 |        "      <td>IT</td>\n",
1012 |        "      <td>it</td>\n",
1013 |        "      <td>\\N</td>\n",
1014 |        "      <td>\\N</td>\n",
1015 |        "      <td>0</td>\n",
1016 |        "    </tr>\n",
1017 |        "    <tr>\n",
1018 |        "      <td>20839952</td>\n",
1019 |        "      <td>tt9916852</td>\n",
1020 |        "      <td>7</td>\n",
1021 |        "      <td>एपिसोड #3.20</td>\n",
1022 |        "      <td>IN</td>\n",
1023 |        "      <td>hi</td>\n",
1024 |        "      <td>\\N</td>\n",
1025 |        "      <td>\\N</td>\n",
1026 |        "      <td>0</td>\n",
1027 |        "    </tr>\n",
1028 |        "  </tbody>\n",
1029 |        "</table>\n",
1030 |        "<p>20839953 rows × 8 columns</p>\n",
1031 |        "</div>"
1032 |       ],
1033 |       "text/plain": [
1034 |        "            titleId  ordering                      title region language  \\\n",
1035 |        "0         tt0000001         1                 Carmencita     DE       \\N   \n",
1036 |        "1         tt0000001         2  Carmencita - spanyol tánc     HU       \\N   \n",
1037 |        "2         tt0000001         3                 Καρμενσίτα     GR       \\N   \n",
1038 |        "3         tt0000001         4                 Карменсита     RU       \\N   \n",
1039 |        "4         tt0000001         5                 Carmencita     US       \\N   \n",
1040 |        "...             ...       ...                        ...    ...      ...   \n",
1041 |        "20839948  tt9916852         3                Folge #3.20     DE       de   \n",
1042 |        "20839949  tt9916852         4                エピソード #3.20     JP       ja   \n",
1043 |        "20839950  tt9916852         5             Episódio #3.20     PT       pt   \n",
1044 |        "20839951  tt9916852         6             Episodio #3.20     IT       it   \n",
1045 |        "20839952  tt9916852         7               एपिसोड #3.20     IN       hi   \n",
1046 |        "\n",
1047 |        "                types     attributes isOriginalTitle  \n",
1048 |        "0                  \\N  literal title               0  \n",
1049 |        "1         imdbDisplay             \\N               0  \n",
1050 |        "2         imdbDisplay             \\N               0  \n",
1051 |        "3         imdbDisplay             \\N               0  \n",
1052 |        "4                  \\N             \\N               0  \n",
1053 |        "...               ...            ...             ...  \n",
1054 |        "20839948           \\N             \\N               0  \n",
1055 |        "20839949           \\N             \\N               0  \n",
1056 |        "20839950           \\N             \\N               0  \n",
1057 |        "20839951           \\N             \\N               0  \n",
1058 |        "20839952           \\N             \\N               0  \n",
1059 |        "\n",
1060 |        "[20839953 rows x 8 columns]"
1061 |       ]
1062 |      },
1063 |      "metadata": {},
1064 |      "output_type": "display_data"
1065 |     },
1066 |     {
1067 |      "data": {
1068 |       "text/html": [
1069 |        "<div>\n",
1070 |        "<style scoped>\n",
1071 |        "    .dataframe tbody tr th:only-of-type {\n",
1072 |        "        vertical-align: middle;\n",
1073 |        "    }\n",
1074 |        "\n",
1075 |        "    .dataframe tbody tr th {\n",
1076 |        "        vertical-align: top;\n",
1077 |        "    }\n",
1078 |        "\n",
1079 |        "    .dataframe thead th {\n",
1080 |        "        text-align: right;\n",
1081 |        "    }\n",
1082 |        "</style>\n",
1083 |        "<table border=\"1\" class=\"dataframe\">\n",
1084 |        "  <thead>\n",
1085 |        "    <tr style=\"text-align: right;\">\n",
1086 |        "      <th></th>\n",
1087 |        "      <th>tconst</th>\n",
1088 |        "      <th>parentTconst</th>\n",
1089 |        "      <th>seasonNumber</th>\n",
1090 |        "      <th>episodeNumber</th>\n",
1091 |        "    </tr>\n",
1092 |        "  </thead>\n",
1093 |        "  <tbody>\n",
1094 |        "    <tr>\n",
1095 |        "      <td>0</td>\n",
1096 |        "      <td>tt0041951</td>\n",
1097 |        "      <td>tt0041038</td>\n",
1098 |        "      <td>1</td>\n",
1099 |        "      <td>9</td>\n",
1100 |        "    </tr>\n",
1101 |        "    <tr>\n",
1102 |        "      <td>1</td>\n",
1103 |        "      <td>tt0042816</td>\n",
1104 |        "      <td>tt0989125</td>\n",
1105 |        "      <td>1</td>\n",
1106 |        "      <td>17</td>\n",
1107 |        "    </tr>\n",
1108 |        "    <tr>\n",
1109 |        "      <td>2</td>\n",
1110 |        "      <td>tt0042889</td>\n",
1111 |        "      <td>tt0989125</td>\n",
1112 |        "      <td>\\N</td>\n",
1113 |        "      <td>\\N</td>\n",
1114 |        "    </tr>\n",
1115 |        "    <tr>\n",
1116 |        "      <td>3</td>\n",
1117 |        "      <td>tt0043426</td>\n",
1118 |        "      <td>tt0040051</td>\n",
1119 |        "      <td>3</td>\n",
1120 |        "      <td>42</td>\n",
1121 |        "    </tr>\n",
1122 |        "    <tr>\n",
1123 |        "      <td>4</td>\n",
1124 |        "      <td>tt0043631</td>\n",
1125 |        "      <td>tt0989125</td>\n",
1126 |        "      <td>2</td>\n",
1127 |        "      <td>16</td>\n",
1128 |        "    </tr>\n",
1129 |        "    <tr>\n",
1130 |        "      <td>...</td>\n",
1131 |        "      <td>...</td>\n",
1132 |        "      <td>...</td>\n",
1133 |        "      <td>...</td>\n",
1134 |        "      <td>...</td>\n",
1135 |        "    </tr>\n",
1136 |        "    <tr>\n",
1137 |        "      <td>4737120</td>\n",
1138 |        "      <td>tt9916846</td>\n",
1139 |        "      <td>tt1289683</td>\n",
1140 |        "      <td>3</td>\n",
1141 |        "      <td>18</td>\n",
1142 |        "    </tr>\n",
1143 |        "    <tr>\n",
1144 |        "      <td>4737121</td>\n",
1145 |        "      <td>tt9916848</td>\n",
1146 |        "      <td>tt1289683</td>\n",
1147 |        "      <td>3</td>\n",
1148 |        "      <td>17</td>\n",
1149 |        "    </tr>\n",
1150 |        "    <tr>\n",
1151 |        "      <td>4737122</td>\n",
1152 |        "      <td>tt9916850</td>\n",
1153 |        "      <td>tt1289683</td>\n",
1154 |        "      <td>3</td>\n",
1155 |        "      <td>19</td>\n",
1156 |        "    </tr>\n",
1157 |        "    <tr>\n",
1158 |        "      <td>4737123</td>\n",
1159 |        "      <td>tt9916852</td>\n",
1160 |        "      <td>tt1289683</td>\n",
1161 |        "      <td>3</td>\n",
1162 |        "      <td>20</td>\n",
1163 |        "    </tr>\n",
1164 |        "    <tr>\n",
1165 |        "      <td>4737124</td>\n",
1166 |        "      <td>tt9916880</td>\n",
1167 |        "      <td>tt0985991</td>\n",
1168 |        "      <td>4</td>\n",
1169 |        "      <td>2</td>\n",
1170 |        "    </tr>\n",
1171 |        "  </tbody>\n",
1172 |        "</table>\n",
1173 |        "<p>4737125 rows × 4 columns</p>\n",
1174 |        "</div>"
1175 |       ],
1176 |       "text/plain": [
1177 |        "            tconst parentTconst seasonNumber episodeNumber\n",
1178 |        "0        tt0041951    tt0041038            1             9\n",
1179 |        "1        tt0042816    tt0989125            1            17\n",
1180 |        "2        tt0042889    tt0989125           \\N            \\N\n",
1181 |        "3        tt0043426    tt0040051            3            42\n",
1182 |        "4        tt0043631    tt0989125            2            16\n",
1183 |        "...            ...          ...          ...           ...\n",
1184 |        "4737120  tt9916846    tt1289683            3            18\n",
1185 |        "4737121  tt9916848    tt1289683            3            17\n",
1186 |        "4737122  tt9916850    tt1289683            3            19\n",
1187 |        "4737123  tt9916852    tt1289683            3            20\n",
1188 |        "4737124  tt9916880    tt0985991            4             2\n",
1189 |        "\n",
1190 |        "[4737125 rows x 4 columns]"
1191 |       ]
1192 |      },
1193 |      "metadata": {},
1194 |      "output_type": "display_data"
1195 |     },
1196 |     {
1197 |      "name": "stdout",
1198 |      "output_type": "stream",
1199 |      "text": [
1200 |       "Wall time: 2min 40s\n"
1201 |      ]
1202 |     }
1203 |    ],
1204 |    "source": [
1205 |     "%%time\n",
1206 |     "imdb_api_file_list = ['title.basics.tsv.gz','title.ratings.tsv.gz','name.basics.tsv.gz','title.principals.tsv.gz','title.crew.tsv.gz','title.akas.tsv.gz','title.episode.tsv.gz']\n",
1207 |     "\n",
1208 |     "for package in imdb_api_file_list:\n",
1209 |     "    package_file_name = f'https://datasets.imdbws.com/{package}'\n",
1210 |     "    display(pd.read_csv(package_file_name, sep='\\t', low_memory=False))"
1211 |    ]
1212 |   },
1213 |   {
1214 |    "cell_type": "markdown",
1215 |    "metadata": {},
1216 |    "source": [
1217 |     "There is a lot of great data here! However, I there aren't isn't much text to work with - no plot summary or even taglines. I will have to scrape for that information.\n",
1218 |     "\n",
1219 |     "I only want movie results (no TV or people), so I'm going to focus on `title.basics.tsv.gz` and `title.ratings.tsv.gz`.\n",
1220 |     "\n",
1221 |     "---\n",
1222 |     "\n",
1223 |     "Save `title.basics` and `ratings` into their own dataframe, merge them together on `tconst`, and explore."
1224 |    ]
1225 |   },
1226 |   {
1227 |    "cell_type": "code",
1228 |    "execution_count": 12,
1229 |    "metadata": {},
1230 |    "outputs": [
1231 |     {
1232 |      "data": {
1233 |       "text/html": [
1234 |        "<div>\n",
1235 |        "<style scoped>\n",
1236 |        "    .dataframe tbody tr th:only-of-type {\n",
1237 |        "        vertical-align: middle;\n",
1238 |        "    }\n",
1239 |        "\n",
1240 |        "    .dataframe tbody tr th {\n",
1241 |        "        vertical-align: top;\n",
1242 |        "    }\n",
1243 |        "\n",
1244 |        "    .dataframe thead th {\n",
1245 |        "        text-align: right;\n",
1246 |        "    }\n",
1247 |        "</style>\n",
1248 |        "<table border=\"1\" class=\"dataframe\">\n",
1249 |        "  <thead>\n",
1250 |        "    <tr style=\"text-align: right;\">\n",
1251 |        "      <th></th>\n",
1252 |        "      <th>tconst</th>\n",
1253 |        "      <th>titleType</th>\n",
1254 |        "      <th>primaryTitle</th>\n",
1255 |        "      <th>originalTitle</th>\n",
1256 |        "      <th>isAdult</th>\n",
1257 |        "      <th>startYear</th>\n",
1258 |        "      <th>endYear</th>\n",
1259 |        "      <th>runtimeMinutes</th>\n",
1260 |        "      <th>genres</th>\n",
1261 |        "      <th>averageRating</th>\n",
1262 |        "      <th>numVotes</th>\n",
1263 |        "    </tr>\n",
1264 |        "  </thead>\n",
1265 |        "  <tbody>\n",
1266 |        "    <tr>\n",
1267 |        "      <td>0</td>\n",
1268 |        "      <td>tt0000001</td>\n",
1269 |        "      <td>short</td>\n",
1270 |        "      <td>Carmencita</td>\n",
1271 |        "      <td>Carmencita</td>\n",
1272 |        "      <td>0</td>\n",
1273 |        "      <td>1894</td>\n",
1274 |        "      <td>\\N</td>\n",
1275 |        "      <td>1</td>\n",
1276 |        "      <td>Documentary,Short</td>\n",
1277 |        "      <td>5.6</td>\n",
1278 |        "      <td>1591</td>\n",
1279 |        "    </tr>\n",
1280 |        "    <tr>\n",
1281 |        "      <td>1</td>\n",
1282 |        "      <td>tt0000002</td>\n",
1283 |        "      <td>short</td>\n",
1284 |        "      <td>Le clown et ses chiens</td>\n",
1285 |        "      <td>Le clown et ses chiens</td>\n",
1286 |        "      <td>0</td>\n",
1287 |        "      <td>1892</td>\n",
1288 |        "      <td>\\N</td>\n",
1289 |        "      <td>5</td>\n",
1290 |        "      <td>Animation,Short</td>\n",
1291 |        "      <td>6.1</td>\n",
1292 |        "      <td>194</td>\n",
1293 |        "    </tr>\n",
1294 |        "    <tr>\n",
1295 |        "      <td>2</td>\n",
1296 |        "      <td>tt0000003</td>\n",
1297 |        "      <td>short</td>\n",
1298 |        "      <td>Pauvre Pierrot</td>\n",
1299 |        "      <td>Pauvre Pierrot</td>\n",
1300 |        "      <td>0</td>\n",
1301 |        "      <td>1892</td>\n",
1302 |        "      <td>\\N</td>\n",
1303 |        "      <td>4</td>\n",
1304 |        "      <td>Animation,Comedy,Romance</td>\n",
1305 |        "      <td>6.5</td>\n",
1306 |        "      <td>1264</td>\n",
1307 |        "    </tr>\n",
1308 |        "    <tr>\n",
1309 |        "      <td>3</td>\n",
1310 |        "      <td>tt0000004</td>\n",
1311 |        "      <td>short</td>\n",
1312 |        "      <td>Un bon bock</td>\n",
1313 |        "      <td>Un bon bock</td>\n",
1314 |        "      <td>0</td>\n",
1315 |        "      <td>1892</td>\n",
1316 |        "      <td>\\N</td>\n",
1317 |        "      <td>12</td>\n",
1318 |        "      <td>Animation,Short</td>\n",
1319 |        "      <td>6.2</td>\n",
1320 |        "      <td>120</td>\n",
1321 |        "    </tr>\n",
1322 |        "    <tr>\n",
1323 |        "      <td>4</td>\n",
1324 |        "      <td>tt0000005</td>\n",
1325 |        "      <td>short</td>\n",
1326 |        "      <td>Blacksmith Scene</td>\n",
1327 |        "      <td>Blacksmith Scene</td>\n",
1328 |        "      <td>0</td>\n",
1329 |        "      <td>1893</td>\n",
1330 |        "      <td>\\N</td>\n",
1331 |        "      <td>1</td>\n",
1332 |        "      <td>Comedy,Short</td>\n",
1333 |        "      <td>6.1</td>\n",
1334 |        "      <td>2025</td>\n",
1335 |        "    </tr>\n",
1336 |        "    <tr>\n",
1337 |        "      <td>...</td>\n",
1338 |        "      <td>...</td>\n",
1339 |        "      <td>...</td>\n",
1340 |        "      <td>...</td>\n",
1341 |        "      <td>...</td>\n",
1342 |        "      <td>...</td>\n",
1343 |        "      <td>...</td>\n",
1344 |        "      <td>...</td>\n",
1345 |        "      <td>...</td>\n",
1346 |        "      <td>...</td>\n",
1347 |        "      <td>...</td>\n",
1348 |        "      <td>...</td>\n",
1349 |        "    </tr>\n",
1350 |        "    <tr>\n",
1351 |        "      <td>1018999</td>\n",
1352 |        "      <td>tt9916576</td>\n",
1353 |        "      <td>tvEpisode</td>\n",
1354 |        "      <td>Destinee's Story</td>\n",
1355 |        "      <td>Destinee's Story</td>\n",
1356 |        "      <td>0</td>\n",
1357 |        "      <td>2019</td>\n",
1358 |        "      <td>\\N</td>\n",
1359 |        "      <td>85</td>\n",
1360 |        "      <td>Reality-TV</td>\n",
1361 |        "      <td>6.0</td>\n",
1362 |        "      <td>9</td>\n",
1363 |        "    </tr>\n",
1364 |        "    <tr>\n",
1365 |        "      <td>1019000</td>\n",
1366 |        "      <td>tt9916578</td>\n",
1367 |        "      <td>tvEpisode</td>\n",
1368 |        "      <td>The Trial of Joan Collins</td>\n",
1369 |        "      <td>The Trial of Joan Collins</td>\n",
1370 |        "      <td>0</td>\n",
1371 |        "      <td>2019</td>\n",
1372 |        "      <td>\\N</td>\n",
1373 |        "      <td>\\N</td>\n",
1374 |        "      <td>Adventure,Biography,Comedy</td>\n",
1375 |        "      <td>8.5</td>\n",
1376 |        "      <td>16</td>\n",
1377 |        "    </tr>\n",
1378 |        "    <tr>\n",
1379 |        "      <td>1019001</td>\n",
1380 |        "      <td>tt9916720</td>\n",
1381 |        "      <td>short</td>\n",
1382 |        "      <td>The Nun 2</td>\n",
1383 |        "      <td>The Nun 2</td>\n",
1384 |        "      <td>0</td>\n",
1385 |        "      <td>2019</td>\n",
1386 |        "      <td>\\N</td>\n",
1387 |        "      <td>10</td>\n",
1388 |        "      <td>Comedy,Horror,Mystery</td>\n",
1389 |        "      <td>5.5</td>\n",
1390 |        "      <td>48</td>\n",
1391 |        "    </tr>\n",
1392 |        "    <tr>\n",
1393 |        "      <td>1019002</td>\n",
1394 |        "      <td>tt9916766</td>\n",
1395 |        "      <td>tvEpisode</td>\n",
1396 |        "      <td>Episode #10.15</td>\n",
1397 |        "      <td>Episode #10.15</td>\n",
1398 |        "      <td>0</td>\n",
1399 |        "      <td>2019</td>\n",
1400 |        "      <td>\\N</td>\n",
1401 |        "      <td>43</td>\n",
1402 |        "      <td>Family,Reality-TV</td>\n",
1403 |        "      <td>6.8</td>\n",
1404 |        "      <td>13</td>\n",
1405 |        "    </tr>\n",
1406 |        "    <tr>\n",
1407 |        "      <td>1019003</td>\n",
1408 |        "      <td>tt9916778</td>\n",
1409 |        "      <td>tvEpisode</td>\n",
1410 |        "      <td>Escape</td>\n",
1411 |        "      <td>Escape</td>\n",
1412 |        "      <td>0</td>\n",
1413 |        "      <td>2019</td>\n",
1414 |        "      <td>\\N</td>\n",
1415 |        "      <td>\\N</td>\n",
1416 |        "      <td>Drama</td>\n",
1417 |        "      <td>7.2</td>\n",
1418 |        "      <td>20</td>\n",
1419 |        "    </tr>\n",
1420 |        "  </tbody>\n",
1421 |        "</table>\n",
1422 |        "<p>1019004 rows × 11 columns</p>\n",
1423 |        "</div>"
1424 |       ],
1425 |       "text/plain": [
1426 |        "            tconst  titleType               primaryTitle  \\\n",
1427 |        "0        tt0000001      short                 Carmencita   \n",
1428 |        "1        tt0000002      short     Le clown et ses chiens   \n",
1429 |        "2        tt0000003      short             Pauvre Pierrot   \n",
1430 |        "3        tt0000004      short                Un bon bock   \n",
1431 |        "4        tt0000005      short           Blacksmith Scene   \n",
1432 |        "...            ...        ...                        ...   \n",
1433 |        "1018999  tt9916576  tvEpisode           Destinee's Story   \n",
1434 |        "1019000  tt9916578  tvEpisode  The Trial of Joan Collins   \n",
1435 |        "1019001  tt9916720      short                  The Nun 2   \n",
1436 |        "1019002  tt9916766  tvEpisode             Episode #10.15   \n",
1437 |        "1019003  tt9916778  tvEpisode                     Escape   \n",
1438 |        "\n",
1439 |        "                     originalTitle  isAdult startYear endYear runtimeMinutes  \\\n",
1440 |        "0                       Carmencita        0      1894      \\N              1   \n",
1441 |        "1           Le clown et ses chiens        0      1892      \\N              5   \n",
1442 |        "2                   Pauvre Pierrot        0      1892      \\N              4   \n",
1443 |        "3                      Un bon bock        0      1892      \\N             12   \n",
1444 |        "4                 Blacksmith Scene        0      1893      \\N              1   \n",
1445 |        "...                            ...      ...       ...     ...            ...   \n",
1446 |        "1018999           Destinee's Story        0      2019      \\N             85   \n",
1447 |        "1019000  The Trial of Joan Collins        0      2019      \\N             \\N   \n",
1448 |        "1019001                  The Nun 2        0      2019      \\N             10   \n",
1449 |        "1019002             Episode #10.15        0      2019      \\N             43   \n",
1450 |        "1019003                     Escape        0      2019      \\N             \\N   \n",
1451 |        "\n",
1452 |        "                             genres  averageRating  numVotes  \n",
1453 |        "0                 Documentary,Short            5.6      1591  \n",
1454 |        "1                   Animation,Short            6.1       194  \n",
1455 |        "2          Animation,Comedy,Romance            6.5      1264  \n",
1456 |        "3                   Animation,Short            6.2       120  \n",
1457 |        "4                      Comedy,Short            6.1      2025  \n",
1458 |        "...                             ...            ...       ...  \n",
1459 |        "1018999                  Reality-TV            6.0         9  \n",
1460 |        "1019000  Adventure,Biography,Comedy            8.5        16  \n",
1461 |        "1019001       Comedy,Horror,Mystery            5.5        48  \n",
1462 |        "1019002           Family,Reality-TV            6.8        13  \n",
1463 |        "1019003                       Drama            7.2        20  \n",
1464 |        "\n",
1465 |        "[1019004 rows x 11 columns]"
1466 |       ]
1467 |      },
1468 |      "execution_count": 12,
1469 |      "metadata": {},
1470 |      "output_type": "execute_result"
1471 |     }
1472 |    ],
1473 |    "source": [
1474 |     "df_basics  = pd.read_csv('https://datasets.imdbws.com/title.basics.tsv.gz',  sep='\\t', low_memory=False)\n",
1475 |     "df_ratings = pd.read_csv('https://datasets.imdbws.com/title.ratings.tsv.gz', sep='\\t', low_memory=False)\n",
1476 |     "df_merge = pd.merge(df_basics, df_ratings, left_on='tconst', right_on='tconst')\n",
1477 |     "df_merge"
1478 |    ]
1479 |   },
1480 |   {
1481 |    "cell_type": "markdown",
1482 |    "metadata": {},
1483 |    "source": [
1484 |     "We have 1,000,000+ rows with basic movie information, now including number of votes and rating. This additional information will help to pare down this dataframe.\n",
1485 |     "\n",
1486 |     "There are still have a LOT of rows that aren't needed. For example, anything where the `titleType` is not \"movie\" can be ignored."
1487 |    ]
1488 |   },
1489 |   {
1490 |    "cell_type": "code",
1491 |    "execution_count": 13,
1492 |    "metadata": {},
1493 |    "outputs": [
1494 |     {
1495 |      "data": {
1496 |       "text/plain": [
1497 |        "tconst             object\n",
1498 |        "titleType          object\n",
1499 |        "primaryTitle       object\n",
1500 |        "originalTitle      object\n",
1501 |        "isAdult             int64\n",
1502 |        "startYear          object\n",
1503 |        "endYear            object\n",
1504 |        "runtimeMinutes     object\n",
1505 |        "genres             object\n",
1506 |        "averageRating     float64\n",
1507 |        "numVotes            int64\n",
1508 |        "dtype: object"
1509 |       ]
1510 |      },
1511 |      "execution_count": 13,
1512 |      "metadata": {},
1513 |      "output_type": "execute_result"
1514 |     }
1515 |    ],
1516 |    "source": [
1517 |     "df_merge.dtypes"
1518 |    ]
1519 |   },
1520 |   {
1521 |    "cell_type": "markdown",
1522 |    "metadata": {},
1523 |    "source": [
1524 |     "The data types of our new dataframe is mostly `object` (string) types. Years and runtime should be integers. However, we don't need to worry about these here."
1525 |    ]
1526 |   },
1527 |   {
1528 |    "cell_type": "markdown",
1529 |    "metadata": {},
1530 |    "source": [
1531 |     "---\n",
1532 |     "\n",
1533 |     "Create slim downed dataframe with the following changes\n",
1534 |     "- Remove unreleased titles (where `startYear` is NULL)\n",
1535 |     "- Only want type 'movie'\n",
1536 |     "- No adult films\n",
1537 |     "- Drop `endYear` as it only applies to TV\n",
1538 |     "- Change `startYear` to `year` and move it to the beginning"
1539 |    ]
1540 |   },
1541 |   {
1542 |    "cell_type": "code",
1543 |    "execution_count": 17,
1544 |    "metadata": {},
1545 |    "outputs": [
1546 |     {
1547 |      "data": {
1548 |       "text/html": [
1549 |        "<div>\n",
1550 |        "<style scoped>\n",
1551 |        "    .dataframe tbody tr th:only-of-type {\n",
1552 |        "        vertical-align: middle;\n",
1553 |        "    }\n",
1554 |        "\n",
1555 |        "    .dataframe tbody tr th {\n",
1556 |        "        vertical-align: top;\n",
1557 |        "    }\n",
1558 |        "\n",
1559 |        "    .dataframe thead th {\n",
1560 |        "        text-align: right;\n",
1561 |        "    }\n",
1562 |        "</style>\n",
1563 |        "<table border=\"1\" class=\"dataframe\">\n",
1564 |        "  <thead>\n",
1565 |        "    <tr style=\"text-align: right;\">\n",
1566 |        "      <th></th>\n",
1567 |        "      <th>tconst</th>\n",
1568 |        "      <th>titleType</th>\n",
1569 |        "      <th>year</th>\n",
1570 |        "      <th>primaryTitle</th>\n",
1571 |        "      <th>originalTitle</th>\n",
1572 |        "      <th>runtimeMinutes</th>\n",
1573 |        "      <th>genres</th>\n",
1574 |        "      <th>averageRating</th>\n",
1575 |        "      <th>numVotes</th>\n",
1576 |        "    </tr>\n",
1577 |        "  </thead>\n",
1578 |        "  <tbody>\n",
1579 |        "    <tr>\n",
1580 |        "      <td>8</td>\n",
1581 |        "      <td>tt0000009</td>\n",
1582 |        "      <td>movie</td>\n",
1583 |        "      <td>1894</td>\n",
1584 |        "      <td>Miss Jerry</td>\n",
1585 |        "      <td>Miss Jerry</td>\n",
1586 |        "      <td>45</td>\n",
1587 |        "      <td>Romance</td>\n",
1588 |        "      <td>5.4</td>\n",
1589 |        "      <td>89</td>\n",
1590 |        "    </tr>\n",
1591 |        "    <tr>\n",
1592 |        "      <td>144</td>\n",
1593 |        "      <td>tt0000147</td>\n",
1594 |        "      <td>movie</td>\n",
1595 |        "      <td>1897</td>\n",
1596 |        "      <td>The Corbett-Fitzsimmons Fight</td>\n",
1597 |        "      <td>The Corbett-Fitzsimmons Fight</td>\n",
1598 |        "      <td>20</td>\n",
1599 |        "      <td>Documentary,News,Sport</td>\n",
1600 |        "      <td>5.2</td>\n",
1601 |        "      <td>333</td>\n",
1602 |        "    </tr>\n",
1603 |        "    <tr>\n",
1604 |        "      <td>251</td>\n",
1605 |        "      <td>tt0000335</td>\n",
1606 |        "      <td>movie</td>\n",
1607 |        "      <td>1900</td>\n",
1608 |        "      <td>Soldiers of the Cross</td>\n",
1609 |        "      <td>Soldiers of the Cross</td>\n",
1610 |        "      <td>\\N</td>\n",
1611 |        "      <td>Biography,Drama</td>\n",
1612 |        "      <td>6.1</td>\n",
1613 |        "      <td>40</td>\n",
1614 |        "    </tr>\n",
1615 |        "    <tr>\n",
1616 |        "      <td>327</td>\n",
1617 |        "      <td>tt0000502</td>\n",
1618 |        "      <td>movie</td>\n",
1619 |        "      <td>1905</td>\n",
1620 |        "      <td>Bohemios</td>\n",
1621 |        "      <td>Bohemios</td>\n",
1622 |        "      <td>100</td>\n",
1623 |        "      <td>\\N</td>\n",
1624 |        "      <td>4.4</td>\n",
1625 |        "      <td>5</td>\n",
1626 |        "    </tr>\n",
1627 |        "    <tr>\n",
1628 |        "      <td>361</td>\n",
1629 |        "      <td>tt0000574</td>\n",
1630 |        "      <td>movie</td>\n",
1631 |        "      <td>1906</td>\n",
1632 |        "      <td>The Story of the Kelly Gang</td>\n",
1633 |        "      <td>The Story of the Kelly Gang</td>\n",
1634 |        "      <td>70</td>\n",
1635 |        "      <td>Biography,Crime,Drama</td>\n",
1636 |        "      <td>6.1</td>\n",
1637 |        "      <td>562</td>\n",
1638 |        "    </tr>\n",
1639 |        "    <tr>\n",
1640 |        "      <td>...</td>\n",
1641 |        "      <td>...</td>\n",
1642 |        "      <td>...</td>\n",
1643 |        "      <td>...</td>\n",
1644 |        "      <td>...</td>\n",
1645 |        "      <td>...</td>\n",
1646 |        "      <td>...</td>\n",
1647 |        "      <td>...</td>\n",
1648 |        "      <td>...</td>\n",
1649 |        "      <td>...</td>\n",
1650 |        "    </tr>\n",
1651 |        "    <tr>\n",
1652 |        "      <td>1018959</td>\n",
1653 |        "      <td>tt9914942</td>\n",
1654 |        "      <td>movie</td>\n",
1655 |        "      <td>2019</td>\n",
1656 |        "      <td>La vida sense la Sara Amat</td>\n",
1657 |        "      <td>La vida sense la Sara Amat</td>\n",
1658 |        "      <td>74</td>\n",
1659 |        "      <td>Drama</td>\n",
1660 |        "      <td>6.7</td>\n",
1661 |        "      <td>76</td>\n",
1662 |        "    </tr>\n",
1663 |        "    <tr>\n",
1664 |        "      <td>1018974</td>\n",
1665 |        "      <td>tt9915790</td>\n",
1666 |        "      <td>movie</td>\n",
1667 |        "      <td>2019</td>\n",
1668 |        "      <td>Bobbyr Bondhura</td>\n",
1669 |        "      <td>Bobbyr Bondhura</td>\n",
1670 |        "      <td>\\N</td>\n",
1671 |        "      <td>Family</td>\n",
1672 |        "      <td>7.6</td>\n",
1673 |        "      <td>13</td>\n",
1674 |        "    </tr>\n",
1675 |        "    <tr>\n",
1676 |        "      <td>1018987</td>\n",
1677 |        "      <td>tt9916160</td>\n",
1678 |        "      <td>movie</td>\n",
1679 |        "      <td>2019</td>\n",
1680 |        "      <td>Drømmeland</td>\n",
1681 |        "      <td>Drømmeland</td>\n",
1682 |        "      <td>72</td>\n",
1683 |        "      <td>Documentary</td>\n",
1684 |        "      <td>6.6</td>\n",
1685 |        "      <td>36</td>\n",
1686 |        "    </tr>\n",
1687 |        "    <tr>\n",
1688 |        "      <td>1018996</td>\n",
1689 |        "      <td>tt9916428</td>\n",
1690 |        "      <td>movie</td>\n",
1691 |        "      <td>2019</td>\n",
1692 |        "      <td>The Secret of China</td>\n",
1693 |        "      <td>The Secret of China</td>\n",
1694 |        "      <td>\\N</td>\n",
1695 |        "      <td>Adventure,History,War</td>\n",
1696 |        "      <td>3.3</td>\n",
1697 |        "      <td>11</td>\n",
1698 |        "    </tr>\n",
1699 |        "    <tr>\n",
1700 |        "      <td>1018997</td>\n",
1701 |        "      <td>tt9916538</td>\n",
1702 |        "      <td>movie</td>\n",
1703 |        "      <td>2019</td>\n",
1704 |        "      <td>Kuambil Lagi Hatiku</td>\n",
1705 |        "      <td>Kuambil Lagi Hatiku</td>\n",
1706 |        "      <td>123</td>\n",
1707 |        "      <td>Drama</td>\n",
1708 |        "      <td>8.4</td>\n",
1709 |        "      <td>5</td>\n",
1710 |        "    </tr>\n",
1711 |        "  </tbody>\n",
1712 |        "</table>\n",
1713 |        "<p>241980 rows × 9 columns</p>\n",
1714 |        "</div>"
1715 |       ],
1716 |       "text/plain": [
1717 |        "            tconst titleType  year                   primaryTitle  \\\n",
1718 |        "8        tt0000009     movie  1894                     Miss Jerry   \n",
1719 |        "144      tt0000147     movie  1897  The Corbett-Fitzsimmons Fight   \n",
1720 |        "251      tt0000335     movie  1900          Soldiers of the Cross   \n",
1721 |        "327      tt0000502     movie  1905                       Bohemios   \n",
1722 |        "361      tt0000574     movie  1906    The Story of the Kelly Gang   \n",
1723 |        "...            ...       ...   ...                            ...   \n",
1724 |        "1018959  tt9914942     movie  2019     La vida sense la Sara Amat   \n",
1725 |        "1018974  tt9915790     movie  2019                Bobbyr Bondhura   \n",
1726 |        "1018987  tt9916160     movie  2019                     Drømmeland   \n",
1727 |        "1018996  tt9916428     movie  2019            The Secret of China   \n",
1728 |        "1018997  tt9916538     movie  2019            Kuambil Lagi Hatiku   \n",
1729 |        "\n",
1730 |        "                         originalTitle runtimeMinutes                  genres  \\\n",
1731 |        "8                           Miss Jerry             45                 Romance   \n",
1732 |        "144      The Corbett-Fitzsimmons Fight             20  Documentary,News,Sport   \n",
1733 |        "251              Soldiers of the Cross             \\N         Biography,Drama   \n",
1734 |        "327                           Bohemios            100                      \\N   \n",
1735 |        "361        The Story of the Kelly Gang             70   Biography,Crime,Drama   \n",
1736 |        "...                                ...            ...                     ...   \n",
1737 |        "1018959     La vida sense la Sara Amat             74                   Drama   \n",
1738 |        "1018974                Bobbyr Bondhura             \\N                  Family   \n",
1739 |        "1018987                     Drømmeland             72             Documentary   \n",
1740 |        "1018996            The Secret of China             \\N   Adventure,History,War   \n",
1741 |        "1018997            Kuambil Lagi Hatiku            123                   Drama   \n",
1742 |        "\n",
1743 |        "         averageRating  numVotes  \n",
1744 |        "8                  5.4        89  \n",
1745 |        "144                5.2       333  \n",
1746 |        "251                6.1        40  \n",
1747 |        "327                4.4         5  \n",
1748 |        "361                6.1       562  \n",
1749 |        "...                ...       ...  \n",
1750 |        "1018959            6.7        76  \n",
1751 |        "1018974            7.6        13  \n",
1752 |        "1018987            6.6        36  \n",
1753 |        "1018996            3.3        11  \n",
1754 |        "1018997            8.4         5  \n",
1755 |        "\n",
1756 |        "[241980 rows x 9 columns]"
1757 |       ]
1758 |      },
1759 |      "execution_count": 17,
1760 |      "metadata": {},
1761 |      "output_type": "execute_result"
1762 |     }
1763 |    ],
1764 |    "source": [
1765 |     "df_slim = df_merge\n",
1766 |     "# Remove unreleased, non-movies, adult\n",
1767 |     "df_slim = df_slim.drop(df_slim[df_slim.startYear == '\\\\N'].index)\n",
1768 |     "df_slim = df_slim[ (df_slim['titleType'] == 'movie' ) & (df_slim['isAdult'] == 0) ]\n",
1769 |     "df_slim = df_slim.drop(['endYear', 'isAdult'], axis=1)\n",
1770 |     "\n",
1771 |     "# Reformat year column\n",
1772 |     "df_slim.insert(loc=2, column='year', value=df_slim['startYear'])\n",
1773 |     "df_slim = df_slim.drop(['startYear'], axis=1)\n",
1774 |     "df_slim"
1775 |    ]
1776 |   },
1777 |   {
1778 |    "cell_type": "code",
1779 |    "execution_count": 65,
1780 |    "metadata": {},
1781 |    "outputs": [
1782 |     {
1783 |      "data": {
1784 |       "text/plain": [
1785 |        "tconst            0\n",
1786 |        "titleType         0\n",
1787 |        "year              0\n",
1788 |        "primaryTitle      0\n",
1789 |        "originalTitle     0\n",
1790 |        "runtimeMinutes    0\n",
1791 |        "genres            0\n",
1792 |        "averageRating     0\n",
1793 |        "numVotes          0\n",
1794 |        "dtype: int64"
1795 |       ]
1796 |      },
1797 |      "execution_count": 65,
1798 |      "metadata": {},
1799 |      "output_type": "execute_result"
1800 |     }
1801 |    ],
1802 |    "source": [
1803 |     "df_slim.isna().sum()"
1804 |    ]
1805 |   },
1806 |   {
1807 |    "cell_type": "code",
1808 |    "execution_count": 26,
1809 |    "metadata": {},
1810 |    "outputs": [
1811 |     {
1812 |      "data": {
1813 |       "text/html": [
1814 |        "<div>\n",
1815 |        "<style scoped>\n",
1816 |        "    .dataframe tbody tr th:only-of-type {\n",
1817 |        "        vertical-align: middle;\n",
1818 |        "    }\n",
1819 |        "\n",
1820 |        "    .dataframe tbody tr th {\n",
1821 |        "        vertical-align: top;\n",
1822 |        "    }\n",
1823 |        "\n",
1824 |        "    .dataframe thead th {\n",
1825 |        "        text-align: right;\n",
1826 |        "    }\n",
1827 |        "</style>\n",
1828 |        "<table border=\"1\" class=\"dataframe\">\n",
1829 |        "  <thead>\n",
1830 |        "    <tr style=\"text-align: right;\">\n",
1831 |        "      <th></th>\n",
1832 |        "      <th>tconst</th>\n",
1833 |        "      <th>titleType</th>\n",
1834 |        "      <th>year</th>\n",
1835 |        "      <th>primaryTitle</th>\n",
1836 |        "      <th>originalTitle</th>\n",
1837 |        "      <th>runtimeMinutes</th>\n",
1838 |        "      <th>genres</th>\n",
1839 |        "      <th>averageRating</th>\n",
1840 |        "      <th>numVotes</th>\n",
1841 |        "    </tr>\n",
1842 |        "  </thead>\n",
1843 |        "  <tbody>\n",
1844 |        "    <tr>\n",
1845 |        "      <td>51794</td>\n",
1846 |        "      <td>tt0076759</td>\n",
1847 |        "      <td>movie</td>\n",
1848 |        "      <td>1977</td>\n",
1849 |        "      <td>Star Wars: Episode IV - A New Hope</td>\n",
1850 |        "      <td>Star Wars</td>\n",
1851 |        "      <td>121</td>\n",
1852 |        "      <td>Action,Adventure,Fantasy</td>\n",
1853 |        "      <td>8.6</td>\n",
1854 |        "      <td>1170498</td>\n",
1855 |        "    </tr>\n",
1856 |        "    <tr>\n",
1857 |        "      <td>998631</td>\n",
1858 |        "      <td>tt8946378</td>\n",
1859 |        "      <td>movie</td>\n",
1860 |        "      <td>2019</td>\n",
1861 |        "      <td>Knives Out</td>\n",
1862 |        "      <td>Knives Out</td>\n",
1863 |        "      <td>131</td>\n",
1864 |        "      <td>Comedy,Crime,Drama</td>\n",
1865 |        "      <td>8.0</td>\n",
1866 |        "      <td>140682</td>\n",
1867 |        "    </tr>\n",
1868 |        "  </tbody>\n",
1869 |        "</table>\n",
1870 |        "</div>"
1871 |       ],
1872 |       "text/plain": [
1873 |        "           tconst titleType  year                        primaryTitle  \\\n",
1874 |        "51794   tt0076759     movie  1977  Star Wars: Episode IV - A New Hope   \n",
1875 |        "998631  tt8946378     movie  2019                          Knives Out   \n",
1876 |        "\n",
1877 |        "       originalTitle runtimeMinutes                    genres  averageRating  \\\n",
1878 |        "51794      Star Wars            121  Action,Adventure,Fantasy            8.6   \n",
1879 |        "998631    Knives Out            131        Comedy,Crime,Drama            8.0   \n",
1880 |        "\n",
1881 |        "        numVotes  \n",
1882 |        "51794    1170498  \n",
1883 |        "998631    140682  "
1884 |       ]
1885 |      },
1886 |      "execution_count": 26,
1887 |      "metadata": {},
1888 |      "output_type": "execute_result"
1889 |     }
1890 |    ],
1891 |    "source": [
1892 |     "dft = df_slim[(df_slim['tconst'].isin(['tt8946378','tt0076759']))]\n",
1893 |     "dft"
1894 |    ]
1895 |   },
1896 |   {
1897 |    "cell_type": "markdown",
1898 |    "metadata": {},
1899 |    "source": [
1900 |     "Down to about 240,000 rows after those modifications.\n",
1901 |     "\n",
1902 |     "Pulled up a quick sample to check. I've watched both of these movies, the genres jump out to me.\n",
1903 |     "- Star Wars missing Sci-fi\n",
1904 |     "- Knives Out missing Mystery\n",
1905 |     "\n",
1906 |     "It turns out, genres are limited to a count of 3, and it's only the first 3 alphabetically. That is not reliable data to correctly classify genres. As you can see from the screenshots (https://www.imdb.com/title/tt0076759/), there are actually 4 genres on this page that associate with this title.\n",
1907 |     "\n",
1908 |     "I am going to scrape this page for all the information I want (storyline / plot summary, FULL genre list). All I need is IMDb's title ID - which is `tconst` in this dataset.\n",
1909 |     "\n",
1910 |     "![](images/imdb-top.png)\n",
1911 |     "\n",
1912 |     "![](images/imdb-bottom.png)\n",
1913 |     "\n",
1914 |     "---"
1915 |    ]
1916 |   },
1917 |   {
1918 |    "cell_type": "markdown",
1919 |    "metadata": {},
1920 |    "source": [
1921 |     "### New Plan:\n",
1922 |     "#### Export list of IMDb IDs so I can make the scraping url"
1923 |    ]
1924 |   },
1925 |   {
1926 |    "cell_type": "code",
1927 |    "execution_count": 51,
1928 |    "metadata": {},
1929 |    "outputs": [
1930 |     {
1931 |      "data": {
1932 |       "text/plain": [
1933 |        "tconst             object\n",
1934 |        "titleType          object\n",
1935 |        "year               object\n",
1936 |        "primaryTitle       object\n",
1937 |        "originalTitle      object\n",
1938 |        "runtimeMinutes     object\n",
1939 |        "genres             object\n",
1940 |        "averageRating     float64\n",
1941 |        "numVotes            int64\n",
1942 |        "dtype: object"
1943 |       ]
1944 |      },
1945 |      "execution_count": 51,
1946 |      "metadata": {},
1947 |      "output_type": "execute_result"
1948 |     }
1949 |    ],
1950 |    "source": [
1951 |     "df_slim.dtypes"
1952 |    ]
1953 |   },
1954 |   {
1955 |    "cell_type": "markdown",
1956 |    "metadata": {},
1957 |    "source": [
1958 |     "I want to do integer comparison for `year` but it currently an `object`. Need to fix that now."
1959 |    ]
1960 |   },
1961 |   {
1962 |    "cell_type": "code",
1963 |    "execution_count": 53,
1964 |    "metadata": {},
1965 |    "outputs": [],
1966 |    "source": [
1967 |     "# Clean year column\n",
1968 |     "df_slim['year'] = df_slim['year'].fillna(0.0).astype(int)"
1969 |    ]
1970 |   },
1971 |   {
1972 |    "cell_type": "code",
1973 |    "execution_count": 55,
1974 |    "metadata": {},
1975 |    "outputs": [
1976 |     {
1977 |      "data": {
1978 |       "text/html": [
1979 |        "<div>\n",
1980 |        "<style scoped>\n",
1981 |        "    .dataframe tbody tr th:only-of-type {\n",
1982 |        "        vertical-align: middle;\n",
1983 |        "    }\n",
1984 |        "\n",
1985 |        "    .dataframe tbody tr th {\n",
1986 |        "        vertical-align: top;\n",
1987 |        "    }\n",
1988 |        "\n",
1989 |        "    .dataframe thead th {\n",
1990 |        "        text-align: right;\n",
1991 |        "    }\n",
1992 |        "</style>\n",
1993 |        "<table border=\"1\" class=\"dataframe\">\n",
1994 |        "  <thead>\n",
1995 |        "    <tr style=\"text-align: right;\">\n",
1996 |        "      <th></th>\n",
1997 |        "      <th>tconst</th>\n",
1998 |        "      <th>titleType</th>\n",
1999 |        "      <th>year</th>\n",
2000 |        "      <th>primaryTitle</th>\n",
2001 |        "      <th>originalTitle</th>\n",
2002 |        "      <th>runtimeMinutes</th>\n",
2003 |        "      <th>genres</th>\n",
2004 |        "      <th>averageRating</th>\n",
2005 |        "      <th>numVotes</th>\n",
2006 |        "    </tr>\n",
2007 |        "  </thead>\n",
2008 |        "  <tbody>\n",
2009 |        "    <tr>\n",
2010 |        "      <td>80744</td>\n",
2011 |        "      <td>tt0111161</td>\n",
2012 |        "      <td>movie</td>\n",
2013 |        "      <td>1994</td>\n",
2014 |        "      <td>The Shawshank Redemption</td>\n",
2015 |        "      <td>The Shawshank Redemption</td>\n",
2016 |        "      <td>142</td>\n",
2017 |        "      <td>Drama</td>\n",
2018 |        "      <td>9.3</td>\n",
2019 |        "      <td>2203956</td>\n",
2020 |        "    </tr>\n",
2021 |        "    <tr>\n",
2022 |        "      <td>241990</td>\n",
2023 |        "      <td>tt0468569</td>\n",
2024 |        "      <td>movie</td>\n",
2025 |        "      <td>2008</td>\n",
2026 |        "      <td>The Dark Knight</td>\n",
2027 |        "      <td>The Dark Knight</td>\n",
2028 |        "      <td>152</td>\n",
2029 |        "      <td>Action,Crime,Drama</td>\n",
2030 |        "      <td>9.0</td>\n",
2031 |        "      <td>2184629</td>\n",
2032 |        "    </tr>\n",
2033 |        "    <tr>\n",
2034 |        "      <td>530003</td>\n",
2035 |        "      <td>tt1375666</td>\n",
2036 |        "      <td>movie</td>\n",
2037 |        "      <td>2010</td>\n",
2038 |        "      <td>Inception</td>\n",
2039 |        "      <td>Inception</td>\n",
2040 |        "      <td>148</td>\n",
2041 |        "      <td>Action,Adventure,Sci-Fi</td>\n",
2042 |        "      <td>8.8</td>\n",
2043 |        "      <td>1933557</td>\n",
2044 |        "    </tr>\n",
2045 |        "    <tr>\n",
2046 |        "      <td>96873</td>\n",
2047 |        "      <td>tt0137523</td>\n",
2048 |        "      <td>movie</td>\n",
2049 |        "      <td>1999</td>\n",
2050 |        "      <td>Fight Club</td>\n",
2051 |        "      <td>Fight Club</td>\n",
2052 |        "      <td>139</td>\n",
2053 |        "      <td>Drama</td>\n",
2054 |        "      <td>8.8</td>\n",
2055 |        "      <td>1759843</td>\n",
2056 |        "    </tr>\n",
2057 |        "    <tr>\n",
2058 |        "      <td>80528</td>\n",
2059 |        "      <td>tt0110912</td>\n",
2060 |        "      <td>movie</td>\n",
2061 |        "      <td>1994</td>\n",
2062 |        "      <td>Pulp Fiction</td>\n",
2063 |        "      <td>Pulp Fiction</td>\n",
2064 |        "      <td>154</td>\n",
2065 |        "      <td>Crime,Drama</td>\n",
2066 |        "      <td>8.9</td>\n",
2067 |        "      <td>1731665</td>\n",
2068 |        "    </tr>\n",
2069 |        "    <tr>\n",
2070 |        "      <td>...</td>\n",
2071 |        "      <td>...</td>\n",
2072 |        "      <td>...</td>\n",
2073 |        "      <td>...</td>\n",
2074 |        "      <td>...</td>\n",
2075 |        "      <td>...</td>\n",
2076 |        "      <td>...</td>\n",
2077 |        "      <td>...</td>\n",
2078 |        "      <td>...</td>\n",
2079 |        "      <td>...</td>\n",
2080 |        "    </tr>\n",
2081 |        "    <tr>\n",
2082 |        "      <td>843975</td>\n",
2083 |        "      <td>tt5275476</td>\n",
2084 |        "      <td>movie</td>\n",
2085 |        "      <td>2017</td>\n",
2086 |        "      <td>Bedbugs</td>\n",
2087 |        "      <td>Fikkefuchs</td>\n",
2088 |        "      <td>101</td>\n",
2089 |        "      <td>Comedy,Drama</td>\n",
2090 |        "      <td>6.2</td>\n",
2091 |        "      <td>1001</td>\n",
2092 |        "    </tr>\n",
2093 |        "    <tr>\n",
2094 |        "      <td>56437</td>\n",
2095 |        "      <td>tt0082210</td>\n",
2096 |        "      <td>movie</td>\n",
2097 |        "      <td>1981</td>\n",
2098 |        "      <td>El crack</td>\n",
2099 |        "      <td>El crack</td>\n",
2100 |        "      <td>131</td>\n",
2101 |        "      <td>Crime,Drama,Mystery</td>\n",
2102 |        "      <td>7.3</td>\n",
2103 |        "      <td>1001</td>\n",
2104 |        "    </tr>\n",
2105 |        "    <tr>\n",
2106 |        "      <td>25596</td>\n",
2107 |        "      <td>tt0045992</td>\n",
2108 |        "      <td>movie</td>\n",
2109 |        "      <td>1952</td>\n",
2110 |        "      <td>The Lawless Breed</td>\n",
2111 |        "      <td>The Lawless Breed</td>\n",
2112 |        "      <td>83</td>\n",
2113 |        "      <td>Western</td>\n",
2114 |        "      <td>6.3</td>\n",
2115 |        "      <td>1001</td>\n",
2116 |        "    </tr>\n",
2117 |        "    <tr>\n",
2118 |        "      <td>520460</td>\n",
2119 |        "      <td>tt1327833</td>\n",
2120 |        "      <td>movie</td>\n",
2121 |        "      <td>2008</td>\n",
2122 |        "      <td>Sorry Bhai!</td>\n",
2123 |        "      <td>Sorry Bhai!</td>\n",
2124 |        "      <td>154</td>\n",
2125 |        "      <td>Comedy,Drama,Romance</td>\n",
2126 |        "      <td>6.1</td>\n",
2127 |        "      <td>1001</td>\n",
2128 |        "    </tr>\n",
2129 |        "    <tr>\n",
2130 |        "      <td>244603</td>\n",
2131 |        "      <td>tt0475627</td>\n",
2132 |        "      <td>movie</td>\n",
2133 |        "      <td>2005</td>\n",
2134 |        "      <td>Shikhar</td>\n",
2135 |        "      <td>Shikhar</td>\n",
2136 |        "      <td>162</td>\n",
2137 |        "      <td>Drama</td>\n",
2138 |        "      <td>4.9</td>\n",
2139 |        "      <td>1001</td>\n",
2140 |        "    </tr>\n",
2141 |        "  </tbody>\n",
2142 |        "</table>\n",
2143 |        "<p>30403 rows × 9 columns</p>\n",
2144 |        "</div>"
2145 |       ],
2146 |       "text/plain": [
2147 |        "           tconst titleType  year              primaryTitle  \\\n",
2148 |        "80744   tt0111161     movie  1994  The Shawshank Redemption   \n",
2149 |        "241990  tt0468569     movie  2008           The Dark Knight   \n",
2150 |        "530003  tt1375666     movie  2010                 Inception   \n",
2151 |        "96873   tt0137523     movie  1999                Fight Club   \n",
2152 |        "80528   tt0110912     movie  1994              Pulp Fiction   \n",
2153 |        "...           ...       ...   ...                       ...   \n",
2154 |        "843975  tt5275476     movie  2017                   Bedbugs   \n",
2155 |        "56437   tt0082210     movie  1981                  El crack   \n",
2156 |        "25596   tt0045992     movie  1952         The Lawless Breed   \n",
2157 |        "520460  tt1327833     movie  2008               Sorry Bhai!   \n",
2158 |        "244603  tt0475627     movie  2005                   Shikhar   \n",
2159 |        "\n",
2160 |        "                   originalTitle runtimeMinutes                   genres  \\\n",
2161 |        "80744   The Shawshank Redemption            142                    Drama   \n",
2162 |        "241990           The Dark Knight            152       Action,Crime,Drama   \n",
2163 |        "530003                 Inception            148  Action,Adventure,Sci-Fi   \n",
2164 |        "96873                 Fight Club            139                    Drama   \n",
2165 |        "80528               Pulp Fiction            154              Crime,Drama   \n",
2166 |        "...                          ...            ...                      ...   \n",
2167 |        "843975                Fikkefuchs            101             Comedy,Drama   \n",
2168 |        "56437                   El crack            131      Crime,Drama,Mystery   \n",
2169 |        "25596          The Lawless Breed             83                  Western   \n",
2170 |        "520460               Sorry Bhai!            154     Comedy,Drama,Romance   \n",
2171 |        "244603                   Shikhar            162                    Drama   \n",
2172 |        "\n",
2173 |        "        averageRating  numVotes  \n",
2174 |        "80744             9.3   2203956  \n",
2175 |        "241990            9.0   2184629  \n",
2176 |        "530003            8.8   1933557  \n",
2177 |        "96873             8.8   1759843  \n",
2178 |        "80528             8.9   1731665  \n",
2179 |        "...               ...       ...  \n",
2180 |        "843975            6.2      1001  \n",
2181 |        "56437             7.3      1001  \n",
2182 |        "25596             6.3      1001  \n",
2183 |        "520460            6.1      1001  \n",
2184 |        "244603            4.9      1001  \n",
2185 |        "\n",
2186 |        "[30403 rows x 9 columns]"
2187 |       ]
2188 |      },
2189 |      "metadata": {},
2190 |      "output_type": "display_data"
2191 |     }
2192 |    ],
2193 |    "source": [
2194 |     "final_df = df_slim[(df_slim['year'] >= 1920) & (df_slim['numVotes'] > 1000)].sort_values(['numVotes'], ascending=False)\n",
2195 |     "display(final_df)"
2196 |    ]
2197 |   },
2198 |   {
2199 |    "cell_type": "markdown",
2200 |    "metadata": {},
2201 |    "source": [
2202 |     "This looks like a good amount, approx. 30,000 titles to scrape. I'm using an arbitrary threshold where `numVotes` is greater than 1000. I'm hoping those entries have some decent level of accuracy if they are that popular.\n",
2203 |     "\n",
2204 |     "Additionally I'm limiting it to 1920 and later for an even '100 years' of movies.\n",
2205 |     "\n",
2206 |     "---\n",
2207 |     "\n",
2208 |     "I'm not going to scrape 2020, but this cell if for future use to see how much 'new' training data I will have."
2209 |    ]
2210 |   },
2211 |   {
2212 |    "cell_type": "code",
2213 |    "execution_count": 61,
2214 |    "metadata": {},
2215 |    "outputs": [
2216 |     {
2217 |      "data": {
2218 |       "text/plain": [
2219 |        "(71, 9)"
2220 |       ]
2221 |      },
2222 |      "execution_count": 61,
2223 |      "metadata": {},
2224 |      "output_type": "execute_result"
2225 |     }
2226 |    ],
2227 |    "source": [
2228 |     "final_df[final_df['year'] >= 2020].shape"
2229 |    ]
2230 |   },
2231 |   {
2232 |    "cell_type": "markdown",
2233 |    "metadata": {},
2234 |    "source": [
2235 |     "Finally, export this list to a `.csv` file so I can access it later for scraping using `tconst` id."
2236 |    ]
2237 |   },
2238 |   {
2239 |    "cell_type": "code",
2240 |    "execution_count": 60,
2241 |    "metadata": {},
2242 |    "outputs": [],
2243 |    "source": [
2244 |     "final_df.to_csv('imdb_movie_list.csv', header=True, index=False)"
2245 |    ]
2246 |   },
2247 |   {
2248 |    "cell_type": "markdown",
2249 |    "metadata": {},
2250 |    "source": [
2251 |     "---"
2252 |    ]
2253 |   },
2254 |   {
2255 |    "cell_type": "markdown",
2256 |    "metadata": {},
2257 |    "source": [
2258 |     "Genre Genie - Multi-label classification using NLP\n",
2259 |     "\n",
2260 |     "Tom Keith - 2020"
2261 |    ]
2262 |   }
2263 |  ],
2264 |  "metadata": {
2265 |   "kernelspec": {
2266 |    "display_name": "Python 3",
2267 |    "language": "python",
2268 |    "name": "python3"
2269 |   },
2270 |   "language_info": {
2271 |    "codemirror_mode": {
2272 |     "name": "ipython",
2273 |     "version": 3
2274 |    },
2275 |    "file_extension": ".py",
2276 |    "mimetype": "text/x-python",
2277 |    "name": "python",
2278 |    "nbconvert_exporter": "python",
2279 |    "pygments_lexer": "ipython3",
2280 |    "version": "3.7.4"
2281 |   }
2282 |  },
2283 |  "nbformat": 4,
2284 |  "nbformat_minor": 4
2285 | }
2286 | 


--------------------------------------------------------------------------------
/3.1-modeling.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "## Genre Genie - Multi-label Classification with NLP\n",
   8 |     "### Part 3.1: Modeling using OneVsRest\n",
   9 |     "\n",
  10 |     "#### Tom Keith\n",
  11 |     "\n",
  12 |     "---\n",
  13 |     "\n",
  14 |     "**Goal:** Fit and optimize multi-label classification model on the train set. Finally, score on test set."
  15 |    ]
  16 |   },
  17 |   {
  18 |    "cell_type": "code",
  19 |    "execution_count": 1,
  20 |    "metadata": {},
  21 |    "outputs": [],
  22 |    "source": [
  23 |     "# Standard imports\n",
  24 |     "import numpy as np\n",
  25 |     "import pandas as pd\n",
  26 |     "pd.set_option('display.max_columns', 500)\n",
  27 |     "import matplotlib.pyplot as plt\n",
  28 |     "%matplotlib inline\n",
  29 |     "import joblib"
  30 |    ]
  31 |   },
  32 |   {
  33 |    "cell_type": "markdown",
  34 |    "metadata": {},
  35 |    "source": [
  36 |     "Import train and test dataframes from previous step. They are large files."
  37 |    ]
  38 |   },
  39 |   {
  40 |    "cell_type": "code",
  41 |    "execution_count": 2,
  42 |    "metadata": {},
  43 |    "outputs": [
  44 |     {
  45 |      "name": "stdout",
  46 |      "output_type": "stream",
  47 |      "text": [
  48 |       "Wall time: 29.3 s\n"
  49 |      ]
  50 |     }
  51 |    ],
  52 |    "source": [
  53 |     "%%time\n",
  54 |     "train = pd.read_csv('data/train_dataframe.tsv', sep='\\t', index_col=0)\n",
  55 |     "test = pd.read_csv('data/test_dataframe.tsv', sep='\\t', index_col=0)"
  56 |    ]
  57 |   },
  58 |   {
  59 |    "cell_type": "markdown",
  60 |    "metadata": {},
  61 |    "source": [
  62 |     "Put the genre and features names (which aren't words) into lists for easy use."
  63 |    ]
  64 |   },
  65 |   {
  66 |    "cell_type": "code",
  67 |    "execution_count": 3,
  68 |    "metadata": {},
  69 |    "outputs": [
  70 |     {
  71 |      "name": "stdout",
  72 |      "output_type": "stream",
  73 |      "text": [
  74 |       "22\n",
  75 |       "['g_action', 'g_adventure', 'g_animation', 'g_biography', 'g_comedy', 'g_crime', 'g_documentary', 'g_drama', 'g_family', 'g_fantasy', 'g_film-noir', 'g_history', 'g_horror', 'g_music', 'g_musical', 'g_mystery', 'g_romance', 'g_sci-fi', 'g_sport', 'g_thriller', 'g_war', 'g_western']\n"
  76 |      ]
  77 |     }
  78 |    ],
  79 |    "source": [
  80 |     "cols = list(train.columns.values)\n",
  81 |     "genre_cols = cols[-22:]\n",
  82 |     "print(len(genre_cols))\n",
  83 |     "print(genre_cols)"
  84 |    ]
  85 |   },
  86 |   {
  87 |    "cell_type": "code",
  88 |    "execution_count": 4,
  89 |    "metadata": {},
  90 |    "outputs": [],
  91 |    "source": [
  92 |     "f_names = cols[:8]"
  93 |    ]
  94 |   },
  95 |   {
  96 |    "cell_type": "markdown",
  97 |    "metadata": {},
  98 |    "source": [
  99 |     "Separate out X and y out of our train and test .tsv files. We want JUST the genre columns for `y` and everything except the genre columns for `X`."
 100 |    ]
 101 |   },
 102 |   {
 103 |    "cell_type": "code",
 104 |    "execution_count": 5,
 105 |    "metadata": {},
 106 |    "outputs": [],
 107 |    "source": [
 108 |     "X_train = train[train.columns[~train.columns.isin(genre_cols)]]\n",
 109 |     "y_train = train[train.columns[ train.columns.isin(genre_cols)]]\n",
 110 |     "\n",
 111 |     "X_test = test[test.columns[~test.columns.isin(genre_cols)]]\n",
 112 |     "y_test = test[test.columns[ test.columns.isin(genre_cols)]]"
 113 |    ]
 114 |   },
 115 |   {
 116 |    "cell_type": "markdown",
 117 |    "metadata": {},
 118 |    "source": [
 119 |     "---\n",
 120 |     "\n",
 121 |     "Before running a model, we need to scale our data. Both standard and min-max were tested, but standard scaler came out on top."
 122 |    ]
 123 |   },
 124 |   {
 125 |    "cell_type": "code",
 126 |    "execution_count": 8,
 127 |    "metadata": {},
 128 |    "outputs": [],
 129 |    "source": [
 130 |     "%%time\n",
 131 |     "# Scale data (Standard Scaler)\n",
 132 |     "from sklearn.preprocessing import StandardScaler \n",
 133 |     "my_standard_scaler = StandardScaler().fit(X_train)\n",
 134 |     "X_train_s = my_standard_scaler.transform(X_train)\n",
 135 |     "X_test_s = my_standard_scaler.transform(X_test)\n",
 136 |     "\n",
 137 |     "#joblib.dump(my_standard_scaler, 'models/my_standard_scaler.pkl')"
 138 |    ]
 139 |   },
 140 |   {
 141 |    "cell_type": "code",
 142 |    "execution_count": 98,
 143 |    "metadata": {},
 144 |    "outputs": [
 145 |     {
 146 |      "data": {
 147 |       "text/plain": [
 148 |        "['models/my_minmax_scaler.pkl']"
 149 |       ]
 150 |      },
 151 |      "execution_count": 98,
 152 |      "metadata": {},
 153 |      "output_type": "execute_result"
 154 |     }
 155 |    ],
 156 |    "source": [
 157 |     "# Scale data (MinMax Scaler)\n",
 158 |     "from sklearn.preprocessing import MinMaxScaler \n",
 159 |     "my_minmax_scaler = MinMaxScaler().fit(X_train)\n",
 160 |     "X_train_mm = my_minmax_scaler.transform(X_train)\n",
 161 |     "X_test_mm = my_minmax_scaler.transform(X_test)\n",
 162 |     "\n",
 163 |     "#joblib.dump(my_minmax_scaler, 'models/my_minmax_scaler.pkl')"
 164 |    ]
 165 |   },
 166 |   {
 167 |    "cell_type": "markdown",
 168 |    "metadata": {},
 169 |    "source": [
 170 |     "---\n",
 171 |     "\n",
 172 |     "### Please note\n",
 173 |     "MANY models were tested and pkl'd. Below is the optimized model. After that, everything below it is testing of other models, scalers, score grading, and tuning hyperparameters. I normally would not include all of them, but they remain for completeness.\n",
 174 |     "\n",
 175 |     "In the end, OneVsRest with Logistic Regression (C=0.01, solver='lbfgs') when scaled with a standard scaler was the best option."
 176 |    ]
 177 |   },
 178 |   {
 179 |    "cell_type": "code",
 180 |    "execution_count": 10,
 181 |    "metadata": {},
 182 |    "outputs": [],
 183 |    "source": [
 184 |     "import joblib\n",
 185 |     "#my_model = joblib.load('models/my_1vr_linear_svc_default.pkl')"
 186 |    ]
 187 |   },
 188 |   {
 189 |    "cell_type": "code",
 190 |    "execution_count": 9,
 191 |    "metadata": {},
 192 |    "outputs": [],
 193 |    "source": [
 194 |     "from sklearn.multiclass import OneVsRestClassifier\n",
 195 |     "from sklearn.svm import LinearSVC\n",
 196 |     "from sklearn.linear_model import LogisticRegression"
 197 |    ]
 198 |   },
 199 |   {
 200 |    "cell_type": "code",
 201 |    "execution_count": 113,
 202 |    "metadata": {},
 203 |    "outputs": [
 204 |     {
 205 |      "name": "stdout",
 206 |      "output_type": "stream",
 207 |      "text": [
 208 |       "[0.09696028 0.08988016 0.09786951 0.09476254 0.09476254]\n",
 209 |       "Fold 1: 0.09696028400266253\n",
 210 |       "Fold 2: 0.08988015978695073\n",
 211 |       "Fold 3: 0.09786950732356857\n",
 212 |       "Fold 4: 0.09476253883710609\n",
 213 |       "Fold 5: 0.09476253883710609\n",
 214 |       "Average Score:0.0948470057574788\n"
 215 |      ]
 216 |     }
 217 |    ],
 218 |    "source": [
 219 |     "from sklearn.model_selection import cross_val_score\n",
 220 |     "my_log_model = OneVsRestClassifier(LogisticRegression(random_state=123, solver='lbfgs', max_iter=3000, C=0.01, n_jobs=-1), n_jobs=-1)\n",
 221 |     "\n",
 222 |     "scores = cross_val_score(my_log_model, X_train_s, y_train, cv = 5)\n",
 223 |     "print(scores)\n",
 224 |     "\n",
 225 |     "for i in range(len(scores)) :\n",
 226 |     "    print(f\"Fold {i+1}: {scores[i]}\")\n",
 227 |     "print(f\"Average Score:{np.mean(scores)}\")"
 228 |    ]
 229 |   },
 230 |   {
 231 |    "cell_type": "code",
 232 |    "execution_count": 10,
 233 |    "metadata": {},
 234 |    "outputs": [
 235 |     {
 236 |      "name": "stdout",
 237 |      "output_type": "stream",
 238 |      "text": [
 239 |       "Wall time: 59.2 s\n"
 240 |      ]
 241 |     }
 242 |    ],
 243 |    "source": [
 244 |     "%%time\n",
 245 |     "my_log_model = OneVsRestClassifier(LogisticRegression(random_state=123, solver='lbfgs', max_iter=3000, C=0.01, n_jobs=-1), n_jobs=-1).fit(X_train_s, y_train)"
 246 |    ]
 247 |   },
 248 |   {
 249 |    "cell_type": "code",
 250 |    "execution_count": 11,
 251 |    "metadata": {},
 252 |    "outputs": [],
 253 |    "source": [
 254 |     "y_train_pred = my_log_model.predict(X_train_s)\n",
 255 |     "y_train_proba = my_log_model.predict_proba(X_train_s)\n",
 256 |     "y_test_pred = my_log_model.predict(X_test_s)\n",
 257 |     "y_test_proba = my_log_model.predict_proba(X_test_s)"
 258 |    ]
 259 |   },
 260 |   {
 261 |    "cell_type": "code",
 262 |    "execution_count": 13,
 263 |    "metadata": {},
 264 |    "outputs": [
 265 |     {
 266 |      "name": "stdout",
 267 |      "output_type": "stream",
 268 |      "text": [
 269 |       "Training score: 0.54290\n",
 270 |       "    Test score: 0.10558\n"
 271 |      ]
 272 |     }
 273 |    ],
 274 |    "source": [
 275 |     "from sklearn.metrics import accuracy_score\n",
 276 |     "print(f'Training score: {accuracy_score(y_train, y_train_pred):0.5f}')\n",
 277 |     "print(f'    Test score: {accuracy_score(y_test, y_test_pred):0.5f}')"
 278 |    ]
 279 |   },
 280 |   {
 281 |    "cell_type": "code",
 282 |    "execution_count": 16,
 283 |    "metadata": {},
 284 |    "outputs": [
 285 |     {
 286 |      "name": "stdout",
 287 |      "output_type": "stream",
 288 |      "text": [
 289 |       "0.8350  g_action\n",
 290 |       "0.8771  g_adventure\n",
 291 |       "0.9703  g_animation\n",
 292 |       "0.9397  g_biography\n",
 293 |       "0.7344  g_comedy\n",
 294 |       "0.8410  g_crime\n",
 295 |       "0.9730  g_documentary\n",
 296 |       "0.6931  g_drama\n",
 297 |       "0.9318  g_family\n",
 298 |       "0.9140  g_fantasy\n",
 299 |       "0.9892  g_film-noir\n",
 300 |       "0.9506  g_history\n",
 301 |       "0.9116  g_horror\n",
 302 |       "0.9384  g_music\n",
 303 |       "0.9657  g_musical\n",
 304 |       "0.8823  g_mystery\n",
 305 |       "0.7822  g_romance\n",
 306 |       "0.9369  g_sci-fi\n",
 307 |       "0.9794  g_sport\n",
 308 |       "0.7769  g_thriller\n",
 309 |       "0.9611  g_war\n",
 310 |       "0.9867  g_western\n"
 311 |      ]
 312 |     }
 313 |    ],
 314 |    "source": [
 315 |     "y_pred_df = pd.DataFrame(y_test_pred, columns=genre_cols)\n",
 316 |     "\n",
 317 |     "# Test set predictions\n",
 318 |     "for g in genre_cols:\n",
 319 |     "    score = accuracy_score(y_test[g], y_pred_df[g])\n",
 320 |     "    print(f'{score:0.4f}  {g}')"
 321 |    ]
 322 |   },
 323 |   {
 324 |    "cell_type": "markdown",
 325 |    "metadata": {},
 326 |    "source": [
 327 |     "\n",
 328 |     "---\n",
 329 |     "\n",
 330 |     "## Everything below is model testing and optimizing"
 331 |    ]
 332 |   },
 333 |   {
 334 |    "cell_type": "code",
 335 |    "execution_count": 79,
 336 |    "metadata": {},
 337 |    "outputs": [
 338 |     {
 339 |      "name": "stdout",
 340 |      "output_type": "stream",
 341 |      "text": [
 342 |       "Wall time: 22min 6s\n"
 343 |      ]
 344 |     }
 345 |    ],
 346 |    "source": [
 347 |     "%%time\n",
 348 |     "#my_model = OneVsRestClassifier(LinearSVC(random_state=123, max_iter=3000), n_jobs=-1).fit(X_train_s, y_train)"
 349 |    ]
 350 |   },
 351 |   {
 352 |    "cell_type": "code",
 353 |    "execution_count": 92,
 354 |    "metadata": {},
 355 |    "outputs": [
 356 |     {
 357 |      "name": "stderr",
 358 |      "output_type": "stream",
 359 |      "text": [
 360 |       "C:\\Users\\Tom\\Anaconda3\\lib\\site-packages\\sklearn\\externals\\joblib\\__init__.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.\n",
 361 |       "  warnings.warn(msg, category=DeprecationWarning)\n"
 362 |      ]
 363 |     },
 364 |     {
 365 |      "data": {
 366 |       "text/plain": [
 367 |        "['models/my_1vr_linear_svc_default.pkl']"
 368 |       ]
 369 |      },
 370 |      "execution_count": 92,
 371 |      "metadata": {},
 372 |      "output_type": "execute_result"
 373 |     }
 374 |    ],
 375 |    "source": [
 376 |     "# EXPORT AND SAVE THE MODEL\n",
 377 |     "#joblib.dump(my_model, 'models/my_1vr_linear_svc_default.pkl')"
 378 |    ]
 379 |   },
 380 |   {
 381 |    "cell_type": "code",
 382 |    "execution_count": null,
 383 |    "metadata": {},
 384 |    "outputs": [],
 385 |    "source": [
 386 |     "y_pred = my_model.predict(X_test_s)"
 387 |    ]
 388 |   },
 389 |   {
 390 |    "cell_type": "code",
 391 |    "execution_count": 12,
 392 |    "metadata": {},
 393 |    "outputs": [],
 394 |    "source": [
 395 |     "y_train_pred = my_model.predict(X_train_s)"
 396 |    ]
 397 |   },
 398 |   {
 399 |    "cell_type": "code",
 400 |    "execution_count": 13,
 401 |    "metadata": {},
 402 |    "outputs": [
 403 |     {
 404 |      "data": {
 405 |       "text/plain": [
 406 |        "True"
 407 |       ]
 408 |      },
 409 |      "execution_count": 13,
 410 |      "metadata": {},
 411 |      "output_type": "execute_result"
 412 |     }
 413 |    ],
 414 |    "source": [
 415 |     "my_model.multilabel_\n",
 416 |     "#my_model.predict_proba(X_train_s)"
 417 |    ]
 418 |   },
 419 |   {
 420 |    "cell_type": "code",
 421 |    "execution_count": 90,
 422 |    "metadata": {},
 423 |    "outputs": [
 424 |     {
 425 |      "name": "stdout",
 426 |      "output_type": "stream",
 427 |      "text": [
 428 |       "Training score: 0.49665\n",
 429 |       "    Test score: 0.04367\n"
 430 |      ]
 431 |     }
 432 |    ],
 433 |    "source": [
 434 |     "from sklearn.metrics import accuracy_score\n",
 435 |     "print(f'Training score: {accuracy_score(y_train, y_train_pred):0.5f}')\n",
 436 |     "print(f'    Test score: {accuracy_score(y_test, y_pred):0.5f}')"
 437 |    ]
 438 |   },
 439 |   {
 440 |    "cell_type": "code",
 441 |    "execution_count": 14,
 442 |    "metadata": {},
 443 |    "outputs": [],
 444 |    "source": [
 445 |     "from sklearn.metrics import confusion_matrix\n",
 446 |     "from sklearn.metrics import accuracy_score\n",
 447 |     "from sklearn.metrics import precision_score\n",
 448 |     "from sklearn.metrics import recall_score\n",
 449 |     "from sklearn.metrics import f1_score\n",
 450 |     "from sklearn.metrics import multilabel_confusion_matrix"
 451 |    ]
 452 |   },
 453 |   {
 454 |    "cell_type": "code",
 455 |    "execution_count": 67,
 456 |    "metadata": {},
 457 |    "outputs": [
 458 |     {
 459 |      "data": {
 460 |       "text/plain": [
 461 |        "array([[7360,   70],\n",
 462 |        "       [  63,   18]], dtype=int64)"
 463 |       ]
 464 |      },
 465 |      "execution_count": 67,
 466 |      "metadata": {},
 467 |      "output_type": "execute_result"
 468 |     }
 469 |    ],
 470 |    "source": [
 471 |     "# Confusion Matrix\n",
 472 |     "cm = multilabel_confusion_matrix(y_test, y_pred)\n",
 473 |     "\n",
 474 |     "g_cm_list = []\n",
 475 |     "for g in cm:\n",
 476 |     "    g_cm_list.append(pd.DataFrame(g, columns=['Predicted Negative (0)', 'Predicted Positive (1)'], \\\n",
 477 |     "                 index=['True Negative (0)','True Positive (1)']))\n",
 478 |     "    \n",
 479 |     "g_cm_list[10].values"
 480 |    ]
 481 |   },
 482 |   {
 483 |    "cell_type": "code",
 484 |    "execution_count": 86,
 485 |    "metadata": {},
 486 |    "outputs": [],
 487 |    "source": [
 488 |     "y_train_pred_df = pd.DataFrame(y_train_pred)\n",
 489 |     "y_train_pred_df.columns = genre_cols\n",
 490 |     "y_pred_df = pd.DataFrame(y_pred)\n",
 491 |     "y_pred_df.columns = genre_cols"
 492 |    ]
 493 |   },
 494 |   {
 495 |    "cell_type": "code",
 496 |    "execution_count": 17,
 497 |    "metadata": {},
 498 |    "outputs": [
 499 |     {
 500 |      "name": "stdout",
 501 |      "output_type": "stream",
 502 |      "text": [
 503 |       "0.8350  g_action\n",
 504 |       "0.8771  g_adventure\n",
 505 |       "0.9703  g_animation\n",
 506 |       "0.9397  g_biography\n",
 507 |       "0.7344  g_comedy\n",
 508 |       "0.8410  g_crime\n",
 509 |       "0.9730  g_documentary\n",
 510 |       "0.6931  g_drama\n",
 511 |       "0.9318  g_family\n",
 512 |       "0.9140  g_fantasy\n",
 513 |       "0.9892  g_film-noir\n",
 514 |       "0.9506  g_history\n",
 515 |       "0.9116  g_horror\n",
 516 |       "0.9384  g_music\n",
 517 |       "0.9657  g_musical\n",
 518 |       "0.8823  g_mystery\n",
 519 |       "0.7822  g_romance\n",
 520 |       "0.9369  g_sci-fi\n",
 521 |       "0.9794  g_sport\n",
 522 |       "0.7769  g_thriller\n",
 523 |       "0.9611  g_war\n",
 524 |       "0.9867  g_western\n"
 525 |      ]
 526 |     }
 527 |    ],
 528 |    "source": [
 529 |     "test_acc_dict = {}\n",
 530 |     "# Test set predictions\n",
 531 |     "for g in genre_cols:\n",
 532 |     "    score = accuracy_score(y_test[g], y_pred_df[g])\n",
 533 |     "    test_acc_dict.update( {g[2:] : score} )\n",
 534 |     "    print(f'{score:0.4f}  {g}')"
 535 |    ]
 536 |   },
 537 |   {
 538 |    "cell_type": "code",
 539 |    "execution_count": 18,
 540 |    "metadata": {},
 541 |    "outputs": [],
 542 |    "source": [
 543 |     "test_scores = pd.DataFrame.from_dict(test_acc_dict, orient='index', columns=['score'])"
 544 |    ]
 545 |   },
 546 |   {
 547 |    "cell_type": "code",
 548 |    "execution_count": 19,
 549 |    "metadata": {},
 550 |    "outputs": [],
 551 |    "source": [
 552 |     "test_scores.to_csv('test_scores_last_model.csv', index_label='genre')"
 553 |    ]
 554 |   },
 555 |   {
 556 |    "cell_type": "code",
 557 |    "execution_count": 22,
 558 |    "metadata": {},
 559 |    "outputs": [],
 560 |    "source": [
 561 |     "coefs = my_model.coef_"
 562 |    ]
 563 |   },
 564 |   {
 565 |    "cell_type": "code",
 566 |    "execution_count": 75,
 567 |    "metadata": {},
 568 |    "outputs": [
 569 |     {
 570 |      "data": {
 571 |       "text/html": [
 572 |        "<div>\n",
 573 |        "<style scoped>\n",
 574 |        "    .dataframe tbody tr th:only-of-type {\n",
 575 |        "        vertical-align: middle;\n",
 576 |        "    }\n",
 577 |        "\n",
 578 |        "    .dataframe tbody tr th {\n",
 579 |        "        vertical-align: top;\n",
 580 |        "    }\n",
 581 |        "\n",
 582 |        "    .dataframe thead th {\n",
 583 |        "        text-align: right;\n",
 584 |        "    }\n",
 585 |        "</style>\n",
 586 |        "<table border=\"1\" class=\"dataframe\">\n",
 587 |        "  <thead>\n",
 588 |        "    <tr style=\"text-align: right;\">\n",
 589 |        "      <th></th>\n",
 590 |        "      <th>g_action</th>\n",
 591 |        "      <th>g_adventure</th>\n",
 592 |        "      <th>g_animation</th>\n",
 593 |        "      <th>g_biography</th>\n",
 594 |        "      <th>g_comedy</th>\n",
 595 |        "      <th>g_crime</th>\n",
 596 |        "      <th>g_documentary</th>\n",
 597 |        "      <th>g_drama</th>\n",
 598 |        "      <th>g_family</th>\n",
 599 |        "      <th>g_fantasy</th>\n",
 600 |        "      <th>g_film-noir</th>\n",
 601 |        "      <th>g_history</th>\n",
 602 |        "      <th>g_horror</th>\n",
 603 |        "      <th>g_music</th>\n",
 604 |        "      <th>g_musical</th>\n",
 605 |        "      <th>g_mystery</th>\n",
 606 |        "      <th>g_romance</th>\n",
 607 |        "      <th>g_sci-fi</th>\n",
 608 |        "      <th>g_sport</th>\n",
 609 |        "      <th>g_thriller</th>\n",
 610 |        "      <th>g_war</th>\n",
 611 |        "      <th>g_western</th>\n",
 612 |        "    </tr>\n",
 613 |        "  </thead>\n",
 614 |        "  <tbody>\n",
 615 |        "    <tr>\n",
 616 |        "      <td>f_release_year</td>\n",
 617 |        "      <td>0.890920</td>\n",
 618 |        "      <td>-0.202585</td>\n",
 619 |        "      <td>0.205493</td>\n",
 620 |        "      <td>0.156150</td>\n",
 621 |        "      <td>0.090189</td>\n",
 622 |        "      <td>0.207143</td>\n",
 623 |        "      <td>0.112173</td>\n",
 624 |        "      <td>-0.024676</td>\n",
 625 |        "      <td>-0.005681</td>\n",
 626 |        "      <td>0.000552</td>\n",
 627 |        "      <td>-0.192294</td>\n",
 628 |        "      <td>0.262658</td>\n",
 629 |        "      <td>-0.124317</td>\n",
 630 |        "      <td>-0.247146</td>\n",
 631 |        "      <td>-0.202726</td>\n",
 632 |        "      <td>-0.000374</td>\n",
 633 |        "      <td>-0.163564</td>\n",
 634 |        "      <td>-0.082269</td>\n",
 635 |        "      <td>-0.013732</td>\n",
 636 |        "      <td>0.092245</td>\n",
 637 |        "      <td>0.168980</td>\n",
 638 |        "      <td>0.168377</td>\n",
 639 |        "    </tr>\n",
 640 |        "    <tr>\n",
 641 |        "      <td>f_release_month</td>\n",
 642 |        "      <td>0.162641</td>\n",
 643 |        "      <td>0.093539</td>\n",
 644 |        "      <td>0.032958</td>\n",
 645 |        "      <td>-0.002919</td>\n",
 646 |        "      <td>-0.007082</td>\n",
 647 |        "      <td>-0.054702</td>\n",
 648 |        "      <td>-0.018309</td>\n",
 649 |        "      <td>0.026233</td>\n",
 650 |        "      <td>0.056573</td>\n",
 651 |        "      <td>0.120411</td>\n",
 652 |        "      <td>0.000712</td>\n",
 653 |        "      <td>-0.090634</td>\n",
 654 |        "      <td>-0.025620</td>\n",
 655 |        "      <td>0.030871</td>\n",
 656 |        "      <td>0.001300</td>\n",
 657 |        "      <td>0.005833</td>\n",
 658 |        "      <td>0.013643</td>\n",
 659 |        "      <td>0.013330</td>\n",
 660 |        "      <td>-0.010611</td>\n",
 661 |        "      <td>0.009509</td>\n",
 662 |        "      <td>-0.108588</td>\n",
 663 |        "      <td>-0.121271</td>\n",
 664 |        "    </tr>\n",
 665 |        "    <tr>\n",
 666 |        "      <td>f_runtime</td>\n",
 667 |        "      <td>0.997796</td>\n",
 668 |        "      <td>0.052923</td>\n",
 669 |        "      <td>-0.329772</td>\n",
 670 |        "      <td>0.151680</td>\n",
 671 |        "      <td>-0.136051</td>\n",
 672 |        "      <td>-0.098478</td>\n",
 673 |        "      <td>-0.125171</td>\n",
 674 |        "      <td>0.161513</td>\n",
 675 |        "      <td>-0.157185</td>\n",
 676 |        "      <td>-0.139964</td>\n",
 677 |        "      <td>-0.046469</td>\n",
 678 |        "      <td>0.234240</td>\n",
 679 |        "      <td>-0.372855</td>\n",
 680 |        "      <td>0.120316</td>\n",
 681 |        "      <td>0.074502</td>\n",
 682 |        "      <td>0.034689</td>\n",
 683 |        "      <td>0.109954</td>\n",
 684 |        "      <td>-0.061331</td>\n",
 685 |        "      <td>0.012625</td>\n",
 686 |        "      <td>0.053091</td>\n",
 687 |        "      <td>0.236038</td>\n",
 688 |        "      <td>0.229005</td>\n",
 689 |        "    </tr>\n",
 690 |        "    <tr>\n",
 691 |        "      <td>f_word_count_long</td>\n",
 692 |        "      <td>0.221205</td>\n",
 693 |        "      <td>0.002507</td>\n",
 694 |        "      <td>0.022708</td>\n",
 695 |        "      <td>0.008983</td>\n",
 696 |        "      <td>-0.089621</td>\n",
 697 |        "      <td>-0.157927</td>\n",
 698 |        "      <td>0.015894</td>\n",
 699 |        "      <td>-0.003234</td>\n",
 700 |        "      <td>0.103574</td>\n",
 701 |        "      <td>0.235768</td>\n",
 702 |        "      <td>0.010266</td>\n",
 703 |        "      <td>0.010166</td>\n",
 704 |        "      <td>-0.077899</td>\n",
 705 |        "      <td>0.029570</td>\n",
 706 |        "      <td>0.042731</td>\n",
 707 |        "      <td>0.011146</td>\n",
 708 |        "      <td>-0.027546</td>\n",
 709 |        "      <td>0.017361</td>\n",
 710 |        "      <td>0.024619</td>\n",
 711 |        "      <td>0.003008</td>\n",
 712 |        "      <td>-0.013937</td>\n",
 713 |        "      <td>-0.003826</td>\n",
 714 |        "    </tr>\n",
 715 |        "    <tr>\n",
 716 |        "      <td>f_imdb_rating</td>\n",
 717 |        "      <td>-0.992456</td>\n",
 718 |        "      <td>-0.122722</td>\n",
 719 |        "      <td>0.277040</td>\n",
 720 |        "      <td>0.168530</td>\n",
 721 |        "      <td>-0.071904</td>\n",
 722 |        "      <td>1.286002</td>\n",
 723 |        "      <td>0.241488</td>\n",
 724 |        "      <td>0.239085</td>\n",
 725 |        "      <td>0.045930</td>\n",
 726 |        "      <td>0.015202</td>\n",
 727 |        "      <td>0.052162</td>\n",
 728 |        "      <td>0.150826</td>\n",
 729 |        "      <td>-0.554526</td>\n",
 730 |        "      <td>0.047041</td>\n",
 731 |        "      <td>0.012359</td>\n",
 732 |        "      <td>0.118412</td>\n",
 733 |        "      <td>-0.026726</td>\n",
 734 |        "      <td>-0.234360</td>\n",
 735 |        "      <td>0.013724</td>\n",
 736 |        "      <td>-0.086579</td>\n",
 737 |        "      <td>0.051071</td>\n",
 738 |        "      <td>0.007038</td>\n",
 739 |        "    </tr>\n",
 740 |        "    <tr>\n",
 741 |        "      <td>...</td>\n",
 742 |        "      <td>...</td>\n",
 743 |        "      <td>...</td>\n",
 744 |        "      <td>...</td>\n",
 745 |        "      <td>...</td>\n",
 746 |        "      <td>...</td>\n",
 747 |        "      <td>...</td>\n",
 748 |        "      <td>...</td>\n",
 749 |        "      <td>...</td>\n",
 750 |        "      <td>...</td>\n",
 751 |        "      <td>...</td>\n",
 752 |        "      <td>...</td>\n",
 753 |        "      <td>...</td>\n",
 754 |        "      <td>...</td>\n",
 755 |        "      <td>...</td>\n",
 756 |        "      <td>...</td>\n",
 757 |        "      <td>...</td>\n",
 758 |        "      <td>...</td>\n",
 759 |        "      <td>...</td>\n",
 760 |        "      <td>...</td>\n",
 761 |        "      <td>...</td>\n",
 762 |        "      <td>...</td>\n",
 763 |        "      <td>...</td>\n",
 764 |        "    </tr>\n",
 765 |        "    <tr>\n",
 766 |        "      <td>zealand</td>\n",
 767 |        "      <td>0.201899</td>\n",
 768 |        "      <td>-0.011669</td>\n",
 769 |        "      <td>-0.000164</td>\n",
 770 |        "      <td>0.038771</td>\n",
 771 |        "      <td>-0.038884</td>\n",
 772 |        "      <td>-0.200695</td>\n",
 773 |        "      <td>0.013058</td>\n",
 774 |        "      <td>0.013915</td>\n",
 775 |        "      <td>0.007938</td>\n",
 776 |        "      <td>0.020752</td>\n",
 777 |        "      <td>-0.001004</td>\n",
 778 |        "      <td>-0.008869</td>\n",
 779 |        "      <td>-0.048644</td>\n",
 780 |        "      <td>0.001566</td>\n",
 781 |        "      <td>-0.016636</td>\n",
 782 |        "      <td>-0.099761</td>\n",
 783 |        "      <td>0.006259</td>\n",
 784 |        "      <td>-0.014353</td>\n",
 785 |        "      <td>0.009601</td>\n",
 786 |        "      <td>0.003707</td>\n",
 787 |        "      <td>0.014746</td>\n",
 788 |        "      <td>0.001099</td>\n",
 789 |        "    </tr>\n",
 790 |        "    <tr>\n",
 791 |        "      <td>zero</td>\n",
 792 |        "      <td>0.145570</td>\n",
 793 |        "      <td>-0.053593</td>\n",
 794 |        "      <td>0.022223</td>\n",
 795 |        "      <td>0.015668</td>\n",
 796 |        "      <td>0.019288</td>\n",
 797 |        "      <td>-0.090709</td>\n",
 798 |        "      <td>-0.003745</td>\n",
 799 |        "      <td>0.002425</td>\n",
 800 |        "      <td>-0.020840</td>\n",
 801 |        "      <td>0.058833</td>\n",
 802 |        "      <td>0.003737</td>\n",
 803 |        "      <td>0.005958</td>\n",
 804 |        "      <td>-0.093422</td>\n",
 805 |        "      <td>0.027760</td>\n",
 806 |        "      <td>0.030987</td>\n",
 807 |        "      <td>0.093844</td>\n",
 808 |        "      <td>0.031099</td>\n",
 809 |        "      <td>0.011547</td>\n",
 810 |        "      <td>0.006995</td>\n",
 811 |        "      <td>0.014213</td>\n",
 812 |        "      <td>0.008819</td>\n",
 813 |        "      <td>0.001835</td>\n",
 814 |        "    </tr>\n",
 815 |        "    <tr>\n",
 816 |        "      <td>zombi</td>\n",
 817 |        "      <td>-0.028788</td>\n",
 818 |        "      <td>-0.057147</td>\n",
 819 |        "      <td>-0.007456</td>\n",
 820 |        "      <td>0.005884</td>\n",
 821 |        "      <td>0.020771</td>\n",
 822 |        "      <td>-0.279043</td>\n",
 823 |        "      <td>-0.000784</td>\n",
 824 |        "      <td>-0.027079</td>\n",
 825 |        "      <td>-0.005703</td>\n",
 826 |        "      <td>-0.038465</td>\n",
 827 |        "      <td>0.003023</td>\n",
 828 |        "      <td>-0.020218</td>\n",
 829 |        "      <td>0.341943</td>\n",
 830 |        "      <td>0.012054</td>\n",
 831 |        "      <td>0.030725</td>\n",
 832 |        "      <td>-0.135719</td>\n",
 833 |        "      <td>-0.007812</td>\n",
 834 |        "      <td>0.022717</td>\n",
 835 |        "      <td>-0.009578</td>\n",
 836 |        "      <td>0.004558</td>\n",
 837 |        "      <td>-0.052168</td>\n",
 838 |        "      <td>-0.011931</td>\n",
 839 |        "    </tr>\n",
 840 |        "    <tr>\n",
 841 |        "      <td>zone</td>\n",
 842 |        "      <td>0.188484</td>\n",
 843 |        "      <td>0.066910</td>\n",
 844 |        "      <td>-0.011633</td>\n",
 845 |        "      <td>-0.034049</td>\n",
 846 |        "      <td>0.013658</td>\n",
 847 |        "      <td>-0.053566</td>\n",
 848 |        "      <td>0.012062</td>\n",
 849 |        "      <td>-0.014565</td>\n",
 850 |        "      <td>-0.003714</td>\n",
 851 |        "      <td>0.043539</td>\n",
 852 |        "      <td>-0.001000</td>\n",
 853 |        "      <td>0.025871</td>\n",
 854 |        "      <td>-0.015155</td>\n",
 855 |        "      <td>0.039160</td>\n",
 856 |        "      <td>-0.007195</td>\n",
 857 |        "      <td>-0.075479</td>\n",
 858 |        "      <td>-0.015189</td>\n",
 859 |        "      <td>0.036640</td>\n",
 860 |        "      <td>-0.033840</td>\n",
 861 |        "      <td>-0.004590</td>\n",
 862 |        "      <td>0.025977</td>\n",
 863 |        "      <td>-0.003066</td>\n",
 864 |        "    </tr>\n",
 865 |        "    <tr>\n",
 866 |        "      <td>zoo</td>\n",
 867 |        "      <td>-0.039962</td>\n",
 868 |        "      <td>0.073740</td>\n",
 869 |        "      <td>0.018455</td>\n",
 870 |        "      <td>-0.023292</td>\n",
 871 |        "      <td>0.014912</td>\n",
 872 |        "      <td>-0.352289</td>\n",
 873 |        "      <td>-0.006540</td>\n",
 874 |        "      <td>-0.004921</td>\n",
 875 |        "      <td>0.044167</td>\n",
 876 |        "      <td>0.052392</td>\n",
 877 |        "      <td>-0.002906</td>\n",
 878 |        "      <td>-0.003083</td>\n",
 879 |        "      <td>0.011880</td>\n",
 880 |        "      <td>0.013114</td>\n",
 881 |        "      <td>-0.004452</td>\n",
 882 |        "      <td>0.041213</td>\n",
 883 |        "      <td>0.002608</td>\n",
 884 |        "      <td>-0.005972</td>\n",
 885 |        "      <td>-0.011298</td>\n",
 886 |        "      <td>-0.001489</td>\n",
 887 |        "      <td>-0.008891</td>\n",
 888 |        "      <td>0.003304</td>\n",
 889 |        "    </tr>\n",
 890 |        "  </tbody>\n",
 891 |        "</table>\n",
 892 |        "<p>5459 rows × 22 columns</p>\n",
 893 |        "</div>"
 894 |       ],
 895 |       "text/plain": [
 896 |        "                   g_action  g_adventure  g_animation  g_biography  g_comedy  \\\n",
 897 |        "f_release_year     0.890920    -0.202585     0.205493     0.156150  0.090189   \n",
 898 |        "f_release_month    0.162641     0.093539     0.032958    -0.002919 -0.007082   \n",
 899 |        "f_runtime          0.997796     0.052923    -0.329772     0.151680 -0.136051   \n",
 900 |        "f_word_count_long  0.221205     0.002507     0.022708     0.008983 -0.089621   \n",
 901 |        "f_imdb_rating     -0.992456    -0.122722     0.277040     0.168530 -0.071904   \n",
 902 |        "...                     ...          ...          ...          ...       ...   \n",
 903 |        "zealand            0.201899    -0.011669    -0.000164     0.038771 -0.038884   \n",
 904 |        "zero               0.145570    -0.053593     0.022223     0.015668  0.019288   \n",
 905 |        "zombi             -0.028788    -0.057147    -0.007456     0.005884  0.020771   \n",
 906 |        "zone               0.188484     0.066910    -0.011633    -0.034049  0.013658   \n",
 907 |        "zoo               -0.039962     0.073740     0.018455    -0.023292  0.014912   \n",
 908 |        "\n",
 909 |        "                    g_crime  g_documentary   g_drama  g_family  g_fantasy  \\\n",
 910 |        "f_release_year     0.207143       0.112173 -0.024676 -0.005681   0.000552   \n",
 911 |        "f_release_month   -0.054702      -0.018309  0.026233  0.056573   0.120411   \n",
 912 |        "f_runtime         -0.098478      -0.125171  0.161513 -0.157185  -0.139964   \n",
 913 |        "f_word_count_long -0.157927       0.015894 -0.003234  0.103574   0.235768   \n",
 914 |        "f_imdb_rating      1.286002       0.241488  0.239085  0.045930   0.015202   \n",
 915 |        "...                     ...            ...       ...       ...        ...   \n",
 916 |        "zealand           -0.200695       0.013058  0.013915  0.007938   0.020752   \n",
 917 |        "zero              -0.090709      -0.003745  0.002425 -0.020840   0.058833   \n",
 918 |        "zombi             -0.279043      -0.000784 -0.027079 -0.005703  -0.038465   \n",
 919 |        "zone              -0.053566       0.012062 -0.014565 -0.003714   0.043539   \n",
 920 |        "zoo               -0.352289      -0.006540 -0.004921  0.044167   0.052392   \n",
 921 |        "\n",
 922 |        "                   g_film-noir  g_history  g_horror   g_music  g_musical  \\\n",
 923 |        "f_release_year       -0.192294   0.262658 -0.124317 -0.247146  -0.202726   \n",
 924 |        "f_release_month       0.000712  -0.090634 -0.025620  0.030871   0.001300   \n",
 925 |        "f_runtime            -0.046469   0.234240 -0.372855  0.120316   0.074502   \n",
 926 |        "f_word_count_long     0.010266   0.010166 -0.077899  0.029570   0.042731   \n",
 927 |        "f_imdb_rating         0.052162   0.150826 -0.554526  0.047041   0.012359   \n",
 928 |        "...                        ...        ...       ...       ...        ...   \n",
 929 |        "zealand              -0.001004  -0.008869 -0.048644  0.001566  -0.016636   \n",
 930 |        "zero                  0.003737   0.005958 -0.093422  0.027760   0.030987   \n",
 931 |        "zombi                 0.003023  -0.020218  0.341943  0.012054   0.030725   \n",
 932 |        "zone                 -0.001000   0.025871 -0.015155  0.039160  -0.007195   \n",
 933 |        "zoo                  -0.002906  -0.003083  0.011880  0.013114  -0.004452   \n",
 934 |        "\n",
 935 |        "                   g_mystery  g_romance  g_sci-fi   g_sport  g_thriller  \\\n",
 936 |        "f_release_year     -0.000374  -0.163564 -0.082269 -0.013732    0.092245   \n",
 937 |        "f_release_month     0.005833   0.013643  0.013330 -0.010611    0.009509   \n",
 938 |        "f_runtime           0.034689   0.109954 -0.061331  0.012625    0.053091   \n",
 939 |        "f_word_count_long   0.011146  -0.027546  0.017361  0.024619    0.003008   \n",
 940 |        "f_imdb_rating       0.118412  -0.026726 -0.234360  0.013724   -0.086579   \n",
 941 |        "...                      ...        ...       ...       ...         ...   \n",
 942 |        "zealand            -0.099761   0.006259 -0.014353  0.009601    0.003707   \n",
 943 |        "zero                0.093844   0.031099  0.011547  0.006995    0.014213   \n",
 944 |        "zombi              -0.135719  -0.007812  0.022717 -0.009578    0.004558   \n",
 945 |        "zone               -0.075479  -0.015189  0.036640 -0.033840   -0.004590   \n",
 946 |        "zoo                 0.041213   0.002608 -0.005972 -0.011298   -0.001489   \n",
 947 |        "\n",
 948 |        "                      g_war  g_western  \n",
 949 |        "f_release_year     0.168980   0.168377  \n",
 950 |        "f_release_month   -0.108588  -0.121271  \n",
 951 |        "f_runtime          0.236038   0.229005  \n",
 952 |        "f_word_count_long -0.013937  -0.003826  \n",
 953 |        "f_imdb_rating      0.051071   0.007038  \n",
 954 |        "...                     ...        ...  \n",
 955 |        "zealand            0.014746   0.001099  \n",
 956 |        "zero               0.008819   0.001835  \n",
 957 |        "zombi             -0.052168  -0.011931  \n",
 958 |        "zone               0.025977  -0.003066  \n",
 959 |        "zoo               -0.008891   0.003304  \n",
 960 |        "\n",
 961 |        "[5459 rows x 22 columns]"
 962 |       ]
 963 |      },
 964 |      "execution_count": 75,
 965 |      "metadata": {},
 966 |      "output_type": "execute_result"
 967 |     }
 968 |    ],
 969 |    "source": [
 970 |     "coef_df = pd.DataFrame(coefs, index=genre_cols, columns=X_train.columns)\n",
 971 |     "coef_tdf = coef_df.T\n",
 972 |     "coef_tdf"
 973 |    ]
 974 |   },
 975 |   {
 976 |    "cell_type": "code",
 977 |    "execution_count": 29,
 978 |    "metadata": {},
 979 |    "outputs": [],
 980 |    "source": [
 981 |     "coef_tdf.to_csv('my_1vr_linear_svc_default_coef.tsv', sep='\\t')"
 982 |    ]
 983 |   },
 984 |   {
 985 |    "cell_type": "code",
 986 |    "execution_count": null,
 987 |    "metadata": {},
 988 |    "outputs": [],
 989 |    "source": []
 990 |   },
 991 |   {
 992 |    "cell_type": "code",
 993 |    "execution_count": null,
 994 |    "metadata": {},
 995 |    "outputs": [],
 996 |    "source": []
 997 |   },
 998 |   {
 999 |    "cell_type": "code",
1000 |    "execution_count": null,
1001 |    "metadata": {},
1002 |    "outputs": [],
1003 |    "source": []
1004 |   },
1005 |   {
1006 |    "cell_type": "code",
1007 |    "execution_count": null,
1008 |    "metadata": {},
1009 |    "outputs": [],
1010 |    "source": []
1011 |   },
1012 |   {
1013 |    "cell_type": "code",
1014 |    "execution_count": 69,
1015 |    "metadata": {},
1016 |    "outputs": [
1017 |     {
1018 |      "name": "stdout",
1019 |      "output_type": "stream",
1020 |      "text": [
1021 |       "Wall time: 8min 26s\n"
1022 |      ]
1023 |     },
1024 |     {
1025 |      "data": {
1026 |       "text/plain": [
1027 |        "['models/my_1vr_logreg_default.pkl']"
1028 |       ]
1029 |      },
1030 |      "execution_count": 69,
1031 |      "metadata": {},
1032 |      "output_type": "execute_result"
1033 |     }
1034 |    ],
1035 |    "source": [
1036 |     "%%time\n",
1037 |     "my_log_model = OneVsRestClassifier(LogisticRegression(random_state=123, max_iter=3000), n_jobs=-1).fit(X_train_s, y_train)\n",
1038 |     "\n",
1039 |     "# EXPORT AND SAVE THE MODEL\n",
1040 |     "joblib.dump(my_log_model, 'models/my_1vr_logreg_default.pkl')"
1041 |    ]
1042 |   },
1043 |   {
1044 |    "cell_type": "code",
1045 |    "execution_count": 106,
1046 |    "metadata": {},
1047 |    "outputs": [
1048 |     {
1049 |      "name": "stdout",
1050 |      "output_type": "stream",
1051 |      "text": [
1052 |       "Wall time: 11 s\n"
1053 |      ]
1054 |     },
1055 |     {
1056 |      "data": {
1057 |       "text/plain": [
1058 |        "['models/my_1vr_logreg_minmax_0.01.pkl']"
1059 |       ]
1060 |      },
1061 |      "execution_count": 106,
1062 |      "metadata": {},
1063 |      "output_type": "execute_result"
1064 |     }
1065 |    ],
1066 |    "source": [
1067 |     "%%time\n",
1068 |     "my_log_model_mm = OneVsRestClassifier(LogisticRegression(random_state=123, max_iter=3000, C=0.01), n_jobs=-1).fit(X_train_mm, y_train)\n",
1069 |     "\n",
1070 |     "# EXPORT AND SAVE THE MODEL\n",
1071 |     "joblib.dump(my_log_model_mm, 'models/my_1vr_logreg_minmax_0.01.pkl')"
1072 |    ]
1073 |   },
1074 |   {
1075 |    "cell_type": "code",
1076 |    "execution_count": 109,
1077 |    "metadata": {},
1078 |    "outputs": [
1079 |     {
1080 |      "name": "stdout",
1081 |      "output_type": "stream",
1082 |      "text": [
1083 |       "Train: 0.08490524166703653\n",
1084 |       " Test: 0.07189455465317535\n",
1085 |       "0.8136  g_action\n",
1086 |       "0.8852  g_adventure\n",
1087 |       "0.9691  g_animation\n",
1088 |       "0.9478  g_biography\n",
1089 |       "0.6686  g_comedy\n",
1090 |       "0.8396  g_crime\n",
1091 |       "0.9525  g_documentary\n",
1092 |       "0.6873  g_drama\n",
1093 |       "0.9406  g_family\n",
1094 |       "0.9249  g_fantasy\n",
1095 |       "0.9892  g_film-noir\n",
1096 |       "0.9561  g_history\n",
1097 |       "0.8645  g_horror\n",
1098 |       "0.9342  g_music\n",
1099 |       "0.9698  g_musical\n",
1100 |       "0.9087  g_mystery\n",
1101 |       "0.7774  g_romance\n",
1102 |       "0.9179  g_sci-fi\n",
1103 |       "0.9708  g_sport\n",
1104 |       "0.7594  g_thriller\n",
1105 |       "0.9490  g_war\n",
1106 |       "0.9775  g_western\n"
1107 |      ]
1108 |     }
1109 |    ],
1110 |    "source": [
1111 |     "y_pred_log_mm = my_log_model_mm.predict(X_test_mm)\n",
1112 |     "y_train_pred_log_mm = my_log_model_mm.predict(X_train_mm)\n",
1113 |     "from sklearn.metrics import accuracy_score\n",
1114 |     "print(f'Train: {accuracy_score(y_train, y_train_pred_log_mm)}')\n",
1115 |     "print(f' Test: {accuracy_score(y_test, y_pred_log)}')\n",
1116 |     "\n",
1117 |     "y_train_pred_log_mm_df = pd.DataFrame(y_train_pred_log_mm)\n",
1118 |     "y_train_pred_log_mm_df.columns = genre_cols\n",
1119 |     "\n",
1120 |     "y_pred_log_mm_df = pd.DataFrame(y_pred_log_mm)\n",
1121 |     "y_pred_log_mm_df.columns = genre_cols\n",
1122 |     "\n",
1123 |     "#test_acc_dict = {}\n",
1124 |     "# Test set predictions\n",
1125 |     "for g in genre_cols:\n",
1126 |     "    score = accuracy_score(y_test[g], y_pred_log_mm_df[g])\n",
1127 |     "    #test_acc_dict.update( {g[2:] : score} )\n",
1128 |     "    print(f'{score:0.4f}  {g}')"
1129 |    ]
1130 |   },
1131 |   {
1132 |    "cell_type": "code",
1133 |    "execution_count": null,
1134 |    "metadata": {},
1135 |    "outputs": [],
1136 |    "source": []
1137 |   },
1138 |   {
1139 |    "cell_type": "code",
1140 |    "execution_count": null,
1141 |    "metadata": {},
1142 |    "outputs": [],
1143 |    "source": []
1144 |   },
1145 |   {
1146 |    "cell_type": "code",
1147 |    "execution_count": 101,
1148 |    "metadata": {},
1149 |    "outputs": [],
1150 |    "source": [
1151 |     "y_pred_log = my_log_model.predict(X_test_s)"
1152 |    ]
1153 |   },
1154 |   {
1155 |    "cell_type": "code",
1156 |    "execution_count": 71,
1157 |    "metadata": {},
1158 |    "outputs": [],
1159 |    "source": [
1160 |     "y_train_pred_log = my_log_model.predict(X_train_s)"
1161 |    ]
1162 |   },
1163 |   {
1164 |    "cell_type": "code",
1165 |    "execution_count": 72,
1166 |    "metadata": {},
1167 |    "outputs": [
1168 |     {
1169 |      "data": {
1170 |       "text/plain": [
1171 |        "True"
1172 |       ]
1173 |      },
1174 |      "execution_count": 72,
1175 |      "metadata": {},
1176 |      "output_type": "execute_result"
1177 |     }
1178 |    ],
1179 |    "source": [
1180 |     "my_log_model.multilabel_\n",
1181 |     "#my_model.predict_proba(X_train_s)"
1182 |    ]
1183 |   },
1184 |   {
1185 |    "cell_type": "code",
1186 |    "execution_count": 73,
1187 |    "metadata": {},
1188 |    "outputs": [
1189 |     {
1190 |      "data": {
1191 |       "text/plain": [
1192 |        "0.622076250499312"
1193 |       ]
1194 |      },
1195 |      "execution_count": 73,
1196 |      "metadata": {},
1197 |      "output_type": "execute_result"
1198 |     }
1199 |    ],
1200 |    "source": [
1201 |     "from sklearn.metrics import accuracy_score\n",
1202 |     "accuracy_score(y_train, y_train_pred_log)"
1203 |    ]
1204 |   },
1205 |   {
1206 |    "cell_type": "code",
1207 |    "execution_count": 80,
1208 |    "metadata": {},
1209 |    "outputs": [
1210 |     {
1211 |      "data": {
1212 |       "text/plain": [
1213 |        "0.06403940886699508"
1214 |       ]
1215 |      },
1216 |      "execution_count": 80,
1217 |      "metadata": {},
1218 |      "output_type": "execute_result"
1219 |     }
1220 |    ],
1221 |    "source": [
1222 |     "from sklearn.metrics import accuracy_score\n",
1223 |     "accuracy_score(y_test, y_pred_log)"
1224 |    ]
1225 |   },
1226 |   {
1227 |    "cell_type": "code",
1228 |    "execution_count": 81,
1229 |    "metadata": {},
1230 |    "outputs": [],
1231 |    "source": [
1232 |     "y_train_pred_log_df = pd.DataFrame(y_train_pred_log)\n",
1233 |     "y_train_pred_log_df.columns = genre_cols\n",
1234 |     "\n",
1235 |     "y_pred_log_df = pd.DataFrame(y_pred_log)\n",
1236 |     "y_pred_log_df.columns = genre_cols"
1237 |    ]
1238 |   },
1239 |   {
1240 |    "cell_type": "code",
1241 |    "execution_count": 82,
1242 |    "metadata": {},
1243 |    "outputs": [
1244 |     {
1245 |      "name": "stdout",
1246 |      "output_type": "stream",
1247 |      "text": [
1248 |       "0.7809  g_action\n",
1249 |       "0.8340  g_adventure\n",
1250 |       "0.9561  g_animation\n",
1251 |       "0.9081  g_biography\n",
1252 |       "0.7252  g_comedy\n",
1253 |       "0.7891  g_crime\n",
1254 |       "0.9686  g_documentary\n",
1255 |       "0.6874  g_drama\n",
1256 |       "0.8992  g_family\n",
1257 |       "0.8768  g_fantasy\n",
1258 |       "0.9874  g_film-noir\n",
1259 |       "0.9217  g_history\n",
1260 |       "0.8882  g_horror\n",
1261 |       "0.9101  g_music\n",
1262 |       "0.9449  g_musical\n",
1263 |       "0.8338  g_mystery\n",
1264 |       "0.7617  g_romance\n",
1265 |       "0.9103  g_sci-fi\n",
1266 |       "0.9767  g_sport\n",
1267 |       "0.7566  g_thriller\n",
1268 |       "0.9451  g_war\n",
1269 |       "0.9840  g_western\n"
1270 |      ]
1271 |     }
1272 |    ],
1273 |    "source": [
1274 |     "test_acc_dict = {}\n",
1275 |     "# Test set predictions\n",
1276 |     "for g in genre_cols:\n",
1277 |     "    score = accuracy_score(y_test[g], y_pred_log_df[g])\n",
1278 |     "    test_acc_dict.update( {g[2:] : score} )\n",
1279 |     "    print(f'{score:0.4f}  {g}')"
1280 |    ]
1281 |   },
1282 |   {
1283 |    "cell_type": "code",
1284 |    "execution_count": 83,
1285 |    "metadata": {},
1286 |    "outputs": [],
1287 |    "source": [
1288 |     "test_scores_log = pd.DataFrame.from_dict(test_acc_dict, orient='index', columns=['score'])"
1289 |    ]
1290 |   },
1291 |   {
1292 |    "cell_type": "code",
1293 |    "execution_count": 84,
1294 |    "metadata": {},
1295 |    "outputs": [],
1296 |    "source": [
1297 |     "test_scores_log.to_csv('test_scores_model1.csv', index_label='genre')"
1298 |    ]
1299 |   },
1300 |   {
1301 |    "cell_type": "code",
1302 |    "execution_count": 85,
1303 |    "metadata": {},
1304 |    "outputs": [
1305 |     {
1306 |      "data": {
1307 |       "text/html": [
1308 |        "<div>\n",
1309 |        "<style scoped>\n",
1310 |        "    .dataframe tbody tr th:only-of-type {\n",
1311 |        "        vertical-align: middle;\n",
1312 |        "    }\n",
1313 |        "\n",
1314 |        "    .dataframe tbody tr th {\n",
1315 |        "        vertical-align: top;\n",
1316 |        "    }\n",
1317 |        "\n",
1318 |        "    .dataframe thead th {\n",
1319 |        "        text-align: right;\n",
1320 |        "    }\n",
1321 |        "</style>\n",
1322 |        "<table border=\"1\" class=\"dataframe\">\n",
1323 |        "  <thead>\n",
1324 |        "    <tr style=\"text-align: right;\">\n",
1325 |        "      <th></th>\n",
1326 |        "      <th>g_action</th>\n",
1327 |        "      <th>g_adventure</th>\n",
1328 |        "      <th>g_animation</th>\n",
1329 |        "      <th>g_biography</th>\n",
1330 |        "      <th>g_comedy</th>\n",
1331 |        "      <th>g_crime</th>\n",
1332 |        "      <th>g_documentary</th>\n",
1333 |        "      <th>g_drama</th>\n",
1334 |        "      <th>g_family</th>\n",
1335 |        "      <th>g_fantasy</th>\n",
1336 |        "      <th>g_film-noir</th>\n",
1337 |        "      <th>g_history</th>\n",
1338 |        "      <th>g_horror</th>\n",
1339 |        "      <th>g_music</th>\n",
1340 |        "      <th>g_musical</th>\n",
1341 |        "      <th>g_mystery</th>\n",
1342 |        "      <th>g_romance</th>\n",
1343 |        "      <th>g_sci-fi</th>\n",
1344 |        "      <th>g_sport</th>\n",
1345 |        "      <th>g_thriller</th>\n",
1346 |        "      <th>g_war</th>\n",
1347 |        "      <th>g_western</th>\n",
1348 |        "    </tr>\n",
1349 |        "  </thead>\n",
1350 |        "  <tbody>\n",
1351 |        "    <tr>\n",
1352 |        "      <td>f_release_year</td>\n",
1353 |        "      <td>0.890920</td>\n",
1354 |        "      <td>-0.202585</td>\n",
1355 |        "      <td>0.205493</td>\n",
1356 |        "      <td>0.156150</td>\n",
1357 |        "      <td>0.090189</td>\n",
1358 |        "      <td>0.207143</td>\n",
1359 |        "      <td>0.112173</td>\n",
1360 |        "      <td>-0.024676</td>\n",
1361 |        "      <td>-0.005681</td>\n",
1362 |        "      <td>0.000552</td>\n",
1363 |        "      <td>-0.192294</td>\n",
1364 |        "      <td>0.262658</td>\n",
1365 |        "      <td>-0.124317</td>\n",
1366 |        "      <td>-0.247146</td>\n",
1367 |        "      <td>-0.202726</td>\n",
1368 |        "      <td>-0.000374</td>\n",
1369 |        "      <td>-0.163564</td>\n",
1370 |        "      <td>-0.082269</td>\n",
1371 |        "      <td>-0.013732</td>\n",
1372 |        "      <td>0.092245</td>\n",
1373 |        "      <td>0.168980</td>\n",
1374 |        "      <td>0.168377</td>\n",
1375 |        "    </tr>\n",
1376 |        "    <tr>\n",
1377 |        "      <td>f_release_month</td>\n",
1378 |        "      <td>0.162641</td>\n",
1379 |        "      <td>0.093539</td>\n",
1380 |        "      <td>0.032958</td>\n",
1381 |        "      <td>-0.002919</td>\n",
1382 |        "      <td>-0.007082</td>\n",
1383 |        "      <td>-0.054702</td>\n",
1384 |        "      <td>-0.018309</td>\n",
1385 |        "      <td>0.026233</td>\n",
1386 |        "      <td>0.056573</td>\n",
1387 |        "      <td>0.120411</td>\n",
1388 |        "      <td>0.000712</td>\n",
1389 |        "      <td>-0.090634</td>\n",
1390 |        "      <td>-0.025620</td>\n",
1391 |        "      <td>0.030871</td>\n",
1392 |        "      <td>0.001300</td>\n",
1393 |        "      <td>0.005833</td>\n",
1394 |        "      <td>0.013643</td>\n",
1395 |        "      <td>0.013330</td>\n",
1396 |        "      <td>-0.010611</td>\n",
1397 |        "      <td>0.009509</td>\n",
1398 |        "      <td>-0.108588</td>\n",
1399 |        "      <td>-0.121271</td>\n",
1400 |        "    </tr>\n",
1401 |        "    <tr>\n",
1402 |        "      <td>f_runtime</td>\n",
1403 |        "      <td>0.997796</td>\n",
1404 |        "      <td>0.052923</td>\n",
1405 |        "      <td>-0.329772</td>\n",
1406 |        "      <td>0.151680</td>\n",
1407 |        "      <td>-0.136051</td>\n",
1408 |        "      <td>-0.098478</td>\n",
1409 |        "      <td>-0.125171</td>\n",
1410 |        "      <td>0.161513</td>\n",
1411 |        "      <td>-0.157185</td>\n",
1412 |        "      <td>-0.139964</td>\n",
1413 |        "      <td>-0.046469</td>\n",
1414 |        "      <td>0.234240</td>\n",
1415 |        "      <td>-0.372855</td>\n",
1416 |        "      <td>0.120316</td>\n",
1417 |        "      <td>0.074502</td>\n",
1418 |        "      <td>0.034689</td>\n",
1419 |        "      <td>0.109954</td>\n",
1420 |        "      <td>-0.061331</td>\n",
1421 |        "      <td>0.012625</td>\n",
1422 |        "      <td>0.053091</td>\n",
1423 |        "      <td>0.236038</td>\n",
1424 |        "      <td>0.229005</td>\n",
1425 |        "    </tr>\n",
1426 |        "    <tr>\n",
1427 |        "      <td>f_word_count_long</td>\n",
1428 |        "      <td>0.221205</td>\n",
1429 |        "      <td>0.002507</td>\n",
1430 |        "      <td>0.022708</td>\n",
1431 |        "      <td>0.008983</td>\n",
1432 |        "      <td>-0.089621</td>\n",
1433 |        "      <td>-0.157927</td>\n",
1434 |        "      <td>0.015894</td>\n",
1435 |        "      <td>-0.003234</td>\n",
1436 |        "      <td>0.103574</td>\n",
1437 |        "      <td>0.235768</td>\n",
1438 |        "      <td>0.010266</td>\n",
1439 |        "      <td>0.010166</td>\n",
1440 |        "      <td>-0.077899</td>\n",
1441 |        "      <td>0.029570</td>\n",
1442 |        "      <td>0.042731</td>\n",
1443 |        "      <td>0.011146</td>\n",
1444 |        "      <td>-0.027546</td>\n",
1445 |        "      <td>0.017361</td>\n",
1446 |        "      <td>0.024619</td>\n",
1447 |        "      <td>0.003008</td>\n",
1448 |        "      <td>-0.013937</td>\n",
1449 |        "      <td>-0.003826</td>\n",
1450 |        "    </tr>\n",
1451 |        "    <tr>\n",
1452 |        "      <td>f_imdb_rating</td>\n",
1453 |        "      <td>-0.992456</td>\n",
1454 |        "      <td>-0.122722</td>\n",
1455 |        "      <td>0.277040</td>\n",
1456 |        "      <td>0.168530</td>\n",
1457 |        "      <td>-0.071904</td>\n",
1458 |        "      <td>1.286002</td>\n",
1459 |        "      <td>0.241488</td>\n",
1460 |        "      <td>0.239085</td>\n",
1461 |        "      <td>0.045930</td>\n",
1462 |        "      <td>0.015202</td>\n",
1463 |        "      <td>0.052162</td>\n",
1464 |        "      <td>0.150826</td>\n",
1465 |        "      <td>-0.554526</td>\n",
1466 |        "      <td>0.047041</td>\n",
1467 |        "      <td>0.012359</td>\n",
1468 |        "      <td>0.118412</td>\n",
1469 |        "      <td>-0.026726</td>\n",
1470 |        "      <td>-0.234360</td>\n",
1471 |        "      <td>0.013724</td>\n",
1472 |        "      <td>-0.086579</td>\n",
1473 |        "      <td>0.051071</td>\n",
1474 |        "      <td>0.007038</td>\n",
1475 |        "    </tr>\n",
1476 |        "    <tr>\n",
1477 |        "      <td>...</td>\n",
1478 |        "      <td>...</td>\n",
1479 |        "      <td>...</td>\n",
1480 |        "      <td>...</td>\n",
1481 |        "      <td>...</td>\n",
1482 |        "      <td>...</td>\n",
1483 |        "      <td>...</td>\n",
1484 |        "      <td>...</td>\n",
1485 |        "      <td>...</td>\n",
1486 |        "      <td>...</td>\n",
1487 |        "      <td>...</td>\n",
1488 |        "      <td>...</td>\n",
1489 |        "      <td>...</td>\n",
1490 |        "      <td>...</td>\n",
1491 |        "      <td>...</td>\n",
1492 |        "      <td>...</td>\n",
1493 |        "      <td>...</td>\n",
1494 |        "      <td>...</td>\n",
1495 |        "      <td>...</td>\n",
1496 |        "      <td>...</td>\n",
1497 |        "      <td>...</td>\n",
1498 |        "      <td>...</td>\n",
1499 |        "      <td>...</td>\n",
1500 |        "    </tr>\n",
1501 |        "    <tr>\n",
1502 |        "      <td>zealand</td>\n",
1503 |        "      <td>0.201899</td>\n",
1504 |        "      <td>-0.011669</td>\n",
1505 |        "      <td>-0.000164</td>\n",
1506 |        "      <td>0.038771</td>\n",
1507 |        "      <td>-0.038884</td>\n",
1508 |        "      <td>-0.200695</td>\n",
1509 |        "      <td>0.013058</td>\n",
1510 |        "      <td>0.013915</td>\n",
1511 |        "      <td>0.007938</td>\n",
1512 |        "      <td>0.020752</td>\n",
1513 |        "      <td>-0.001004</td>\n",
1514 |        "      <td>-0.008869</td>\n",
1515 |        "      <td>-0.048644</td>\n",
1516 |        "      <td>0.001566</td>\n",
1517 |        "      <td>-0.016636</td>\n",
1518 |        "      <td>-0.099761</td>\n",
1519 |        "      <td>0.006259</td>\n",
1520 |        "      <td>-0.014353</td>\n",
1521 |        "      <td>0.009601</td>\n",
1522 |        "      <td>0.003707</td>\n",
1523 |        "      <td>0.014746</td>\n",
1524 |        "      <td>0.001099</td>\n",
1525 |        "    </tr>\n",
1526 |        "    <tr>\n",
1527 |        "      <td>zero</td>\n",
1528 |        "      <td>0.145570</td>\n",
1529 |        "      <td>-0.053593</td>\n",
1530 |        "      <td>0.022223</td>\n",
1531 |        "      <td>0.015668</td>\n",
1532 |        "      <td>0.019288</td>\n",
1533 |        "      <td>-0.090709</td>\n",
1534 |        "      <td>-0.003745</td>\n",
1535 |        "      <td>0.002425</td>\n",
1536 |        "      <td>-0.020840</td>\n",
1537 |        "      <td>0.058833</td>\n",
1538 |        "      <td>0.003737</td>\n",
1539 |        "      <td>0.005958</td>\n",
1540 |        "      <td>-0.093422</td>\n",
1541 |        "      <td>0.027760</td>\n",
1542 |        "      <td>0.030987</td>\n",
1543 |        "      <td>0.093844</td>\n",
1544 |        "      <td>0.031099</td>\n",
1545 |        "      <td>0.011547</td>\n",
1546 |        "      <td>0.006995</td>\n",
1547 |        "      <td>0.014213</td>\n",
1548 |        "      <td>0.008819</td>\n",
1549 |        "      <td>0.001835</td>\n",
1550 |        "    </tr>\n",
1551 |        "    <tr>\n",
1552 |        "      <td>zombi</td>\n",
1553 |        "      <td>-0.028788</td>\n",
1554 |        "      <td>-0.057147</td>\n",
1555 |        "      <td>-0.007456</td>\n",
1556 |        "      <td>0.005884</td>\n",
1557 |        "      <td>0.020771</td>\n",
1558 |        "      <td>-0.279043</td>\n",
1559 |        "      <td>-0.000784</td>\n",
1560 |        "      <td>-0.027079</td>\n",
1561 |        "      <td>-0.005703</td>\n",
1562 |        "      <td>-0.038465</td>\n",
1563 |        "      <td>0.003023</td>\n",
1564 |        "      <td>-0.020218</td>\n",
1565 |        "      <td>0.341943</td>\n",
1566 |        "      <td>0.012054</td>\n",
1567 |        "      <td>0.030725</td>\n",
1568 |        "      <td>-0.135719</td>\n",
1569 |        "      <td>-0.007812</td>\n",
1570 |        "      <td>0.022717</td>\n",
1571 |        "      <td>-0.009578</td>\n",
1572 |        "      <td>0.004558</td>\n",
1573 |        "      <td>-0.052168</td>\n",
1574 |        "      <td>-0.011931</td>\n",
1575 |        "    </tr>\n",
1576 |        "    <tr>\n",
1577 |        "      <td>zone</td>\n",
1578 |        "      <td>0.188484</td>\n",
1579 |        "      <td>0.066910</td>\n",
1580 |        "      <td>-0.011633</td>\n",
1581 |        "      <td>-0.034049</td>\n",
1582 |        "      <td>0.013658</td>\n",
1583 |        "      <td>-0.053566</td>\n",
1584 |        "      <td>0.012062</td>\n",
1585 |        "      <td>-0.014565</td>\n",
1586 |        "      <td>-0.003714</td>\n",
1587 |        "      <td>0.043539</td>\n",
1588 |        "      <td>-0.001000</td>\n",
1589 |        "      <td>0.025871</td>\n",
1590 |        "      <td>-0.015155</td>\n",
1591 |        "      <td>0.039160</td>\n",
1592 |        "      <td>-0.007195</td>\n",
1593 |        "      <td>-0.075479</td>\n",
1594 |        "      <td>-0.015189</td>\n",
1595 |        "      <td>0.036640</td>\n",
1596 |        "      <td>-0.033840</td>\n",
1597 |        "      <td>-0.004590</td>\n",
1598 |        "      <td>0.025977</td>\n",
1599 |        "      <td>-0.003066</td>\n",
1600 |        "    </tr>\n",
1601 |        "    <tr>\n",
1602 |        "      <td>zoo</td>\n",
1603 |        "      <td>-0.039962</td>\n",
1604 |        "      <td>0.073740</td>\n",
1605 |        "      <td>0.018455</td>\n",
1606 |        "      <td>-0.023292</td>\n",
1607 |        "      <td>0.014912</td>\n",
1608 |        "      <td>-0.352289</td>\n",
1609 |        "      <td>-0.006540</td>\n",
1610 |        "      <td>-0.004921</td>\n",
1611 |        "      <td>0.044167</td>\n",
1612 |        "      <td>0.052392</td>\n",
1613 |        "      <td>-0.002906</td>\n",
1614 |        "      <td>-0.003083</td>\n",
1615 |        "      <td>0.011880</td>\n",
1616 |        "      <td>0.013114</td>\n",
1617 |        "      <td>-0.004452</td>\n",
1618 |        "      <td>0.041213</td>\n",
1619 |        "      <td>0.002608</td>\n",
1620 |        "      <td>-0.005972</td>\n",
1621 |        "      <td>-0.011298</td>\n",
1622 |        "      <td>-0.001489</td>\n",
1623 |        "      <td>-0.008891</td>\n",
1624 |        "      <td>0.003304</td>\n",
1625 |        "    </tr>\n",
1626 |        "  </tbody>\n",
1627 |        "</table>\n",
1628 |        "<p>5459 rows × 22 columns</p>\n",
1629 |        "</div>"
1630 |       ],
1631 |       "text/plain": [
1632 |        "                   g_action  g_adventure  g_animation  g_biography  g_comedy  \\\n",
1633 |        "f_release_year     0.890920    -0.202585     0.205493     0.156150  0.090189   \n",
1634 |        "f_release_month    0.162641     0.093539     0.032958    -0.002919 -0.007082   \n",
1635 |        "f_runtime          0.997796     0.052923    -0.329772     0.151680 -0.136051   \n",
1636 |        "f_word_count_long  0.221205     0.002507     0.022708     0.008983 -0.089621   \n",
1637 |        "f_imdb_rating     -0.992456    -0.122722     0.277040     0.168530 -0.071904   \n",
1638 |        "...                     ...          ...          ...          ...       ...   \n",
1639 |        "zealand            0.201899    -0.011669    -0.000164     0.038771 -0.038884   \n",
1640 |        "zero               0.145570    -0.053593     0.022223     0.015668  0.019288   \n",
1641 |        "zombi             -0.028788    -0.057147    -0.007456     0.005884  0.020771   \n",
1642 |        "zone               0.188484     0.066910    -0.011633    -0.034049  0.013658   \n",
1643 |        "zoo               -0.039962     0.073740     0.018455    -0.023292  0.014912   \n",
1644 |        "\n",
1645 |        "                    g_crime  g_documentary   g_drama  g_family  g_fantasy  \\\n",
1646 |        "f_release_year     0.207143       0.112173 -0.024676 -0.005681   0.000552   \n",
1647 |        "f_release_month   -0.054702      -0.018309  0.026233  0.056573   0.120411   \n",
1648 |        "f_runtime         -0.098478      -0.125171  0.161513 -0.157185  -0.139964   \n",
1649 |        "f_word_count_long -0.157927       0.015894 -0.003234  0.103574   0.235768   \n",
1650 |        "f_imdb_rating      1.286002       0.241488  0.239085  0.045930   0.015202   \n",
1651 |        "...                     ...            ...       ...       ...        ...   \n",
1652 |        "zealand           -0.200695       0.013058  0.013915  0.007938   0.020752   \n",
1653 |        "zero              -0.090709      -0.003745  0.002425 -0.020840   0.058833   \n",
1654 |        "zombi             -0.279043      -0.000784 -0.027079 -0.005703  -0.038465   \n",
1655 |        "zone              -0.053566       0.012062 -0.014565 -0.003714   0.043539   \n",
1656 |        "zoo               -0.352289      -0.006540 -0.004921  0.044167   0.052392   \n",
1657 |        "\n",
1658 |        "                   g_film-noir  g_history  g_horror   g_music  g_musical  \\\n",
1659 |        "f_release_year       -0.192294   0.262658 -0.124317 -0.247146  -0.202726   \n",
1660 |        "f_release_month       0.000712  -0.090634 -0.025620  0.030871   0.001300   \n",
1661 |        "f_runtime            -0.046469   0.234240 -0.372855  0.120316   0.074502   \n",
1662 |        "f_word_count_long     0.010266   0.010166 -0.077899  0.029570   0.042731   \n",
1663 |        "f_imdb_rating         0.052162   0.150826 -0.554526  0.047041   0.012359   \n",
1664 |        "...                        ...        ...       ...       ...        ...   \n",
1665 |        "zealand              -0.001004  -0.008869 -0.048644  0.001566  -0.016636   \n",
1666 |        "zero                  0.003737   0.005958 -0.093422  0.027760   0.030987   \n",
1667 |        "zombi                 0.003023  -0.020218  0.341943  0.012054   0.030725   \n",
1668 |        "zone                 -0.001000   0.025871 -0.015155  0.039160  -0.007195   \n",
1669 |        "zoo                  -0.002906  -0.003083  0.011880  0.013114  -0.004452   \n",
1670 |        "\n",
1671 |        "                   g_mystery  g_romance  g_sci-fi   g_sport  g_thriller  \\\n",
1672 |        "f_release_year     -0.000374  -0.163564 -0.082269 -0.013732    0.092245   \n",
1673 |        "f_release_month     0.005833   0.013643  0.013330 -0.010611    0.009509   \n",
1674 |        "f_runtime           0.034689   0.109954 -0.061331  0.012625    0.053091   \n",
1675 |        "f_word_count_long   0.011146  -0.027546  0.017361  0.024619    0.003008   \n",
1676 |        "f_imdb_rating       0.118412  -0.026726 -0.234360  0.013724   -0.086579   \n",
1677 |        "...                      ...        ...       ...       ...         ...   \n",
1678 |        "zealand            -0.099761   0.006259 -0.014353  0.009601    0.003707   \n",
1679 |        "zero                0.093844   0.031099  0.011547  0.006995    0.014213   \n",
1680 |        "zombi              -0.135719  -0.007812  0.022717 -0.009578    0.004558   \n",
1681 |        "zone               -0.075479  -0.015189  0.036640 -0.033840   -0.004590   \n",
1682 |        "zoo                 0.041213   0.002608 -0.005972 -0.011298   -0.001489   \n",
1683 |        "\n",
1684 |        "                      g_war  g_western  \n",
1685 |        "f_release_year     0.168980   0.168377  \n",
1686 |        "f_release_month   -0.108588  -0.121271  \n",
1687 |        "f_runtime          0.236038   0.229005  \n",
1688 |        "f_word_count_long -0.013937  -0.003826  \n",
1689 |        "f_imdb_rating      0.051071   0.007038  \n",
1690 |        "...                     ...        ...  \n",
1691 |        "zealand            0.014746   0.001099  \n",
1692 |        "zero               0.008819   0.001835  \n",
1693 |        "zombi             -0.052168  -0.011931  \n",
1694 |        "zone               0.025977  -0.003066  \n",
1695 |        "zoo               -0.008891   0.003304  \n",
1696 |        "\n",
1697 |        "[5459 rows x 22 columns]"
1698 |       ]
1699 |      },
1700 |      "execution_count": 85,
1701 |      "metadata": {},
1702 |      "output_type": "execute_result"
1703 |     }
1704 |    ],
1705 |    "source": [
1706 |     "coef_df = pd.DataFrame(coefs, index=genre_cols, columns=X_train.columns)\n",
1707 |     "coef_tdf = coef_df.T\n",
1708 |     "coef_tdf"
1709 |    ]
1710 |   },
1711 |   {
1712 |    "cell_type": "code",
1713 |    "execution_count": null,
1714 |    "metadata": {},
1715 |    "outputs": [],
1716 |    "source": [
1717 |     "coef_tdf.to_csv('my_1vr_logreg_default_coef.tsv', sep='\\t')"
1718 |    ]
1719 |   },
1720 |   {
1721 |    "cell_type": "code",
1722 |    "execution_count": null,
1723 |    "metadata": {},
1724 |    "outputs": [],
1725 |    "source": []
1726 |   },
1727 |   {
1728 |    "cell_type": "code",
1729 |    "execution_count": null,
1730 |    "metadata": {},
1731 |    "outputs": [],
1732 |    "source": []
1733 |   },
1734 |   {
1735 |    "cell_type": "code",
1736 |    "execution_count": null,
1737 |    "metadata": {},
1738 |    "outputs": [],
1739 |    "source": []
1740 |   },
1741 |   {
1742 |    "cell_type": "code",
1743 |    "execution_count": null,
1744 |    "metadata": {},
1745 |    "outputs": [],
1746 |    "source": []
1747 |   },
1748 |   {
1749 |    "cell_type": "code",
1750 |    "execution_count": null,
1751 |    "metadata": {},
1752 |    "outputs": [],
1753 |    "source": []
1754 |   },
1755 |   {
1756 |    "cell_type": "code",
1757 |    "execution_count": 95,
1758 |    "metadata": {
1759 |     "scrolled": true
1760 |    },
1761 |    "outputs": [
1762 |     {
1763 |      "name": "stdout",
1764 |      "output_type": "stream",
1765 |      "text": [
1766 |       "C:  1e-05\n",
1767 |       "Train score: 0.07767\n",
1768 |       " Test score: 0.07150\n",
1769 |       "0.8131  g_action\n",
1770 |       "0.8852  g_adventure\n",
1771 |       "0.9691  g_animation\n",
1772 |       "0.9478  g_biography\n",
1773 |       "0.6407  g_comedy\n",
1774 |       "0.8392  g_crime\n",
1775 |       "0.9525  g_documentary\n",
1776 |       "0.6377  g_drama\n",
1777 |       "0.9406  g_family\n",
1778 |       "0.9249  g_fantasy\n",
1779 |       "0.9892  g_film-noir\n",
1780 |       "0.9561  g_history\n",
1781 |       "0.8643  g_horror\n",
1782 |       "0.9342  g_music\n",
1783 |       "0.9698  g_musical\n",
1784 |       "0.9087  g_mystery\n",
1785 |       "0.7738  g_romance\n",
1786 |       "0.9179  g_sci-fi\n",
1787 |       "0.9708  g_sport\n",
1788 |       "0.7542  g_thriller\n",
1789 |       "0.9489  g_war\n",
1790 |       "0.9775  g_western\n",
1791 |       "C:  0.0001\n",
1792 |       "Train score: 0.17256\n",
1793 |       " Test score: 0.12355\n",
1794 |       "0.8506  g_action\n",
1795 |       "0.8946  g_adventure\n",
1796 |       "0.9691  g_animation\n",
1797 |       "0.9478  g_biography\n",
1798 |       "0.7606  g_comedy\n",
1799 |       "0.8627  g_crime\n",
1800 |       "0.9558  g_documentary\n",
1801 |       "0.7239  g_drama\n",
1802 |       "0.9409  g_family\n",
1803 |       "0.9282  g_fantasy\n",
1804 |       "0.9892  g_film-noir\n",
1805 |       "0.9562  g_history\n",
1806 |       "0.8956  g_horror\n",
1807 |       "0.9388  g_music\n",
1808 |       "0.9698  g_musical\n",
1809 |       "0.9100  g_mystery\n",
1810 |       "0.8046  g_romance\n",
1811 |       "0.9305  g_sci-fi\n",
1812 |       "0.9712  g_sport\n",
1813 |       "0.7984  g_thriller\n",
1814 |       "0.9539  g_war\n",
1815 |       "0.9784  g_western\n",
1816 |       "C:  0.001\n",
1817 |       "Train score: 0.35680\n",
1818 |       " Test score: 0.13993\n",
1819 |       "0.8631  g_action\n",
1820 |       "0.9027  g_adventure\n",
1821 |       "0.9714  g_animation\n",
1822 |       "0.9518  g_biography\n",
1823 |       "0.7512  g_comedy\n",
1824 |       "0.8729  g_crime\n",
1825 |       "0.9720  g_documentary\n",
1826 |       "0.7047  g_drama\n",
1827 |       "0.9439  g_family\n",
1828 |       "0.9330  g_fantasy\n",
1829 |       "0.9891  g_film-noir\n",
1830 |       "0.9615  g_history\n",
1831 |       "0.9270  g_horror\n",
1832 |       "0.9494  g_music\n",
1833 |       "0.9703  g_musical\n",
1834 |       "0.9103  g_mystery\n",
1835 |       "0.8112  g_romance\n",
1836 |       "0.9467  g_sci-fi\n",
1837 |       "0.9792  g_sport\n",
1838 |       "0.7975  g_thriller\n",
1839 |       "0.9649  g_war\n",
1840 |       "0.9855  g_western\n",
1841 |       "C:  0.01\n",
1842 |       "Train score: 0.54267\n",
1843 |       " Test score: 0.10558\n",
1844 |       "0.8352  g_action\n",
1845 |       "0.8771  g_adventure\n",
1846 |       "0.9702  g_animation\n",
1847 |       "0.9394  g_biography\n",
1848 |       "0.7344  g_comedy\n",
1849 |       "0.8410  g_crime\n",
1850 |       "0.9731  g_documentary\n",
1851 |       "0.6931  g_drama\n",
1852 |       "0.9317  g_family\n",
1853 |       "0.9137  g_fantasy\n",
1854 |       "0.9892  g_film-noir\n",
1855 |       "0.9506  g_history\n",
1856 |       "0.9117  g_horror\n",
1857 |       "0.9385  g_music\n",
1858 |       "0.9655  g_musical\n",
1859 |       "0.8823  g_mystery\n",
1860 |       "0.7822  g_romance\n",
1861 |       "0.9370  g_sci-fi\n",
1862 |       "0.9794  g_sport\n",
1863 |       "0.7769  g_thriller\n",
1864 |       "0.9610  g_war\n",
1865 |       "0.9868  g_western\n",
1866 |       "C:  0.1\n",
1867 |       "Train score: 0.60312\n",
1868 |       " Test score: 0.07882\n",
1869 |       "0.8002  g_action\n",
1870 |       "0.8517  g_adventure\n",
1871 |       "0.9643  g_animation\n",
1872 |       "0.9276  g_biography\n",
1873 |       "0.7265  g_comedy\n",
1874 |       "0.8064  g_crime\n",
1875 |       "0.9720  g_documentary\n",
1876 |       "0.6883  g_drama\n",
1877 |       "0.9197  g_family\n",
1878 |       "0.8938  g_fantasy\n",
1879 |       "0.9888  g_film-noir\n",
1880 |       "0.9392  g_history\n",
1881 |       "0.9019  g_horror\n",
1882 |       "0.9240  g_music\n",
1883 |       "0.9587  g_musical\n",
1884 |       "0.8558  g_mystery\n",
1885 |       "0.7657  g_romance\n",
1886 |       "0.9261  g_sci-fi\n",
1887 |       "0.9796  g_sport\n",
1888 |       "0.7593  g_thriller\n",
1889 |       "0.9569  g_war\n",
1890 |       "0.9868  g_western\n",
1891 |       "C:  1\n",
1892 |       "Train score: 0.62265\n",
1893 |       " Test score: 0.06763\n",
1894 |       "0.7848  g_action\n",
1895 |       "0.8404  g_adventure\n",
1896 |       "0.9593  g_animation\n",
1897 |       "0.9157  g_biography\n",
1898 |       "0.7252  g_comedy\n",
1899 |       "0.7922  g_crime\n",
1900 |       "0.9695  g_documentary\n",
1901 |       "0.6874  g_drama\n",
1902 |       "0.9060  g_family\n",
1903 |       "0.8818  g_fantasy\n",
1904 |       "0.9879  g_film-noir\n",
1905 |       "0.9289  g_history\n",
1906 |       "0.8923  g_horror\n",
1907 |       "0.9136  g_music\n",
1908 |       "0.9497  g_musical\n",
1909 |       "0.8408  g_mystery\n",
1910 |       "0.7615  g_romance\n",
1911 |       "0.9157  g_sci-fi\n",
1912 |       "0.9778  g_sport\n",
1913 |       "0.7568  g_thriller\n",
1914 |       "0.9501  g_war\n",
1915 |       "0.9854  g_western\n",
1916 |       "Wall time: 1h 48min 14s\n"
1917 |      ]
1918 |     }
1919 |    ],
1920 |    "source": [
1921 |     "%%time\n",
1922 |     "\n",
1923 |     "c_values = [0.00001, 0.0001, 0.001, 0.01, 0.1, 1]\n",
1924 |     "train_scores = []\n",
1925 |     "test_scores = []\n",
1926 |     "\n",
1927 |     "for c_val in c_values:\n",
1928 |     "    my_log_model = OneVsRestClassifier(LogisticRegression(random_state=123, solver='sag', max_iter=3000, C=c_val, n_jobs=-1), n_jobs=-1).fit(X_train_s, y_train)\n",
1929 |     "\n",
1930 |     "    # EXPORT AND SAVE THE MODEL\n",
1931 |     "    joblib.dump(my_log_model, f'models/my_1vr_logreg_sag_{c_val}.pkl')\n",
1932 |     "    \n",
1933 |     "    # Make predictions\n",
1934 |     "    y_train_pred_log = my_log_model.predict(X_train_s)\n",
1935 |     "    y_pred_log = my_log_model.predict(X_test_s)\n",
1936 |     "\n",
1937 |     "    #my_log_model.multilabel_\n",
1938 |     "    #my_model.predict_proba(X_train_s)\n",
1939 |     "    \n",
1940 |     "    # Check overall accuracies\n",
1941 |     "    from sklearn.metrics import accuracy_score\n",
1942 |     "    train_acc = accuracy_score(y_train, y_train_pred_log)\n",
1943 |     "    test_acc = accuracy_score(y_test, y_pred_log)\n",
1944 |     "    train_scores.append(train_acc)\n",
1945 |     "    test_scores.append(test_acc)\n",
1946 |     "    print(f'C:  {c_val}')\n",
1947 |     "    print(f'Train score: {train_acc:0.5f}')\n",
1948 |     "    print(f' Test score: {test_acc:0.5f}')\n",
1949 |     "\n",
1950 |     "    y_train_pred_log_df = pd.DataFrame(y_train_pred_log)\n",
1951 |     "    y_train_pred_log_df.columns = genre_cols\n",
1952 |     "\n",
1953 |     "    y_pred_log_df = pd.DataFrame(y_pred_log)\n",
1954 |     "    y_pred_log_df.columns = genre_cols\n",
1955 |     "\n",
1956 |     "    test_acc_dict = {}\n",
1957 |     "    # Test genre set predictions\n",
1958 |     "    for g in genre_cols:\n",
1959 |     "        score = accuracy_score(y_test[g], y_pred_log_df[g])\n",
1960 |     "        test_acc_dict.update( {g[2:] : score} )\n",
1961 |     "        print(f'{score:0.4f}  {g}')\n",
1962 |     "\n",
1963 |     "    # Export genre scores\n",
1964 |     "    test_scores_log = pd.DataFrame.from_dict(test_acc_dict, orient='index', columns=['score'])\n",
1965 |     "    test_scores_log.to_csv(f'test_scores_logreg_sag_{c_val}.csv', index_label='genre')"
1966 |    ]
1967 |   },
1968 |   {
1969 |    "cell_type": "code",
1970 |    "execution_count": 96,
1971 |    "metadata": {},
1972 |    "outputs": [
1973 |     {
1974 |      "data": {
1975 |       "text/plain": [
1976 |        "<function matplotlib.pyplot.show(*args, **kw)>"
1977 |       ]
1978 |      },
1979 |      "execution_count": 96,
1980 |      "metadata": {},
1981 |      "output_type": "execute_result"
1982 |     },
1983 |     {
1984 |      "data": {
1985 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8GearUAAAgAElEQVR4nO3deXyU1b3H8c9JQhaSELYACSHsRPYtBJequCu20GprsYqyVKqt7b2t7a2tt5utrXa5rb219VJZXEvVq4CK0qr12tYKCRJ2wi5ZWMKSsGTPnPvHMyFDSGACM3lm+b5fr3kxz5LJ7ySTL2fOeeaMsdYiIiLhL8btAkREJDAU6CIiEUKBLiISIRToIiIRQoEuIhIhFOgiIhEizq1v3LNnTztgwAC3vr2ISFhas2bNIWttemvHXAv0AQMGUFBQ4Na3FxEJS8aYj9s6piEXEZEIoUAXEYkQCnQRkQihQBcRiRAKdBGRCKFAFxGJEK5dtigiEi2stVRW11NWUcO+ymqG9U6lX/fOAf8+CnQRkQtUVddwKqz3VdRQVllNWUU1+yprTv1bVdd46vyHp4/krksGBLwOBbqIyFnUN3rY7xPMZU2hXVFNWaUT4hVV9Wd8XXpqAplpiQztlcqVw3qR2TWRjLQkMromMrhnSlBqVaCLSNTyeCyHTtRS5g3sptDeV1lNaUUN+yqqKT9RS8sPdktL6kRGWiKZXZOYkN2VzK5JpwI7My2J3mkJJMTFdnh7FOgiEpF8x62doPb2qL0967KKag4cq6G+8fS0TuoUS0bXRDLTkhg2LP30sPb+m5wQmtEZmlWJiJyD77i107v2jmFX1lBa4QyLVNc3nvY1cTGGPmlOWE/s342MtCT6+gyFZKYl0bVzJ4wxLrXqwvgV6MaYG4HHgVjgKWvto62ccxvwQ8AC66y1XwhgnSISReoaPBw41jxuXertYTsTjs7+yurTx62NgfSUBDK6JpHTO5Up3nHrzK5Jp4ZHeqYkEBsTnmHtj3MGujEmFngCuA4oAfKNMcuttZt9zhkKfAe4zFp71BjTK1gFi0hkqGvw8MHOQ+w4eKK5p1159nHrzK5JZKYlMrF/V2/vujmse3dJJD4uut9a408PPQ/YYa3dBWCMWQJMBzb7nHMP8IS19iiAtfZgoAsVkfDn8VgKPj7K0sJSVmzYd+rqkKROsad60zk56c1h7TN23TleI8Tn4s9PqC9Q7LNdAkxucc4wAGPMP3GGZX5orX0rIBWKSNjbuv8YS9eW8dq6MkorqknqFMv1I3szfVwmE7K7kZYUvuPWocSfQG/tp9zixRBxwFBgCpAF/N0YM8paW3HaAxkzD5gHkJ2d3e5iRSR8lBytYvm6MpatLaPowHFiYwxXDO3Jf9yYw7XDe4fslSLhzJ+faAnQz2c7Cyhr5ZwPrbX1wG5jTBFOwOf7nmStnQ/MB8jNzW35n4KIhLkjJ+t4Y8M+lheWkr/nKAC5/bvx4+kjmTo6gx4pCS5XGNn8CfR8YKgxZiBQCswAWl7BshS4HVhsjOmJMwSzK5CFikhoqqpr4K+bD7CssIz3t5XT4LEM7ZXCt27IYdrYzKCsWSKtO2egW2sbjDH3AytxxscXWms3GWMeBgqstcu9x643xmwGGoFvWWsPB7NwEXFPfaOHf2w/xLLCUv6y+QBVdY1kpCUy9/KBTB/bl+EZqRoTd4GxLa8N6iC5ublWHxItEj6stXy09yjLCst4ff0+jpysIy2pE1NHZ/DpcZlMGtCdmAi+xjtUGGPWWGtzWzumWQkROavtB46ztLCUZYVllBytJiEuhutG9Gb6uL5cOSw96q/9DiUKdBE5Q1lFNa+tK2NpYRlb9h0jxsAnhqbz9WuHccOoPqToCpWQpN+KiABQUVXHig37WVZYyuo9R7AWxvXryg8/NYKbx2SSnqorVEKdAl0kitXUN/L2FucKlfeKDlLfaBmUnszXrx3G9HGZ9O+R7HaJ0g4KdJEo09Do4YOdh1laWMpfNh3gRG0DvbskMOvSAUwf15eRmV10hUqYUqCLRAFrLYXFFaeuUDl0opbUxDhuHp3B9HGZTB7UI6JXIYwWCnSRCLaz/ATLCstYXljKnsNVxMfFcM1FvZg+ri9TctJJ7NTxn6ojwaNAF4kwB47V8Nq6MpYVlrGhtBJj4NLBPfjyVUO4cVQfuiR2crtECRIFukgEqKyuZ+XG/SwtLOVfuw5jLYzJSuM/bx7OtLGZ9OqS6HaJ0gEU6CJhqqa+kfeKDrJ0bRnvFh2krsHDgB6d+drVQ5k2LpPB6cH5ZHkJXQp0kTDS6LF8uOswywpLeXPjfo7XNNAzJYE7JmczfVxfxmal6QqVKKZAFwlx1lo2lh5jaWEpr60r4+DxWlIS4rhhZB8+PT6TSwb1IC5Wb78XBbpIyNpz6CTLCstYtq6UXeUn6RRruCrHuULlmuG9dIWKnEGBLhJCDh6v4fV1+1i2rox1xRUYA5MHdueeywcxdVQGaZ11hYq0TYEuEgLqGz18+3/Xs3RtKR4LIzK68N2pF/GpsZlkpCW5XZ6ECQW6iMs8Hsu3X17PK2tLmXPZQG7P68fQ3qlulyVhSIEu4iJrLY+s2MIra0t54LphfPWaoW6XJGFMU+MiLvr9eztZ8I/dzL5sAPdfPcTtciTMKdBFXPLCqr38YmURnxnfl+/dPELXj8sFU6CLuGDFhn3859INXJWTzs8/O0afxSkBoUAX6WD/3HGIf19SyITsbvz+jol00puCJED0TBLpQOtLKpj3TAGD0pNZcPckkuL15iAJHAW6SAfZcfAEsxbl0z0lnqfn5OlNQhJwCnSRDlBWUc1dC1YRYwzPzplMby1nK0GgQBcJsiMn65i5YBXHaxp4es4kBvTUBy9LcOiNRSJBdLK2gdmL8yk5Ws0zc/IYmZnmdkkSwRToIkFS29DIvc+tYWNpJU/eOZHJg3q4XZJEOL+GXIwxNxpjiowxO4wxD7ZyfJYxptwYU+i9fTHwpYqEj0aP5RsvruPv2w/x2K1juG5Eb7dLkihwzh66MSYWeAK4DigB8o0xy621m1uc+mdr7f1BqFEkrFhr+f6yjbyxfh8PTR3OZydmuV2SRAl/euh5wA5r7S5rbR2wBJge3LJEwtev/7qN51ft5d4rB3PPFYPcLkeiiD+B3hco9tku8e5r6VZjzHpjzMvGmH6tPZAxZp4xpsAYU1BeXn4e5YqEtkX/3M1v393B53P78e0bc9wuR6KMP4He2iITtsX2a8AAa+0Y4G3g6dYeyFo731qba63NTU9Pb1+lIiFu6dpSfvTaZm4Y2ZtHPjNKi21Jh/Mn0EsA3x53FlDme4K19rC1tta7+UdgYmDKEwkPf9t6kG++tI5LBvXg8Rnj9aHN4gp/nnX5wFBjzEBjTDwwA1jue4IxJsNncxqwJXAlioS2gj1HuO/5NVyUkcr8uybqw5vFNee8ysVa22CMuR9YCcQCC621m4wxDwMF1trlwNeMMdOABuAIMCuINYuEjK37jzFncT6ZaUksnp1HaqLWZxH3GGtbDod3jNzcXFtQUODK9xYJhOIjVdz6hw+IMYaX77uErG6d3S5JooAxZo21Nre1Y3qnqMh5KD9ey50LVlHb4OGlexXmEho0cyPSTsdq6rl74WoOHqtl0exJDOud6nZJIoACXaRdauob+eLTBWw/eJwnZ05kQnY3t0sSOUVDLiJ+amj0cP8La8nfc4THZ4znymF6L4WEFvXQRfxgreXBVzbw9pYDPDxtJNPGZrpdksgZFOgifvjZm1t5eU0J/37tUGZeMsDtckRapUAXOYcn/28n89/fxd2X9OffrhnqdjkibVKgi5zFn/P38uibW5k2NpMffGqk1meRkKZAF2nDWxv3851XNnDFsHR++bmxxMQozCW0KdBFWvHBzkN8bclaxvbrypN3TiA+Tn8qEvr0LBVpYWNpJfOeWcOAHp1ZNGsSneN1da+EBwW6iI9d5Se4e+Fq0pI68cycyXTtHO92SSJ+U6CLeO2vrGHmgtUAPDs3jz5piS5XJNI+ei0pAlRU1TFzwSoqq+tZMu9iBqWnuF2SSLsp0CXqVdU1MHtxPh8fqeLp2XmM6pvmdkki50VDLhLV6ho83PvcR6wrruC/bx/PJYN7uF2SyHlTD12ilsdjeeCldby/rZzHbh3NDSP7uF2SyAVRD12ikrWWH762idfWlfHgTRfx+UnZbpckcsEU6BKVHn9nO8/862PmXTGIe68c7HY5IgGhQJeo88y/9vCbt7fzuYlZfOemi9wuRyRgFOgSVZYVlvKD5Zu4bkRvfnbLaC22JRFFgS5R472igzzw4jomDejOf98+nrhYPf0lsugZLVFhzcdHue+5jxjWO5Wn7s4lsVOs2yWJBJwCXSLetgPHmbM4n95dEnh6Th5dEju5XZJIUCjQJaIVH6li5oJVJMTF8OzcyaSnJrhdkkjQKNAlYh06UctdC1dTXdfIs3Mn0697Z7dLEgkqvwLdGHOjMabIGLPDGPPgWc77rDHGGmNyA1eiSPsdr6ln1qLV7KusZtHsSeT0SXW7JJGgO2egG2NigSeAm4ARwO3GmBGtnJcKfA1YFegiRdqjpr6Re54pYOu+4/zhjolM7N/d7ZJEOoQ/PfQ8YIe1dpe1tg5YAkxv5bwfAz8HagJYn0i7NDR6+Lcla/lw1xF++bmxXHVRL7dLEukw/gR6X6DYZ7vEu+8UY8x4oJ+19vUA1ibSLtZaHnp1Iys3HeAHnxrBp8f3PfcXiUQQfwK9tbfS2VMHjYkBfg08cM4HMmaeMabAGFNQXl7uf5UifnjsrSL+XFDM164ewuzLBrpdjkiH8yfQS4B+PttZQJnPdiowCnjPGLMHuBhY3trEqLV2vrU211qbm56efv5Vi7Qw//2dPPl/O7ljcjZfv26Y2+WIuMKfQM8HhhpjBhpj4oEZwPKmg9baSmttT2vtAGvtAOBDYJq1tiAoFYu08FJBMT9dsZWbx2Tw8PRRWp9FotY5A91a2wDcD6wEtgAvWms3GWMeNsZMC3aBImfz180HePCVDVw+tCe/vm0csTEKc4lefn1ikbV2BbCixb7vt3HulAsvS+TcPtx1mK+88BGj+qbx5J0TiY/T++QkuukvQMLSxtJK7nm6gOzunVk8axLJCfo0RREFuoSd3YdOMmvRalIT43hmTh7dkuPdLkkkJCjQJawcOFbDzAWr8Fh4Zu5kMrsmuV2SSMhQoEvYqKyq564Fqzl6so7FsycxpFeK2yWJhBQNPEpYqK5rZM7T+ew+dJJFsycxJqur2yWJhBz10CXk1Td6uO/5Nazde5THZ4zjsiE93S5JJCSphy4hzeOxfOuldbxXVM7PbhnNTaMz3C5JJGSphy4hy1rLw69vZmlhGd+6IYfb87LdLkkkpCnQJWT97t0dLP5gD3M/MZAvTxnsdjkiIU+BLiHpuQ8/5ld/3cYtE/ry0NThWp9FxA8KdAk5r60r43vLNnLNRb147NYxxGh9FhG/KNAlpLxXdJCv/7mQSQO688QdE+gUq6eoiL/01yIho2DPEe59bg05fVJ56u5cEjvFul2SSFhRoEtI2LLvGHMW55OZlsTTc/LoktjJ7ZJEwo4CXVy359BJZi5YTXJCHM/MzaNnSoLbJYmEJQW6uOrAsRruXLCKRo+HZ+fmkdWts9sliYQtBbq4pqKqjpkLVnH0ZB1Pz8ljSK9Ut0sSCWt667+44mRtA7MW5bPncBWLtdiWSECohy4drrahkXufW8OG0kp+d/t4Lh2sxbZEAkGBLh2q0WP5+p8L+fv2Qzx26xiuH9nH7ZJEIoYCXTqMtZaHXt3Aig37+d4nR/DZiVlulyQSURTo0mEefWsrS/KL+erVQ5j7iYFulyMScRTo0iH+8N5O/uf/djHz4v5847phbpcjEpEU6BJ0f1q9l8fe2sq0sZn8aNpIrZwoEiQKdAmqN9bv47uvbuCqnHR+ddtYrZwoEkQKdAma97eV8+9/Xktu/278/o6JWjlRJMj0FyZBsebjo3zp2TUM6ZXKU3dPIileKyeKBJtfgW6MudEYU2SM2WGMebCV4/caYzYYYwqNMf8wxowIfKkSLor2H2fO4nx6d0ngmTl5pCVp5USRjnDOQDfGxAJPADcBI4DbWwnsF6y1o62144CfA/8V8EolLOw9XMXMBatI7BTDs3Mnk56qlRNFOoo/PfQ8YIe1dpe1tg5YAkz3PcFae8xnMxmwgStRwsVB78qJdY0enps7mX7dtXKiSEfyZ3GuvkCxz3YJMLnlScaYrwDfAOKBq1t7IGPMPGAeQHZ2dntrlRBWWVXPXQtXc+hELS/cczFDe2vlRJGO5k8PvbXrzM7ogVtrn7DWDga+Dfxnaw9krZ1vrc211uamp6e3r1IJWVV1DcxevJpd5Sf54125jOunlRNF3OBPoJcA/Xy2s4Cys5y/BPj0hRQl4aOuwcO9z31EYXEFv719HJcN0cqJIm7xJ9DzgaHGmIHGmHhgBrDc9wRjzFCfzZuB7YErUUJVo8fy9RcLeX9bOY/eMoYbR2W4XZJIVDvnGLq1tsEYcz+wEogFFlprNxljHgYKrLXLgfuNMdcC9cBR4O5gFi3us9byvWUbeWP9Ph6aOpzbJvU79xeJSFD59YlF1toVwIoW+77vc//fAlyXhLhfrCzihVV7+fKUwdxzxSC3yxER9E5ROQ/z39/J79/byRcmZ/OtG3LcLkdEvBTo0i4v5hfz0xVb+eSYDH48fZRWThQJIQp08dtbG/fx4CvruXJYOv912zhitXKiSEhRoItf/rH9EF/7UyHjs7vxhzsnEB+np45IqNFfpZzT2r1HmfdsAYPSk1l49yQ6x/s1ly4iHUyBLme17cBxZi/Op2eKd+XEzlo5USRUKdClTcVHnJUT42NjeG7uZHp1SXS7JBE5C712llaVH69l5oJV1NR7ePFLl5DdQysnioQ69dDlDJXVzsqJB47Vsmj2JHL6aOVEkXCgQJfTVNc1MndxPjsOHmf+XROZkN3N7ZJExE8KdDmlrsHDfc+v4aO9R3l8xnguH6oljkXCicbQBQCPx/LNl9bxXlE5P7tlNFNHa+VEkXCjHrpgreUHyzexfF0ZD950Ebfn6dOkRMKRAl34r79u49kPP+ZLVw7i3isHu12OiJwnBXqUe+rvu/jvd3cwY1I/HrzxIrfLEZELoECPYi+vKeEnb2xh6ug+PPKZ0Vo5USTMKdCj1F827efb/7uey4f25Nef18qJIpFAgR6FPth5iPv/tJYxWWk8eedEEuJi3S5JRAJAgR5l1pdUcM/TBQzskcyiWZNITtCVqyKRQoEeRXYcPM7dC1fTPSWeZ+bm0bVzvNsliUgAKdCjRGlFNTMXrCbOu3Jib62cKBJxFOhR4NCJWmY+tYqTtQ08MyeP/j2S3S5JRIJAA6gR7lhNPXcvXE1ZZTXPf3EywzO6uF2SiASJeugRrKa+kS8+XcC2A8d58s6JTOzf3e2SRCSI1EOPUPWNHr7y/Efk7znCb2eMZ0pOL7dLEpEgUw89Ank8lv94eT3vbD3ITz49ik+NzXS7JBHpAAr0CGOt5eHXN/Pq2lK+dUMOd0zu73ZJItJB/Ap0Y8yNxpgiY8wOY8yDrRz/hjFmszFmvTHmHWOMUsQlj7+zncUf7OGeywfy5SlaOVEkmpwz0I0xscATwE3ACOB2Y8yIFqetBXKttWOAl4GfB7pQObdF/9zNb97ezm25WXx36nAttiUSZfzpoecBO6y1u6y1dcASYLrvCdbav1lrq7ybHwJZgS1TzuXVtSX86LXN3DCyNz/VyokiUcmfQO8LFPtsl3j3tWUu8GZrB4wx84wxBcaYgvLycv+rlLN6e/MBvvnSei4b0oPHZ4wnLlZTIyLRyJ+//Na6erbVE425E8gFftHacWvtfGttrrU2Nz1dH0AcCB/uOsxXXviIUZld+J+ZuSR20sqJItHKn+vQS4B+PttZQFnLk4wx1wIPAVdaa2sDU56czcbSSr74dAHZ3TuzeHYeKVo5USSq+dNDzweGGmMGGmPigRnAct8TjDHjgf8BpllrDwa+TGlpZ/kJ7l64mrSkTjw7dzLdkrVyoki0O2egW2sbgPuBlcAW4EVr7SZjzMPGmGne034BpAAvGWMKjTHL23g4CYCyimpmPrUKY+C5L06mT5pWThQRP9/6b61dAaxose/7PvevDXBd0obDJ2qZuWAVx2saWPKlixnYUysniohDg65h5HhNPbMW5VNaUc2zcyczMjPN7ZJEJITo+rYwUVPfyD3PFLBl3zH+cMdEJg3Qyokicjr10MNAQ6OH+19Yy6rdR/jN58dx1UVaOVFEzqRAD3EbSyt5+PXNrN59hB9PH8n0cWd7T5eIRDMFeogqq6jmlyuLeGVtKd2T4/n5rWO4bVK/c3+hiEQtBXqIOV5Tzx/e28mCf+zGAvdNGcx9UwbTJbGT26WJSIhToIeI+kYPS1bv5Tdvb+fwyTo+M74vD1w/jKxund0uTUTChALdZdZa3t5ykJ+9uYVd5SeZPLA7i24ezpisrm6XJiJhRoHuog0llTyyYjMf7jrCoPRk/nhXLtcO76Wlb0XkvCjQXVDqnfB81Tvh+ePpI5mRl00nLXsrIhdAgd6BjvlMeBrgy1MGc68mPEUkQBToHaC+0cOfvBOeR07Wccv4vjxwQw59uya5XZqIRBAFehC1nPC8eFB3Hpo6gtFZWoNFRAJPgR4k60sqeOSNLaza7Ux4PnVXLtdowlNEgkiBHmAlR6v45coilhaW0SM5nh9/ehQzJvXThKc/ao/DiYNwfD+cOABxiZA5DlIzQP8RipyTAj1AjtXU8/u/7WThP50Jz69cNZh7rxxMarRPeHoaoepwc0ifOOC9fxBO7PcJ8INQf7L1x0juBZnjnXDPHA8Z46BLRse2QyQMKNAvUH2jhxdW7eXxd7wTnhP68s3rc8iM9AnPuqrmgD5xAI433W8R0ifLwTae+fUJaZDSC1L7QN8JkNKneTull7Ndexz2FUJZIZSthR1/Betxvj6ljxPwGeOawz61T8f+DERCjAL9PFlr+evmAzz65lZ2HTrJJYN68NDNwxnVN4wnPD0eqD7qDeWmkG4R0E3btcfO/HoT4/SmU3s74Zox1hvQvZtvqb2dc+L9XNIge3Lz/bqTsH9Dc8DvK4RtKwHrHE/NOD3gM8Y5308kSijQz8O64goeWbGF1buPMDg9mQV353L1RSE84VlfAycPtt2Lbto+cQA8DWd+fadkJxhT+kDvUTDEpxfdFNIpvaFzD4iJDV474pMh+2Ln1qT2hDfk1zb35re9RXPIZ54+VJM5HlLSg1ejiIsU6O1QcrSKX6wsYpl3wvMn3gnPODcmPK2FmgqfkD5w+ji17zBITUUrD2AguWfzUEevET4h3ev0nnVCSoc3z28JKdD/EufWpPZ4c8g39eaL3uRUyHfpe3rAZ45zfhYiYU6B7odjNfU88bcdLPrnHgxw/1VD+NKVgzpmwvPEQWdYoeyjFr3qA9BYe+b5cYnNQZw+DAZe3npIJ6dDbIT++hNSof+lzq1J7XHYt96nJ78Wtr7efDytnzNEdKo3Px6Se3R87SIXIEL/ogOjacLzN29vo6K6ns+M74AJT2uhvAiKVji9ypJ8wEJimjN8kNobegz2DnW0Mj6d0EWX+LUmIRUGXObcmtRUOiHfFPBlhS1CPhsyx57em++sz3KV0KVAb4W1lr94Jzx3d8SEZ2MD7P2XE+BFK+Dobmd/xjiY8h3IuQn6jFZQB1pimvMKZuDlzfuqK2D/+uaA31cIW15rPt41+8yJV4W8hAgFeguFxRX89I0trN5zhCG9Ulg4K5ercoIw4VlTCTvecUJ8+1+cce7YBBh4BVz6VRh2I6Tp80M7XFJX53cw8IrmfdUVsG/d6cM1W5Y3H+/av8V18mMhqVvH1y5RT4HuVXzEmfBcvq6MninxPPKZUXw+N8ATnhV7oegtpxe+5x/gqXeuDLnoZqcXPuiq0J6AjFZJXWHQlc6tSfXR5pBvmnjdvLT5eLeBp18nnzHWeRyRIIr6QK+sruf33gnPmBj46tVD+NKVg0lJCMCPxuNxenRFbzq3Axuc/T2HwSVfhpypkDUpuJf6SXAkdYNBU5xbk6ojp78RqnQNbHq1+Xj3QS2Ga8Y6wz4iAeJXahljbgQeB2KBp6y1j7Y4fgXwG2AMMMNa+3KgCw20ugYPL6z6mMff2U5FdT23jM/imzcMIyPtAic866th9/tOgG97C47vc95wk30JXP8TGHYT9BwSmEZIaOncHQZf7dyaVB05faimpAA2vdJ8vPvg06+T7zNKwzVy3s4Z6MaYWOAJ4DqgBMg3xiy31m72OW0vMAv4ZjCKDCRrLSs3HeCxt5wJz0sH9+C7Uy9wwvPkISe8i96Ene9CfRXEp8CQa5xe+NDrNXEWrTp3d54HQ65p3nfyMOxb2zxcU7waNv5v8/HkdOdVXM+h3n+999OyIUaLvEnb/Omh5wE7rLW7AIwxS4DpwKlAt9bu8R7zBKHGgCksruCRNzaTv+coQ3ulsGjWJKbkpLd/wtNaOLSt+dLC4tWAdd6wMu4Lznj4gMshLiEo7ZAwl9wDhlzr3JqcPOSEe/kW57lVvg02L3PG6pvEJUKPoT5B7/23xxD/l1KQiOZPoPcFin22S4DJbZwbkoqPVPHzlUW85p3w/OlnRnNbblb7JjwbG6D4w+ZLC4/scvZnjIUpD3ovLRyjSwvl/CT3hKHXOjdfJw87AX/qtr15Atb69J/Sss8M+p7DnDeU6TkZNfwJ9NaeDfZ8vpkxZh4wDyA7O/t8HqJdLnjCs+YY7PReWrhtpffSwngYeCVc8hVnPFyXFkowJfeA5BZLG4CzPs+RXXCoyAn5psD/6F/OkF+TxLTTh22a7ncbALFRvrRzBPIn2UqAfj7bWUDZ+Xwza+18YD5Abm7uef2n4I+6Bg/Peyc8K6vruXVCFg9c7+eEZ0Wxdzx8Bez+u3NpYVJ3Zyw85yYYfJXzrkMRN3VKhN4jnJsvjweOlzX35puCfue7UPh883kxcc5VN62N1evKm7DlT6DnA0ONMQOBUmAG8IWgVnWenAnP/Tz65mGvZ+oAAAeCSURBVFb2HK7isiHOhOfIzLM8Qa31ubRwhbOoEzhjlRff5wR5vzxdWijhISYG0rKcm+/VNuC84vQN+abQ3/bW6atspvQ5M+R7DnPmiDQpG9KMtefuKBtjpuJclhgLLLTWPmKMeRgosNYuN8ZMAl4FugE1wH5r7cizPWZubq4tKCi44AY0Wbv3KI+8sYWCj50Jz+/ePJwpw9qY8KyvgT1/905qvuX0aEwM9LvY6YXn3OQ8iUWiQWM9HP349JA/VOTcr6lsPq9TZ2cC1jfo03OcSy87JbpXf5Qxxqyx1ua2esyfQA+GQAV68ZEqHntrK6+v30fPlAQeuH4Yn5vYyoTnyUPOW+yLVsCOd52PO+uUfPqlhVpdT6SZtc4nTp0W9N77FcU0T6UZ6Na/9bH6zj00KRtgZwv0sH2naGVVPU+8t4PF3gnPr109hHktJzwPbfe5tHCVc1VAaiaMneGE+IBPqGch0hZjvGvk93L+VnzVVcGRnc2XWDYF/u73oaGm+bykbi1CPse537V/5C7f7KKw+4nWNXh47sOP+e27zoTnZydk8cD1OfRJS3QuLdzzz+YQP7LT+aI+Y+CK/3CGUjLGqscgcqHiOzsrgPYZffp+jwcqi1uM1W+HbX+Btc81nxcb7wzV9BzifHRgYhokdnXWuzn1r8+++BT93foh7AL9t+9s53d/28EnhvTku1OHM6KHgR1vwTtvwvaVzhsxYjo5q+VdfJ8T4mlZbpctEh1iYpzhl279z7ymvvromUF/cIvTq685xlmvhjaxTsC3FfiJXds+npgWNRc1hF2g333pAC5Nr+GShtWYdx5zJjcb65yXdsNudG6Dr4bELm6XKiK+kro5V4z1yzvzmMfjfPB4TYWzXHFNpc9973bT/abjFXub97X2Wbi+EtK8gZ92lv8QWr5C8J4TFx+cn0cQhF2gpxc+Qfo7P3I2ug+GyV/yrlqYpzE5kXAVE+OEaFJX51q59rDWeTNVy8A/238Ih3Y032+oPvvjd+rc/lcFTfs6JXXoUFH4JeCgKc7Lp5ypurRQRJzAjE92bufzzu2G2tZfAZz2H4LP/soSOLDJ2Vd77OyPHRvfIuS99yfMPH3p5QAJv0DvO8G5iYgEQlxC89U87dXY0GKo6ByvEKoOw+GdztxeEIRfoIuIhIrYOGeJ5BBZHlvv4xURiRAKdBGRCKFAFxGJEAp0EZEIoUAXEYkQCnQRkQihQBcRiRAKdBGRCOHaB1wYY8qBj4E0wOdjUU7bbutYT+BQgEpp+T3O97y2jre23982+94PVJv9ba8/56rNbe9vz3Y4trm9v+OW26Hc5kA9r1tuB6rN/a216a0esda6egPmt7Xd1jGcj74Lyvc/3/PaOt7afn/b3OJ+QNrsb3vV5gtrc3u2w7HN7f0dh1ObA/W87og2t7yFwpDLa2fZPtuxYH3/8z2vreOt7fe3zW62159z1ea297dnOxzb3N7fccvtUG5zoJ7XLbeD0ebTuDbkciGMMQW2jc/Ui1Rqc3RQm6NDsNocCj308zHf7QJcoDZHB7U5OgSlzWHZQxcRkTOFaw9dRERaUKCLiEQIBbqISISIuEA3xkwxxvzdGPOkMWaK2/V0FGNMsjFmjTHmk27X0hGMMcO9v+OXjTH3uV1PRzDGfNoY80djzDJjzPVu19MRjDGDjDELjDEvu11LsHj/dp/2/m7vuJDHCqlAN8YsNMYcNMZsbLH/RmNMkTFmhzHmwXM8jAVOAIlASbBqDZQAtRng28CLwakysALRZmvtFmvtvcBtQMhf8hagNi+11t4DzAI+H8RyAyJAbd5lrZ0b3EoDr51tvwV42fu7nXZB3zgY71a6gHd6XQFMADb67IsFdgKDgHhgHTACGA283uLWC4jxfl1v4Hm329RBbb4WmIHzh/5Jt9vUEW32fs004APgC263qaPa7P26XwET3G5TB7f5ZbfbE8S2fwcY5z3nhQv5viH1IdHW2veNMQNa7M4DdlhrdwEYY5YA0621PwPONrxwFEgIRp2BFIg2G2OuApJxnhzVxpgV1lpPUAu/AIH6PVtrlwPLjTFvAC8Er+ILF6DfswEeBd601n4U3IovXID/nsNKe9qOM5KQBRRygaMmIRXobegLFPtslwCT2zrZGHMLcAPQFfhdcEsLmna12Vr7EIAxZhZwKJTD/Cza+3uegvNSNQFYEdTKgqddbQa+ivNqLM0YM8Ra+2QwiwuS9v6eewCPAOONMd/xBn+4aqvtvwV+Z4y5mQtcHiAcAt20sq/Nd0NZa18BXgleOR2iXW0+dYK1iwNfSodp7+/5PeC9YBXTQdrb5t/i/PGHs/a2+TBwb/DK6VCttt1aexKYHYhvEFKTom0oAfr5bGcBZS7V0lHUZrU5UkVjm5sEve3hEOj5wFBjzEBjTDzO5N9yl2sKNrVZbY5U0djmJsFvu9uzwS1mhv8E7APqcf43m+vdPxXYhjND/JDbdarNarParDaHYtu1OJeISIQIhyEXERHxgwJdRCRCKNBFRCKEAl1EJEIo0EVEIoQCXUQkQijQRUQihAJdRCRCKNBFRCLE/wNC0wVAbYjlFgAAAABJRU5ErkJggg==\n",
1986 |       "text/plain": [
1987 |        "<Figure size 432x288 with 1 Axes>"
1988 |       ]
1989 |      },
1990 |      "metadata": {
1991 |       "needs_background": "light"
1992 |      },
1993 |      "output_type": "display_data"
1994 |     }
1995 |    ],
1996 |    "source": [
1997 |     "plt.figure()\n",
1998 |     "plt.plot(c_values, train_scores, label='train')\n",
1999 |     "plt.plot(c_values, test_scores, label='test')\n",
2000 |     "plt.xscale('log')\n",
2001 |     "plt.show"
2002 |    ]
2003 |   },
2004 |   {
2005 |    "cell_type": "code",
2006 |    "execution_count": 110,
2007 |    "metadata": {},
2008 |    "outputs": [
2009 |     {
2010 |      "data": {
2011 |       "text/plain": [
2012 |        "[0.07767076472415782,\n",
2013 |        " 0.17256224757001465,\n",
2014 |        " 0.35679730149571703,\n",
2015 |        " 0.5426745373041587,\n",
2016 |        " 0.6031245839066175,\n",
2017 |        " 0.6226532333229773]"
2018 |       ]
2019 |      },
2020 |      "execution_count": 110,
2021 |      "metadata": {},
2022 |      "output_type": "execute_result"
2023 |     }
2024 |    ],
2025 |    "source": [
2026 |     "train_scores"
2027 |    ]
2028 |   },
2029 |   {
2030 |    "cell_type": "code",
2031 |    "execution_count": 111,
2032 |    "metadata": {},
2033 |    "outputs": [
2034 |     {
2035 |      "data": {
2036 |       "text/plain": [
2037 |        "[0.0714951404606577,\n",
2038 |        " 0.12355212355212356,\n",
2039 |        " 0.1399281054453468,\n",
2040 |        " 0.10557848488882972,\n",
2041 |        " 0.07881773399014778,\n",
2042 |        " 0.06763413659965384]"
2043 |       ]
2044 |      },
2045 |      "execution_count": 111,
2046 |      "metadata": {},
2047 |      "output_type": "execute_result"
2048 |     }
2049 |    ],
2050 |    "source": [
2051 |     "test_scores"
2052 |    ]
2053 |   },
2054 |   {
2055 |    "cell_type": "code",
2056 |    "execution_count": null,
2057 |    "metadata": {},
2058 |    "outputs": [],
2059 |    "source": []
2060 |   }
2061 |  ],
2062 |  "metadata": {
2063 |   "kernelspec": {
2064 |    "display_name": "Python 3",
2065 |    "language": "python",
2066 |    "name": "python3"
2067 |   },
2068 |   "language_info": {
2069 |    "codemirror_mode": {
2070 |     "name": "ipython",
2071 |     "version": 3
2072 |    },
2073 |    "file_extension": ".py",
2074 |    "mimetype": "text/x-python",
2075 |    "name": "python",
2076 |    "nbconvert_exporter": "python",
2077 |    "pygments_lexer": "ipython3",
2078 |    "version": "3.7.4"
2079 |   }
2080 |  },
2081 |  "nbformat": 4,
2082 |  "nbformat_minor": 4
2083 | }
2084 | 


--------------------------------------------------------------------------------
/demo/models/my_best_model.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/models/my_best_model.pkl


--------------------------------------------------------------------------------
/demo/models/my_best_scaler.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/models/my_best_scaler.pkl


--------------------------------------------------------------------------------
/demo/models/my_best_tfidf.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/models/my_best_tfidf.pkl


--------------------------------------------------------------------------------
/demo/predict.py:
--------------------------------------------------------------------------------
  1 | from flask import Flask, render_template, flash, request
  2 | 
  3 | #Importing the packages we will be using
  4 | # Basic Packages
  5 | import numpy as np
  6 | import pandas as pd
  7 | #pd.set_option('display.max_columns', 500)
  8 | #np.set_printoptions(suppress=True)
  9 | 
 10 | # NLTK Packages
 11 | #import nltk
 12 | # Use the code below to download the NLTK package, a straightforward GUI should pop up
 13 | #nltk.download()
 14 | #from nltk.corpus import stopwords
 15 | from nltk.tokenize import word_tokenize
 16 | from nltk.stem import PorterStemmer
 17 | from nltk.stem import WordNetLemmatizer
 18 | import joblib
 19 | 
 20 | #stop_words = stopwords.words('english')
 21 | stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
 22 | #Adds stuff to our stop words list
 23 | stop_words.extend(['.',','])
 24 | 
 25 | ## This function can improve, simplify. Look into Text Data Lecture
 26 | def remove_stopwords(list_of_tokens):
 27 |     """
 28 |     Removes stopwords
 29 |     """
 30 | 
 31 |     cleaned_tokens = []
 32 | 
 33 |     for token in list_of_tokens:
 34 |         if token in stop_words: continue
 35 |         cleaned_tokens.append(token)
 36 | 
 37 |     return cleaned_tokens
 38 | def stemmer(list_of_tokens):
 39 |     '''
 40 |     Takes in an input which is a list of tokens, and spits out a list of stemmed tokens.
 41 |     '''
 42 | 
 43 |     stemmed_tokens_list = []
 44 | 
 45 |     for i in list_of_tokens:
 46 | 
 47 |         token = PorterStemmer().stem(i)
 48 |         stemmed_tokens_list.append(token)
 49 | 
 50 |     return stemmed_tokens_list
 51 | 
 52 | #from nltk.stem import WordNetLemmatizer
 53 | 
 54 | def lemmatizer(list_of_tokens):
 55 | 
 56 |     lemmatized_tokens_list = []
 57 | 
 58 |     for i in list_of_tokens:
 59 |         token = WordNetLemmatizer().lemmatize(i)
 60 |         lemmatized_tokens_list.append(token)
 61 | 
 62 |     return lemmatized_tokens_list
 63 | 
 64 | 
 65 | def the_untokenizer(token_list):
 66 |         '''
 67 |         Returns all the tokenized words in the list to one string.
 68 |         Used after the pre processing, such as removing stopwords, and lemmatizing.
 69 |         '''
 70 |         return " ".join(token_list)
 71 | 
 72 | def clean_string(my_string):
 73 |     tokenized_list = word_tokenize(my_string)
 74 |     removed_stopwords = remove_stopwords(tokenized_list)
 75 |     stemmed_words = stemmer(removed_stopwords)
 76 |     lemmatized_words = lemmatizer(stemmed_words)
 77 |     back_to_string = the_untokenizer(lemmatized_words)
 78 |     return back_to_string
 79 | 
 80 | app = Flask("genre_prediction", template_folder='templates')
 81 | #app = Flask("genre_prediction", template_folder='/home/TomKeith/genre/templates')
 82 | app.secret_key = "super secret key"
 83 | 
 84 | @app.route('/', methods=["GET","POST"])
 85 | def predict():
 86 | 
 87 |     #error = 'trying post'
 88 |     try:
 89 |         if request.method == "POST":
 90 |             my_string = request.form['plot']
 91 | 
 92 |             train_df = pd.read_csv('train_medians.csv')
 93 |             my_model = joblib.load('models/my_best_model.pkl')
 94 |             my_scaler = joblib.load('models/my_best_scaler.pkl')
 95 |             my_tfidf = joblib.load('models/my_best_tfidf.pkl')
 96 |             
 97 |             #train_df = pd.read_csv('/home/TomKeith/genre/train_medians.csv')
 98 |             #my_model = joblib.load('/home/TomKeith/genre/models/my_best_model.pkl')
 99 |             #my_scaler = joblib.load('/home/TomKeith/genre/models/my_best_scaler.pkl')
100 |             #my_tfidf = joblib.load('/home/TomKeith/genre/models/my_best_tfidf.pkl')
101 | 
102 |             genre_cols = ['action','adventure','animation','biography','comedy','crime','documentary',\
103 |                           'drama','family','fantasy','film-noir','history','horror','music','musical',\
104 |                           'mystery','romance','sci-fi','sport','thriller','war','western']
105 | 
106 |             feature_cols = ['f_release_year','f_release_month','f_runtime','f_word_count_long','f_imdb_rating',\
107 |                             'f_num_imdb_votes','f_num_user_reviews','f_num_critic_reviews']
108 | 
109 | 
110 |             feature_cols_df = pd.DataFrame([[0]*8 ], columns=feature_cols)
111 | 
112 |             input_tfidf = my_tfidf.transform([clean_string(my_string)])
113 |             input_transformed_df = pd.DataFrame(input_tfidf.toarray(), columns=my_tfidf.get_feature_names())
114 | 
115 |             input_final = pd.concat([feature_cols_df, input_transformed_df], axis=1)
116 | 
117 |             for col in feature_cols:
118 |                 input_final.at[0,col] = train_df[col].median()
119 |             input_final.at[0,'f_word_count_long'] = len(my_string)
120 |             input_final_df = my_scaler.transform(input_final)
121 | 
122 |             input_pred = my_model.predict_proba(input_final_df)
123 | 
124 |             df = pd.DataFrame(input_pred, columns=genre_cols).T.sort_values(0, ascending=False)
125 |             output_list = []
126 |             for index, row in df.iterrows():
127 |                 if row.values[0] >= 0.2:
128 |                     temp_list = [int(round(row.values[0]*100,0)), index.capitalize()]
129 |                     output_list.append(temp_list)
130 |             return render_template('predict.html', results=output_list, my_string=my_string)
131 |         else:
132 |             return render_template('predict.html')
133 | 
134 |     except Exception as e:
135 |         print(e)
136 |         #return json.dumps({'success':True}, 200, {'ContentType':'application/json'})
137 |         return render_template("predict.html", error = e)
138 | 
139 | if __name__ == "__main__":
140 |     app.debug = True
141 |     app.run()


--------------------------------------------------------------------------------
/demo/static/css/styles.css:
--------------------------------------------------------------------------------
 1 | @import url('https://fonts.googleapis.com/css2?family=Raleway:wght@400;700&display=swap');
 2 | 
 3 | .container {
 4 | 	margin: auto;
 5 | 	width: 800px;
 6 | }
 7 | 
 8 | body {
 9 | 	text-align: center;
10 | 	background-color: #f46524;
11 | 	font-family: 'Raleway', sans-serif;
12 | 	color: white;
13 | }
14 | h1 {
15 | 	font-weight: 700;
16 | 	font-size: 64pt;
17 | 	font-family: 'Raleway', sans-serif;
18 | 	margin-bottom: 6px;
19 | 	margin-top: 10px;
20 | 
21 | }
22 | 
23 | h2 {
24 | 	font-weight: 700;
25 | 	font-size: 48pt;
26 | 	font-family: 'Raleway', sans-serif;
27 | 	margin-bottom: 8px;
28 | 	margin-top: 10px;
29 | 
30 | }
31 | 
32 | h3 {
33 | 	font-weight: 700;
34 | 	font-size: 24pt;
35 | 	font-family: 'Raleway', sans-serif;
36 | 	margin-bottom: 10px;
37 | 	margin-top: 10px;
38 | 
39 | }
40 | 
41 | p {
42 | 	color: white;
43 | }
44 | 
45 | .results {
46 | 	// width: 600px;
47 | }
48 | 
49 | .genre_name{
50 | 	font-size: 24pt;
51 | }
52 | 
53 | .genre_score{
54 | 	font-size: 16pt;
55 | }
56 | 
57 | .genre_block{
58 | 	width: 25%;
59 | 	display:inline-block;
60 | 	float: center;
61 | }
62 | 
63 | img{
64 | 	width: 50px;
65 | }
66 | 
67 | .btn {
68 | 	background-color: white;
69 | 	border: none;
70 | 	color: #f46524;
71 | 	padding: 15px 32px;
72 | 	text-align: center;
73 | 	text-decoration: none;
74 | 	display: inline-block;
75 | 	font-size: 16px;
76 | 	font-weight: 700;
77 | 	border-radius: 6px;
78 | }
79 | 
80 | .footer {
81 |     position: fixed;
82 |     left: 0;
83 |     bottom: 0;
84 |     width: 100%;
85 |     // background-color: red;
86 |     color: black;
87 |     text-align: center;
88 |     // font-weight: bold;
89 |     font-size: 10pt;
90 | }
91 | 
92 | a{
93 |     text-decoration: none;
94 |     // font-weight: bold;
95 |     color: white;
96 | }


--------------------------------------------------------------------------------
/demo/static/images/Action.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Action.png


--------------------------------------------------------------------------------
/demo/static/images/Adventure.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Adventure.png


--------------------------------------------------------------------------------
/demo/static/images/Animation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Animation.png


--------------------------------------------------------------------------------
/demo/static/images/Biography.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Biography.png


--------------------------------------------------------------------------------
/demo/static/images/Comedy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Comedy.png


--------------------------------------------------------------------------------
/demo/static/images/Crime.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Crime.png


--------------------------------------------------------------------------------
/demo/static/images/Documentary.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Documentary.png


--------------------------------------------------------------------------------
/demo/static/images/Drama.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Drama.png


--------------------------------------------------------------------------------
/demo/static/images/Family.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Family.png


--------------------------------------------------------------------------------
/demo/static/images/Fantasy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Fantasy.png


--------------------------------------------------------------------------------
/demo/static/images/Film-noir.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Film-noir.png


--------------------------------------------------------------------------------
/demo/static/images/History.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/History.png


--------------------------------------------------------------------------------
/demo/static/images/Horror.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Horror.png


--------------------------------------------------------------------------------
/demo/static/images/Music.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Music.png


--------------------------------------------------------------------------------
/demo/static/images/Musical.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Musical.png


--------------------------------------------------------------------------------
/demo/static/images/Mystery.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Mystery.png


--------------------------------------------------------------------------------
/demo/static/images/Romance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Romance.png


--------------------------------------------------------------------------------
/demo/static/images/Sci-fi.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Sci-fi.png


--------------------------------------------------------------------------------
/demo/static/images/Sport.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Sport.png


--------------------------------------------------------------------------------
/demo/static/images/Thriller.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Thriller.png


--------------------------------------------------------------------------------
/demo/static/images/War.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/War.png


--------------------------------------------------------------------------------
/demo/static/images/Western.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Western.png


--------------------------------------------------------------------------------
/demo/static/images/magic-lamp.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/magic-lamp.png


--------------------------------------------------------------------------------
/demo/templates/predict.html:
--------------------------------------------------------------------------------
 1 | <html>
 2 | <head>
 3 | <link rel="stylesheet" href="{{ url_for('static',filename='css/styles.css') }}">
 4 | <meta name="viewport" content="width=device-width, initial-scale=1">
 5 | <title>Genre Genie - Movie Genre Predictor - Tom Keith</title>
 6 | </head>
 7 | <body>
 8 | 	<h1>Genre Genie</h1><img src="{{ url_for('static',filename='images/') }}magic-lamp.png" />
 9 | 	<h3>Movie Genre Predictor</h3>
10 | 	<div class="container">
11 | 		<p>Enter a plot summary to get genre predictions.</p>
12 | 		<form action="" class="form-inline" method="post">
13 | 			<textarea rows="12" cols="100" type="textarea" class="form-control" placeholder = "{{ my_string }}" name="plot" value="{{request.form.plot}}" >{% if my_string %}{{ my_string }}{% else %}Luke Skywalker, Han Solo, Princess Leia and Chewbacca face attack by the Imperial forces and its AT-AT walkers on the ice planet Hoth. While Han and Leia escape in the Millennium Falcon, Luke travels to Dagobah in search of Yoda. Only with the Jedi Master's help will Luke survive when the Dark Side of the Force beckons him into the ultimate duel with Darth Vader.{% endif %}</textarea>
14 | 			<br /><br />
15 | 			<input class="btn" type="submit" value="MAKE PREDICTION">
16 | 		<br />
17 | 		</form>
18 | 		<br />
19 | 		{% if results %}
20 | 			<div class="results">
21 | 			{% for g in results %}
22 | 				<div class="genre_block">
23 | 					<div class="genre_icon"><img  src="{{ url_for('static',filename='images/') }}{{g[1]}}.png" /></div>
24 | 					<div class="genre_name">{{g[1]}}</div>
25 | 					<div class="genre_score">{{g[0]}} %</div>
26 | 				<br />
27 | 				</div>
28 | 
29 | 			{% endfor %}
30 | 			</div>
31 | 		{% endif %}
32 | 	</div>
33 | 	{% if error %}<p>{{error}}</p>{% endif %}
34 | 	<div class="footer">
35 |         <p>Created by Tom Keith - <a href="https://github.com/tomkeith" target="_blank">GitHub</a> | <a href="https://www.linkedin.com/in/tomkeithdata/" target="_blank">LinkedIn</a></p>
36 |     </div>
37 | </body>
38 | </html>


--------------------------------------------------------------------------------
/demo/train_medians.csv:
--------------------------------------------------------------------------------
1 | f_release_year,f_release_month,f_runtime,f_word_count_long,f_imdb_rating,f_num_imdb_votes,f_num_user_reviews,f_num_critic_reviews
2 | 2004,7,100,76,6.5,3564,35,29
3 | 


--------------------------------------------------------------------------------
/images/app.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/images/app.png


--------------------------------------------------------------------------------
/images/app2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/images/app2.png


--------------------------------------------------------------------------------
/images/genre-counts-graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/images/genre-counts-graph.png


--------------------------------------------------------------------------------
/images/imdb-bottom.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/images/imdb-bottom.png


--------------------------------------------------------------------------------
/images/imdb-top.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/images/imdb-top.png


--------------------------------------------------------------------------------
/images/results-graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/images/results-graph.png


--------------------------------------------------------------------------------
/images/wc_img.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/images/wc_img.png


--------------------------------------------------------------------------------
/models/my_1vr_logreg_0.01.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/models/my_1vr_logreg_0.01.pkl


--------------------------------------------------------------------------------
/models/my_1vr_logreg_default.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/models/my_1vr_logreg_default.pkl


--------------------------------------------------------------------------------
/models/my_best_model.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/models/my_best_model.pkl


--------------------------------------------------------------------------------
/models/my_best_scaler.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/models/my_best_scaler.pkl


--------------------------------------------------------------------------------
/models/my_best_tfidf.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/models/my_best_tfidf.pkl


--------------------------------------------------------------------------------
/models/my_minmax_scaler.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/models/my_minmax_scaler.pkl


--------------------------------------------------------------------------------
/models/my_standard_scaler.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/models/my_standard_scaler.pkl


--------------------------------------------------------------------------------
/models/my_tfidf_min20.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/models/my_tfidf_min20.pkl


--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
 1 | # Genre Genie - Movie Genre Predictions
 2 | 
 3 | ## Multi-label Classification with Natural Language Processing
 4 | 
 5 | ### Tom Keith - BrainStation Data Science Diploma Capstone - March 2020
 6 | 
 7 | **DEMO**: http://tomkeith.pythonanywhere.com/
 8 | 
 9 | <a href="http://tomkeith.pythonanywhere.com/" target="_blank"><img src="images/app2.png" alt="Genre Genie app"></a>
10 | 
11 | ---
12 | 
13 | ### Genre Genie - Movie Genre Predictions
14 | 
15 | As a huge fan of movies and the information associated with them, countless hours have been spent on the Internet Movie Database (<a href="https://www.imdb.com/" target="_blank">IMDb</a>) looking up movie trivia, box office figures, following people through film (degrees of Kevin Bacon), and exploring movies similar to my favourites. Like many, I prefer movies of particular genres more than others. What makes a genre? Which words can ‘define’ a genre? How can data answer this question? These burning questions gave rise to Genre Genie.
16 | 
17 | Genre Genie has been trained on information from 30,000 movies - including brief plot summaries - with a goal of accurately predicting genres most associated to new, unseen plot summaries. I set out to tackle this multi-label classification problem by utilizing web scraping, natural language processing (NLP) and machine learning. A secondary goal was to create some sort of interactivity element for display and demonstration.
18 | 
19 | ### Multi-label Classification
20 | 
21 | During my computer science undergraduate degree, I was in a database class learning about 'many-to-many' relationships – a relationship that is exemplified well with movies and genres. Simply put, a movie can have many genres associated with it, and a genre can be associated with many movies (hence 'many-to-many'). This is different than the classifications that more commonly occur as multi-class classification (or even binary classification) which is 'one-to-many'. For example, image classification of animals: dog, cat, horse, etc. The image can’t be classified as a dog *and* a horse, it can only be one – that is multi-*class* classification due to the fact the classifications are mutually exclusive. Movie genres are multi-*label* classifications and are **not** mutually exclusive – as best seen by the success of romantic comedies for example.
22 | 
23 | While multi-label classification comes with its own set of problems, it is still very much overlaps with multi-class classification. Sourcing articles handling multi-label classification with machine learning[<a href="https://towardsdatascience.com/multi-label-text-classification-with-scikit-learn-30714b7819c5" target="_blank">1</a>][<a href="https://towardsdatascience.com/journey-to-the-center-of-multi-label-classification-384c40229bff" target="_blank">2</a>], a popular method is a ‘OneVsRest’ approach, which will be discussed more later. Multi-label classifications have many real-world applications, especially with text classification. For example, news articles, blog websites, or like in the examples from sourced articles, toxicity level in comments.
24 | 
25 | ### The Data
26 | 
27 | Finding data for this problem proved interesting, however the challenge was welcomed. IMDb has its own <a href="https://www.imdb.com/interfaces/" target="_blank">sets of open data</a>, which provided a great starting point, but lacked some major components needed to solve this problem. Most importantly, no text data (other than titles) was present to utilize NLP. Additionally, the dataset was limited to only the first (alphabetically) 3 genres, rather than the amount shown in more detail on IMDb.com - where a movie can have up to 7 genres.
28 | 
29 | Fortunately, this dataset has IMDB IDs (aka `tconst`) which can be extracted and utilized to scrape IMDb directly using their simple url construction - <a href="https://www.imdb.com/title/tt0076759/" target="_blank">imdb.com/title/tt0076759/</a> where `tt0076759` is the `tconst`. A list of over 30,000 movies' `tconst` values were fetched, with the conditions of having more than 1,000 rating votes and from the last 100 years (1920-2019). These thresholds were chosen with the assumption that if a movie has at least 1,000 rating votes, it's presumed to be accurate. I chose to go back 100 years for a representative sample distribution with variety across all genres, storylines and time periods.
30 | 
31 | Armed with 30,000+ IMDb IDs, I scraped IMDb for more information than required, seeking to have anything that could aid the model in making better predictions. Attempts at scraping a very long plot synopsis, potentially yielding in more text to train on, were halted when 2/3 of scrapes returned null. This resulted in using a shorter summary with an average of about 90 words, and no null values. This text and the additional numeric values scraped were evaluated for feature selection.
32 | 
33 | ### EDA and Feature Selection
34 | 
35 | After scraping, some features just didn’t have enough data. For example, MPAA (the content rating of a movie) was missing for over 5,000 entries. After doing some research, the modern MPAA rating system didn't yet exist for many of the movies in my dataset (often just saying 'passed' rather than a rating 'PG13'). Additionally, the Metacritic rating was null for about half the data. These two features, along with the long synopsis mentioned above, were dropped altogether.
36 | 
37 | I featured engineered a word count (of the plot summary), in addition to the release month of movies. In the end, prior to pre-processing the text for NLP, I had 8 numerical features, and 1 text column consisting of - release year, release month, runtime, word count, IMDb rating, number of IMDb rating votes, number of user reviews and number of critic reviews.
38 | 
39 | The genres, our target variables, displayed some interesting trends. The representation among the genres were not balanced, however. For example, of all 25 genre tags, the top 4 (drama, comedy, thriller, romance) made up over 50% of all the genre tags (more on this in results). Movies associated with the bottom 3 (game-show, news, adult) were dropped altogether or absorbed into another genre. For example, 'news' genre proved redundant as it was always shared with 'documentary'. Ended with 22 genres.
40 | 
41 | ![](images/genre-counts-graph.png)
42 | 
43 | ### NLP Pre-processing and TF-IDF
44 | 
45 | Once I had my final dataset, I needed to convert the text into numerical values for modelling. To achieve this, we first need to pre-process the text before vectorizing it. I used a tokenizer to break down strings into single words, a stemmer to chop off the end of the words (and remove stop-words), a lemmatizer to change the word into their base form, and finally an un-tokenizer to put the words back into one string.
46 | 
47 | Once the text was pre-processed, Term Frequency-Inverse Document Frequency (TF-IDF) was used to transform the pre-processed text into vectorized numerical values. But first, and most importantly, the data must be split! 25% was set aside for a test set (75% train). When vectorizing, one n-gram (single words only) and a threshold of 20 for the minimum number of documents the word needed to appear in for it to count as a feature. Once the vectorizer had been fit on the train data, both sets of text from the train and test sets we transformed, then combined with the remaining features (including targets). Finally, exported separate ‘train’ and ‘test’ files where they will be used for modeling.
48 | 
49 | ### Modeling with OneVsRest
50 | 
51 | Prior to fitting, as with most modeling, scaling was required. Both standard and min-max scalers were tested, however standard scaler proved better results.
52 | 
53 | When using a multi-label classifier, the target variable is not just one column, but many. My approach considered each label individually by fitting an independent model to each of the labels. This process is simplified using the 'OneVsRest' classifier that allows one model type to be passed (logistic regression) as a parameter, and the entire multi-dimensional target array then fitted.
54 | 
55 | OneVsRest takes each target column and evaluates it independently from the others (is it action, or not – hence 'one vs rest'). Thus, when the trained model predicts the label, it can output more than one label. Logistic regression proved to be the best model (vs Linear SVM) for this data set with the hyperparameters tuned (`C=0.01`, `solver='lbfgs'`). 
56 | 
57 | ### Results
58 | 
59 | Similar to how the model is fitted one genre at a time, the results in the chart below are also independent of each other. The predicted values are checked against the target test set variables as an 'accuracy score'. As shown in the chart, the accuracy score is much higher for genres which are less represented (right-most), and the genres which have a greater representation have lower accuracy scores (left-most). These higher accuracies could be attributed to having more distinct words/data to identify that genre. The inverse is also true for the lower accuracies where there is too much overlap with other genres. Overall, the average genre accuracy was 90% - a score I consider 'good enough' for this application.
60 | 
61 | ![](images/results-graph.png)
62 | 
63 | ### Demo with Flask
64 | 
65 | While outside the scope of my initial goal, I wanted a fun way of demonstrating my model to the public on demo day. I created a Flask app where a user can input text (a plot summary) and the model will predict the genres most associated with that input text. All of the demo code is include in this project, and is temporarily available to demo here: http://tomkeith.pythonanywhere.com/
66 | 
67 | <a href="http://tomkeith.pythonanywhere.com/" target="_blank"><img src="images/app.png" alt="Genre Genie app" width="200"/></a>
68 | 
69 | ### Next Steps
70 | 
71 | This is an unbalanced dataset. There are many options to deal with unbalances data sets. Next steps could try to deal with this unbalanced set better than just using logistic regression, which handles the imbalance decently well. Over/under sampling, or looking into different scoring methods like f1 score, precision, recall and reviewing each of the 22 genres' confusion matrices would give a better understanding of how accurate the predictions are. Finally, applying a multi-label classification using NLP to a more business situation such as customer complaints, article/blog classification.
72 | 
73 | ---
74 | 
75 | ## PROJECT FILES AND FOLDERS
76 | 
77 | ### Files
78 | 
79 | - **`1.1-imdb-datasets.ipynb`** - Explore IMDb datasets
80 | - **`1.2-scraping-imdb.ipynb`** - Web scraper designed to scrape IMDb.com titles and export .tsv files.
81 | - **`1.3-data-merge-clean-encode.ipynb`** - Merge scraped data from IMDb, clean, and binary encode the genres (from list format).
82 | - **`2.1-eda.ipynb`** - EDA, feature engineering and prepare dataset for NLP preprocessing.
83 | - **`2.2-data-preprocessing.ipynb`** - Pre-process text data, split into train and test sets, TF-IDF.
84 | - **`3.1-modeling.ipynb`** - Fit and optimize multi-label classification model (OneVsRest) and measure accuracy.
85 | - **`3.2-best-model.ipynb`** - Create final model from optimized hyperparameters on full dataset.
86 | - **`3.3-wordclouds.ipynb`** - Bonus workbook only used to create a wordcloud in my presentation.
87 | 
88 | ### Folders
89 | 
90 | - **`/data/`** - **Not all data has been included** due to file size limits. Full project with <a href="https://www.dropbox.com/sh/na8dzbpic4mo8fe/AABdryo23KpN6gtluH662xwha?dl=0" target="_blank">data files here</a>.
91 | - **`/demo/`** - Local Flask app (run with predict.py). Alternatively, visit: http://tomkeith.pythonanywhere.com/
92 | - **`/images/`** - Contains misc images used in report and workbooks.
93 | - **`/models/`** - Contains only the best models (.pkl) and a few samples.
94 | - **`/rawdata/`** - Contains 100 .tsv files of movie data, one for each year.
95 | 
96 | ---
97 | 


--------------------------------------------------------------------------------