├── .gitattributes ├── .gitignore ├── 1.1-imdb-datasets.ipynb ├── 1.2-scraping-imdb.ipynb ├── 1.3-data-merge-clean-encode.ipynb ├── 2.1-eda.ipynb ├── 2.2-data-preprocessing.ipynb ├── 3.1-modeling.ipynb ├── 3.2-best-model.ipynb ├── 3.3-wordclouds.ipynb ├── data ├── clean_df.tsv ├── encoded_genres.tsv └── imdb_movie_list.csv ├── demo ├── models │ ├── my_best_model.pkl │ ├── my_best_scaler.pkl │ └── my_best_tfidf.pkl ├── predict.py ├── static │ ├── css │ │ └── styles.css │ └── images │ │ ├── Action.png │ │ ├── Adventure.png │ │ ├── Animation.png │ │ ├── Biography.png │ │ ├── Comedy.png │ │ ├── Crime.png │ │ ├── Documentary.png │ │ ├── Drama.png │ │ ├── Family.png │ │ ├── Fantasy.png │ │ ├── Film-noir.png │ │ ├── History.png │ │ ├── Horror.png │ │ ├── Music.png │ │ ├── Musical.png │ │ ├── Mystery.png │ │ ├── Romance.png │ │ ├── Sci-fi.png │ │ ├── Sport.png │ │ ├── Thriller.png │ │ ├── War.png │ │ ├── Western.png │ │ └── magic-lamp.png ├── templates │ └── predict.html └── train_medians.csv ├── images ├── app.png ├── app2.png ├── genre-counts-graph.png ├── imdb-bottom.png ├── imdb-top.png ├── results-graph.png └── wc_img.png ├── models ├── my_1vr_logreg_0.01.pkl ├── my_1vr_logreg_default.pkl ├── my_best_model.pkl ├── my_best_scaler.pkl ├── my_best_tfidf.pkl ├── my_minmax_scaler.pkl ├── my_standard_scaler.pkl └── my_tfidf_min20.pkl └── readme.md /.gitattributes: -------------------------------------------------------------------------------- 1 | *.tsv filter=lfs diff=lfs merge=lfs -text -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .tsv -------------------------------------------------------------------------------- /1.1-imdb-datasets.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Genre Genie - Multi-label Classification with NLP\n", 8 | "### Part 1.1: IMDb dataset\n", 9 | "\n", 10 | "#### Tom Keith\n", 11 | "\n", 12 | "---\n", 13 | "\n", 14 | "**Goal:** Explore IMDb datasets." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "IMDb offers datasets with loads of information. I explored these sets to see what I could use while still figuring out what direction to go with this project.\n", 22 | "\n", 23 | "More information on these datasets and the features they have can be found here: https://www.imdb.com/interfaces/\n", 24 | "\n", 25 | "I will be using the direct links for the compressed `.tsv` files: https://datasets.imdbws.com/\n", 26 | "\n", 27 | "---" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 48, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "import pandas as pd\n", 37 | "import numpy as np\n", 38 | "import matplotlib.pyplot as plt\n", 39 | "from PIL import Image\n", 40 | "\n", 41 | "pd.set_option('display.max_rows', 200)" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "---\n", 49 | "**Exploring IMDb Datasets**\n", 50 | "\n", 51 | "Loop though to peek at what each dataset looks like. While these files are updated daily, the data used throughout this project was fetched February 2, 2020.\n", 52 | "\n", 53 | "An importnat note is that `NULL` values are represented as `\\N` in these sets." 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 47, 59 | "metadata": { 60 | "scrolled": true 61 | }, 62 | "outputs": [ 63 | { 64 | "data": { 65 | "text/html": [ 66 | "
\n", 67 | "\n", 80 | "\n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | "
tconsttitleTypeprimaryTitleoriginalTitleisAdultstartYearendYearruntimeMinutesgenres
0tt0000001shortCarmencitaCarmencita01894\\N1Documentary,Short
1tt0000002shortLe clown et ses chiensLe clown et ses chiens01892\\N5Animation,Short
2tt0000003shortPauvre PierrotPauvre Pierrot01892\\N4Animation,Comedy,Romance
3tt0000004shortUn bon bockUn bon bock01892\\N12Animation,Short
4tt0000005shortBlacksmith SceneBlacksmith Scene01893\\N1Comedy,Short
..............................
6672689tt9916848tvEpisodeEpisode #3.17Episode #3.1702010\\N\\NAction,Drama,Family
6672690tt9916850tvEpisodeEpisode #3.19Episode #3.1902010\\N\\NAction,Drama,Family
6672691tt9916852tvEpisodeEpisode #3.20Episode #3.2002010\\N\\NAction,Drama,Family
6672692tt9916856shortThe WindThe Wind02015\\N27Short
6672693tt9916880tvEpisodeHorrid Henry Knows It AllHorrid Henry Knows It All02014\\N10Animation,Comedy,Family
\n", 230 | "

6672694 rows × 9 columns

\n", 231 | "
" 232 | ], 233 | "text/plain": [ 234 | " tconst titleType primaryTitle \\\n", 235 | "0 tt0000001 short Carmencita \n", 236 | "1 tt0000002 short Le clown et ses chiens \n", 237 | "2 tt0000003 short Pauvre Pierrot \n", 238 | "3 tt0000004 short Un bon bock \n", 239 | "4 tt0000005 short Blacksmith Scene \n", 240 | "... ... ... ... \n", 241 | "6672689 tt9916848 tvEpisode Episode #3.17 \n", 242 | "6672690 tt9916850 tvEpisode Episode #3.19 \n", 243 | "6672691 tt9916852 tvEpisode Episode #3.20 \n", 244 | "6672692 tt9916856 short The Wind \n", 245 | "6672693 tt9916880 tvEpisode Horrid Henry Knows It All \n", 246 | "\n", 247 | " originalTitle isAdult startYear endYear runtimeMinutes \\\n", 248 | "0 Carmencita 0 1894 \\N 1 \n", 249 | "1 Le clown et ses chiens 0 1892 \\N 5 \n", 250 | "2 Pauvre Pierrot 0 1892 \\N 4 \n", 251 | "3 Un bon bock 0 1892 \\N 12 \n", 252 | "4 Blacksmith Scene 0 1893 \\N 1 \n", 253 | "... ... ... ... ... ... \n", 254 | "6672689 Episode #3.17 0 2010 \\N \\N \n", 255 | "6672690 Episode #3.19 0 2010 \\N \\N \n", 256 | "6672691 Episode #3.20 0 2010 \\N \\N \n", 257 | "6672692 The Wind 0 2015 \\N 27 \n", 258 | "6672693 Horrid Henry Knows It All 0 2014 \\N 10 \n", 259 | "\n", 260 | " genres \n", 261 | "0 Documentary,Short \n", 262 | "1 Animation,Short \n", 263 | "2 Animation,Comedy,Romance \n", 264 | "3 Animation,Short \n", 265 | "4 Comedy,Short \n", 266 | "... ... \n", 267 | "6672689 Action,Drama,Family \n", 268 | "6672690 Action,Drama,Family \n", 269 | "6672691 Action,Drama,Family \n", 270 | "6672692 Short \n", 271 | "6672693 Animation,Comedy,Family \n", 272 | "\n", 273 | "[6672694 rows x 9 columns]" 274 | ] 275 | }, 276 | "metadata": {}, 277 | "output_type": "display_data" 278 | }, 279 | { 280 | "data": { 281 | "text/html": [ 282 | "
\n", 283 | "\n", 296 | "\n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | "
tconstaverageRatingnumVotes
0tt00000015.61591
1tt00000026.1194
2tt00000036.51264
3tt00000046.2120
4tt00000056.12025
............
1019001tt99165766.09
1019002tt99165788.516
1019003tt99167205.548
1019004tt99167666.813
1019005tt99167787.220
\n", 374 | "

1019006 rows × 3 columns

\n", 375 | "
" 376 | ], 377 | "text/plain": [ 378 | " tconst averageRating numVotes\n", 379 | "0 tt0000001 5.6 1591\n", 380 | "1 tt0000002 6.1 194\n", 381 | "2 tt0000003 6.5 1264\n", 382 | "3 tt0000004 6.2 120\n", 383 | "4 tt0000005 6.1 2025\n", 384 | "... ... ... ...\n", 385 | "1019001 tt9916576 6.0 9\n", 386 | "1019002 tt9916578 8.5 16\n", 387 | "1019003 tt9916720 5.5 48\n", 388 | "1019004 tt9916766 6.8 13\n", 389 | "1019005 tt9916778 7.2 20\n", 390 | "\n", 391 | "[1019006 rows x 3 columns]" 392 | ] 393 | }, 394 | "metadata": {}, 395 | "output_type": "display_data" 396 | }, 397 | { 398 | "data": { 399 | "text/html": [ 400 | "
\n", 401 | "\n", 414 | "\n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | "
nconstprimaryNamebirthYeardeathYearprimaryProfessionknownForTitles
0nm0000001Fred Astaire18991987soundtrack,actor,miscellaneoustt0050419,tt0072308,tt0053137,tt0043044
1nm0000002Lauren Bacall19242014actress,soundtracktt0117057,tt0038355,tt0037382,tt0071877
2nm0000003Brigitte Bardot1934\\Nactress,soundtrack,producertt0054452,tt0057345,tt0059956,tt0049189
3nm0000004John Belushi19491982actor,soundtrack,writertt0072562,tt0077975,tt0080455,tt0078723
4nm0000005Ingmar Bergman19182007writer,director,actortt0069467,tt0050986,tt0050976,tt0083922
.....................
9982871nm9993714Romeo del Rosario\\N\\Nanimation_department,art_departmenttt2455546
9982872nm9993716Essias Loberg\\N\\NNaN\\N
9982873nm9993717Harikrishnan Rajan\\N\\Ncinematographertt8736744
9982874nm9993718Aayush Nair\\N\\Ncinematographer\\N
9982875nm9993719Andre Hill\\N\\NNaN\\N
\n", 528 | "

9982876 rows × 6 columns

\n", 529 | "
" 530 | ], 531 | "text/plain": [ 532 | " nconst primaryName birthYear deathYear \\\n", 533 | "0 nm0000001 Fred Astaire 1899 1987 \n", 534 | "1 nm0000002 Lauren Bacall 1924 2014 \n", 535 | "2 nm0000003 Brigitte Bardot 1934 \\N \n", 536 | "3 nm0000004 John Belushi 1949 1982 \n", 537 | "4 nm0000005 Ingmar Bergman 1918 2007 \n", 538 | "... ... ... ... ... \n", 539 | "9982871 nm9993714 Romeo del Rosario \\N \\N \n", 540 | "9982872 nm9993716 Essias Loberg \\N \\N \n", 541 | "9982873 nm9993717 Harikrishnan Rajan \\N \\N \n", 542 | "9982874 nm9993718 Aayush Nair \\N \\N \n", 543 | "9982875 nm9993719 Andre Hill \\N \\N \n", 544 | "\n", 545 | " primaryProfession \\\n", 546 | "0 soundtrack,actor,miscellaneous \n", 547 | "1 actress,soundtrack \n", 548 | "2 actress,soundtrack,producer \n", 549 | "3 actor,soundtrack,writer \n", 550 | "4 writer,director,actor \n", 551 | "... ... \n", 552 | "9982871 animation_department,art_department \n", 553 | "9982872 NaN \n", 554 | "9982873 cinematographer \n", 555 | "9982874 cinematographer \n", 556 | "9982875 NaN \n", 557 | "\n", 558 | " knownForTitles \n", 559 | "0 tt0050419,tt0072308,tt0053137,tt0043044 \n", 560 | "1 tt0117057,tt0038355,tt0037382,tt0071877 \n", 561 | "2 tt0054452,tt0057345,tt0059956,tt0049189 \n", 562 | "3 tt0072562,tt0077975,tt0080455,tt0078723 \n", 563 | "4 tt0069467,tt0050986,tt0050976,tt0083922 \n", 564 | "... ... \n", 565 | "9982871 tt2455546 \n", 566 | "9982872 \\N \n", 567 | "9982873 tt8736744 \n", 568 | "9982874 \\N \n", 569 | "9982875 \\N \n", 570 | "\n", 571 | "[9982876 rows x 6 columns]" 572 | ] 573 | }, 574 | "metadata": {}, 575 | "output_type": "display_data" 576 | }, 577 | { 578 | "data": { 579 | "text/html": [ 580 | "
\n", 581 | "\n", 594 | "\n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | "
tconstorderingnconstcategoryjobcharacters
0tt00000011nm1588970self\\N[\"Self\"]
1tt00000012nm0005690director\\N\\N
2tt00000013nm0374658cinematographerdirector of photography\\N
3tt00000021nm0721526director\\N\\N
4tt00000022nm1335271composer\\N\\N
.....................
38538893tt99168805nm0996406directorprincipal director\\N
38538894tt99168806nm1482639writer\\N\\N
38538895tt99168807nm2586970writerbooks\\N
38538896tt99168808nm1594058producerproducer\\N
38538897tt99168809nm2676923actress\\N[\"Sour Susan\",\"Goody-Goody Gordon\",\"Singing So...
\n", 708 | "

38538898 rows × 6 columns

\n", 709 | "
" 710 | ], 711 | "text/plain": [ 712 | " tconst ordering nconst category \\\n", 713 | "0 tt0000001 1 nm1588970 self \n", 714 | "1 tt0000001 2 nm0005690 director \n", 715 | "2 tt0000001 3 nm0374658 cinematographer \n", 716 | "3 tt0000002 1 nm0721526 director \n", 717 | "4 tt0000002 2 nm1335271 composer \n", 718 | "... ... ... ... ... \n", 719 | "38538893 tt9916880 5 nm0996406 director \n", 720 | "38538894 tt9916880 6 nm1482639 writer \n", 721 | "38538895 tt9916880 7 nm2586970 writer \n", 722 | "38538896 tt9916880 8 nm1594058 producer \n", 723 | "38538897 tt9916880 9 nm2676923 actress \n", 724 | "\n", 725 | " job \\\n", 726 | "0 \\N \n", 727 | "1 \\N \n", 728 | "2 director of photography \n", 729 | "3 \\N \n", 730 | "4 \\N \n", 731 | "... ... \n", 732 | "38538893 principal director \n", 733 | "38538894 \\N \n", 734 | "38538895 books \n", 735 | "38538896 producer \n", 736 | "38538897 \\N \n", 737 | "\n", 738 | " characters \n", 739 | "0 [\"Self\"] \n", 740 | "1 \\N \n", 741 | "2 \\N \n", 742 | "3 \\N \n", 743 | "4 \\N \n", 744 | "... ... \n", 745 | "38538893 \\N \n", 746 | "38538894 \\N \n", 747 | "38538895 \\N \n", 748 | "38538896 \\N \n", 749 | "38538897 [\"Sour Susan\",\"Goody-Goody Gordon\",\"Singing So... \n", 750 | "\n", 751 | "[38538898 rows x 6 columns]" 752 | ] 753 | }, 754 | "metadata": {}, 755 | "output_type": "display_data" 756 | }, 757 | { 758 | "data": { 759 | "text/html": [ 760 | "
\n", 761 | "\n", 774 | "\n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | "
tconstdirectorswriters
0tt0000001nm0005690\\N
1tt0000002nm0721526\\N
2tt0000003nm0721526\\N
3tt0000004nm0721526\\N
4tt0000005nm0005690\\N
............
6672689tt9916848nm5519454,nm5519375nm6182221,nm1628284,nm2921377
6672690tt9916850nm5519375,nm5519454nm6182221,nm1628284,nm2921377
6672691tt9916852nm5519375,nm5519454nm6182221,nm1628284,nm2921377
6672692tt9916856nm10538645nm6951431
6672693tt9916880nm0996406nm1482639,nm2586970
\n", 852 | "

6672694 rows × 3 columns

\n", 853 | "
" 854 | ], 855 | "text/plain": [ 856 | " tconst directors writers\n", 857 | "0 tt0000001 nm0005690 \\N\n", 858 | "1 tt0000002 nm0721526 \\N\n", 859 | "2 tt0000003 nm0721526 \\N\n", 860 | "3 tt0000004 nm0721526 \\N\n", 861 | "4 tt0000005 nm0005690 \\N\n", 862 | "... ... ... ...\n", 863 | "6672689 tt9916848 nm5519454,nm5519375 nm6182221,nm1628284,nm2921377\n", 864 | "6672690 tt9916850 nm5519375,nm5519454 nm6182221,nm1628284,nm2921377\n", 865 | "6672691 tt9916852 nm5519375,nm5519454 nm6182221,nm1628284,nm2921377\n", 866 | "6672692 tt9916856 nm10538645 nm6951431\n", 867 | "6672693 tt9916880 nm0996406 nm1482639,nm2586970\n", 868 | "\n", 869 | "[6672694 rows x 3 columns]" 870 | ] 871 | }, 872 | "metadata": {}, 873 | "output_type": "display_data" 874 | }, 875 | { 876 | "data": { 877 | "text/html": [ 878 | "
\n", 879 | "\n", 892 | "\n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | "
titleIdorderingtitleregionlanguagetypesattributesisOriginalTitle
0tt00000011CarmencitaDE\\N\\Nliteral title0
1tt00000012Carmencita - spanyol táncHU\\NimdbDisplay\\N0
2tt00000013ΚαρμενσίταGR\\NimdbDisplay\\N0
3tt00000014КарменситаRU\\NimdbDisplay\\N0
4tt00000015CarmencitaUS\\N\\N\\N0
...........................
20839948tt99168523Folge #3.20DEde\\N\\N0
20839949tt99168524エピソード #3.20JPja\\N\\N0
20839950tt99168525Episódio #3.20PTpt\\N\\N0
20839951tt99168526Episodio #3.20ITit\\N\\N0
20839952tt99168527एपिसोड #3.20INhi\\N\\N0
\n", 1030 | "

20839953 rows × 8 columns

\n", 1031 | "
" 1032 | ], 1033 | "text/plain": [ 1034 | " titleId ordering title region language \\\n", 1035 | "0 tt0000001 1 Carmencita DE \\N \n", 1036 | "1 tt0000001 2 Carmencita - spanyol tánc HU \\N \n", 1037 | "2 tt0000001 3 Καρμενσίτα GR \\N \n", 1038 | "3 tt0000001 4 Карменсита RU \\N \n", 1039 | "4 tt0000001 5 Carmencita US \\N \n", 1040 | "... ... ... ... ... ... \n", 1041 | "20839948 tt9916852 3 Folge #3.20 DE de \n", 1042 | "20839949 tt9916852 4 エピソード #3.20 JP ja \n", 1043 | "20839950 tt9916852 5 Episódio #3.20 PT pt \n", 1044 | "20839951 tt9916852 6 Episodio #3.20 IT it \n", 1045 | "20839952 tt9916852 7 एपिसोड #3.20 IN hi \n", 1046 | "\n", 1047 | " types attributes isOriginalTitle \n", 1048 | "0 \\N literal title 0 \n", 1049 | "1 imdbDisplay \\N 0 \n", 1050 | "2 imdbDisplay \\N 0 \n", 1051 | "3 imdbDisplay \\N 0 \n", 1052 | "4 \\N \\N 0 \n", 1053 | "... ... ... ... \n", 1054 | "20839948 \\N \\N 0 \n", 1055 | "20839949 \\N \\N 0 \n", 1056 | "20839950 \\N \\N 0 \n", 1057 | "20839951 \\N \\N 0 \n", 1058 | "20839952 \\N \\N 0 \n", 1059 | "\n", 1060 | "[20839953 rows x 8 columns]" 1061 | ] 1062 | }, 1063 | "metadata": {}, 1064 | "output_type": "display_data" 1065 | }, 1066 | { 1067 | "data": { 1068 | "text/html": [ 1069 | "
\n", 1070 | "\n", 1083 | "\n", 1084 | " \n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | " \n", 1090 | " \n", 1091 | " \n", 1092 | " \n", 1093 | " \n", 1094 | " \n", 1095 | " \n", 1096 | " \n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | "
tconstparentTconstseasonNumberepisodeNumber
0tt0041951tt004103819
1tt0042816tt0989125117
2tt0042889tt0989125\\N\\N
3tt0043426tt0040051342
4tt0043631tt0989125216
...............
4737120tt9916846tt1289683318
4737121tt9916848tt1289683317
4737122tt9916850tt1289683319
4737123tt9916852tt1289683320
4737124tt9916880tt098599142
\n", 1173 | "

4737125 rows × 4 columns

\n", 1174 | "
" 1175 | ], 1176 | "text/plain": [ 1177 | " tconst parentTconst seasonNumber episodeNumber\n", 1178 | "0 tt0041951 tt0041038 1 9\n", 1179 | "1 tt0042816 tt0989125 1 17\n", 1180 | "2 tt0042889 tt0989125 \\N \\N\n", 1181 | "3 tt0043426 tt0040051 3 42\n", 1182 | "4 tt0043631 tt0989125 2 16\n", 1183 | "... ... ... ... ...\n", 1184 | "4737120 tt9916846 tt1289683 3 18\n", 1185 | "4737121 tt9916848 tt1289683 3 17\n", 1186 | "4737122 tt9916850 tt1289683 3 19\n", 1187 | "4737123 tt9916852 tt1289683 3 20\n", 1188 | "4737124 tt9916880 tt0985991 4 2\n", 1189 | "\n", 1190 | "[4737125 rows x 4 columns]" 1191 | ] 1192 | }, 1193 | "metadata": {}, 1194 | "output_type": "display_data" 1195 | }, 1196 | { 1197 | "name": "stdout", 1198 | "output_type": "stream", 1199 | "text": [ 1200 | "Wall time: 2min 40s\n" 1201 | ] 1202 | } 1203 | ], 1204 | "source": [ 1205 | "%%time\n", 1206 | "imdb_api_file_list = ['title.basics.tsv.gz','title.ratings.tsv.gz','name.basics.tsv.gz','title.principals.tsv.gz','title.crew.tsv.gz','title.akas.tsv.gz','title.episode.tsv.gz']\n", 1207 | "\n", 1208 | "for package in imdb_api_file_list:\n", 1209 | " package_file_name = f'https://datasets.imdbws.com/{package}'\n", 1210 | " display(pd.read_csv(package_file_name, sep='\\t', low_memory=False))" 1211 | ] 1212 | }, 1213 | { 1214 | "cell_type": "markdown", 1215 | "metadata": {}, 1216 | "source": [ 1217 | "There is a lot of great data here! However, I there aren't isn't much text to work with - no plot summary or even taglines. I will have to scrape for that information.\n", 1218 | "\n", 1219 | "I only want movie results (no TV or people), so I'm going to focus on `title.basics.tsv.gz` and `title.ratings.tsv.gz`.\n", 1220 | "\n", 1221 | "---\n", 1222 | "\n", 1223 | "Save `title.basics` and `ratings` into their own dataframe, merge them together on `tconst`, and explore." 1224 | ] 1225 | }, 1226 | { 1227 | "cell_type": "code", 1228 | "execution_count": 12, 1229 | "metadata": {}, 1230 | "outputs": [ 1231 | { 1232 | "data": { 1233 | "text/html": [ 1234 | "
\n", 1235 | "\n", 1248 | "\n", 1249 | " \n", 1250 | " \n", 1251 | " \n", 1252 | " \n", 1253 | " \n", 1254 | " \n", 1255 | " \n", 1256 | " \n", 1257 | " \n", 1258 | " \n", 1259 | " \n", 1260 | " \n", 1261 | " \n", 1262 | " \n", 1263 | " \n", 1264 | " \n", 1265 | " \n", 1266 | " \n", 1267 | " \n", 1268 | " \n", 1269 | " \n", 1270 | " \n", 1271 | " \n", 1272 | " \n", 1273 | " \n", 1274 | " \n", 1275 | " \n", 1276 | " \n", 1277 | " \n", 1278 | " \n", 1279 | " \n", 1280 | " \n", 1281 | " \n", 1282 | " \n", 1283 | " \n", 1284 | " \n", 1285 | " \n", 1286 | " \n", 1287 | " \n", 1288 | " \n", 1289 | " \n", 1290 | " \n", 1291 | " \n", 1292 | " \n", 1293 | " \n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " \n", 1307 | " \n", 1308 | " \n", 1309 | " \n", 1310 | " \n", 1311 | " \n", 1312 | " \n", 1313 | " \n", 1314 | " \n", 1315 | " \n", 1316 | " \n", 1317 | " \n", 1318 | " \n", 1319 | " \n", 1320 | " \n", 1321 | " \n", 1322 | " \n", 1323 | " \n", 1324 | " \n", 1325 | " \n", 1326 | " \n", 1327 | " \n", 1328 | " \n", 1329 | " \n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | " \n", 1336 | " \n", 1337 | " \n", 1338 | " \n", 1339 | " \n", 1340 | " \n", 1341 | " \n", 1342 | " \n", 1343 | " \n", 1344 | " \n", 1345 | " \n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " \n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | " \n", 1360 | " \n", 1361 | " \n", 1362 | " \n", 1363 | " \n", 1364 | " \n", 1365 | " \n", 1366 | " \n", 1367 | " \n", 1368 | " \n", 1369 | " \n", 1370 | " \n", 1371 | " \n", 1372 | " \n", 1373 | " \n", 1374 | " \n", 1375 | " \n", 1376 | " \n", 1377 | " \n", 1378 | " \n", 1379 | " \n", 1380 | " \n", 1381 | " \n", 1382 | " \n", 1383 | " \n", 1384 | " \n", 1385 | " \n", 1386 | " \n", 1387 | " \n", 1388 | " \n", 1389 | " \n", 1390 | " \n", 1391 | " \n", 1392 | " \n", 1393 | " \n", 1394 | " \n", 1395 | " \n", 1396 | " \n", 1397 | " \n", 1398 | " \n", 1399 | " \n", 1400 | " \n", 1401 | " \n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | " \n", 1421 | "
tconsttitleTypeprimaryTitleoriginalTitleisAdultstartYearendYearruntimeMinutesgenresaverageRatingnumVotes
0tt0000001shortCarmencitaCarmencita01894\\N1Documentary,Short5.61591
1tt0000002shortLe clown et ses chiensLe clown et ses chiens01892\\N5Animation,Short6.1194
2tt0000003shortPauvre PierrotPauvre Pierrot01892\\N4Animation,Comedy,Romance6.51264
3tt0000004shortUn bon bockUn bon bock01892\\N12Animation,Short6.2120
4tt0000005shortBlacksmith SceneBlacksmith Scene01893\\N1Comedy,Short6.12025
....................................
1018999tt9916576tvEpisodeDestinee's StoryDestinee's Story02019\\N85Reality-TV6.09
1019000tt9916578tvEpisodeThe Trial of Joan CollinsThe Trial of Joan Collins02019\\N\\NAdventure,Biography,Comedy8.516
1019001tt9916720shortThe Nun 2The Nun 202019\\N10Comedy,Horror,Mystery5.548
1019002tt9916766tvEpisodeEpisode #10.15Episode #10.1502019\\N43Family,Reality-TV6.813
1019003tt9916778tvEpisodeEscapeEscape02019\\N\\NDrama7.220
\n", 1422 | "

1019004 rows × 11 columns

\n", 1423 | "
" 1424 | ], 1425 | "text/plain": [ 1426 | " tconst titleType primaryTitle \\\n", 1427 | "0 tt0000001 short Carmencita \n", 1428 | "1 tt0000002 short Le clown et ses chiens \n", 1429 | "2 tt0000003 short Pauvre Pierrot \n", 1430 | "3 tt0000004 short Un bon bock \n", 1431 | "4 tt0000005 short Blacksmith Scene \n", 1432 | "... ... ... ... \n", 1433 | "1018999 tt9916576 tvEpisode Destinee's Story \n", 1434 | "1019000 tt9916578 tvEpisode The Trial of Joan Collins \n", 1435 | "1019001 tt9916720 short The Nun 2 \n", 1436 | "1019002 tt9916766 tvEpisode Episode #10.15 \n", 1437 | "1019003 tt9916778 tvEpisode Escape \n", 1438 | "\n", 1439 | " originalTitle isAdult startYear endYear runtimeMinutes \\\n", 1440 | "0 Carmencita 0 1894 \\N 1 \n", 1441 | "1 Le clown et ses chiens 0 1892 \\N 5 \n", 1442 | "2 Pauvre Pierrot 0 1892 \\N 4 \n", 1443 | "3 Un bon bock 0 1892 \\N 12 \n", 1444 | "4 Blacksmith Scene 0 1893 \\N 1 \n", 1445 | "... ... ... ... ... ... \n", 1446 | "1018999 Destinee's Story 0 2019 \\N 85 \n", 1447 | "1019000 The Trial of Joan Collins 0 2019 \\N \\N \n", 1448 | "1019001 The Nun 2 0 2019 \\N 10 \n", 1449 | "1019002 Episode #10.15 0 2019 \\N 43 \n", 1450 | "1019003 Escape 0 2019 \\N \\N \n", 1451 | "\n", 1452 | " genres averageRating numVotes \n", 1453 | "0 Documentary,Short 5.6 1591 \n", 1454 | "1 Animation,Short 6.1 194 \n", 1455 | "2 Animation,Comedy,Romance 6.5 1264 \n", 1456 | "3 Animation,Short 6.2 120 \n", 1457 | "4 Comedy,Short 6.1 2025 \n", 1458 | "... ... ... ... \n", 1459 | "1018999 Reality-TV 6.0 9 \n", 1460 | "1019000 Adventure,Biography,Comedy 8.5 16 \n", 1461 | "1019001 Comedy,Horror,Mystery 5.5 48 \n", 1462 | "1019002 Family,Reality-TV 6.8 13 \n", 1463 | "1019003 Drama 7.2 20 \n", 1464 | "\n", 1465 | "[1019004 rows x 11 columns]" 1466 | ] 1467 | }, 1468 | "execution_count": 12, 1469 | "metadata": {}, 1470 | "output_type": "execute_result" 1471 | } 1472 | ], 1473 | "source": [ 1474 | "df_basics = pd.read_csv('https://datasets.imdbws.com/title.basics.tsv.gz', sep='\\t', low_memory=False)\n", 1475 | "df_ratings = pd.read_csv('https://datasets.imdbws.com/title.ratings.tsv.gz', sep='\\t', low_memory=False)\n", 1476 | "df_merge = pd.merge(df_basics, df_ratings, left_on='tconst', right_on='tconst')\n", 1477 | "df_merge" 1478 | ] 1479 | }, 1480 | { 1481 | "cell_type": "markdown", 1482 | "metadata": {}, 1483 | "source": [ 1484 | "We have 1,000,000+ rows with basic movie information, now including number of votes and rating. This additional information will help to pare down this dataframe.\n", 1485 | "\n", 1486 | "There are still have a LOT of rows that aren't needed. For example, anything where the `titleType` is not \"movie\" can be ignored." 1487 | ] 1488 | }, 1489 | { 1490 | "cell_type": "code", 1491 | "execution_count": 13, 1492 | "metadata": {}, 1493 | "outputs": [ 1494 | { 1495 | "data": { 1496 | "text/plain": [ 1497 | "tconst object\n", 1498 | "titleType object\n", 1499 | "primaryTitle object\n", 1500 | "originalTitle object\n", 1501 | "isAdult int64\n", 1502 | "startYear object\n", 1503 | "endYear object\n", 1504 | "runtimeMinutes object\n", 1505 | "genres object\n", 1506 | "averageRating float64\n", 1507 | "numVotes int64\n", 1508 | "dtype: object" 1509 | ] 1510 | }, 1511 | "execution_count": 13, 1512 | "metadata": {}, 1513 | "output_type": "execute_result" 1514 | } 1515 | ], 1516 | "source": [ 1517 | "df_merge.dtypes" 1518 | ] 1519 | }, 1520 | { 1521 | "cell_type": "markdown", 1522 | "metadata": {}, 1523 | "source": [ 1524 | "The data types of our new dataframe is mostly `object` (string) types. Years and runtime should be integers. However, we don't need to worry about these here." 1525 | ] 1526 | }, 1527 | { 1528 | "cell_type": "markdown", 1529 | "metadata": {}, 1530 | "source": [ 1531 | "---\n", 1532 | "\n", 1533 | "Create slim downed dataframe with the following changes\n", 1534 | "- Remove unreleased titles (where `startYear` is NULL)\n", 1535 | "- Only want type 'movie'\n", 1536 | "- No adult films\n", 1537 | "- Drop `endYear` as it only applies to TV\n", 1538 | "- Change `startYear` to `year` and move it to the beginning" 1539 | ] 1540 | }, 1541 | { 1542 | "cell_type": "code", 1543 | "execution_count": 17, 1544 | "metadata": {}, 1545 | "outputs": [ 1546 | { 1547 | "data": { 1548 | "text/html": [ 1549 | "
\n", 1550 | "\n", 1563 | "\n", 1564 | " \n", 1565 | " \n", 1566 | " \n", 1567 | " \n", 1568 | " \n", 1569 | " \n", 1570 | " \n", 1571 | " \n", 1572 | " \n", 1573 | " \n", 1574 | " \n", 1575 | " \n", 1576 | " \n", 1577 | " \n", 1578 | " \n", 1579 | " \n", 1580 | " \n", 1581 | " \n", 1582 | " \n", 1583 | " \n", 1584 | " \n", 1585 | " \n", 1586 | " \n", 1587 | " \n", 1588 | " \n", 1589 | " \n", 1590 | " \n", 1591 | " \n", 1592 | " \n", 1593 | " \n", 1594 | " \n", 1595 | " \n", 1596 | " \n", 1597 | " \n", 1598 | " \n", 1599 | " \n", 1600 | " \n", 1601 | " \n", 1602 | " \n", 1603 | " \n", 1604 | " \n", 1605 | " \n", 1606 | " \n", 1607 | " \n", 1608 | " \n", 1609 | " \n", 1610 | " \n", 1611 | " \n", 1612 | " \n", 1613 | " \n", 1614 | " \n", 1615 | " \n", 1616 | " \n", 1617 | " \n", 1618 | " \n", 1619 | " \n", 1620 | " \n", 1621 | " \n", 1622 | " \n", 1623 | " \n", 1624 | " \n", 1625 | " \n", 1626 | " \n", 1627 | " \n", 1628 | " \n", 1629 | " \n", 1630 | " \n", 1631 | " \n", 1632 | " \n", 1633 | " \n", 1634 | " \n", 1635 | " \n", 1636 | " \n", 1637 | " \n", 1638 | " \n", 1639 | " \n", 1640 | " \n", 1641 | " \n", 1642 | " \n", 1643 | " \n", 1644 | " \n", 1645 | " \n", 1646 | " \n", 1647 | " \n", 1648 | " \n", 1649 | " \n", 1650 | " \n", 1651 | " \n", 1652 | " \n", 1653 | " \n", 1654 | " \n", 1655 | " \n", 1656 | " \n", 1657 | " \n", 1658 | " \n", 1659 | " \n", 1660 | " \n", 1661 | " \n", 1662 | " \n", 1663 | " \n", 1664 | " \n", 1665 | " \n", 1666 | " \n", 1667 | " \n", 1668 | " \n", 1669 | " \n", 1670 | " \n", 1671 | " \n", 1672 | " \n", 1673 | " \n", 1674 | " \n", 1675 | " \n", 1676 | " \n", 1677 | " \n", 1678 | " \n", 1679 | " \n", 1680 | " \n", 1681 | " \n", 1682 | " \n", 1683 | " \n", 1684 | " \n", 1685 | " \n", 1686 | " \n", 1687 | " \n", 1688 | " \n", 1689 | " \n", 1690 | " \n", 1691 | " \n", 1692 | " \n", 1693 | " \n", 1694 | " \n", 1695 | " \n", 1696 | " \n", 1697 | " \n", 1698 | " \n", 1699 | " \n", 1700 | " \n", 1701 | " \n", 1702 | " \n", 1703 | " \n", 1704 | " \n", 1705 | " \n", 1706 | " \n", 1707 | " \n", 1708 | " \n", 1709 | " \n", 1710 | " \n", 1711 | " \n", 1712 | "
tconsttitleTypeyearprimaryTitleoriginalTitleruntimeMinutesgenresaverageRatingnumVotes
8tt0000009movie1894Miss JerryMiss Jerry45Romance5.489
144tt0000147movie1897The Corbett-Fitzsimmons FightThe Corbett-Fitzsimmons Fight20Documentary,News,Sport5.2333
251tt0000335movie1900Soldiers of the CrossSoldiers of the Cross\\NBiography,Drama6.140
327tt0000502movie1905BohemiosBohemios100\\N4.45
361tt0000574movie1906The Story of the Kelly GangThe Story of the Kelly Gang70Biography,Crime,Drama6.1562
..............................
1018959tt9914942movie2019La vida sense la Sara AmatLa vida sense la Sara Amat74Drama6.776
1018974tt9915790movie2019Bobbyr BondhuraBobbyr Bondhura\\NFamily7.613
1018987tt9916160movie2019DrømmelandDrømmeland72Documentary6.636
1018996tt9916428movie2019The Secret of ChinaThe Secret of China\\NAdventure,History,War3.311
1018997tt9916538movie2019Kuambil Lagi HatikuKuambil Lagi Hatiku123Drama8.45
\n", 1713 | "

241980 rows × 9 columns

\n", 1714 | "
" 1715 | ], 1716 | "text/plain": [ 1717 | " tconst titleType year primaryTitle \\\n", 1718 | "8 tt0000009 movie 1894 Miss Jerry \n", 1719 | "144 tt0000147 movie 1897 The Corbett-Fitzsimmons Fight \n", 1720 | "251 tt0000335 movie 1900 Soldiers of the Cross \n", 1721 | "327 tt0000502 movie 1905 Bohemios \n", 1722 | "361 tt0000574 movie 1906 The Story of the Kelly Gang \n", 1723 | "... ... ... ... ... \n", 1724 | "1018959 tt9914942 movie 2019 La vida sense la Sara Amat \n", 1725 | "1018974 tt9915790 movie 2019 Bobbyr Bondhura \n", 1726 | "1018987 tt9916160 movie 2019 Drømmeland \n", 1727 | "1018996 tt9916428 movie 2019 The Secret of China \n", 1728 | "1018997 tt9916538 movie 2019 Kuambil Lagi Hatiku \n", 1729 | "\n", 1730 | " originalTitle runtimeMinutes genres \\\n", 1731 | "8 Miss Jerry 45 Romance \n", 1732 | "144 The Corbett-Fitzsimmons Fight 20 Documentary,News,Sport \n", 1733 | "251 Soldiers of the Cross \\N Biography,Drama \n", 1734 | "327 Bohemios 100 \\N \n", 1735 | "361 The Story of the Kelly Gang 70 Biography,Crime,Drama \n", 1736 | "... ... ... ... \n", 1737 | "1018959 La vida sense la Sara Amat 74 Drama \n", 1738 | "1018974 Bobbyr Bondhura \\N Family \n", 1739 | "1018987 Drømmeland 72 Documentary \n", 1740 | "1018996 The Secret of China \\N Adventure,History,War \n", 1741 | "1018997 Kuambil Lagi Hatiku 123 Drama \n", 1742 | "\n", 1743 | " averageRating numVotes \n", 1744 | "8 5.4 89 \n", 1745 | "144 5.2 333 \n", 1746 | "251 6.1 40 \n", 1747 | "327 4.4 5 \n", 1748 | "361 6.1 562 \n", 1749 | "... ... ... \n", 1750 | "1018959 6.7 76 \n", 1751 | "1018974 7.6 13 \n", 1752 | "1018987 6.6 36 \n", 1753 | "1018996 3.3 11 \n", 1754 | "1018997 8.4 5 \n", 1755 | "\n", 1756 | "[241980 rows x 9 columns]" 1757 | ] 1758 | }, 1759 | "execution_count": 17, 1760 | "metadata": {}, 1761 | "output_type": "execute_result" 1762 | } 1763 | ], 1764 | "source": [ 1765 | "df_slim = df_merge\n", 1766 | "# Remove unreleased, non-movies, adult\n", 1767 | "df_slim = df_slim.drop(df_slim[df_slim.startYear == '\\\\N'].index)\n", 1768 | "df_slim = df_slim[ (df_slim['titleType'] == 'movie' ) & (df_slim['isAdult'] == 0) ]\n", 1769 | "df_slim = df_slim.drop(['endYear', 'isAdult'], axis=1)\n", 1770 | "\n", 1771 | "# Reformat year column\n", 1772 | "df_slim.insert(loc=2, column='year', value=df_slim['startYear'])\n", 1773 | "df_slim = df_slim.drop(['startYear'], axis=1)\n", 1774 | "df_slim" 1775 | ] 1776 | }, 1777 | { 1778 | "cell_type": "code", 1779 | "execution_count": 65, 1780 | "metadata": {}, 1781 | "outputs": [ 1782 | { 1783 | "data": { 1784 | "text/plain": [ 1785 | "tconst 0\n", 1786 | "titleType 0\n", 1787 | "year 0\n", 1788 | "primaryTitle 0\n", 1789 | "originalTitle 0\n", 1790 | "runtimeMinutes 0\n", 1791 | "genres 0\n", 1792 | "averageRating 0\n", 1793 | "numVotes 0\n", 1794 | "dtype: int64" 1795 | ] 1796 | }, 1797 | "execution_count": 65, 1798 | "metadata": {}, 1799 | "output_type": "execute_result" 1800 | } 1801 | ], 1802 | "source": [ 1803 | "df_slim.isna().sum()" 1804 | ] 1805 | }, 1806 | { 1807 | "cell_type": "code", 1808 | "execution_count": 26, 1809 | "metadata": {}, 1810 | "outputs": [ 1811 | { 1812 | "data": { 1813 | "text/html": [ 1814 | "
\n", 1815 | "\n", 1828 | "\n", 1829 | " \n", 1830 | " \n", 1831 | " \n", 1832 | " \n", 1833 | " \n", 1834 | " \n", 1835 | " \n", 1836 | " \n", 1837 | " \n", 1838 | " \n", 1839 | " \n", 1840 | " \n", 1841 | " \n", 1842 | " \n", 1843 | " \n", 1844 | " \n", 1845 | " \n", 1846 | " \n", 1847 | " \n", 1848 | " \n", 1849 | " \n", 1850 | " \n", 1851 | " \n", 1852 | " \n", 1853 | " \n", 1854 | " \n", 1855 | " \n", 1856 | " \n", 1857 | " \n", 1858 | " \n", 1859 | " \n", 1860 | " \n", 1861 | " \n", 1862 | " \n", 1863 | " \n", 1864 | " \n", 1865 | " \n", 1866 | " \n", 1867 | " \n", 1868 | " \n", 1869 | "
tconsttitleTypeyearprimaryTitleoriginalTitleruntimeMinutesgenresaverageRatingnumVotes
51794tt0076759movie1977Star Wars: Episode IV - A New HopeStar Wars121Action,Adventure,Fantasy8.61170498
998631tt8946378movie2019Knives OutKnives Out131Comedy,Crime,Drama8.0140682
\n", 1870 | "
" 1871 | ], 1872 | "text/plain": [ 1873 | " tconst titleType year primaryTitle \\\n", 1874 | "51794 tt0076759 movie 1977 Star Wars: Episode IV - A New Hope \n", 1875 | "998631 tt8946378 movie 2019 Knives Out \n", 1876 | "\n", 1877 | " originalTitle runtimeMinutes genres averageRating \\\n", 1878 | "51794 Star Wars 121 Action,Adventure,Fantasy 8.6 \n", 1879 | "998631 Knives Out 131 Comedy,Crime,Drama 8.0 \n", 1880 | "\n", 1881 | " numVotes \n", 1882 | "51794 1170498 \n", 1883 | "998631 140682 " 1884 | ] 1885 | }, 1886 | "execution_count": 26, 1887 | "metadata": {}, 1888 | "output_type": "execute_result" 1889 | } 1890 | ], 1891 | "source": [ 1892 | "dft = df_slim[(df_slim['tconst'].isin(['tt8946378','tt0076759']))]\n", 1893 | "dft" 1894 | ] 1895 | }, 1896 | { 1897 | "cell_type": "markdown", 1898 | "metadata": {}, 1899 | "source": [ 1900 | "Down to about 240,000 rows after those modifications.\n", 1901 | "\n", 1902 | "Pulled up a quick sample to check. I've watched both of these movies, the genres jump out to me.\n", 1903 | "- Star Wars missing Sci-fi\n", 1904 | "- Knives Out missing Mystery\n", 1905 | "\n", 1906 | "It turns out, genres are limited to a count of 3, and it's only the first 3 alphabetically. That is not reliable data to correctly classify genres. As you can see from the screenshots (https://www.imdb.com/title/tt0076759/), there are actually 4 genres on this page that associate with this title.\n", 1907 | "\n", 1908 | "I am going to scrape this page for all the information I want (storyline / plot summary, FULL genre list). All I need is IMDb's title ID - which is `tconst` in this dataset.\n", 1909 | "\n", 1910 | "![](images/imdb-top.png)\n", 1911 | "\n", 1912 | "![](images/imdb-bottom.png)\n", 1913 | "\n", 1914 | "---" 1915 | ] 1916 | }, 1917 | { 1918 | "cell_type": "markdown", 1919 | "metadata": {}, 1920 | "source": [ 1921 | "### New Plan:\n", 1922 | "#### Export list of IMDb IDs so I can make the scraping url" 1923 | ] 1924 | }, 1925 | { 1926 | "cell_type": "code", 1927 | "execution_count": 51, 1928 | "metadata": {}, 1929 | "outputs": [ 1930 | { 1931 | "data": { 1932 | "text/plain": [ 1933 | "tconst object\n", 1934 | "titleType object\n", 1935 | "year object\n", 1936 | "primaryTitle object\n", 1937 | "originalTitle object\n", 1938 | "runtimeMinutes object\n", 1939 | "genres object\n", 1940 | "averageRating float64\n", 1941 | "numVotes int64\n", 1942 | "dtype: object" 1943 | ] 1944 | }, 1945 | "execution_count": 51, 1946 | "metadata": {}, 1947 | "output_type": "execute_result" 1948 | } 1949 | ], 1950 | "source": [ 1951 | "df_slim.dtypes" 1952 | ] 1953 | }, 1954 | { 1955 | "cell_type": "markdown", 1956 | "metadata": {}, 1957 | "source": [ 1958 | "I want to do integer comparison for `year` but it currently an `object`. Need to fix that now." 1959 | ] 1960 | }, 1961 | { 1962 | "cell_type": "code", 1963 | "execution_count": 53, 1964 | "metadata": {}, 1965 | "outputs": [], 1966 | "source": [ 1967 | "# Clean year column\n", 1968 | "df_slim['year'] = df_slim['year'].fillna(0.0).astype(int)" 1969 | ] 1970 | }, 1971 | { 1972 | "cell_type": "code", 1973 | "execution_count": 55, 1974 | "metadata": {}, 1975 | "outputs": [ 1976 | { 1977 | "data": { 1978 | "text/html": [ 1979 | "
\n", 1980 | "\n", 1993 | "\n", 1994 | " \n", 1995 | " \n", 1996 | " \n", 1997 | " \n", 1998 | " \n", 1999 | " \n", 2000 | " \n", 2001 | " \n", 2002 | " \n", 2003 | " \n", 2004 | " \n", 2005 | " \n", 2006 | " \n", 2007 | " \n", 2008 | " \n", 2009 | " \n", 2010 | " \n", 2011 | " \n", 2012 | " \n", 2013 | " \n", 2014 | " \n", 2015 | " \n", 2016 | " \n", 2017 | " \n", 2018 | " \n", 2019 | " \n", 2020 | " \n", 2021 | " \n", 2022 | " \n", 2023 | " \n", 2024 | " \n", 2025 | " \n", 2026 | " \n", 2027 | " \n", 2028 | " \n", 2029 | " \n", 2030 | " \n", 2031 | " \n", 2032 | " \n", 2033 | " \n", 2034 | " \n", 2035 | " \n", 2036 | " \n", 2037 | " \n", 2038 | " \n", 2039 | " \n", 2040 | " \n", 2041 | " \n", 2042 | " \n", 2043 | " \n", 2044 | " \n", 2045 | " \n", 2046 | " \n", 2047 | " \n", 2048 | " \n", 2049 | " \n", 2050 | " \n", 2051 | " \n", 2052 | " \n", 2053 | " \n", 2054 | " \n", 2055 | " \n", 2056 | " \n", 2057 | " \n", 2058 | " \n", 2059 | " \n", 2060 | " \n", 2061 | " \n", 2062 | " \n", 2063 | " \n", 2064 | " \n", 2065 | " \n", 2066 | " \n", 2067 | " \n", 2068 | " \n", 2069 | " \n", 2070 | " \n", 2071 | " \n", 2072 | " \n", 2073 | " \n", 2074 | " \n", 2075 | " \n", 2076 | " \n", 2077 | " \n", 2078 | " \n", 2079 | " \n", 2080 | " \n", 2081 | " \n", 2082 | " \n", 2083 | " \n", 2084 | " \n", 2085 | " \n", 2086 | " \n", 2087 | " \n", 2088 | " \n", 2089 | " \n", 2090 | " \n", 2091 | " \n", 2092 | " \n", 2093 | " \n", 2094 | " \n", 2095 | " \n", 2096 | " \n", 2097 | " \n", 2098 | " \n", 2099 | " \n", 2100 | " \n", 2101 | " \n", 2102 | " \n", 2103 | " \n", 2104 | " \n", 2105 | " \n", 2106 | " \n", 2107 | " \n", 2108 | " \n", 2109 | " \n", 2110 | " \n", 2111 | " \n", 2112 | " \n", 2113 | " \n", 2114 | " \n", 2115 | " \n", 2116 | " \n", 2117 | " \n", 2118 | " \n", 2119 | " \n", 2120 | " \n", 2121 | " \n", 2122 | " \n", 2123 | " \n", 2124 | " \n", 2125 | " \n", 2126 | " \n", 2127 | " \n", 2128 | " \n", 2129 | " \n", 2130 | " \n", 2131 | " \n", 2132 | " \n", 2133 | " \n", 2134 | " \n", 2135 | " \n", 2136 | " \n", 2137 | " \n", 2138 | " \n", 2139 | " \n", 2140 | " \n", 2141 | " \n", 2142 | "
tconsttitleTypeyearprimaryTitleoriginalTitleruntimeMinutesgenresaverageRatingnumVotes
80744tt0111161movie1994The Shawshank RedemptionThe Shawshank Redemption142Drama9.32203956
241990tt0468569movie2008The Dark KnightThe Dark Knight152Action,Crime,Drama9.02184629
530003tt1375666movie2010InceptionInception148Action,Adventure,Sci-Fi8.81933557
96873tt0137523movie1999Fight ClubFight Club139Drama8.81759843
80528tt0110912movie1994Pulp FictionPulp Fiction154Crime,Drama8.91731665
..............................
843975tt5275476movie2017BedbugsFikkefuchs101Comedy,Drama6.21001
56437tt0082210movie1981El crackEl crack131Crime,Drama,Mystery7.31001
25596tt0045992movie1952The Lawless BreedThe Lawless Breed83Western6.31001
520460tt1327833movie2008Sorry Bhai!Sorry Bhai!154Comedy,Drama,Romance6.11001
244603tt0475627movie2005ShikharShikhar162Drama4.91001
\n", 2143 | "

30403 rows × 9 columns

\n", 2144 | "
" 2145 | ], 2146 | "text/plain": [ 2147 | " tconst titleType year primaryTitle \\\n", 2148 | "80744 tt0111161 movie 1994 The Shawshank Redemption \n", 2149 | "241990 tt0468569 movie 2008 The Dark Knight \n", 2150 | "530003 tt1375666 movie 2010 Inception \n", 2151 | "96873 tt0137523 movie 1999 Fight Club \n", 2152 | "80528 tt0110912 movie 1994 Pulp Fiction \n", 2153 | "... ... ... ... ... \n", 2154 | "843975 tt5275476 movie 2017 Bedbugs \n", 2155 | "56437 tt0082210 movie 1981 El crack \n", 2156 | "25596 tt0045992 movie 1952 The Lawless Breed \n", 2157 | "520460 tt1327833 movie 2008 Sorry Bhai! \n", 2158 | "244603 tt0475627 movie 2005 Shikhar \n", 2159 | "\n", 2160 | " originalTitle runtimeMinutes genres \\\n", 2161 | "80744 The Shawshank Redemption 142 Drama \n", 2162 | "241990 The Dark Knight 152 Action,Crime,Drama \n", 2163 | "530003 Inception 148 Action,Adventure,Sci-Fi \n", 2164 | "96873 Fight Club 139 Drama \n", 2165 | "80528 Pulp Fiction 154 Crime,Drama \n", 2166 | "... ... ... ... \n", 2167 | "843975 Fikkefuchs 101 Comedy,Drama \n", 2168 | "56437 El crack 131 Crime,Drama,Mystery \n", 2169 | "25596 The Lawless Breed 83 Western \n", 2170 | "520460 Sorry Bhai! 154 Comedy,Drama,Romance \n", 2171 | "244603 Shikhar 162 Drama \n", 2172 | "\n", 2173 | " averageRating numVotes \n", 2174 | "80744 9.3 2203956 \n", 2175 | "241990 9.0 2184629 \n", 2176 | "530003 8.8 1933557 \n", 2177 | "96873 8.8 1759843 \n", 2178 | "80528 8.9 1731665 \n", 2179 | "... ... ... \n", 2180 | "843975 6.2 1001 \n", 2181 | "56437 7.3 1001 \n", 2182 | "25596 6.3 1001 \n", 2183 | "520460 6.1 1001 \n", 2184 | "244603 4.9 1001 \n", 2185 | "\n", 2186 | "[30403 rows x 9 columns]" 2187 | ] 2188 | }, 2189 | "metadata": {}, 2190 | "output_type": "display_data" 2191 | } 2192 | ], 2193 | "source": [ 2194 | "final_df = df_slim[(df_slim['year'] >= 1920) & (df_slim['numVotes'] > 1000)].sort_values(['numVotes'], ascending=False)\n", 2195 | "display(final_df)" 2196 | ] 2197 | }, 2198 | { 2199 | "cell_type": "markdown", 2200 | "metadata": {}, 2201 | "source": [ 2202 | "This looks like a good amount, approx. 30,000 titles to scrape. I'm using an arbitrary threshold where `numVotes` is greater than 1000. I'm hoping those entries have some decent level of accuracy if they are that popular.\n", 2203 | "\n", 2204 | "Additionally I'm limiting it to 1920 and later for an even '100 years' of movies.\n", 2205 | "\n", 2206 | "---\n", 2207 | "\n", 2208 | "I'm not going to scrape 2020, but this cell if for future use to see how much 'new' training data I will have." 2209 | ] 2210 | }, 2211 | { 2212 | "cell_type": "code", 2213 | "execution_count": 61, 2214 | "metadata": {}, 2215 | "outputs": [ 2216 | { 2217 | "data": { 2218 | "text/plain": [ 2219 | "(71, 9)" 2220 | ] 2221 | }, 2222 | "execution_count": 61, 2223 | "metadata": {}, 2224 | "output_type": "execute_result" 2225 | } 2226 | ], 2227 | "source": [ 2228 | "final_df[final_df['year'] >= 2020].shape" 2229 | ] 2230 | }, 2231 | { 2232 | "cell_type": "markdown", 2233 | "metadata": {}, 2234 | "source": [ 2235 | "Finally, export this list to a `.csv` file so I can access it later for scraping using `tconst` id." 2236 | ] 2237 | }, 2238 | { 2239 | "cell_type": "code", 2240 | "execution_count": 60, 2241 | "metadata": {}, 2242 | "outputs": [], 2243 | "source": [ 2244 | "final_df.to_csv('imdb_movie_list.csv', header=True, index=False)" 2245 | ] 2246 | }, 2247 | { 2248 | "cell_type": "markdown", 2249 | "metadata": {}, 2250 | "source": [ 2251 | "---" 2252 | ] 2253 | }, 2254 | { 2255 | "cell_type": "markdown", 2256 | "metadata": {}, 2257 | "source": [ 2258 | "Genre Genie - Multi-label classification using NLP\n", 2259 | "\n", 2260 | "Tom Keith - 2020" 2261 | ] 2262 | } 2263 | ], 2264 | "metadata": { 2265 | "kernelspec": { 2266 | "display_name": "Python 3", 2267 | "language": "python", 2268 | "name": "python3" 2269 | }, 2270 | "language_info": { 2271 | "codemirror_mode": { 2272 | "name": "ipython", 2273 | "version": 3 2274 | }, 2275 | "file_extension": ".py", 2276 | "mimetype": "text/x-python", 2277 | "name": "python", 2278 | "nbconvert_exporter": "python", 2279 | "pygments_lexer": "ipython3", 2280 | "version": "3.7.4" 2281 | } 2282 | }, 2283 | "nbformat": 4, 2284 | "nbformat_minor": 4 2285 | } 2286 | -------------------------------------------------------------------------------- /3.1-modeling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Genre Genie - Multi-label Classification with NLP\n", 8 | "### Part 3.1: Modeling using OneVsRest\n", 9 | "\n", 10 | "#### Tom Keith\n", 11 | "\n", 12 | "---\n", 13 | "\n", 14 | "**Goal:** Fit and optimize multi-label classification model on the train set. Finally, score on test set." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "# Standard imports\n", 24 | "import numpy as np\n", 25 | "import pandas as pd\n", 26 | "pd.set_option('display.max_columns', 500)\n", 27 | "import matplotlib.pyplot as plt\n", 28 | "%matplotlib inline\n", 29 | "import joblib" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "Import train and test dataframes from previous step. They are large files." 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 2, 42 | "metadata": {}, 43 | "outputs": [ 44 | { 45 | "name": "stdout", 46 | "output_type": "stream", 47 | "text": [ 48 | "Wall time: 29.3 s\n" 49 | ] 50 | } 51 | ], 52 | "source": [ 53 | "%%time\n", 54 | "train = pd.read_csv('data/train_dataframe.tsv', sep='\\t', index_col=0)\n", 55 | "test = pd.read_csv('data/test_dataframe.tsv', sep='\\t', index_col=0)" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "Put the genre and features names (which aren't words) into lists for easy use." 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 3, 68 | "metadata": {}, 69 | "outputs": [ 70 | { 71 | "name": "stdout", 72 | "output_type": "stream", 73 | "text": [ 74 | "22\n", 75 | "['g_action', 'g_adventure', 'g_animation', 'g_biography', 'g_comedy', 'g_crime', 'g_documentary', 'g_drama', 'g_family', 'g_fantasy', 'g_film-noir', 'g_history', 'g_horror', 'g_music', 'g_musical', 'g_mystery', 'g_romance', 'g_sci-fi', 'g_sport', 'g_thriller', 'g_war', 'g_western']\n" 76 | ] 77 | } 78 | ], 79 | "source": [ 80 | "cols = list(train.columns.values)\n", 81 | "genre_cols = cols[-22:]\n", 82 | "print(len(genre_cols))\n", 83 | "print(genre_cols)" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": 4, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "f_names = cols[:8]" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "Separate out X and y out of our train and test .tsv files. We want JUST the genre columns for `y` and everything except the genre columns for `X`." 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 5, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "X_train = train[train.columns[~train.columns.isin(genre_cols)]]\n", 109 | "y_train = train[train.columns[ train.columns.isin(genre_cols)]]\n", 110 | "\n", 111 | "X_test = test[test.columns[~test.columns.isin(genre_cols)]]\n", 112 | "y_test = test[test.columns[ test.columns.isin(genre_cols)]]" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "---\n", 120 | "\n", 121 | "Before running a model, we need to scale our data. Both standard and min-max were tested, but standard scaler came out on top." 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 8, 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "%%time\n", 131 | "# Scale data (Standard Scaler)\n", 132 | "from sklearn.preprocessing import StandardScaler \n", 133 | "my_standard_scaler = StandardScaler().fit(X_train)\n", 134 | "X_train_s = my_standard_scaler.transform(X_train)\n", 135 | "X_test_s = my_standard_scaler.transform(X_test)\n", 136 | "\n", 137 | "#joblib.dump(my_standard_scaler, 'models/my_standard_scaler.pkl')" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 98, 143 | "metadata": {}, 144 | "outputs": [ 145 | { 146 | "data": { 147 | "text/plain": [ 148 | "['models/my_minmax_scaler.pkl']" 149 | ] 150 | }, 151 | "execution_count": 98, 152 | "metadata": {}, 153 | "output_type": "execute_result" 154 | } 155 | ], 156 | "source": [ 157 | "# Scale data (MinMax Scaler)\n", 158 | "from sklearn.preprocessing import MinMaxScaler \n", 159 | "my_minmax_scaler = MinMaxScaler().fit(X_train)\n", 160 | "X_train_mm = my_minmax_scaler.transform(X_train)\n", 161 | "X_test_mm = my_minmax_scaler.transform(X_test)\n", 162 | "\n", 163 | "#joblib.dump(my_minmax_scaler, 'models/my_minmax_scaler.pkl')" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "---\n", 171 | "\n", 172 | "### Please note\n", 173 | "MANY models were tested and pkl'd. Below is the optimized model. After that, everything below it is testing of other models, scalers, score grading, and tuning hyperparameters. I normally would not include all of them, but they remain for completeness.\n", 174 | "\n", 175 | "In the end, OneVsRest with Logistic Regression (C=0.01, solver='lbfgs') when scaled with a standard scaler was the best option." 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 10, 181 | "metadata": {}, 182 | "outputs": [], 183 | "source": [ 184 | "import joblib\n", 185 | "#my_model = joblib.load('models/my_1vr_linear_svc_default.pkl')" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": 9, 191 | "metadata": {}, 192 | "outputs": [], 193 | "source": [ 194 | "from sklearn.multiclass import OneVsRestClassifier\n", 195 | "from sklearn.svm import LinearSVC\n", 196 | "from sklearn.linear_model import LogisticRegression" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 113, 202 | "metadata": {}, 203 | "outputs": [ 204 | { 205 | "name": "stdout", 206 | "output_type": "stream", 207 | "text": [ 208 | "[0.09696028 0.08988016 0.09786951 0.09476254 0.09476254]\n", 209 | "Fold 1: 0.09696028400266253\n", 210 | "Fold 2: 0.08988015978695073\n", 211 | "Fold 3: 0.09786950732356857\n", 212 | "Fold 4: 0.09476253883710609\n", 213 | "Fold 5: 0.09476253883710609\n", 214 | "Average Score:0.0948470057574788\n" 215 | ] 216 | } 217 | ], 218 | "source": [ 219 | "from sklearn.model_selection import cross_val_score\n", 220 | "my_log_model = OneVsRestClassifier(LogisticRegression(random_state=123, solver='lbfgs', max_iter=3000, C=0.01, n_jobs=-1), n_jobs=-1)\n", 221 | "\n", 222 | "scores = cross_val_score(my_log_model, X_train_s, y_train, cv = 5)\n", 223 | "print(scores)\n", 224 | "\n", 225 | "for i in range(len(scores)) :\n", 226 | " print(f\"Fold {i+1}: {scores[i]}\")\n", 227 | "print(f\"Average Score:{np.mean(scores)}\")" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": 10, 233 | "metadata": {}, 234 | "outputs": [ 235 | { 236 | "name": "stdout", 237 | "output_type": "stream", 238 | "text": [ 239 | "Wall time: 59.2 s\n" 240 | ] 241 | } 242 | ], 243 | "source": [ 244 | "%%time\n", 245 | "my_log_model = OneVsRestClassifier(LogisticRegression(random_state=123, solver='lbfgs', max_iter=3000, C=0.01, n_jobs=-1), n_jobs=-1).fit(X_train_s, y_train)" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": 11, 251 | "metadata": {}, 252 | "outputs": [], 253 | "source": [ 254 | "y_train_pred = my_log_model.predict(X_train_s)\n", 255 | "y_train_proba = my_log_model.predict_proba(X_train_s)\n", 256 | "y_test_pred = my_log_model.predict(X_test_s)\n", 257 | "y_test_proba = my_log_model.predict_proba(X_test_s)" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": 13, 263 | "metadata": {}, 264 | "outputs": [ 265 | { 266 | "name": "stdout", 267 | "output_type": "stream", 268 | "text": [ 269 | "Training score: 0.54290\n", 270 | " Test score: 0.10558\n" 271 | ] 272 | } 273 | ], 274 | "source": [ 275 | "from sklearn.metrics import accuracy_score\n", 276 | "print(f'Training score: {accuracy_score(y_train, y_train_pred):0.5f}')\n", 277 | "print(f' Test score: {accuracy_score(y_test, y_test_pred):0.5f}')" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 16, 283 | "metadata": {}, 284 | "outputs": [ 285 | { 286 | "name": "stdout", 287 | "output_type": "stream", 288 | "text": [ 289 | "0.8350 g_action\n", 290 | "0.8771 g_adventure\n", 291 | "0.9703 g_animation\n", 292 | "0.9397 g_biography\n", 293 | "0.7344 g_comedy\n", 294 | "0.8410 g_crime\n", 295 | "0.9730 g_documentary\n", 296 | "0.6931 g_drama\n", 297 | "0.9318 g_family\n", 298 | "0.9140 g_fantasy\n", 299 | "0.9892 g_film-noir\n", 300 | "0.9506 g_history\n", 301 | "0.9116 g_horror\n", 302 | "0.9384 g_music\n", 303 | "0.9657 g_musical\n", 304 | "0.8823 g_mystery\n", 305 | "0.7822 g_romance\n", 306 | "0.9369 g_sci-fi\n", 307 | "0.9794 g_sport\n", 308 | "0.7769 g_thriller\n", 309 | "0.9611 g_war\n", 310 | "0.9867 g_western\n" 311 | ] 312 | } 313 | ], 314 | "source": [ 315 | "y_pred_df = pd.DataFrame(y_test_pred, columns=genre_cols)\n", 316 | "\n", 317 | "# Test set predictions\n", 318 | "for g in genre_cols:\n", 319 | " score = accuracy_score(y_test[g], y_pred_df[g])\n", 320 | " print(f'{score:0.4f} {g}')" 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "\n", 328 | "---\n", 329 | "\n", 330 | "## Everything below is model testing and optimizing" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": 79, 336 | "metadata": {}, 337 | "outputs": [ 338 | { 339 | "name": "stdout", 340 | "output_type": "stream", 341 | "text": [ 342 | "Wall time: 22min 6s\n" 343 | ] 344 | } 345 | ], 346 | "source": [ 347 | "%%time\n", 348 | "#my_model = OneVsRestClassifier(LinearSVC(random_state=123, max_iter=3000), n_jobs=-1).fit(X_train_s, y_train)" 349 | ] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "execution_count": 92, 354 | "metadata": {}, 355 | "outputs": [ 356 | { 357 | "name": "stderr", 358 | "output_type": "stream", 359 | "text": [ 360 | "C:\\Users\\Tom\\Anaconda3\\lib\\site-packages\\sklearn\\externals\\joblib\\__init__.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.\n", 361 | " warnings.warn(msg, category=DeprecationWarning)\n" 362 | ] 363 | }, 364 | { 365 | "data": { 366 | "text/plain": [ 367 | "['models/my_1vr_linear_svc_default.pkl']" 368 | ] 369 | }, 370 | "execution_count": 92, 371 | "metadata": {}, 372 | "output_type": "execute_result" 373 | } 374 | ], 375 | "source": [ 376 | "# EXPORT AND SAVE THE MODEL\n", 377 | "#joblib.dump(my_model, 'models/my_1vr_linear_svc_default.pkl')" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": null, 383 | "metadata": {}, 384 | "outputs": [], 385 | "source": [ 386 | "y_pred = my_model.predict(X_test_s)" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": 12, 392 | "metadata": {}, 393 | "outputs": [], 394 | "source": [ 395 | "y_train_pred = my_model.predict(X_train_s)" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": 13, 401 | "metadata": {}, 402 | "outputs": [ 403 | { 404 | "data": { 405 | "text/plain": [ 406 | "True" 407 | ] 408 | }, 409 | "execution_count": 13, 410 | "metadata": {}, 411 | "output_type": "execute_result" 412 | } 413 | ], 414 | "source": [ 415 | "my_model.multilabel_\n", 416 | "#my_model.predict_proba(X_train_s)" 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": 90, 422 | "metadata": {}, 423 | "outputs": [ 424 | { 425 | "name": "stdout", 426 | "output_type": "stream", 427 | "text": [ 428 | "Training score: 0.49665\n", 429 | " Test score: 0.04367\n" 430 | ] 431 | } 432 | ], 433 | "source": [ 434 | "from sklearn.metrics import accuracy_score\n", 435 | "print(f'Training score: {accuracy_score(y_train, y_train_pred):0.5f}')\n", 436 | "print(f' Test score: {accuracy_score(y_test, y_pred):0.5f}')" 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": 14, 442 | "metadata": {}, 443 | "outputs": [], 444 | "source": [ 445 | "from sklearn.metrics import confusion_matrix\n", 446 | "from sklearn.metrics import accuracy_score\n", 447 | "from sklearn.metrics import precision_score\n", 448 | "from sklearn.metrics import recall_score\n", 449 | "from sklearn.metrics import f1_score\n", 450 | "from sklearn.metrics import multilabel_confusion_matrix" 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": 67, 456 | "metadata": {}, 457 | "outputs": [ 458 | { 459 | "data": { 460 | "text/plain": [ 461 | "array([[7360, 70],\n", 462 | " [ 63, 18]], dtype=int64)" 463 | ] 464 | }, 465 | "execution_count": 67, 466 | "metadata": {}, 467 | "output_type": "execute_result" 468 | } 469 | ], 470 | "source": [ 471 | "# Confusion Matrix\n", 472 | "cm = multilabel_confusion_matrix(y_test, y_pred)\n", 473 | "\n", 474 | "g_cm_list = []\n", 475 | "for g in cm:\n", 476 | " g_cm_list.append(pd.DataFrame(g, columns=['Predicted Negative (0)', 'Predicted Positive (1)'], \\\n", 477 | " index=['True Negative (0)','True Positive (1)']))\n", 478 | " \n", 479 | "g_cm_list[10].values" 480 | ] 481 | }, 482 | { 483 | "cell_type": "code", 484 | "execution_count": 86, 485 | "metadata": {}, 486 | "outputs": [], 487 | "source": [ 488 | "y_train_pred_df = pd.DataFrame(y_train_pred)\n", 489 | "y_train_pred_df.columns = genre_cols\n", 490 | "y_pred_df = pd.DataFrame(y_pred)\n", 491 | "y_pred_df.columns = genre_cols" 492 | ] 493 | }, 494 | { 495 | "cell_type": "code", 496 | "execution_count": 17, 497 | "metadata": {}, 498 | "outputs": [ 499 | { 500 | "name": "stdout", 501 | "output_type": "stream", 502 | "text": [ 503 | "0.8350 g_action\n", 504 | "0.8771 g_adventure\n", 505 | "0.9703 g_animation\n", 506 | "0.9397 g_biography\n", 507 | "0.7344 g_comedy\n", 508 | "0.8410 g_crime\n", 509 | "0.9730 g_documentary\n", 510 | "0.6931 g_drama\n", 511 | "0.9318 g_family\n", 512 | "0.9140 g_fantasy\n", 513 | "0.9892 g_film-noir\n", 514 | "0.9506 g_history\n", 515 | "0.9116 g_horror\n", 516 | "0.9384 g_music\n", 517 | "0.9657 g_musical\n", 518 | "0.8823 g_mystery\n", 519 | "0.7822 g_romance\n", 520 | "0.9369 g_sci-fi\n", 521 | "0.9794 g_sport\n", 522 | "0.7769 g_thriller\n", 523 | "0.9611 g_war\n", 524 | "0.9867 g_western\n" 525 | ] 526 | } 527 | ], 528 | "source": [ 529 | "test_acc_dict = {}\n", 530 | "# Test set predictions\n", 531 | "for g in genre_cols:\n", 532 | " score = accuracy_score(y_test[g], y_pred_df[g])\n", 533 | " test_acc_dict.update( {g[2:] : score} )\n", 534 | " print(f'{score:0.4f} {g}')" 535 | ] 536 | }, 537 | { 538 | "cell_type": "code", 539 | "execution_count": 18, 540 | "metadata": {}, 541 | "outputs": [], 542 | "source": [ 543 | "test_scores = pd.DataFrame.from_dict(test_acc_dict, orient='index', columns=['score'])" 544 | ] 545 | }, 546 | { 547 | "cell_type": "code", 548 | "execution_count": 19, 549 | "metadata": {}, 550 | "outputs": [], 551 | "source": [ 552 | "test_scores.to_csv('test_scores_last_model.csv', index_label='genre')" 553 | ] 554 | }, 555 | { 556 | "cell_type": "code", 557 | "execution_count": 22, 558 | "metadata": {}, 559 | "outputs": [], 560 | "source": [ 561 | "coefs = my_model.coef_" 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": 75, 567 | "metadata": {}, 568 | "outputs": [ 569 | { 570 | "data": { 571 | "text/html": [ 572 | "
\n", 573 | "\n", 586 | "\n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | "
g_actiong_adventureg_animationg_biographyg_comedyg_crimeg_documentaryg_dramag_familyg_fantasyg_film-noirg_historyg_horrorg_musicg_musicalg_mysteryg_romanceg_sci-fig_sportg_thrillerg_warg_western
f_release_year0.890920-0.2025850.2054930.1561500.0901890.2071430.112173-0.024676-0.0056810.000552-0.1922940.262658-0.124317-0.247146-0.202726-0.000374-0.163564-0.082269-0.0137320.0922450.1689800.168377
f_release_month0.1626410.0935390.032958-0.002919-0.007082-0.054702-0.0183090.0262330.0565730.1204110.000712-0.090634-0.0256200.0308710.0013000.0058330.0136430.013330-0.0106110.009509-0.108588-0.121271
f_runtime0.9977960.052923-0.3297720.151680-0.136051-0.098478-0.1251710.161513-0.157185-0.139964-0.0464690.234240-0.3728550.1203160.0745020.0346890.109954-0.0613310.0126250.0530910.2360380.229005
f_word_count_long0.2212050.0025070.0227080.008983-0.089621-0.1579270.015894-0.0032340.1035740.2357680.0102660.010166-0.0778990.0295700.0427310.011146-0.0275460.0173610.0246190.003008-0.013937-0.003826
f_imdb_rating-0.992456-0.1227220.2770400.168530-0.0719041.2860020.2414880.2390850.0459300.0152020.0521620.150826-0.5545260.0470410.0123590.118412-0.026726-0.2343600.013724-0.0865790.0510710.007038
.....................................................................
zealand0.201899-0.011669-0.0001640.038771-0.038884-0.2006950.0130580.0139150.0079380.020752-0.001004-0.008869-0.0486440.001566-0.016636-0.0997610.006259-0.0143530.0096010.0037070.0147460.001099
zero0.145570-0.0535930.0222230.0156680.019288-0.090709-0.0037450.002425-0.0208400.0588330.0037370.005958-0.0934220.0277600.0309870.0938440.0310990.0115470.0069950.0142130.0088190.001835
zombi-0.028788-0.057147-0.0074560.0058840.020771-0.279043-0.000784-0.027079-0.005703-0.0384650.003023-0.0202180.3419430.0120540.030725-0.135719-0.0078120.022717-0.0095780.004558-0.052168-0.011931
zone0.1884840.066910-0.011633-0.0340490.013658-0.0535660.012062-0.014565-0.0037140.043539-0.0010000.025871-0.0151550.039160-0.007195-0.075479-0.0151890.036640-0.033840-0.0045900.025977-0.003066
zoo-0.0399620.0737400.018455-0.0232920.014912-0.352289-0.006540-0.0049210.0441670.052392-0.002906-0.0030830.0118800.013114-0.0044520.0412130.002608-0.005972-0.011298-0.001489-0.0088910.003304
\n", 892 | "

5459 rows × 22 columns

\n", 893 | "
" 894 | ], 895 | "text/plain": [ 896 | " g_action g_adventure g_animation g_biography g_comedy \\\n", 897 | "f_release_year 0.890920 -0.202585 0.205493 0.156150 0.090189 \n", 898 | "f_release_month 0.162641 0.093539 0.032958 -0.002919 -0.007082 \n", 899 | "f_runtime 0.997796 0.052923 -0.329772 0.151680 -0.136051 \n", 900 | "f_word_count_long 0.221205 0.002507 0.022708 0.008983 -0.089621 \n", 901 | "f_imdb_rating -0.992456 -0.122722 0.277040 0.168530 -0.071904 \n", 902 | "... ... ... ... ... ... \n", 903 | "zealand 0.201899 -0.011669 -0.000164 0.038771 -0.038884 \n", 904 | "zero 0.145570 -0.053593 0.022223 0.015668 0.019288 \n", 905 | "zombi -0.028788 -0.057147 -0.007456 0.005884 0.020771 \n", 906 | "zone 0.188484 0.066910 -0.011633 -0.034049 0.013658 \n", 907 | "zoo -0.039962 0.073740 0.018455 -0.023292 0.014912 \n", 908 | "\n", 909 | " g_crime g_documentary g_drama g_family g_fantasy \\\n", 910 | "f_release_year 0.207143 0.112173 -0.024676 -0.005681 0.000552 \n", 911 | "f_release_month -0.054702 -0.018309 0.026233 0.056573 0.120411 \n", 912 | "f_runtime -0.098478 -0.125171 0.161513 -0.157185 -0.139964 \n", 913 | "f_word_count_long -0.157927 0.015894 -0.003234 0.103574 0.235768 \n", 914 | "f_imdb_rating 1.286002 0.241488 0.239085 0.045930 0.015202 \n", 915 | "... ... ... ... ... ... \n", 916 | "zealand -0.200695 0.013058 0.013915 0.007938 0.020752 \n", 917 | "zero -0.090709 -0.003745 0.002425 -0.020840 0.058833 \n", 918 | "zombi -0.279043 -0.000784 -0.027079 -0.005703 -0.038465 \n", 919 | "zone -0.053566 0.012062 -0.014565 -0.003714 0.043539 \n", 920 | "zoo -0.352289 -0.006540 -0.004921 0.044167 0.052392 \n", 921 | "\n", 922 | " g_film-noir g_history g_horror g_music g_musical \\\n", 923 | "f_release_year -0.192294 0.262658 -0.124317 -0.247146 -0.202726 \n", 924 | "f_release_month 0.000712 -0.090634 -0.025620 0.030871 0.001300 \n", 925 | "f_runtime -0.046469 0.234240 -0.372855 0.120316 0.074502 \n", 926 | "f_word_count_long 0.010266 0.010166 -0.077899 0.029570 0.042731 \n", 927 | "f_imdb_rating 0.052162 0.150826 -0.554526 0.047041 0.012359 \n", 928 | "... ... ... ... ... ... \n", 929 | "zealand -0.001004 -0.008869 -0.048644 0.001566 -0.016636 \n", 930 | "zero 0.003737 0.005958 -0.093422 0.027760 0.030987 \n", 931 | "zombi 0.003023 -0.020218 0.341943 0.012054 0.030725 \n", 932 | "zone -0.001000 0.025871 -0.015155 0.039160 -0.007195 \n", 933 | "zoo -0.002906 -0.003083 0.011880 0.013114 -0.004452 \n", 934 | "\n", 935 | " g_mystery g_romance g_sci-fi g_sport g_thriller \\\n", 936 | "f_release_year -0.000374 -0.163564 -0.082269 -0.013732 0.092245 \n", 937 | "f_release_month 0.005833 0.013643 0.013330 -0.010611 0.009509 \n", 938 | "f_runtime 0.034689 0.109954 -0.061331 0.012625 0.053091 \n", 939 | "f_word_count_long 0.011146 -0.027546 0.017361 0.024619 0.003008 \n", 940 | "f_imdb_rating 0.118412 -0.026726 -0.234360 0.013724 -0.086579 \n", 941 | "... ... ... ... ... ... \n", 942 | "zealand -0.099761 0.006259 -0.014353 0.009601 0.003707 \n", 943 | "zero 0.093844 0.031099 0.011547 0.006995 0.014213 \n", 944 | "zombi -0.135719 -0.007812 0.022717 -0.009578 0.004558 \n", 945 | "zone -0.075479 -0.015189 0.036640 -0.033840 -0.004590 \n", 946 | "zoo 0.041213 0.002608 -0.005972 -0.011298 -0.001489 \n", 947 | "\n", 948 | " g_war g_western \n", 949 | "f_release_year 0.168980 0.168377 \n", 950 | "f_release_month -0.108588 -0.121271 \n", 951 | "f_runtime 0.236038 0.229005 \n", 952 | "f_word_count_long -0.013937 -0.003826 \n", 953 | "f_imdb_rating 0.051071 0.007038 \n", 954 | "... ... ... \n", 955 | "zealand 0.014746 0.001099 \n", 956 | "zero 0.008819 0.001835 \n", 957 | "zombi -0.052168 -0.011931 \n", 958 | "zone 0.025977 -0.003066 \n", 959 | "zoo -0.008891 0.003304 \n", 960 | "\n", 961 | "[5459 rows x 22 columns]" 962 | ] 963 | }, 964 | "execution_count": 75, 965 | "metadata": {}, 966 | "output_type": "execute_result" 967 | } 968 | ], 969 | "source": [ 970 | "coef_df = pd.DataFrame(coefs, index=genre_cols, columns=X_train.columns)\n", 971 | "coef_tdf = coef_df.T\n", 972 | "coef_tdf" 973 | ] 974 | }, 975 | { 976 | "cell_type": "code", 977 | "execution_count": 29, 978 | "metadata": {}, 979 | "outputs": [], 980 | "source": [ 981 | "coef_tdf.to_csv('my_1vr_linear_svc_default_coef.tsv', sep='\\t')" 982 | ] 983 | }, 984 | { 985 | "cell_type": "code", 986 | "execution_count": null, 987 | "metadata": {}, 988 | "outputs": [], 989 | "source": [] 990 | }, 991 | { 992 | "cell_type": "code", 993 | "execution_count": null, 994 | "metadata": {}, 995 | "outputs": [], 996 | "source": [] 997 | }, 998 | { 999 | "cell_type": "code", 1000 | "execution_count": null, 1001 | "metadata": {}, 1002 | "outputs": [], 1003 | "source": [] 1004 | }, 1005 | { 1006 | "cell_type": "code", 1007 | "execution_count": null, 1008 | "metadata": {}, 1009 | "outputs": [], 1010 | "source": [] 1011 | }, 1012 | { 1013 | "cell_type": "code", 1014 | "execution_count": 69, 1015 | "metadata": {}, 1016 | "outputs": [ 1017 | { 1018 | "name": "stdout", 1019 | "output_type": "stream", 1020 | "text": [ 1021 | "Wall time: 8min 26s\n" 1022 | ] 1023 | }, 1024 | { 1025 | "data": { 1026 | "text/plain": [ 1027 | "['models/my_1vr_logreg_default.pkl']" 1028 | ] 1029 | }, 1030 | "execution_count": 69, 1031 | "metadata": {}, 1032 | "output_type": "execute_result" 1033 | } 1034 | ], 1035 | "source": [ 1036 | "%%time\n", 1037 | "my_log_model = OneVsRestClassifier(LogisticRegression(random_state=123, max_iter=3000), n_jobs=-1).fit(X_train_s, y_train)\n", 1038 | "\n", 1039 | "# EXPORT AND SAVE THE MODEL\n", 1040 | "joblib.dump(my_log_model, 'models/my_1vr_logreg_default.pkl')" 1041 | ] 1042 | }, 1043 | { 1044 | "cell_type": "code", 1045 | "execution_count": 106, 1046 | "metadata": {}, 1047 | "outputs": [ 1048 | { 1049 | "name": "stdout", 1050 | "output_type": "stream", 1051 | "text": [ 1052 | "Wall time: 11 s\n" 1053 | ] 1054 | }, 1055 | { 1056 | "data": { 1057 | "text/plain": [ 1058 | "['models/my_1vr_logreg_minmax_0.01.pkl']" 1059 | ] 1060 | }, 1061 | "execution_count": 106, 1062 | "metadata": {}, 1063 | "output_type": "execute_result" 1064 | } 1065 | ], 1066 | "source": [ 1067 | "%%time\n", 1068 | "my_log_model_mm = OneVsRestClassifier(LogisticRegression(random_state=123, max_iter=3000, C=0.01), n_jobs=-1).fit(X_train_mm, y_train)\n", 1069 | "\n", 1070 | "# EXPORT AND SAVE THE MODEL\n", 1071 | "joblib.dump(my_log_model_mm, 'models/my_1vr_logreg_minmax_0.01.pkl')" 1072 | ] 1073 | }, 1074 | { 1075 | "cell_type": "code", 1076 | "execution_count": 109, 1077 | "metadata": {}, 1078 | "outputs": [ 1079 | { 1080 | "name": "stdout", 1081 | "output_type": "stream", 1082 | "text": [ 1083 | "Train: 0.08490524166703653\n", 1084 | " Test: 0.07189455465317535\n", 1085 | "0.8136 g_action\n", 1086 | "0.8852 g_adventure\n", 1087 | "0.9691 g_animation\n", 1088 | "0.9478 g_biography\n", 1089 | "0.6686 g_comedy\n", 1090 | "0.8396 g_crime\n", 1091 | "0.9525 g_documentary\n", 1092 | "0.6873 g_drama\n", 1093 | "0.9406 g_family\n", 1094 | "0.9249 g_fantasy\n", 1095 | "0.9892 g_film-noir\n", 1096 | "0.9561 g_history\n", 1097 | "0.8645 g_horror\n", 1098 | "0.9342 g_music\n", 1099 | "0.9698 g_musical\n", 1100 | "0.9087 g_mystery\n", 1101 | "0.7774 g_romance\n", 1102 | "0.9179 g_sci-fi\n", 1103 | "0.9708 g_sport\n", 1104 | "0.7594 g_thriller\n", 1105 | "0.9490 g_war\n", 1106 | "0.9775 g_western\n" 1107 | ] 1108 | } 1109 | ], 1110 | "source": [ 1111 | "y_pred_log_mm = my_log_model_mm.predict(X_test_mm)\n", 1112 | "y_train_pred_log_mm = my_log_model_mm.predict(X_train_mm)\n", 1113 | "from sklearn.metrics import accuracy_score\n", 1114 | "print(f'Train: {accuracy_score(y_train, y_train_pred_log_mm)}')\n", 1115 | "print(f' Test: {accuracy_score(y_test, y_pred_log)}')\n", 1116 | "\n", 1117 | "y_train_pred_log_mm_df = pd.DataFrame(y_train_pred_log_mm)\n", 1118 | "y_train_pred_log_mm_df.columns = genre_cols\n", 1119 | "\n", 1120 | "y_pred_log_mm_df = pd.DataFrame(y_pred_log_mm)\n", 1121 | "y_pred_log_mm_df.columns = genre_cols\n", 1122 | "\n", 1123 | "#test_acc_dict = {}\n", 1124 | "# Test set predictions\n", 1125 | "for g in genre_cols:\n", 1126 | " score = accuracy_score(y_test[g], y_pred_log_mm_df[g])\n", 1127 | " #test_acc_dict.update( {g[2:] : score} )\n", 1128 | " print(f'{score:0.4f} {g}')" 1129 | ] 1130 | }, 1131 | { 1132 | "cell_type": "code", 1133 | "execution_count": null, 1134 | "metadata": {}, 1135 | "outputs": [], 1136 | "source": [] 1137 | }, 1138 | { 1139 | "cell_type": "code", 1140 | "execution_count": null, 1141 | "metadata": {}, 1142 | "outputs": [], 1143 | "source": [] 1144 | }, 1145 | { 1146 | "cell_type": "code", 1147 | "execution_count": 101, 1148 | "metadata": {}, 1149 | "outputs": [], 1150 | "source": [ 1151 | "y_pred_log = my_log_model.predict(X_test_s)" 1152 | ] 1153 | }, 1154 | { 1155 | "cell_type": "code", 1156 | "execution_count": 71, 1157 | "metadata": {}, 1158 | "outputs": [], 1159 | "source": [ 1160 | "y_train_pred_log = my_log_model.predict(X_train_s)" 1161 | ] 1162 | }, 1163 | { 1164 | "cell_type": "code", 1165 | "execution_count": 72, 1166 | "metadata": {}, 1167 | "outputs": [ 1168 | { 1169 | "data": { 1170 | "text/plain": [ 1171 | "True" 1172 | ] 1173 | }, 1174 | "execution_count": 72, 1175 | "metadata": {}, 1176 | "output_type": "execute_result" 1177 | } 1178 | ], 1179 | "source": [ 1180 | "my_log_model.multilabel_\n", 1181 | "#my_model.predict_proba(X_train_s)" 1182 | ] 1183 | }, 1184 | { 1185 | "cell_type": "code", 1186 | "execution_count": 73, 1187 | "metadata": {}, 1188 | "outputs": [ 1189 | { 1190 | "data": { 1191 | "text/plain": [ 1192 | "0.622076250499312" 1193 | ] 1194 | }, 1195 | "execution_count": 73, 1196 | "metadata": {}, 1197 | "output_type": "execute_result" 1198 | } 1199 | ], 1200 | "source": [ 1201 | "from sklearn.metrics import accuracy_score\n", 1202 | "accuracy_score(y_train, y_train_pred_log)" 1203 | ] 1204 | }, 1205 | { 1206 | "cell_type": "code", 1207 | "execution_count": 80, 1208 | "metadata": {}, 1209 | "outputs": [ 1210 | { 1211 | "data": { 1212 | "text/plain": [ 1213 | "0.06403940886699508" 1214 | ] 1215 | }, 1216 | "execution_count": 80, 1217 | "metadata": {}, 1218 | "output_type": "execute_result" 1219 | } 1220 | ], 1221 | "source": [ 1222 | "from sklearn.metrics import accuracy_score\n", 1223 | "accuracy_score(y_test, y_pred_log)" 1224 | ] 1225 | }, 1226 | { 1227 | "cell_type": "code", 1228 | "execution_count": 81, 1229 | "metadata": {}, 1230 | "outputs": [], 1231 | "source": [ 1232 | "y_train_pred_log_df = pd.DataFrame(y_train_pred_log)\n", 1233 | "y_train_pred_log_df.columns = genre_cols\n", 1234 | "\n", 1235 | "y_pred_log_df = pd.DataFrame(y_pred_log)\n", 1236 | "y_pred_log_df.columns = genre_cols" 1237 | ] 1238 | }, 1239 | { 1240 | "cell_type": "code", 1241 | "execution_count": 82, 1242 | "metadata": {}, 1243 | "outputs": [ 1244 | { 1245 | "name": "stdout", 1246 | "output_type": "stream", 1247 | "text": [ 1248 | "0.7809 g_action\n", 1249 | "0.8340 g_adventure\n", 1250 | "0.9561 g_animation\n", 1251 | "0.9081 g_biography\n", 1252 | "0.7252 g_comedy\n", 1253 | "0.7891 g_crime\n", 1254 | "0.9686 g_documentary\n", 1255 | "0.6874 g_drama\n", 1256 | "0.8992 g_family\n", 1257 | "0.8768 g_fantasy\n", 1258 | "0.9874 g_film-noir\n", 1259 | "0.9217 g_history\n", 1260 | "0.8882 g_horror\n", 1261 | "0.9101 g_music\n", 1262 | "0.9449 g_musical\n", 1263 | "0.8338 g_mystery\n", 1264 | "0.7617 g_romance\n", 1265 | "0.9103 g_sci-fi\n", 1266 | "0.9767 g_sport\n", 1267 | "0.7566 g_thriller\n", 1268 | "0.9451 g_war\n", 1269 | "0.9840 g_western\n" 1270 | ] 1271 | } 1272 | ], 1273 | "source": [ 1274 | "test_acc_dict = {}\n", 1275 | "# Test set predictions\n", 1276 | "for g in genre_cols:\n", 1277 | " score = accuracy_score(y_test[g], y_pred_log_df[g])\n", 1278 | " test_acc_dict.update( {g[2:] : score} )\n", 1279 | " print(f'{score:0.4f} {g}')" 1280 | ] 1281 | }, 1282 | { 1283 | "cell_type": "code", 1284 | "execution_count": 83, 1285 | "metadata": {}, 1286 | "outputs": [], 1287 | "source": [ 1288 | "test_scores_log = pd.DataFrame.from_dict(test_acc_dict, orient='index', columns=['score'])" 1289 | ] 1290 | }, 1291 | { 1292 | "cell_type": "code", 1293 | "execution_count": 84, 1294 | "metadata": {}, 1295 | "outputs": [], 1296 | "source": [ 1297 | "test_scores_log.to_csv('test_scores_model1.csv', index_label='genre')" 1298 | ] 1299 | }, 1300 | { 1301 | "cell_type": "code", 1302 | "execution_count": 85, 1303 | "metadata": {}, 1304 | "outputs": [ 1305 | { 1306 | "data": { 1307 | "text/html": [ 1308 | "
\n", 1309 | "\n", 1322 | "\n", 1323 | " \n", 1324 | " \n", 1325 | " \n", 1326 | " \n", 1327 | " \n", 1328 | " \n", 1329 | " \n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | " \n", 1336 | " \n", 1337 | " \n", 1338 | " \n", 1339 | " \n", 1340 | " \n", 1341 | " \n", 1342 | " \n", 1343 | " \n", 1344 | " \n", 1345 | " \n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " \n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | " \n", 1360 | " \n", 1361 | " \n", 1362 | " \n", 1363 | " \n", 1364 | " \n", 1365 | " \n", 1366 | " \n", 1367 | " \n", 1368 | " \n", 1369 | " \n", 1370 | " \n", 1371 | " \n", 1372 | " \n", 1373 | " \n", 1374 | " \n", 1375 | " \n", 1376 | " \n", 1377 | " \n", 1378 | " \n", 1379 | " \n", 1380 | " \n", 1381 | " \n", 1382 | " \n", 1383 | " \n", 1384 | " \n", 1385 | " \n", 1386 | " \n", 1387 | " \n", 1388 | " \n", 1389 | " \n", 1390 | " \n", 1391 | " \n", 1392 | " \n", 1393 | " \n", 1394 | " \n", 1395 | " \n", 1396 | " \n", 1397 | " \n", 1398 | " \n", 1399 | " \n", 1400 | " \n", 1401 | " \n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | " \n", 1421 | " \n", 1422 | " \n", 1423 | " \n", 1424 | " \n", 1425 | " \n", 1426 | " \n", 1427 | " \n", 1428 | " \n", 1429 | " \n", 1430 | " \n", 1431 | " \n", 1432 | " \n", 1433 | " \n", 1434 | " \n", 1435 | " \n", 1436 | " \n", 1437 | " \n", 1438 | " \n", 1439 | " \n", 1440 | " \n", 1441 | " \n", 1442 | " \n", 1443 | " \n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | " \n", 1458 | " \n", 1459 | " \n", 1460 | " \n", 1461 | " \n", 1462 | " \n", 1463 | " \n", 1464 | " \n", 1465 | " \n", 1466 | " \n", 1467 | " \n", 1468 | " \n", 1469 | " \n", 1470 | " \n", 1471 | " \n", 1472 | " \n", 1473 | " \n", 1474 | " \n", 1475 | " \n", 1476 | " \n", 1477 | " \n", 1478 | " \n", 1479 | " \n", 1480 | " \n", 1481 | " \n", 1482 | " \n", 1483 | " \n", 1484 | " \n", 1485 | " \n", 1486 | " \n", 1487 | " \n", 1488 | " \n", 1489 | " \n", 1490 | " \n", 1491 | " \n", 1492 | " \n", 1493 | " \n", 1494 | " \n", 1495 | " \n", 1496 | " \n", 1497 | " \n", 1498 | " \n", 1499 | " \n", 1500 | " \n", 1501 | " \n", 1502 | " \n", 1503 | " \n", 1504 | " \n", 1505 | " \n", 1506 | " \n", 1507 | " \n", 1508 | " \n", 1509 | " \n", 1510 | " \n", 1511 | " \n", 1512 | " \n", 1513 | " \n", 1514 | " \n", 1515 | " \n", 1516 | " \n", 1517 | " \n", 1518 | " \n", 1519 | " \n", 1520 | " \n", 1521 | " \n", 1522 | " \n", 1523 | " \n", 1524 | " \n", 1525 | " \n", 1526 | " \n", 1527 | " \n", 1528 | " \n", 1529 | " \n", 1530 | " \n", 1531 | " \n", 1532 | " \n", 1533 | " \n", 1534 | " \n", 1535 | " \n", 1536 | " \n", 1537 | " \n", 1538 | " \n", 1539 | " \n", 1540 | " \n", 1541 | " \n", 1542 | " \n", 1543 | " \n", 1544 | " \n", 1545 | " \n", 1546 | " \n", 1547 | " \n", 1548 | " \n", 1549 | " \n", 1550 | " \n", 1551 | " \n", 1552 | " \n", 1553 | " \n", 1554 | " \n", 1555 | " \n", 1556 | " \n", 1557 | " \n", 1558 | " \n", 1559 | " \n", 1560 | " \n", 1561 | " \n", 1562 | " \n", 1563 | " \n", 1564 | " \n", 1565 | " \n", 1566 | " \n", 1567 | " \n", 1568 | " \n", 1569 | " \n", 1570 | " \n", 1571 | " \n", 1572 | " \n", 1573 | " \n", 1574 | " \n", 1575 | " \n", 1576 | " \n", 1577 | " \n", 1578 | " \n", 1579 | " \n", 1580 | " \n", 1581 | " \n", 1582 | " \n", 1583 | " \n", 1584 | " \n", 1585 | " \n", 1586 | " \n", 1587 | " \n", 1588 | " \n", 1589 | " \n", 1590 | " \n", 1591 | " \n", 1592 | " \n", 1593 | " \n", 1594 | " \n", 1595 | " \n", 1596 | " \n", 1597 | " \n", 1598 | " \n", 1599 | " \n", 1600 | " \n", 1601 | " \n", 1602 | " \n", 1603 | " \n", 1604 | " \n", 1605 | " \n", 1606 | " \n", 1607 | " \n", 1608 | " \n", 1609 | " \n", 1610 | " \n", 1611 | " \n", 1612 | " \n", 1613 | " \n", 1614 | " \n", 1615 | " \n", 1616 | " \n", 1617 | " \n", 1618 | " \n", 1619 | " \n", 1620 | " \n", 1621 | " \n", 1622 | " \n", 1623 | " \n", 1624 | " \n", 1625 | " \n", 1626 | " \n", 1627 | "
g_actiong_adventureg_animationg_biographyg_comedyg_crimeg_documentaryg_dramag_familyg_fantasyg_film-noirg_historyg_horrorg_musicg_musicalg_mysteryg_romanceg_sci-fig_sportg_thrillerg_warg_western
f_release_year0.890920-0.2025850.2054930.1561500.0901890.2071430.112173-0.024676-0.0056810.000552-0.1922940.262658-0.124317-0.247146-0.202726-0.000374-0.163564-0.082269-0.0137320.0922450.1689800.168377
f_release_month0.1626410.0935390.032958-0.002919-0.007082-0.054702-0.0183090.0262330.0565730.1204110.000712-0.090634-0.0256200.0308710.0013000.0058330.0136430.013330-0.0106110.009509-0.108588-0.121271
f_runtime0.9977960.052923-0.3297720.151680-0.136051-0.098478-0.1251710.161513-0.157185-0.139964-0.0464690.234240-0.3728550.1203160.0745020.0346890.109954-0.0613310.0126250.0530910.2360380.229005
f_word_count_long0.2212050.0025070.0227080.008983-0.089621-0.1579270.015894-0.0032340.1035740.2357680.0102660.010166-0.0778990.0295700.0427310.011146-0.0275460.0173610.0246190.003008-0.013937-0.003826
f_imdb_rating-0.992456-0.1227220.2770400.168530-0.0719041.2860020.2414880.2390850.0459300.0152020.0521620.150826-0.5545260.0470410.0123590.118412-0.026726-0.2343600.013724-0.0865790.0510710.007038
.....................................................................
zealand0.201899-0.011669-0.0001640.038771-0.038884-0.2006950.0130580.0139150.0079380.020752-0.001004-0.008869-0.0486440.001566-0.016636-0.0997610.006259-0.0143530.0096010.0037070.0147460.001099
zero0.145570-0.0535930.0222230.0156680.019288-0.090709-0.0037450.002425-0.0208400.0588330.0037370.005958-0.0934220.0277600.0309870.0938440.0310990.0115470.0069950.0142130.0088190.001835
zombi-0.028788-0.057147-0.0074560.0058840.020771-0.279043-0.000784-0.027079-0.005703-0.0384650.003023-0.0202180.3419430.0120540.030725-0.135719-0.0078120.022717-0.0095780.004558-0.052168-0.011931
zone0.1884840.066910-0.011633-0.0340490.013658-0.0535660.012062-0.014565-0.0037140.043539-0.0010000.025871-0.0151550.039160-0.007195-0.075479-0.0151890.036640-0.033840-0.0045900.025977-0.003066
zoo-0.0399620.0737400.018455-0.0232920.014912-0.352289-0.006540-0.0049210.0441670.052392-0.002906-0.0030830.0118800.013114-0.0044520.0412130.002608-0.005972-0.011298-0.001489-0.0088910.003304
\n", 1628 | "

5459 rows × 22 columns

\n", 1629 | "
" 1630 | ], 1631 | "text/plain": [ 1632 | " g_action g_adventure g_animation g_biography g_comedy \\\n", 1633 | "f_release_year 0.890920 -0.202585 0.205493 0.156150 0.090189 \n", 1634 | "f_release_month 0.162641 0.093539 0.032958 -0.002919 -0.007082 \n", 1635 | "f_runtime 0.997796 0.052923 -0.329772 0.151680 -0.136051 \n", 1636 | "f_word_count_long 0.221205 0.002507 0.022708 0.008983 -0.089621 \n", 1637 | "f_imdb_rating -0.992456 -0.122722 0.277040 0.168530 -0.071904 \n", 1638 | "... ... ... ... ... ... \n", 1639 | "zealand 0.201899 -0.011669 -0.000164 0.038771 -0.038884 \n", 1640 | "zero 0.145570 -0.053593 0.022223 0.015668 0.019288 \n", 1641 | "zombi -0.028788 -0.057147 -0.007456 0.005884 0.020771 \n", 1642 | "zone 0.188484 0.066910 -0.011633 -0.034049 0.013658 \n", 1643 | "zoo -0.039962 0.073740 0.018455 -0.023292 0.014912 \n", 1644 | "\n", 1645 | " g_crime g_documentary g_drama g_family g_fantasy \\\n", 1646 | "f_release_year 0.207143 0.112173 -0.024676 -0.005681 0.000552 \n", 1647 | "f_release_month -0.054702 -0.018309 0.026233 0.056573 0.120411 \n", 1648 | "f_runtime -0.098478 -0.125171 0.161513 -0.157185 -0.139964 \n", 1649 | "f_word_count_long -0.157927 0.015894 -0.003234 0.103574 0.235768 \n", 1650 | "f_imdb_rating 1.286002 0.241488 0.239085 0.045930 0.015202 \n", 1651 | "... ... ... ... ... ... \n", 1652 | "zealand -0.200695 0.013058 0.013915 0.007938 0.020752 \n", 1653 | "zero -0.090709 -0.003745 0.002425 -0.020840 0.058833 \n", 1654 | "zombi -0.279043 -0.000784 -0.027079 -0.005703 -0.038465 \n", 1655 | "zone -0.053566 0.012062 -0.014565 -0.003714 0.043539 \n", 1656 | "zoo -0.352289 -0.006540 -0.004921 0.044167 0.052392 \n", 1657 | "\n", 1658 | " g_film-noir g_history g_horror g_music g_musical \\\n", 1659 | "f_release_year -0.192294 0.262658 -0.124317 -0.247146 -0.202726 \n", 1660 | "f_release_month 0.000712 -0.090634 -0.025620 0.030871 0.001300 \n", 1661 | "f_runtime -0.046469 0.234240 -0.372855 0.120316 0.074502 \n", 1662 | "f_word_count_long 0.010266 0.010166 -0.077899 0.029570 0.042731 \n", 1663 | "f_imdb_rating 0.052162 0.150826 -0.554526 0.047041 0.012359 \n", 1664 | "... ... ... ... ... ... \n", 1665 | "zealand -0.001004 -0.008869 -0.048644 0.001566 -0.016636 \n", 1666 | "zero 0.003737 0.005958 -0.093422 0.027760 0.030987 \n", 1667 | "zombi 0.003023 -0.020218 0.341943 0.012054 0.030725 \n", 1668 | "zone -0.001000 0.025871 -0.015155 0.039160 -0.007195 \n", 1669 | "zoo -0.002906 -0.003083 0.011880 0.013114 -0.004452 \n", 1670 | "\n", 1671 | " g_mystery g_romance g_sci-fi g_sport g_thriller \\\n", 1672 | "f_release_year -0.000374 -0.163564 -0.082269 -0.013732 0.092245 \n", 1673 | "f_release_month 0.005833 0.013643 0.013330 -0.010611 0.009509 \n", 1674 | "f_runtime 0.034689 0.109954 -0.061331 0.012625 0.053091 \n", 1675 | "f_word_count_long 0.011146 -0.027546 0.017361 0.024619 0.003008 \n", 1676 | "f_imdb_rating 0.118412 -0.026726 -0.234360 0.013724 -0.086579 \n", 1677 | "... ... ... ... ... ... \n", 1678 | "zealand -0.099761 0.006259 -0.014353 0.009601 0.003707 \n", 1679 | "zero 0.093844 0.031099 0.011547 0.006995 0.014213 \n", 1680 | "zombi -0.135719 -0.007812 0.022717 -0.009578 0.004558 \n", 1681 | "zone -0.075479 -0.015189 0.036640 -0.033840 -0.004590 \n", 1682 | "zoo 0.041213 0.002608 -0.005972 -0.011298 -0.001489 \n", 1683 | "\n", 1684 | " g_war g_western \n", 1685 | "f_release_year 0.168980 0.168377 \n", 1686 | "f_release_month -0.108588 -0.121271 \n", 1687 | "f_runtime 0.236038 0.229005 \n", 1688 | "f_word_count_long -0.013937 -0.003826 \n", 1689 | "f_imdb_rating 0.051071 0.007038 \n", 1690 | "... ... ... \n", 1691 | "zealand 0.014746 0.001099 \n", 1692 | "zero 0.008819 0.001835 \n", 1693 | "zombi -0.052168 -0.011931 \n", 1694 | "zone 0.025977 -0.003066 \n", 1695 | "zoo -0.008891 0.003304 \n", 1696 | "\n", 1697 | "[5459 rows x 22 columns]" 1698 | ] 1699 | }, 1700 | "execution_count": 85, 1701 | "metadata": {}, 1702 | "output_type": "execute_result" 1703 | } 1704 | ], 1705 | "source": [ 1706 | "coef_df = pd.DataFrame(coefs, index=genre_cols, columns=X_train.columns)\n", 1707 | "coef_tdf = coef_df.T\n", 1708 | "coef_tdf" 1709 | ] 1710 | }, 1711 | { 1712 | "cell_type": "code", 1713 | "execution_count": null, 1714 | "metadata": {}, 1715 | "outputs": [], 1716 | "source": [ 1717 | "coef_tdf.to_csv('my_1vr_logreg_default_coef.tsv', sep='\\t')" 1718 | ] 1719 | }, 1720 | { 1721 | "cell_type": "code", 1722 | "execution_count": null, 1723 | "metadata": {}, 1724 | "outputs": [], 1725 | "source": [] 1726 | }, 1727 | { 1728 | "cell_type": "code", 1729 | "execution_count": null, 1730 | "metadata": {}, 1731 | "outputs": [], 1732 | "source": [] 1733 | }, 1734 | { 1735 | "cell_type": "code", 1736 | "execution_count": null, 1737 | "metadata": {}, 1738 | "outputs": [], 1739 | "source": [] 1740 | }, 1741 | { 1742 | "cell_type": "code", 1743 | "execution_count": null, 1744 | "metadata": {}, 1745 | "outputs": [], 1746 | "source": [] 1747 | }, 1748 | { 1749 | "cell_type": "code", 1750 | "execution_count": null, 1751 | "metadata": {}, 1752 | "outputs": [], 1753 | "source": [] 1754 | }, 1755 | { 1756 | "cell_type": "code", 1757 | "execution_count": 95, 1758 | "metadata": { 1759 | "scrolled": true 1760 | }, 1761 | "outputs": [ 1762 | { 1763 | "name": "stdout", 1764 | "output_type": "stream", 1765 | "text": [ 1766 | "C: 1e-05\n", 1767 | "Train score: 0.07767\n", 1768 | " Test score: 0.07150\n", 1769 | "0.8131 g_action\n", 1770 | "0.8852 g_adventure\n", 1771 | "0.9691 g_animation\n", 1772 | "0.9478 g_biography\n", 1773 | "0.6407 g_comedy\n", 1774 | "0.8392 g_crime\n", 1775 | "0.9525 g_documentary\n", 1776 | "0.6377 g_drama\n", 1777 | "0.9406 g_family\n", 1778 | "0.9249 g_fantasy\n", 1779 | "0.9892 g_film-noir\n", 1780 | "0.9561 g_history\n", 1781 | "0.8643 g_horror\n", 1782 | "0.9342 g_music\n", 1783 | "0.9698 g_musical\n", 1784 | "0.9087 g_mystery\n", 1785 | "0.7738 g_romance\n", 1786 | "0.9179 g_sci-fi\n", 1787 | "0.9708 g_sport\n", 1788 | "0.7542 g_thriller\n", 1789 | "0.9489 g_war\n", 1790 | "0.9775 g_western\n", 1791 | "C: 0.0001\n", 1792 | "Train score: 0.17256\n", 1793 | " Test score: 0.12355\n", 1794 | "0.8506 g_action\n", 1795 | "0.8946 g_adventure\n", 1796 | "0.9691 g_animation\n", 1797 | "0.9478 g_biography\n", 1798 | "0.7606 g_comedy\n", 1799 | "0.8627 g_crime\n", 1800 | "0.9558 g_documentary\n", 1801 | "0.7239 g_drama\n", 1802 | "0.9409 g_family\n", 1803 | "0.9282 g_fantasy\n", 1804 | "0.9892 g_film-noir\n", 1805 | "0.9562 g_history\n", 1806 | "0.8956 g_horror\n", 1807 | "0.9388 g_music\n", 1808 | "0.9698 g_musical\n", 1809 | "0.9100 g_mystery\n", 1810 | "0.8046 g_romance\n", 1811 | "0.9305 g_sci-fi\n", 1812 | "0.9712 g_sport\n", 1813 | "0.7984 g_thriller\n", 1814 | "0.9539 g_war\n", 1815 | "0.9784 g_western\n", 1816 | "C: 0.001\n", 1817 | "Train score: 0.35680\n", 1818 | " Test score: 0.13993\n", 1819 | "0.8631 g_action\n", 1820 | "0.9027 g_adventure\n", 1821 | "0.9714 g_animation\n", 1822 | "0.9518 g_biography\n", 1823 | "0.7512 g_comedy\n", 1824 | "0.8729 g_crime\n", 1825 | "0.9720 g_documentary\n", 1826 | "0.7047 g_drama\n", 1827 | "0.9439 g_family\n", 1828 | "0.9330 g_fantasy\n", 1829 | "0.9891 g_film-noir\n", 1830 | "0.9615 g_history\n", 1831 | "0.9270 g_horror\n", 1832 | "0.9494 g_music\n", 1833 | "0.9703 g_musical\n", 1834 | "0.9103 g_mystery\n", 1835 | "0.8112 g_romance\n", 1836 | "0.9467 g_sci-fi\n", 1837 | "0.9792 g_sport\n", 1838 | "0.7975 g_thriller\n", 1839 | "0.9649 g_war\n", 1840 | "0.9855 g_western\n", 1841 | "C: 0.01\n", 1842 | "Train score: 0.54267\n", 1843 | " Test score: 0.10558\n", 1844 | "0.8352 g_action\n", 1845 | "0.8771 g_adventure\n", 1846 | "0.9702 g_animation\n", 1847 | "0.9394 g_biography\n", 1848 | "0.7344 g_comedy\n", 1849 | "0.8410 g_crime\n", 1850 | "0.9731 g_documentary\n", 1851 | "0.6931 g_drama\n", 1852 | "0.9317 g_family\n", 1853 | "0.9137 g_fantasy\n", 1854 | "0.9892 g_film-noir\n", 1855 | "0.9506 g_history\n", 1856 | "0.9117 g_horror\n", 1857 | "0.9385 g_music\n", 1858 | "0.9655 g_musical\n", 1859 | "0.8823 g_mystery\n", 1860 | "0.7822 g_romance\n", 1861 | "0.9370 g_sci-fi\n", 1862 | "0.9794 g_sport\n", 1863 | "0.7769 g_thriller\n", 1864 | "0.9610 g_war\n", 1865 | "0.9868 g_western\n", 1866 | "C: 0.1\n", 1867 | "Train score: 0.60312\n", 1868 | " Test score: 0.07882\n", 1869 | "0.8002 g_action\n", 1870 | "0.8517 g_adventure\n", 1871 | "0.9643 g_animation\n", 1872 | "0.9276 g_biography\n", 1873 | "0.7265 g_comedy\n", 1874 | "0.8064 g_crime\n", 1875 | "0.9720 g_documentary\n", 1876 | "0.6883 g_drama\n", 1877 | "0.9197 g_family\n", 1878 | "0.8938 g_fantasy\n", 1879 | "0.9888 g_film-noir\n", 1880 | "0.9392 g_history\n", 1881 | "0.9019 g_horror\n", 1882 | "0.9240 g_music\n", 1883 | "0.9587 g_musical\n", 1884 | "0.8558 g_mystery\n", 1885 | "0.7657 g_romance\n", 1886 | "0.9261 g_sci-fi\n", 1887 | "0.9796 g_sport\n", 1888 | "0.7593 g_thriller\n", 1889 | "0.9569 g_war\n", 1890 | "0.9868 g_western\n", 1891 | "C: 1\n", 1892 | "Train score: 0.62265\n", 1893 | " Test score: 0.06763\n", 1894 | "0.7848 g_action\n", 1895 | "0.8404 g_adventure\n", 1896 | "0.9593 g_animation\n", 1897 | "0.9157 g_biography\n", 1898 | "0.7252 g_comedy\n", 1899 | "0.7922 g_crime\n", 1900 | "0.9695 g_documentary\n", 1901 | "0.6874 g_drama\n", 1902 | "0.9060 g_family\n", 1903 | "0.8818 g_fantasy\n", 1904 | "0.9879 g_film-noir\n", 1905 | "0.9289 g_history\n", 1906 | "0.8923 g_horror\n", 1907 | "0.9136 g_music\n", 1908 | "0.9497 g_musical\n", 1909 | "0.8408 g_mystery\n", 1910 | "0.7615 g_romance\n", 1911 | "0.9157 g_sci-fi\n", 1912 | "0.9778 g_sport\n", 1913 | "0.7568 g_thriller\n", 1914 | "0.9501 g_war\n", 1915 | "0.9854 g_western\n", 1916 | "Wall time: 1h 48min 14s\n" 1917 | ] 1918 | } 1919 | ], 1920 | "source": [ 1921 | "%%time\n", 1922 | "\n", 1923 | "c_values = [0.00001, 0.0001, 0.001, 0.01, 0.1, 1]\n", 1924 | "train_scores = []\n", 1925 | "test_scores = []\n", 1926 | "\n", 1927 | "for c_val in c_values:\n", 1928 | " my_log_model = OneVsRestClassifier(LogisticRegression(random_state=123, solver='sag', max_iter=3000, C=c_val, n_jobs=-1), n_jobs=-1).fit(X_train_s, y_train)\n", 1929 | "\n", 1930 | " # EXPORT AND SAVE THE MODEL\n", 1931 | " joblib.dump(my_log_model, f'models/my_1vr_logreg_sag_{c_val}.pkl')\n", 1932 | " \n", 1933 | " # Make predictions\n", 1934 | " y_train_pred_log = my_log_model.predict(X_train_s)\n", 1935 | " y_pred_log = my_log_model.predict(X_test_s)\n", 1936 | "\n", 1937 | " #my_log_model.multilabel_\n", 1938 | " #my_model.predict_proba(X_train_s)\n", 1939 | " \n", 1940 | " # Check overall accuracies\n", 1941 | " from sklearn.metrics import accuracy_score\n", 1942 | " train_acc = accuracy_score(y_train, y_train_pred_log)\n", 1943 | " test_acc = accuracy_score(y_test, y_pred_log)\n", 1944 | " train_scores.append(train_acc)\n", 1945 | " test_scores.append(test_acc)\n", 1946 | " print(f'C: {c_val}')\n", 1947 | " print(f'Train score: {train_acc:0.5f}')\n", 1948 | " print(f' Test score: {test_acc:0.5f}')\n", 1949 | "\n", 1950 | " y_train_pred_log_df = pd.DataFrame(y_train_pred_log)\n", 1951 | " y_train_pred_log_df.columns = genre_cols\n", 1952 | "\n", 1953 | " y_pred_log_df = pd.DataFrame(y_pred_log)\n", 1954 | " y_pred_log_df.columns = genre_cols\n", 1955 | "\n", 1956 | " test_acc_dict = {}\n", 1957 | " # Test genre set predictions\n", 1958 | " for g in genre_cols:\n", 1959 | " score = accuracy_score(y_test[g], y_pred_log_df[g])\n", 1960 | " test_acc_dict.update( {g[2:] : score} )\n", 1961 | " print(f'{score:0.4f} {g}')\n", 1962 | "\n", 1963 | " # Export genre scores\n", 1964 | " test_scores_log = pd.DataFrame.from_dict(test_acc_dict, orient='index', columns=['score'])\n", 1965 | " test_scores_log.to_csv(f'test_scores_logreg_sag_{c_val}.csv', index_label='genre')" 1966 | ] 1967 | }, 1968 | { 1969 | "cell_type": "code", 1970 | "execution_count": 96, 1971 | "metadata": {}, 1972 | "outputs": [ 1973 | { 1974 | "data": { 1975 | "text/plain": [ 1976 | "" 1977 | ] 1978 | }, 1979 | "execution_count": 96, 1980 | "metadata": {}, 1981 | "output_type": "execute_result" 1982 | }, 1983 | { 1984 | "data": { 1985 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8GearUAAAgAElEQVR4nO3deXyU1b3H8c9JQhaSELYACSHsRPYtBJequCu20GprsYqyVKqt7b2t7a2tt5utrXa5rb219VJZXEvVq4CK0qr12tYKCRJ2wi5ZWMKSsGTPnPvHMyFDSGACM3lm+b5fr3kxz5LJ7ySTL2fOeeaMsdYiIiLhL8btAkREJDAU6CIiEUKBLiISIRToIiIRQoEuIhIhFOgiIhEizq1v3LNnTztgwAC3vr2ISFhas2bNIWttemvHXAv0AQMGUFBQ4Na3FxEJS8aYj9s6piEXEZEIoUAXEYkQCnQRkQihQBcRiRAKdBGRCKFAFxGJEK5dtigiEi2stVRW11NWUcO+ymqG9U6lX/fOAf8+CnQRkQtUVddwKqz3VdRQVllNWUU1+yprTv1bVdd46vyHp4/krksGBLwOBbqIyFnUN3rY7xPMZU2hXVFNWaUT4hVV9Wd8XXpqAplpiQztlcqVw3qR2TWRjLQkMromMrhnSlBqVaCLSNTyeCyHTtRS5g3sptDeV1lNaUUN+yqqKT9RS8sPdktL6kRGWiKZXZOYkN2VzK5JpwI7My2J3mkJJMTFdnh7FOgiEpF8x62doPb2qL0967KKag4cq6G+8fS0TuoUS0bXRDLTkhg2LP30sPb+m5wQmtEZmlWJiJyD77i107v2jmFX1lBa4QyLVNc3nvY1cTGGPmlOWE/s342MtCT6+gyFZKYl0bVzJ4wxLrXqwvgV6MaYG4HHgVjgKWvto62ccxvwQ8AC66y1XwhgnSISReoaPBw41jxuXertYTsTjs7+yurTx62NgfSUBDK6JpHTO5Up3nHrzK5Jp4ZHeqYkEBsTnmHtj3MGujEmFngCuA4oAfKNMcuttZt9zhkKfAe4zFp71BjTK1gFi0hkqGvw8MHOQ+w4eKK5p1159nHrzK5JZKYlMrF/V2/vujmse3dJJD4uut9a408PPQ/YYa3dBWCMWQJMBzb7nHMP8IS19iiAtfZgoAsVkfDn8VgKPj7K0sJSVmzYd+rqkKROsad60zk56c1h7TN23TleI8Tn4s9PqC9Q7LNdAkxucc4wAGPMP3GGZX5orX0rIBWKSNjbuv8YS9eW8dq6MkorqknqFMv1I3szfVwmE7K7kZYUvuPWocSfQG/tp9zixRBxwFBgCpAF/N0YM8paW3HaAxkzD5gHkJ2d3e5iRSR8lBytYvm6MpatLaPowHFiYwxXDO3Jf9yYw7XDe4fslSLhzJ+faAnQz2c7Cyhr5ZwPrbX1wG5jTBFOwOf7nmStnQ/MB8jNzW35n4KIhLkjJ+t4Y8M+lheWkr/nKAC5/bvx4+kjmTo6gx4pCS5XGNn8CfR8YKgxZiBQCswAWl7BshS4HVhsjOmJMwSzK5CFikhoqqpr4K+bD7CssIz3t5XT4LEM7ZXCt27IYdrYzKCsWSKtO2egW2sbjDH3AytxxscXWms3GWMeBgqstcu9x643xmwGGoFvWWsPB7NwEXFPfaOHf2w/xLLCUv6y+QBVdY1kpCUy9/KBTB/bl+EZqRoTd4GxLa8N6iC5ublWHxItEj6stXy09yjLCst4ff0+jpysIy2pE1NHZ/DpcZlMGtCdmAi+xjtUGGPWWGtzWzumWQkROavtB46ztLCUZYVllBytJiEuhutG9Gb6uL5cOSw96q/9DiUKdBE5Q1lFNa+tK2NpYRlb9h0jxsAnhqbz9WuHccOoPqToCpWQpN+KiABQUVXHig37WVZYyuo9R7AWxvXryg8/NYKbx2SSnqorVEKdAl0kitXUN/L2FucKlfeKDlLfaBmUnszXrx3G9HGZ9O+R7HaJ0g4KdJEo09Do4YOdh1laWMpfNh3gRG0DvbskMOvSAUwf15eRmV10hUqYUqCLRAFrLYXFFaeuUDl0opbUxDhuHp3B9HGZTB7UI6JXIYwWCnSRCLaz/ATLCstYXljKnsNVxMfFcM1FvZg+ri9TctJJ7NTxn6ojwaNAF4kwB47V8Nq6MpYVlrGhtBJj4NLBPfjyVUO4cVQfuiR2crtECRIFukgEqKyuZ+XG/SwtLOVfuw5jLYzJSuM/bx7OtLGZ9OqS6HaJ0gEU6CJhqqa+kfeKDrJ0bRnvFh2krsHDgB6d+drVQ5k2LpPB6cH5ZHkJXQp0kTDS6LF8uOswywpLeXPjfo7XNNAzJYE7JmczfVxfxmal6QqVKKZAFwlx1lo2lh5jaWEpr60r4+DxWlIS4rhhZB8+PT6TSwb1IC5Wb78XBbpIyNpz6CTLCstYtq6UXeUn6RRruCrHuULlmuG9dIWKnEGBLhJCDh6v4fV1+1i2rox1xRUYA5MHdueeywcxdVQGaZ11hYq0TYEuEgLqGz18+3/Xs3RtKR4LIzK68N2pF/GpsZlkpCW5XZ6ECQW6iMs8Hsu3X17PK2tLmXPZQG7P68fQ3qlulyVhSIEu4iJrLY+s2MIra0t54LphfPWaoW6XJGFMU+MiLvr9eztZ8I/dzL5sAPdfPcTtciTMKdBFXPLCqr38YmURnxnfl+/dPELXj8sFU6CLuGDFhn3859INXJWTzs8/O0afxSkBoUAX6WD/3HGIf19SyITsbvz+jol00puCJED0TBLpQOtLKpj3TAGD0pNZcPckkuL15iAJHAW6SAfZcfAEsxbl0z0lnqfn5OlNQhJwCnSRDlBWUc1dC1YRYwzPzplMby1nK0GgQBcJsiMn65i5YBXHaxp4es4kBvTUBy9LcOiNRSJBdLK2gdmL8yk5Ws0zc/IYmZnmdkkSwRToIkFS29DIvc+tYWNpJU/eOZHJg3q4XZJEOL+GXIwxNxpjiowxO4wxD7ZyfJYxptwYU+i9fTHwpYqEj0aP5RsvruPv2w/x2K1juG5Eb7dLkihwzh66MSYWeAK4DigB8o0xy621m1uc+mdr7f1BqFEkrFhr+f6yjbyxfh8PTR3OZydmuV2SRAl/euh5wA5r7S5rbR2wBJge3LJEwtev/7qN51ft5d4rB3PPFYPcLkeiiD+B3hco9tku8e5r6VZjzHpjzMvGmH6tPZAxZp4xpsAYU1BeXn4e5YqEtkX/3M1v393B53P78e0bc9wuR6KMP4He2iITtsX2a8AAa+0Y4G3g6dYeyFo731qba63NTU9Pb1+lIiFu6dpSfvTaZm4Y2ZtHPjNKi21Jh/Mn0EsA3x53FlDme4K19rC1tta7+UdgYmDKEwkPf9t6kG++tI5LBvXg8Rnj9aHN4gp/nnX5wFBjzEBjTDwwA1jue4IxJsNncxqwJXAlioS2gj1HuO/5NVyUkcr8uybqw5vFNee8ysVa22CMuR9YCcQCC621m4wxDwMF1trlwNeMMdOABuAIMCuINYuEjK37jzFncT6ZaUksnp1HaqLWZxH3GGtbDod3jNzcXFtQUODK9xYJhOIjVdz6hw+IMYaX77uErG6d3S5JooAxZo21Nre1Y3qnqMh5KD9ey50LVlHb4OGlexXmEho0cyPSTsdq6rl74WoOHqtl0exJDOud6nZJIoACXaRdauob+eLTBWw/eJwnZ05kQnY3t0sSOUVDLiJ+amj0cP8La8nfc4THZ4znymF6L4WEFvXQRfxgreXBVzbw9pYDPDxtJNPGZrpdksgZFOgifvjZm1t5eU0J/37tUGZeMsDtckRapUAXOYcn/28n89/fxd2X9OffrhnqdjkibVKgi5zFn/P38uibW5k2NpMffGqk1meRkKZAF2nDWxv3851XNnDFsHR++bmxxMQozCW0KdBFWvHBzkN8bclaxvbrypN3TiA+Tn8qEvr0LBVpYWNpJfOeWcOAHp1ZNGsSneN1da+EBwW6iI9d5Se4e+Fq0pI68cycyXTtHO92SSJ+U6CLeO2vrGHmgtUAPDs3jz5piS5XJNI+ei0pAlRU1TFzwSoqq+tZMu9iBqWnuF2SSLsp0CXqVdU1MHtxPh8fqeLp2XmM6pvmdkki50VDLhLV6ho83PvcR6wrruC/bx/PJYN7uF2SyHlTD12ilsdjeeCldby/rZzHbh3NDSP7uF2SyAVRD12ikrWWH762idfWlfHgTRfx+UnZbpckcsEU6BKVHn9nO8/862PmXTGIe68c7HY5IgGhQJeo88y/9vCbt7fzuYlZfOemi9wuRyRgFOgSVZYVlvKD5Zu4bkRvfnbLaC22JRFFgS5R472igzzw4jomDejOf98+nrhYPf0lsugZLVFhzcdHue+5jxjWO5Wn7s4lsVOs2yWJBJwCXSLetgPHmbM4n95dEnh6Th5dEju5XZJIUCjQJaIVH6li5oJVJMTF8OzcyaSnJrhdkkjQKNAlYh06UctdC1dTXdfIs3Mn0697Z7dLEgkqvwLdGHOjMabIGLPDGPPgWc77rDHGGmNyA1eiSPsdr6ln1qLV7KusZtHsSeT0SXW7JJGgO2egG2NigSeAm4ARwO3GmBGtnJcKfA1YFegiRdqjpr6Re54pYOu+4/zhjolM7N/d7ZJEOoQ/PfQ8YIe1dpe1tg5YAkxv5bwfAz8HagJYn0i7NDR6+Lcla/lw1xF++bmxXHVRL7dLEukw/gR6X6DYZ7vEu+8UY8x4oJ+19vUA1ibSLtZaHnp1Iys3HeAHnxrBp8f3PfcXiUQQfwK9tbfS2VMHjYkBfg08cM4HMmaeMabAGFNQXl7uf5UifnjsrSL+XFDM164ewuzLBrpdjkiH8yfQS4B+PttZQJnPdiowCnjPGLMHuBhY3trEqLV2vrU211qbm56efv5Vi7Qw//2dPPl/O7ljcjZfv26Y2+WIuMKfQM8HhhpjBhpj4oEZwPKmg9baSmttT2vtAGvtAOBDYJq1tiAoFYu08FJBMT9dsZWbx2Tw8PRRWp9FotY5A91a2wDcD6wEtgAvWms3GWMeNsZMC3aBImfz180HePCVDVw+tCe/vm0csTEKc4lefn1ikbV2BbCixb7vt3HulAsvS+TcPtx1mK+88BGj+qbx5J0TiY/T++QkuukvQMLSxtJK7nm6gOzunVk8axLJCfo0RREFuoSd3YdOMmvRalIT43hmTh7dkuPdLkkkJCjQJawcOFbDzAWr8Fh4Zu5kMrsmuV2SSMhQoEvYqKyq564Fqzl6so7FsycxpFeK2yWJhBQNPEpYqK5rZM7T+ew+dJJFsycxJqur2yWJhBz10CXk1Td6uO/5Nazde5THZ4zjsiE93S5JJCSphy4hzeOxfOuldbxXVM7PbhnNTaMz3C5JJGSphy4hy1rLw69vZmlhGd+6IYfb87LdLkkkpCnQJWT97t0dLP5gD3M/MZAvTxnsdjkiIU+BLiHpuQ8/5ld/3cYtE/ry0NThWp9FxA8KdAk5r60r43vLNnLNRb147NYxxGh9FhG/KNAlpLxXdJCv/7mQSQO688QdE+gUq6eoiL/01yIho2DPEe59bg05fVJ56u5cEjvFul2SSFhRoEtI2LLvGHMW55OZlsTTc/LoktjJ7ZJEwo4CXVy359BJZi5YTXJCHM/MzaNnSoLbJYmEJQW6uOrAsRruXLCKRo+HZ+fmkdWts9sliYQtBbq4pqKqjpkLVnH0ZB1Pz8ljSK9Ut0sSCWt667+44mRtA7MW5bPncBWLtdiWSECohy4drrahkXufW8OG0kp+d/t4Lh2sxbZEAkGBLh2q0WP5+p8L+fv2Qzx26xiuH9nH7ZJEIoYCXTqMtZaHXt3Aig37+d4nR/DZiVlulyQSURTo0mEefWsrS/KL+erVQ5j7iYFulyMScRTo0iH+8N5O/uf/djHz4v5847phbpcjEpEU6BJ0f1q9l8fe2sq0sZn8aNpIrZwoEiQKdAmqN9bv47uvbuCqnHR+ddtYrZwoEkQKdAma97eV8+9/Xktu/278/o6JWjlRJMj0FyZBsebjo3zp2TUM6ZXKU3dPIileKyeKBJtfgW6MudEYU2SM2WGMebCV4/caYzYYYwqNMf8wxowIfKkSLor2H2fO4nx6d0ngmTl5pCVp5USRjnDOQDfGxAJPADcBI4DbWwnsF6y1o62144CfA/8V8EolLOw9XMXMBatI7BTDs3Mnk56qlRNFOoo/PfQ8YIe1dpe1tg5YAkz3PcFae8xnMxmwgStRwsVB78qJdY0enps7mX7dtXKiSEfyZ3GuvkCxz3YJMLnlScaYrwDfAOKBq1t7IGPMPGAeQHZ2dntrlRBWWVXPXQtXc+hELS/cczFDe2vlRJGO5k8PvbXrzM7ogVtrn7DWDga+Dfxnaw9krZ1vrc211uamp6e3r1IJWVV1DcxevJpd5Sf54125jOunlRNF3OBPoJcA/Xy2s4Cys5y/BPj0hRQl4aOuwcO9z31EYXEFv719HJcN0cqJIm7xJ9DzgaHGmIHGmHhgBrDc9wRjzFCfzZuB7YErUUJVo8fy9RcLeX9bOY/eMoYbR2W4XZJIVDvnGLq1tsEYcz+wEogFFlprNxljHgYKrLXLgfuNMdcC9cBR4O5gFi3us9byvWUbeWP9Ph6aOpzbJvU79xeJSFD59YlF1toVwIoW+77vc//fAlyXhLhfrCzihVV7+fKUwdxzxSC3yxER9E5ROQ/z39/J79/byRcmZ/OtG3LcLkdEvBTo0i4v5hfz0xVb+eSYDH48fZRWThQJIQp08dtbG/fx4CvruXJYOv912zhitXKiSEhRoItf/rH9EF/7UyHjs7vxhzsnEB+np45IqNFfpZzT2r1HmfdsAYPSk1l49yQ6x/s1ly4iHUyBLme17cBxZi/Op2eKd+XEzlo5USRUKdClTcVHnJUT42NjeG7uZHp1SXS7JBE5C712llaVH69l5oJV1NR7ePFLl5DdQysnioQ69dDlDJXVzsqJB47Vsmj2JHL6aOVEkXCgQJfTVNc1MndxPjsOHmf+XROZkN3N7ZJExE8KdDmlrsHDfc+v4aO9R3l8xnguH6oljkXCicbQBQCPx/LNl9bxXlE5P7tlNFNHa+VEkXCjHrpgreUHyzexfF0ZD950Ebfn6dOkRMKRAl34r79u49kPP+ZLVw7i3isHu12OiJwnBXqUe+rvu/jvd3cwY1I/HrzxIrfLEZELoECPYi+vKeEnb2xh6ug+PPKZ0Vo5USTMKdCj1F827efb/7uey4f25Nef18qJIpFAgR6FPth5iPv/tJYxWWk8eedEEuJi3S5JRAJAgR5l1pdUcM/TBQzskcyiWZNITtCVqyKRQoEeRXYcPM7dC1fTPSWeZ+bm0bVzvNsliUgAKdCjRGlFNTMXrCbOu3Jib62cKBJxFOhR4NCJWmY+tYqTtQ08MyeP/j2S3S5JRIJAA6gR7lhNPXcvXE1ZZTXPf3EywzO6uF2SiASJeugRrKa+kS8+XcC2A8d58s6JTOzf3e2SRCSI1EOPUPWNHr7y/Efk7znCb2eMZ0pOL7dLEpEgUw89Ank8lv94eT3vbD3ITz49ik+NzXS7JBHpAAr0CGOt5eHXN/Pq2lK+dUMOd0zu73ZJItJB/Ap0Y8yNxpgiY8wOY8yDrRz/hjFmszFmvTHmHWOMUsQlj7+zncUf7OGeywfy5SlaOVEkmpwz0I0xscATwE3ACOB2Y8yIFqetBXKttWOAl4GfB7pQObdF/9zNb97ezm25WXx36nAttiUSZfzpoecBO6y1u6y1dcASYLrvCdbav1lrq7ybHwJZgS1TzuXVtSX86LXN3DCyNz/VyokiUcmfQO8LFPtsl3j3tWUu8GZrB4wx84wxBcaYgvLycv+rlLN6e/MBvvnSei4b0oPHZ4wnLlZTIyLRyJ+//Na6erbVE425E8gFftHacWvtfGttrrU2Nz1dH0AcCB/uOsxXXviIUZld+J+ZuSR20sqJItHKn+vQS4B+PttZQFnLk4wx1wIPAVdaa2sDU56czcbSSr74dAHZ3TuzeHYeKVo5USSq+dNDzweGGmMGGmPigRnAct8TjDHjgf8BpllrDwa+TGlpZ/kJ7l64mrSkTjw7dzLdkrVyoki0O2egW2sbgPuBlcAW4EVr7SZjzMPGmGne034BpAAvGWMKjTHL23g4CYCyimpmPrUKY+C5L06mT5pWThQRP9/6b61dAaxose/7PvevDXBd0obDJ2qZuWAVx2saWPKlixnYUysniohDg65h5HhNPbMW5VNaUc2zcyczMjPN7ZJEJITo+rYwUVPfyD3PFLBl3zH+cMdEJg3Qyokicjr10MNAQ6OH+19Yy6rdR/jN58dx1UVaOVFEzqRAD3EbSyt5+PXNrN59hB9PH8n0cWd7T5eIRDMFeogqq6jmlyuLeGVtKd2T4/n5rWO4bVK/c3+hiEQtBXqIOV5Tzx/e28mCf+zGAvdNGcx9UwbTJbGT26WJSIhToIeI+kYPS1bv5Tdvb+fwyTo+M74vD1w/jKxund0uTUTChALdZdZa3t5ykJ+9uYVd5SeZPLA7i24ezpisrm6XJiJhRoHuog0llTyyYjMf7jrCoPRk/nhXLtcO76Wlb0XkvCjQXVDqnfB81Tvh+ePpI5mRl00nLXsrIhdAgd6BjvlMeBrgy1MGc68mPEUkQBToHaC+0cOfvBOeR07Wccv4vjxwQw59uya5XZqIRBAFehC1nPC8eFB3Hpo6gtFZWoNFRAJPgR4k60sqeOSNLaza7Ux4PnVXLtdowlNEgkiBHmAlR6v45coilhaW0SM5nh9/ehQzJvXThKc/ao/DiYNwfD+cOABxiZA5DlIzQP8RipyTAj1AjtXU8/u/7WThP50Jz69cNZh7rxxMarRPeHoaoepwc0ifOOC9fxBO7PcJ8INQf7L1x0juBZnjnXDPHA8Z46BLRse2QyQMKNAvUH2jhxdW7eXxd7wTnhP68s3rc8iM9AnPuqrmgD5xAI433W8R0ifLwTae+fUJaZDSC1L7QN8JkNKneTull7Ndexz2FUJZIZSthR1/Betxvj6ljxPwGeOawz61T8f+DERCjAL9PFlr+evmAzz65lZ2HTrJJYN68NDNwxnVN4wnPD0eqD7qDeWmkG4R0E3btcfO/HoT4/SmU3s74Zox1hvQvZtvqb2dc+L9XNIge3Lz/bqTsH9Dc8DvK4RtKwHrHE/NOD3gM8Y5308kSijQz8O64goeWbGF1buPMDg9mQV353L1RSE84VlfAycPtt2Lbto+cQA8DWd+fadkJxhT+kDvUTDEpxfdFNIpvaFzD4iJDV474pMh+2Ln1qT2hDfk1zb35re9RXPIZ54+VJM5HlLSg1ejiIsU6O1QcrSKX6wsYpl3wvMn3gnPODcmPK2FmgqfkD5w+ji17zBITUUrD2AguWfzUEevET4h3ev0nnVCSoc3z28JKdD/EufWpPZ4c8g39eaL3uRUyHfpe3rAZ45zfhYiYU6B7odjNfU88bcdLPrnHgxw/1VD+NKVgzpmwvPEQWdYoeyjFr3qA9BYe+b5cYnNQZw+DAZe3npIJ6dDbIT++hNSof+lzq1J7XHYt96nJ78Wtr7efDytnzNEdKo3Px6Se3R87SIXIEL/ogOjacLzN29vo6K6ns+M74AJT2uhvAiKVji9ypJ8wEJimjN8kNobegz2DnW0Mj6d0EWX+LUmIRUGXObcmtRUOiHfFPBlhS1CPhsyx57em++sz3KV0KVAb4W1lr94Jzx3d8SEZ2MD7P2XE+BFK+Dobmd/xjiY8h3IuQn6jFZQB1pimvMKZuDlzfuqK2D/+uaA31cIW15rPt41+8yJV4W8hAgFeguFxRX89I0trN5zhCG9Ulg4K5ercoIw4VlTCTvecUJ8+1+cce7YBBh4BVz6VRh2I6Tp80M7XFJX53cw8IrmfdUVsG/d6cM1W5Y3H+/av8V18mMhqVvH1y5RT4HuVXzEmfBcvq6MninxPPKZUXw+N8ATnhV7oegtpxe+5x/gqXeuDLnoZqcXPuiq0J6AjFZJXWHQlc6tSfXR5pBvmnjdvLT5eLeBp18nnzHWeRyRIIr6QK+sruf33gnPmBj46tVD+NKVg0lJCMCPxuNxenRFbzq3Axuc/T2HwSVfhpypkDUpuJf6SXAkdYNBU5xbk6ojp78RqnQNbHq1+Xj3QS2Ga8Y6wz4iAeJXahljbgQeB2KBp6y1j7Y4fgXwG2AMMMNa+3KgCw20ugYPL6z6mMff2U5FdT23jM/imzcMIyPtAic866th9/tOgG97C47vc95wk30JXP8TGHYT9BwSmEZIaOncHQZf7dyaVB05faimpAA2vdJ8vPvg06+T7zNKwzVy3s4Z6MaYWOAJ4DqgBMg3xiy31m72OW0vMAv4ZjCKDCRrLSs3HeCxt5wJz0sH9+C7Uy9wwvPkISe8i96Ene9CfRXEp8CQa5xe+NDrNXEWrTp3d54HQ65p3nfyMOxb2zxcU7waNv5v8/HkdOdVXM+h3n+999OyIUaLvEnb/Omh5wE7rLW7AIwxS4DpwKlAt9bu8R7zBKHGgCksruCRNzaTv+coQ3ulsGjWJKbkpLd/wtNaOLSt+dLC4tWAdd6wMu4Lznj4gMshLiEo7ZAwl9wDhlzr3JqcPOSEe/kW57lVvg02L3PG6pvEJUKPoT5B7/23xxD/l1KQiOZPoPcFin22S4DJbZwbkoqPVPHzlUW85p3w/OlnRnNbblb7JjwbG6D4w+ZLC4/scvZnjIUpD3ovLRyjSwvl/CT3hKHXOjdfJw87AX/qtr15Atb69J/Sss8M+p7DnDeU6TkZNfwJ9NaeDfZ8vpkxZh4wDyA7O/t8HqJdLnjCs+YY7PReWrhtpffSwngYeCVc8hVnPFyXFkowJfeA5BZLG4CzPs+RXXCoyAn5psD/6F/OkF+TxLTTh22a7ncbALFRvrRzBPIn2UqAfj7bWUDZ+Xwza+18YD5Abm7uef2n4I+6Bg/Peyc8K6vruXVCFg9c7+eEZ0Wxdzx8Bez+u3NpYVJ3Zyw85yYYfJXzrkMRN3VKhN4jnJsvjweOlzX35puCfue7UPh883kxcc5VN62N1evKm7DlT6DnA0ONMQOBUmAG8IWgVnWenAnP/Tz65mGvZ+oAAAeCSURBVFb2HK7isiHOhOfIzLM8Qa31ubRwhbOoEzhjlRff5wR5vzxdWijhISYG0rKcm+/VNuC84vQN+abQ3/bW6atspvQ5M+R7DnPmiDQpG9KMtefuKBtjpuJclhgLLLTWPmKMeRgosNYuN8ZMAl4FugE1wH5r7cizPWZubq4tKCi44AY0Wbv3KI+8sYWCj50Jz+/ePJwpw9qY8KyvgT1/905qvuX0aEwM9LvY6YXn3OQ8iUWiQWM9HP349JA/VOTcr6lsPq9TZ2cC1jfo03OcSy87JbpXf5Qxxqyx1ua2esyfQA+GQAV68ZEqHntrK6+v30fPlAQeuH4Yn5vYyoTnyUPOW+yLVsCOd52PO+uUfPqlhVpdT6SZtc4nTp0W9N77FcU0T6UZ6Na/9bH6zj00KRtgZwv0sH2naGVVPU+8t4PF3gnPr109hHktJzwPbfe5tHCVc1VAaiaMneGE+IBPqGch0hZjvGvk93L+VnzVVcGRnc2XWDYF/u73oaGm+bykbi1CPse537V/5C7f7KKw+4nWNXh47sOP+e27zoTnZydk8cD1OfRJS3QuLdzzz+YQP7LT+aI+Y+CK/3CGUjLGqscgcqHiOzsrgPYZffp+jwcqi1uM1W+HbX+Btc81nxcb7wzV9BzifHRgYhokdnXWuzn1r8+++BT93foh7AL9t+9s53d/28EnhvTku1OHM6KHgR1vwTtvwvaVzhsxYjo5q+VdfJ8T4mlZbpctEh1iYpzhl279z7ymvvromUF/cIvTq685xlmvhjaxTsC3FfiJXds+npgWNRc1hF2g333pAC5Nr+GShtWYdx5zJjcb65yXdsNudG6Dr4bELm6XKiK+kro5V4z1yzvzmMfjfPB4TYWzXHFNpc9973bT/abjFXub97X2Wbi+EtK8gZ92lv8QWr5C8J4TFx+cn0cQhF2gpxc+Qfo7P3I2ug+GyV/yrlqYpzE5kXAVE+OEaFJX51q59rDWeTNVy8A/238Ih3Y032+oPvvjd+rc/lcFTfs6JXXoUFH4JeCgKc7Lp5ypurRQRJzAjE92bufzzu2G2tZfAZz2H4LP/soSOLDJ2Vd77OyPHRvfIuS99yfMPH3p5QAJv0DvO8G5iYgEQlxC89U87dXY0GKo6ByvEKoOw+GdztxeEIRfoIuIhIrYOGeJ5BBZHlvv4xURiRAKdBGRCKFAFxGJEAp0EZEIoUAXEYkQCnQRkQihQBcRiRAKdBGRCOHaB1wYY8qBj4E0wOdjUU7bbutYT+BQgEpp+T3O97y2jre23982+94PVJv9ba8/56rNbe9vz3Y4trm9v+OW26Hc5kA9r1tuB6rN/a216a0esda6egPmt7Xd1jGcj74Lyvc/3/PaOt7afn/b3OJ+QNrsb3vV5gtrc3u2w7HN7f0dh1ObA/W87og2t7yFwpDLa2fZPtuxYH3/8z2vreOt7fe3zW62159z1ea297dnOxzb3N7fccvtUG5zoJ7XLbeD0ebTuDbkciGMMQW2jc/Ui1Rqc3RQm6NDsNocCj308zHf7QJcoDZHB7U5OgSlzWHZQxcRkTOFaw9dRERaUKCLiEQIBbqISISIuEA3xkwxxvzdGPOkMWaK2/V0FGNMsjFmjTHmk27X0hGMMcO9v+OXjTH3uV1PRzDGfNoY80djzDJjzPVu19MRjDGDjDELjDEvu11LsHj/dp/2/m7vuJDHCqlAN8YsNMYcNMZsbLH/RmNMkTFmhzHmwXM8jAVOAIlASbBqDZQAtRng28CLwakysALRZmvtFmvtvcBtQMhf8hagNi+11t4DzAI+H8RyAyJAbd5lrZ0b3EoDr51tvwV42fu7nXZB3zgY71a6gHd6XQFMADb67IsFdgKDgHhgHTACGA283uLWC4jxfl1v4Hm329RBbb4WmIHzh/5Jt9vUEW32fs004APgC263qaPa7P26XwET3G5TB7f5ZbfbE8S2fwcY5z3nhQv5viH1IdHW2veNMQNa7M4DdlhrdwEYY5YA0621PwPONrxwFEgIRp2BFIg2G2OuApJxnhzVxpgV1lpPUAu/AIH6PVtrlwPLjTFvAC8Er+ILF6DfswEeBd601n4U3IovXID/nsNKe9qOM5KQBRRygaMmIRXobegLFPtslwCT2zrZGHMLcAPQFfhdcEsLmna12Vr7EIAxZhZwKJTD/Cza+3uegvNSNQFYEdTKgqddbQa+ivNqLM0YM8Ra+2QwiwuS9v6eewCPAOONMd/xBn+4aqvtvwV+Z4y5mQtcHiAcAt20sq/Nd0NZa18BXgleOR2iXW0+dYK1iwNfSodp7+/5PeC9YBXTQdrb5t/i/PGHs/a2+TBwb/DK6VCttt1aexKYHYhvEFKTom0oAfr5bGcBZS7V0lHUZrU5UkVjm5sEve3hEOj5wFBjzEBjTDzO5N9yl2sKNrVZbY5U0djmJsFvu9uzwS1mhv8E7APqcf43m+vdPxXYhjND/JDbdarNarParDaHYtu1OJeISIQIhyEXERHxgwJdRCRCKNBFRCKEAl1EJEIo0EVEIoQCXUQkQijQRUQihAJdRCRCKNBFRCLE/wNC0wVAbYjlFgAAAABJRU5ErkJggg==\n", 1986 | "text/plain": [ 1987 | "
" 1988 | ] 1989 | }, 1990 | "metadata": { 1991 | "needs_background": "light" 1992 | }, 1993 | "output_type": "display_data" 1994 | } 1995 | ], 1996 | "source": [ 1997 | "plt.figure()\n", 1998 | "plt.plot(c_values, train_scores, label='train')\n", 1999 | "plt.plot(c_values, test_scores, label='test')\n", 2000 | "plt.xscale('log')\n", 2001 | "plt.show" 2002 | ] 2003 | }, 2004 | { 2005 | "cell_type": "code", 2006 | "execution_count": 110, 2007 | "metadata": {}, 2008 | "outputs": [ 2009 | { 2010 | "data": { 2011 | "text/plain": [ 2012 | "[0.07767076472415782,\n", 2013 | " 0.17256224757001465,\n", 2014 | " 0.35679730149571703,\n", 2015 | " 0.5426745373041587,\n", 2016 | " 0.6031245839066175,\n", 2017 | " 0.6226532333229773]" 2018 | ] 2019 | }, 2020 | "execution_count": 110, 2021 | "metadata": {}, 2022 | "output_type": "execute_result" 2023 | } 2024 | ], 2025 | "source": [ 2026 | "train_scores" 2027 | ] 2028 | }, 2029 | { 2030 | "cell_type": "code", 2031 | "execution_count": 111, 2032 | "metadata": {}, 2033 | "outputs": [ 2034 | { 2035 | "data": { 2036 | "text/plain": [ 2037 | "[0.0714951404606577,\n", 2038 | " 0.12355212355212356,\n", 2039 | " 0.1399281054453468,\n", 2040 | " 0.10557848488882972,\n", 2041 | " 0.07881773399014778,\n", 2042 | " 0.06763413659965384]" 2043 | ] 2044 | }, 2045 | "execution_count": 111, 2046 | "metadata": {}, 2047 | "output_type": "execute_result" 2048 | } 2049 | ], 2050 | "source": [ 2051 | "test_scores" 2052 | ] 2053 | }, 2054 | { 2055 | "cell_type": "code", 2056 | "execution_count": null, 2057 | "metadata": {}, 2058 | "outputs": [], 2059 | "source": [] 2060 | } 2061 | ], 2062 | "metadata": { 2063 | "kernelspec": { 2064 | "display_name": "Python 3", 2065 | "language": "python", 2066 | "name": "python3" 2067 | }, 2068 | "language_info": { 2069 | "codemirror_mode": { 2070 | "name": "ipython", 2071 | "version": 3 2072 | }, 2073 | "file_extension": ".py", 2074 | "mimetype": "text/x-python", 2075 | "name": "python", 2076 | "nbconvert_exporter": "python", 2077 | "pygments_lexer": "ipython3", 2078 | "version": "3.7.4" 2079 | } 2080 | }, 2081 | "nbformat": 4, 2082 | "nbformat_minor": 4 2083 | } 2084 | -------------------------------------------------------------------------------- /demo/models/my_best_model.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/models/my_best_model.pkl -------------------------------------------------------------------------------- /demo/models/my_best_scaler.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/models/my_best_scaler.pkl -------------------------------------------------------------------------------- /demo/models/my_best_tfidf.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/models/my_best_tfidf.pkl -------------------------------------------------------------------------------- /demo/predict.py: -------------------------------------------------------------------------------- 1 | from flask import Flask, render_template, flash, request 2 | 3 | #Importing the packages we will be using 4 | # Basic Packages 5 | import numpy as np 6 | import pandas as pd 7 | #pd.set_option('display.max_columns', 500) 8 | #np.set_printoptions(suppress=True) 9 | 10 | # NLTK Packages 11 | #import nltk 12 | # Use the code below to download the NLTK package, a straightforward GUI should pop up 13 | #nltk.download() 14 | #from nltk.corpus import stopwords 15 | from nltk.tokenize import word_tokenize 16 | from nltk.stem import PorterStemmer 17 | from nltk.stem import WordNetLemmatizer 18 | import joblib 19 | 20 | #stop_words = stopwords.words('english') 21 | stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"] 22 | #Adds stuff to our stop words list 23 | stop_words.extend(['.',',']) 24 | 25 | ## This function can improve, simplify. Look into Text Data Lecture 26 | def remove_stopwords(list_of_tokens): 27 | """ 28 | Removes stopwords 29 | """ 30 | 31 | cleaned_tokens = [] 32 | 33 | for token in list_of_tokens: 34 | if token in stop_words: continue 35 | cleaned_tokens.append(token) 36 | 37 | return cleaned_tokens 38 | def stemmer(list_of_tokens): 39 | ''' 40 | Takes in an input which is a list of tokens, and spits out a list of stemmed tokens. 41 | ''' 42 | 43 | stemmed_tokens_list = [] 44 | 45 | for i in list_of_tokens: 46 | 47 | token = PorterStemmer().stem(i) 48 | stemmed_tokens_list.append(token) 49 | 50 | return stemmed_tokens_list 51 | 52 | #from nltk.stem import WordNetLemmatizer 53 | 54 | def lemmatizer(list_of_tokens): 55 | 56 | lemmatized_tokens_list = [] 57 | 58 | for i in list_of_tokens: 59 | token = WordNetLemmatizer().lemmatize(i) 60 | lemmatized_tokens_list.append(token) 61 | 62 | return lemmatized_tokens_list 63 | 64 | 65 | def the_untokenizer(token_list): 66 | ''' 67 | Returns all the tokenized words in the list to one string. 68 | Used after the pre processing, such as removing stopwords, and lemmatizing. 69 | ''' 70 | return " ".join(token_list) 71 | 72 | def clean_string(my_string): 73 | tokenized_list = word_tokenize(my_string) 74 | removed_stopwords = remove_stopwords(tokenized_list) 75 | stemmed_words = stemmer(removed_stopwords) 76 | lemmatized_words = lemmatizer(stemmed_words) 77 | back_to_string = the_untokenizer(lemmatized_words) 78 | return back_to_string 79 | 80 | app = Flask("genre_prediction", template_folder='templates') 81 | #app = Flask("genre_prediction", template_folder='/home/TomKeith/genre/templates') 82 | app.secret_key = "super secret key" 83 | 84 | @app.route('/', methods=["GET","POST"]) 85 | def predict(): 86 | 87 | #error = 'trying post' 88 | try: 89 | if request.method == "POST": 90 | my_string = request.form['plot'] 91 | 92 | train_df = pd.read_csv('train_medians.csv') 93 | my_model = joblib.load('models/my_best_model.pkl') 94 | my_scaler = joblib.load('models/my_best_scaler.pkl') 95 | my_tfidf = joblib.load('models/my_best_tfidf.pkl') 96 | 97 | #train_df = pd.read_csv('/home/TomKeith/genre/train_medians.csv') 98 | #my_model = joblib.load('/home/TomKeith/genre/models/my_best_model.pkl') 99 | #my_scaler = joblib.load('/home/TomKeith/genre/models/my_best_scaler.pkl') 100 | #my_tfidf = joblib.load('/home/TomKeith/genre/models/my_best_tfidf.pkl') 101 | 102 | genre_cols = ['action','adventure','animation','biography','comedy','crime','documentary',\ 103 | 'drama','family','fantasy','film-noir','history','horror','music','musical',\ 104 | 'mystery','romance','sci-fi','sport','thriller','war','western'] 105 | 106 | feature_cols = ['f_release_year','f_release_month','f_runtime','f_word_count_long','f_imdb_rating',\ 107 | 'f_num_imdb_votes','f_num_user_reviews','f_num_critic_reviews'] 108 | 109 | 110 | feature_cols_df = pd.DataFrame([[0]*8 ], columns=feature_cols) 111 | 112 | input_tfidf = my_tfidf.transform([clean_string(my_string)]) 113 | input_transformed_df = pd.DataFrame(input_tfidf.toarray(), columns=my_tfidf.get_feature_names()) 114 | 115 | input_final = pd.concat([feature_cols_df, input_transformed_df], axis=1) 116 | 117 | for col in feature_cols: 118 | input_final.at[0,col] = train_df[col].median() 119 | input_final.at[0,'f_word_count_long'] = len(my_string) 120 | input_final_df = my_scaler.transform(input_final) 121 | 122 | input_pred = my_model.predict_proba(input_final_df) 123 | 124 | df = pd.DataFrame(input_pred, columns=genre_cols).T.sort_values(0, ascending=False) 125 | output_list = [] 126 | for index, row in df.iterrows(): 127 | if row.values[0] >= 0.2: 128 | temp_list = [int(round(row.values[0]*100,0)), index.capitalize()] 129 | output_list.append(temp_list) 130 | return render_template('predict.html', results=output_list, my_string=my_string) 131 | else: 132 | return render_template('predict.html') 133 | 134 | except Exception as e: 135 | print(e) 136 | #return json.dumps({'success':True}, 200, {'ContentType':'application/json'}) 137 | return render_template("predict.html", error = e) 138 | 139 | if __name__ == "__main__": 140 | app.debug = True 141 | app.run() -------------------------------------------------------------------------------- /demo/static/css/styles.css: -------------------------------------------------------------------------------- 1 | @import url('https://fonts.googleapis.com/css2?family=Raleway:wght@400;700&display=swap'); 2 | 3 | .container { 4 | margin: auto; 5 | width: 800px; 6 | } 7 | 8 | body { 9 | text-align: center; 10 | background-color: #f46524; 11 | font-family: 'Raleway', sans-serif; 12 | color: white; 13 | } 14 | h1 { 15 | font-weight: 700; 16 | font-size: 64pt; 17 | font-family: 'Raleway', sans-serif; 18 | margin-bottom: 6px; 19 | margin-top: 10px; 20 | 21 | } 22 | 23 | h2 { 24 | font-weight: 700; 25 | font-size: 48pt; 26 | font-family: 'Raleway', sans-serif; 27 | margin-bottom: 8px; 28 | margin-top: 10px; 29 | 30 | } 31 | 32 | h3 { 33 | font-weight: 700; 34 | font-size: 24pt; 35 | font-family: 'Raleway', sans-serif; 36 | margin-bottom: 10px; 37 | margin-top: 10px; 38 | 39 | } 40 | 41 | p { 42 | color: white; 43 | } 44 | 45 | .results { 46 | // width: 600px; 47 | } 48 | 49 | .genre_name{ 50 | font-size: 24pt; 51 | } 52 | 53 | .genre_score{ 54 | font-size: 16pt; 55 | } 56 | 57 | .genre_block{ 58 | width: 25%; 59 | display:inline-block; 60 | float: center; 61 | } 62 | 63 | img{ 64 | width: 50px; 65 | } 66 | 67 | .btn { 68 | background-color: white; 69 | border: none; 70 | color: #f46524; 71 | padding: 15px 32px; 72 | text-align: center; 73 | text-decoration: none; 74 | display: inline-block; 75 | font-size: 16px; 76 | font-weight: 700; 77 | border-radius: 6px; 78 | } 79 | 80 | .footer { 81 | position: fixed; 82 | left: 0; 83 | bottom: 0; 84 | width: 100%; 85 | // background-color: red; 86 | color: black; 87 | text-align: center; 88 | // font-weight: bold; 89 | font-size: 10pt; 90 | } 91 | 92 | a{ 93 | text-decoration: none; 94 | // font-weight: bold; 95 | color: white; 96 | } -------------------------------------------------------------------------------- /demo/static/images/Action.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Action.png -------------------------------------------------------------------------------- /demo/static/images/Adventure.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Adventure.png -------------------------------------------------------------------------------- /demo/static/images/Animation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Animation.png -------------------------------------------------------------------------------- /demo/static/images/Biography.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Biography.png -------------------------------------------------------------------------------- /demo/static/images/Comedy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Comedy.png -------------------------------------------------------------------------------- /demo/static/images/Crime.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Crime.png -------------------------------------------------------------------------------- /demo/static/images/Documentary.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Documentary.png -------------------------------------------------------------------------------- /demo/static/images/Drama.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Drama.png -------------------------------------------------------------------------------- /demo/static/images/Family.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Family.png -------------------------------------------------------------------------------- /demo/static/images/Fantasy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Fantasy.png -------------------------------------------------------------------------------- /demo/static/images/Film-noir.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Film-noir.png -------------------------------------------------------------------------------- /demo/static/images/History.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/History.png -------------------------------------------------------------------------------- /demo/static/images/Horror.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Horror.png -------------------------------------------------------------------------------- /demo/static/images/Music.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Music.png -------------------------------------------------------------------------------- /demo/static/images/Musical.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Musical.png -------------------------------------------------------------------------------- /demo/static/images/Mystery.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Mystery.png -------------------------------------------------------------------------------- /demo/static/images/Romance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Romance.png -------------------------------------------------------------------------------- /demo/static/images/Sci-fi.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Sci-fi.png -------------------------------------------------------------------------------- /demo/static/images/Sport.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Sport.png -------------------------------------------------------------------------------- /demo/static/images/Thriller.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Thriller.png -------------------------------------------------------------------------------- /demo/static/images/War.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/War.png -------------------------------------------------------------------------------- /demo/static/images/Western.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/Western.png -------------------------------------------------------------------------------- /demo/static/images/magic-lamp.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/demo/static/images/magic-lamp.png -------------------------------------------------------------------------------- /demo/templates/predict.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | Genre Genie - Movie Genre Predictor - Tom Keith 6 | 7 | 8 |

Genre Genie

9 |

Movie Genre Predictor

10 |
11 |

Enter a plot summary to get genre predictions.

12 |
13 | 14 |

15 | 16 |
17 |
18 |
19 | {% if results %} 20 |
21 | {% for g in results %} 22 |
23 |
24 |
{{g[1]}}
25 |
{{g[0]}} %
26 |
27 |
28 | 29 | {% endfor %} 30 |
31 | {% endif %} 32 |
33 | {% if error %}

{{error}}

{% endif %} 34 | 37 | 38 | -------------------------------------------------------------------------------- /demo/train_medians.csv: -------------------------------------------------------------------------------- 1 | f_release_year,f_release_month,f_runtime,f_word_count_long,f_imdb_rating,f_num_imdb_votes,f_num_user_reviews,f_num_critic_reviews 2 | 2004,7,100,76,6.5,3564,35,29 3 | -------------------------------------------------------------------------------- /images/app.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/images/app.png -------------------------------------------------------------------------------- /images/app2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/images/app2.png -------------------------------------------------------------------------------- /images/genre-counts-graph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/images/genre-counts-graph.png -------------------------------------------------------------------------------- /images/imdb-bottom.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/images/imdb-bottom.png -------------------------------------------------------------------------------- /images/imdb-top.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/images/imdb-top.png -------------------------------------------------------------------------------- /images/results-graph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/images/results-graph.png -------------------------------------------------------------------------------- /images/wc_img.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/images/wc_img.png -------------------------------------------------------------------------------- /models/my_1vr_logreg_0.01.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/models/my_1vr_logreg_0.01.pkl -------------------------------------------------------------------------------- /models/my_1vr_logreg_default.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/models/my_1vr_logreg_default.pkl -------------------------------------------------------------------------------- /models/my_best_model.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/models/my_best_model.pkl -------------------------------------------------------------------------------- /models/my_best_scaler.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/models/my_best_scaler.pkl -------------------------------------------------------------------------------- /models/my_best_tfidf.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/models/my_best_tfidf.pkl -------------------------------------------------------------------------------- /models/my_minmax_scaler.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/models/my_minmax_scaler.pkl -------------------------------------------------------------------------------- /models/my_standard_scaler.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/models/my_standard_scaler.pkl -------------------------------------------------------------------------------- /models/my_tfidf_min20.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tomkeith/Multi-label-classification-with-NLP/ceeb0393afd7c69995e2537ad2be491f03bcceca/models/my_tfidf_min20.pkl -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | # Genre Genie - Movie Genre Predictions 2 | 3 | ## Multi-label Classification with Natural Language Processing 4 | 5 | ### Tom Keith - BrainStation Data Science Diploma Capstone - March 2020 6 | 7 | **DEMO**: http://tomkeith.pythonanywhere.com/ 8 | 9 | Genre Genie app 10 | 11 | --- 12 | 13 | ### Genre Genie - Movie Genre Predictions 14 | 15 | As a huge fan of movies and the information associated with them, countless hours have been spent on the Internet Movie Database (IMDb) looking up movie trivia, box office figures, following people through film (degrees of Kevin Bacon), and exploring movies similar to my favourites. Like many, I prefer movies of particular genres more than others. What makes a genre? Which words can ‘define’ a genre? How can data answer this question? These burning questions gave rise to Genre Genie. 16 | 17 | Genre Genie has been trained on information from 30,000 movies - including brief plot summaries - with a goal of accurately predicting genres most associated to new, unseen plot summaries. I set out to tackle this multi-label classification problem by utilizing web scraping, natural language processing (NLP) and machine learning. A secondary goal was to create some sort of interactivity element for display and demonstration. 18 | 19 | ### Multi-label Classification 20 | 21 | During my computer science undergraduate degree, I was in a database class learning about 'many-to-many' relationships – a relationship that is exemplified well with movies and genres. Simply put, a movie can have many genres associated with it, and a genre can be associated with many movies (hence 'many-to-many'). This is different than the classifications that more commonly occur as multi-class classification (or even binary classification) which is 'one-to-many'. For example, image classification of animals: dog, cat, horse, etc. The image can’t be classified as a dog *and* a horse, it can only be one – that is multi-*class* classification due to the fact the classifications are mutually exclusive. Movie genres are multi-*label* classifications and are **not** mutually exclusive – as best seen by the success of romantic comedies for example. 22 | 23 | While multi-label classification comes with its own set of problems, it is still very much overlaps with multi-class classification. Sourcing articles handling multi-label classification with machine learning[1][2], a popular method is a ‘OneVsRest’ approach, which will be discussed more later. Multi-label classifications have many real-world applications, especially with text classification. For example, news articles, blog websites, or like in the examples from sourced articles, toxicity level in comments. 24 | 25 | ### The Data 26 | 27 | Finding data for this problem proved interesting, however the challenge was welcomed. IMDb has its own sets of open data, which provided a great starting point, but lacked some major components needed to solve this problem. Most importantly, no text data (other than titles) was present to utilize NLP. Additionally, the dataset was limited to only the first (alphabetically) 3 genres, rather than the amount shown in more detail on IMDb.com - where a movie can have up to 7 genres. 28 | 29 | Fortunately, this dataset has IMDB IDs (aka `tconst`) which can be extracted and utilized to scrape IMDb directly using their simple url construction - imdb.com/title/tt0076759/ where `tt0076759` is the `tconst`. A list of over 30,000 movies' `tconst` values were fetched, with the conditions of having more than 1,000 rating votes and from the last 100 years (1920-2019). These thresholds were chosen with the assumption that if a movie has at least 1,000 rating votes, it's presumed to be accurate. I chose to go back 100 years for a representative sample distribution with variety across all genres, storylines and time periods. 30 | 31 | Armed with 30,000+ IMDb IDs, I scraped IMDb for more information than required, seeking to have anything that could aid the model in making better predictions. Attempts at scraping a very long plot synopsis, potentially yielding in more text to train on, were halted when 2/3 of scrapes returned null. This resulted in using a shorter summary with an average of about 90 words, and no null values. This text and the additional numeric values scraped were evaluated for feature selection. 32 | 33 | ### EDA and Feature Selection 34 | 35 | After scraping, some features just didn’t have enough data. For example, MPAA (the content rating of a movie) was missing for over 5,000 entries. After doing some research, the modern MPAA rating system didn't yet exist for many of the movies in my dataset (often just saying 'passed' rather than a rating 'PG13'). Additionally, the Metacritic rating was null for about half the data. These two features, along with the long synopsis mentioned above, were dropped altogether. 36 | 37 | I featured engineered a word count (of the plot summary), in addition to the release month of movies. In the end, prior to pre-processing the text for NLP, I had 8 numerical features, and 1 text column consisting of - release year, release month, runtime, word count, IMDb rating, number of IMDb rating votes, number of user reviews and number of critic reviews. 38 | 39 | The genres, our target variables, displayed some interesting trends. The representation among the genres were not balanced, however. For example, of all 25 genre tags, the top 4 (drama, comedy, thriller, romance) made up over 50% of all the genre tags (more on this in results). Movies associated with the bottom 3 (game-show, news, adult) were dropped altogether or absorbed into another genre. For example, 'news' genre proved redundant as it was always shared with 'documentary'. Ended with 22 genres. 40 | 41 | ![](images/genre-counts-graph.png) 42 | 43 | ### NLP Pre-processing and TF-IDF 44 | 45 | Once I had my final dataset, I needed to convert the text into numerical values for modelling. To achieve this, we first need to pre-process the text before vectorizing it. I used a tokenizer to break down strings into single words, a stemmer to chop off the end of the words (and remove stop-words), a lemmatizer to change the word into their base form, and finally an un-tokenizer to put the words back into one string. 46 | 47 | Once the text was pre-processed, Term Frequency-Inverse Document Frequency (TF-IDF) was used to transform the pre-processed text into vectorized numerical values. But first, and most importantly, the data must be split! 25% was set aside for a test set (75% train). When vectorizing, one n-gram (single words only) and a threshold of 20 for the minimum number of documents the word needed to appear in for it to count as a feature. Once the vectorizer had been fit on the train data, both sets of text from the train and test sets we transformed, then combined with the remaining features (including targets). Finally, exported separate ‘train’ and ‘test’ files where they will be used for modeling. 48 | 49 | ### Modeling with OneVsRest 50 | 51 | Prior to fitting, as with most modeling, scaling was required. Both standard and min-max scalers were tested, however standard scaler proved better results. 52 | 53 | When using a multi-label classifier, the target variable is not just one column, but many. My approach considered each label individually by fitting an independent model to each of the labels. This process is simplified using the 'OneVsRest' classifier that allows one model type to be passed (logistic regression) as a parameter, and the entire multi-dimensional target array then fitted. 54 | 55 | OneVsRest takes each target column and evaluates it independently from the others (is it action, or not – hence 'one vs rest'). Thus, when the trained model predicts the label, it can output more than one label. Logistic regression proved to be the best model (vs Linear SVM) for this data set with the hyperparameters tuned (`C=0.01`, `solver='lbfgs'`). 56 | 57 | ### Results 58 | 59 | Similar to how the model is fitted one genre at a time, the results in the chart below are also independent of each other. The predicted values are checked against the target test set variables as an 'accuracy score'. As shown in the chart, the accuracy score is much higher for genres which are less represented (right-most), and the genres which have a greater representation have lower accuracy scores (left-most). These higher accuracies could be attributed to having more distinct words/data to identify that genre. The inverse is also true for the lower accuracies where there is too much overlap with other genres. Overall, the average genre accuracy was 90% - a score I consider 'good enough' for this application. 60 | 61 | ![](images/results-graph.png) 62 | 63 | ### Demo with Flask 64 | 65 | While outside the scope of my initial goal, I wanted a fun way of demonstrating my model to the public on demo day. I created a Flask app where a user can input text (a plot summary) and the model will predict the genres most associated with that input text. All of the demo code is include in this project, and is temporarily available to demo here: http://tomkeith.pythonanywhere.com/ 66 | 67 | Genre Genie app 68 | 69 | ### Next Steps 70 | 71 | This is an unbalanced dataset. There are many options to deal with unbalances data sets. Next steps could try to deal with this unbalanced set better than just using logistic regression, which handles the imbalance decently well. Over/under sampling, or looking into different scoring methods like f1 score, precision, recall and reviewing each of the 22 genres' confusion matrices would give a better understanding of how accurate the predictions are. Finally, applying a multi-label classification using NLP to a more business situation such as customer complaints, article/blog classification. 72 | 73 | --- 74 | 75 | ## PROJECT FILES AND FOLDERS 76 | 77 | ### Files 78 | 79 | - **`1.1-imdb-datasets.ipynb`** - Explore IMDb datasets 80 | - **`1.2-scraping-imdb.ipynb`** - Web scraper designed to scrape IMDb.com titles and export .tsv files. 81 | - **`1.3-data-merge-clean-encode.ipynb`** - Merge scraped data from IMDb, clean, and binary encode the genres (from list format). 82 | - **`2.1-eda.ipynb`** - EDA, feature engineering and prepare dataset for NLP preprocessing. 83 | - **`2.2-data-preprocessing.ipynb`** - Pre-process text data, split into train and test sets, TF-IDF. 84 | - **`3.1-modeling.ipynb`** - Fit and optimize multi-label classification model (OneVsRest) and measure accuracy. 85 | - **`3.2-best-model.ipynb`** - Create final model from optimized hyperparameters on full dataset. 86 | - **`3.3-wordclouds.ipynb`** - Bonus workbook only used to create a wordcloud in my presentation. 87 | 88 | ### Folders 89 | 90 | - **`/data/`** - **Not all data has been included** due to file size limits. Full project with data files here. 91 | - **`/demo/`** - Local Flask app (run with predict.py). Alternatively, visit: http://tomkeith.pythonanywhere.com/ 92 | - **`/images/`** - Contains misc images used in report and workbooks. 93 | - **`/models/`** - Contains only the best models (.pkl) and a few samples. 94 | - **`/rawdata/`** - Contains 100 .tsv files of movie data, one for each year. 95 | 96 | --- 97 | --------------------------------------------------------------------------------