├── Amazon ML challenge data preprocessing intuition.docx ├── Leaderboard - Amazon ML Challenge _ HackerEarth - Google Chrome 8_2_2021 2_37_28 PM.png ├── Product_Based_Browse_Node.ipynb ├── README.md └── submission.csv /Amazon ML challenge data preprocessing intuition.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/akshatprogrammer/Amazon-ML-Challenge/ba4089b58431fb63042e5abd8c28d37dae3d55f1/Amazon ML challenge data preprocessing intuition.docx -------------------------------------------------------------------------------- /Leaderboard - Amazon ML Challenge _ HackerEarth - Google Chrome 8_2_2021 2_37_28 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/akshatprogrammer/Amazon-ML-Challenge/ba4089b58431fb63042e5abd8c28d37dae3d55f1/Leaderboard - Amazon ML Challenge _ HackerEarth - Google Chrome 8_2_2021 2_37_28 PM.png -------------------------------------------------------------------------------- /Product_Based_Browse_Node.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "engine.ipynb", 7 | "provenance": [] 8 | }, 9 | "kernelspec": { 10 | "name": "python3", 11 | "display_name": "Python 3" 12 | }, 13 | "language_info": { 14 | "name": "python" 15 | } 16 | }, 17 | "cells": [ 18 | { 19 | "cell_type": "code", 20 | "metadata": { 21 | "colab": { 22 | "base_uri": "https://localhost:8080/" 23 | }, 24 | "id": "EPIz_ZOBSMU-", 25 | "outputId": "174f0363-3c16-47ad-8136-2a90c57b6202" 26 | }, 27 | "source": [ 28 | "from google.colab import drive\n", 29 | "drive.mount('/content/drive')\n", 30 | "import pandas as pd\n", 31 | "import numpy as np\n", 32 | "import csv\n", 33 | "import nltk\n", 34 | "from nltk.corpus import stopwords\n", 35 | "from nltk.stem import WordNetLemmatizer\n", 36 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 37 | "from sklearn.model_selection import train_test_split\n", 38 | "from sklearn.feature_selection import chi2\n", 39 | "import re\n", 40 | "from sklearn.model_selection import train_test_split\n", 41 | "from sklearn.linear_model import LogisticRegression\n", 42 | "from sklearn.naive_bayes import MultinomialNB\n", 43 | "from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix" 44 | ], 45 | "execution_count": 1, 46 | "outputs": [ 47 | { 48 | "output_type": "stream", 49 | "text": [ 50 | "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n" 51 | ], 52 | "name": "stdout" 53 | } 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "metadata": { 59 | "colab": { 60 | "base_uri": "https://localhost:8080/", 61 | "height": 287 62 | }, 63 | "id": "tTBsJUAVb6Xx", 64 | "outputId": "633d2041-56e9-4f5d-a133-6c7e7722168e" 65 | }, 66 | "source": [ 67 | "#test file\n", 68 | "path = \"/content/drive/MyDrive/Datasets/test.csv\"\n", 69 | "df_test = pd.read_csv(path, escapechar = \"\\\\\", quoting = csv.QUOTE_NONE)\n", 70 | "df_test.head()" 71 | ], 72 | "execution_count": 2, 73 | "outputs": [ 74 | { 75 | "output_type": "execute_result", 76 | "data": { 77 | "text/html": [ 78 | "
\n", 79 | "\n", 92 | "\n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | "
PRODUCT_IDTITLEDESCRIPTIONBULLET_POINTSBRAND
01Command 3M Small Kitchen Hooks, White, Decorat...Sale Unit: PACK[INCLUDES - 9 hooks and 12 small indoor strips...Command
12O'Neal Jump Hardware JAG Unisex-Adult Glove (B...Synthetic leather palm with double-layer thumb...[Silicone printing for a better grip. Long las...O'Neal
23NFL Detroit Lions Portable Party Fridge, 15.8 ...Boelter Brands lets you celebrate your favorit...[Runs on 12 Volt DC Power or 110 Volt AC Power...Boelter Brands
34Panasonic Single Line KX-TS880MX Corded Phone ...Features: 50 Station Phonebook Corded Phone Al...Panasonic Landline Phones doesn't come with a ...Panasonic
45Zero Baby Girl's 100% Cotton Innerwear Bloomer...Zero Baby Girl Panties Set. 100% Cotton, Breat...[Zero Baby Girl Panties, Pack of 6, 100% Cotto...Zero
\n", 146 | "
" 147 | ], 148 | "text/plain": [ 149 | " PRODUCT_ID ... BRAND\n", 150 | "0 1 ... Command\n", 151 | "1 2 ... O'Neal\n", 152 | "2 3 ... Boelter Brands\n", 153 | "3 4 ... Panasonic\n", 154 | "4 5 ... Zero\n", 155 | "\n", 156 | "[5 rows x 5 columns]" 157 | ] 158 | }, 159 | "metadata": { 160 | "tags": [] 161 | }, 162 | "execution_count": 2 163 | } 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "metadata": { 169 | "colab": { 170 | "base_uri": "https://localhost:8080/", 171 | "height": 287 172 | }, 173 | "id": "q4uQ8Nfrgd9I", 174 | "outputId": "ff09accf-8e03-4a67-b203-9fdd9ff4f66a" 175 | }, 176 | "source": [ 177 | "#train file\n", 178 | "path = \"/content/drive/MyDrive/Datasets/train.csv\"\n", 179 | "df_train = pd.read_csv(path, escapechar = \"\\\\\", quoting = csv.QUOTE_NONE)\n", 180 | "df_train = df_train.dropna()\n", 181 | "df_train.head()" 182 | ], 183 | "execution_count": 3, 184 | "outputs": [ 185 | { 186 | "output_type": "execute_result", 187 | "data": { 188 | "text/html": [ 189 | "
\n", 190 | "\n", 203 | "\n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | "
TITLEDESCRIPTIONBULLET_POINTSBRANDBROWSE_NODE_ID
0Pete The Cat Bedtime Blues Doll, 14.5 InchPete the Cat is the coolest, most popular cat ...[Pete the Cat Bedtime Blues plush doll,Based o...MerryMakers0
1The New Yorker NYHM014 Refrigerator Magnet, 2 ...The New Yorker Handsome Cello Wrapped Hard Mag...[Cat In A Tea Cup by New Yorker cover artist G...The New Yorker1
5Men'S Full Sleeve Raglan T-Shirts Denim T-Shir...Men'S Full Sleeve Raglan T-Shirts Denim T-Shir...[Color: Blue,Sleeve: Full Sleeve,Material: Cot...Bhavya Enterprise5
6Glance Women's Wallet (Black) (LW-21)This Black wallet by Glance will be a treasure...[The Most Comfortable Women's Wallet That You ...Glance6
7Wild Animals Hungry Brain Educational Flash Ca...Wild Animals are the animals that mostly stays...[Playful learning: Flash cards develops the lo...hungry brain7
\n", 257 | "
" 258 | ], 259 | "text/plain": [ 260 | " TITLE ... BROWSE_NODE_ID\n", 261 | "0 Pete The Cat Bedtime Blues Doll, 14.5 Inch ... 0\n", 262 | "1 The New Yorker NYHM014 Refrigerator Magnet, 2 ... ... 1\n", 263 | "5 Men'S Full Sleeve Raglan T-Shirts Denim T-Shir... ... 5\n", 264 | "6 Glance Women's Wallet (Black) (LW-21) ... 6\n", 265 | "7 Wild Animals Hungry Brain Educational Flash Ca... ... 7\n", 266 | "\n", 267 | "[5 rows x 5 columns]" 268 | ] 269 | }, 270 | "metadata": { 271 | "tags": [] 272 | }, 273 | "execution_count": 3 274 | } 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "metadata": { 280 | "colab": { 281 | "base_uri": "https://localhost:8080/" 282 | }, 283 | "id": "DfrjJPq8T8nZ", 284 | "outputId": "c6ae6051-73c4-4506-a93c-6dc5da96b52c" 285 | }, 286 | "source": [ 287 | "punctuation_signs = list(\"?:!.,;\")\n", 288 | "nltk.download('punkt')\n", 289 | "nltk.download('wordnet')\n", 290 | "wordnet_lemmatizer = WordNetLemmatizer()\n", 291 | "nltk.download('stopwords')\n", 292 | "stop_words = list(stopwords.words('english'))" 293 | ], 294 | "execution_count": 4, 295 | "outputs": [ 296 | { 297 | "output_type": "stream", 298 | "text": [ 299 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 300 | "[nltk_data] Package punkt is already up-to-date!\n", 301 | "[nltk_data] Downloading package wordnet to /root/nltk_data...\n", 302 | "[nltk_data] Package wordnet is already up-to-date!\n", 303 | "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", 304 | "[nltk_data] Package stopwords is already up-to-date!\n" 305 | ], 306 | "name": "stdout" 307 | } 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "metadata": { 313 | "id": "hw0_D27kT9AU" 314 | }, 315 | "source": [ 316 | "df_train['Title'] = df_train['TITLE'].str.replace(\"\\r\", \" \")\n", 317 | "df_train['Title'] = df_train['Title'].str.replace(\"\\n\", \" \")\n", 318 | "df_train['Title'] = df_train['Title'].str.replace(\" \", \" \")\n", 319 | "df_train['Title'] = df_train['Title'].str.replace('\"', '')\n", 320 | "df_train['Title'] = df_train['Title'].str.lower()\n", 321 | "for punct_sign in punctuation_signs:\n", 322 | " df_train['Title'] = df_train['Title'].str.replace(punct_sign, '')\n", 323 | "df_train['Title'] = df_train['Title'].str.replace(\"'s\", \"\")" 324 | ], 325 | "execution_count": 5, 326 | "outputs": [] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "metadata": { 331 | "id": "-OU_-P-0UB_V" 332 | }, 333 | "source": [ 334 | "final_cols = [\"Title\", \"BROWSE_NODE_ID\"]\n", 335 | "df_train = df_train[final_cols]\n", 336 | "df_train = df_train.iloc[:35000, :]" 337 | ], 338 | "execution_count": 6, 339 | "outputs": [] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "metadata": { 344 | "id": "_EJDqp4ZULE_" 345 | }, 346 | "source": [ 347 | "df_test['Title'] = df_test['TITLE'].str.replace(\"\\r\", \" \")\n", 348 | "df_test['Title'] = df_test['Title'].str.replace(\"\\n\", \" \")\n", 349 | "df_test['Title'] = df_test['Title'].str.replace(\" \", \" \")\n", 350 | "df_test['Title'] = df_test['Title'].str.replace('\"', '')\n", 351 | "df_test['Title'] = df_test['Title'].str.lower()\n", 352 | "for punct_sign in punctuation_signs:\n", 353 | " df_test['Title'] = df_test['Title'].str.replace(punct_sign, '')\n", 354 | "df_test['Title'] = df_test['Title'].str.replace(\"'s\", \"\")" 355 | ], 356 | "execution_count": 7, 357 | "outputs": [] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "metadata": { 362 | "id": "iO2eSD_06gwR" 363 | }, 364 | "source": [ 365 | "final_cols = [\"Title\", \"PRODUCT_ID\"]\n", 366 | "df_test = df_test[final_cols]" 367 | ], 368 | "execution_count": 8, 369 | "outputs": [] 370 | }, 371 | { 372 | "cell_type": "code", 373 | "metadata": { 374 | "colab": { 375 | "base_uri": "https://localhost:8080/", 376 | "height": 203 377 | }, 378 | "id": "HcXWz9VgUgoR", 379 | "outputId": "fe862498-dfcb-4022-b59a-29f12238f83b" 380 | }, 381 | "source": [ 382 | "df_train.head()" 383 | ], 384 | "execution_count": 9, 385 | "outputs": [ 386 | { 387 | "output_type": "execute_result", 388 | "data": { 389 | "text/html": [ 390 | "
\n", 391 | "\n", 404 | "\n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | "
TitleBROWSE_NODE_ID
0pete the cat bedtime blues doll 145 inch0
1the new yorker nyhm014 refrigerator magnet 2 x 351
5men full sleeve raglan t-shirts denim t-shirt ...5
6glance women wallet (black) (lw-21)6
7wild animals hungry brain educational flash ca...7
\n", 440 | "
" 441 | ], 442 | "text/plain": [ 443 | " Title BROWSE_NODE_ID\n", 444 | "0 pete the cat bedtime blues doll 145 inch 0\n", 445 | "1 the new yorker nyhm014 refrigerator magnet 2 x 35 1\n", 446 | "5 men full sleeve raglan t-shirts denim t-shirt ... 5\n", 447 | "6 glance women wallet (black) (lw-21) 6\n", 448 | "7 wild animals hungry brain educational flash ca... 7" 449 | ] 450 | }, 451 | "metadata": { 452 | "tags": [] 453 | }, 454 | "execution_count": 9 455 | } 456 | ] 457 | }, 458 | { 459 | "cell_type": "code", 460 | "metadata": { 461 | "colab": { 462 | "base_uri": "https://localhost:8080/", 463 | "height": 203 464 | }, 465 | "id": "0eAngSitUlP2", 466 | "outputId": "173baf1b-3f82-47db-de7f-5d76a6534015" 467 | }, 468 | "source": [ 469 | "df_test.head()" 470 | ], 471 | "execution_count": 10, 472 | "outputs": [ 473 | { 474 | "output_type": "execute_result", 475 | "data": { 476 | "text/html": [ 477 | "
\n", 478 | "\n", 491 | "\n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | "
TitlePRODUCT_ID
0command 3m small kitchen hooks white decorate ...1
1o'neal jump hardware jag unisex-adult glove (b...2
2nfl detroit lions portable party fridge 158 quart3
3panasonic single line kx-ts880mx corded phone ...4
4zero baby girl 100% cotton innerwear bloomer d...5
\n", 527 | "
" 528 | ], 529 | "text/plain": [ 530 | " Title PRODUCT_ID\n", 531 | "0 command 3m small kitchen hooks white decorate ... 1\n", 532 | "1 o'neal jump hardware jag unisex-adult glove (b... 2\n", 533 | "2 nfl detroit lions portable party fridge 158 quart 3\n", 534 | "3 panasonic single line kx-ts880mx corded phone ... 4\n", 535 | "4 zero baby girl 100% cotton innerwear bloomer d... 5" 536 | ] 537 | }, 538 | "metadata": { 539 | "tags": [] 540 | }, 541 | "execution_count": 10 542 | } 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "metadata": { 548 | "colab": { 549 | "base_uri": "https://localhost:8080/" 550 | }, 551 | "id": "gPq6ragPUmtD", 552 | "outputId": "d0da8e4d-dc01-43c7-824a-2f605dbf3697" 553 | }, 554 | "source": [ 555 | "df_train.isna().sum()" 556 | ], 557 | "execution_count": 11, 558 | "outputs": [ 559 | { 560 | "output_type": "execute_result", 561 | "data": { 562 | "text/plain": [ 563 | "Title 0\n", 564 | "BROWSE_NODE_ID 0\n", 565 | "dtype: int64" 566 | ] 567 | }, 568 | "metadata": { 569 | "tags": [] 570 | }, 571 | "execution_count": 11 572 | } 573 | ] 574 | }, 575 | { 576 | "cell_type": "code", 577 | "metadata": { 578 | "id": "A0y2bReP75ic" 579 | }, 580 | "source": [ 581 | "df_test[\"Title\"].fillna(\"No Data\", inplace = True)" 582 | ], 583 | "execution_count": 12, 584 | "outputs": [] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "metadata": { 589 | "colab": { 590 | "base_uri": "https://localhost:8080/" 591 | }, 592 | "id": "91d96DmBaKp6", 593 | "outputId": "8e61dd4e-8c44-4b1e-8b7b-e7205e223dd3" 594 | }, 595 | "source": [ 596 | "df_test.isna().sum()" 597 | ], 598 | "execution_count": 13, 599 | "outputs": [ 600 | { 601 | "output_type": "execute_result", 602 | "data": { 603 | "text/plain": [ 604 | "Title 0\n", 605 | "PRODUCT_ID 0\n", 606 | "dtype: int64" 607 | ] 608 | }, 609 | "metadata": { 610 | "tags": [] 611 | }, 612 | "execution_count": 13 613 | } 614 | ] 615 | }, 616 | { 617 | "cell_type": "code", 618 | "metadata": { 619 | "id": "s4DVt4gsXQaX" 620 | }, 621 | "source": [ 622 | "X_train, X_test, y_train = df_train[\"Title\"], df_test[\"Title\"], df_train[\"BROWSE_NODE_ID\"]" 623 | ], 624 | "execution_count": 14, 625 | "outputs": [] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "metadata": { 630 | "colab": { 631 | "base_uri": "https://localhost:8080/" 632 | }, 633 | "id": "J5X_c72rUs1l", 634 | "outputId": "abd8aaef-f245-4e77-ff72-f9b2bb066067" 635 | }, 636 | "source": [ 637 | "tfidf = TfidfVectorizer(stop_words='english', ngram_range=(1,2), min_df=5)\n", 638 | "print('1')\n", 639 | "X_train_vectors_tfidf = tfidf.fit_transform(X_train)\n", 640 | "print(X_train_vectors_tfidf.shape)\n", 641 | "print('1')\n", 642 | "X_test_vectors_tfidf = tfidf.transform(X_test)\n", 643 | "print(X_test_vectors_tfidf.shape)" 644 | ], 645 | "execution_count": 15, 646 | "outputs": [ 647 | { 648 | "output_type": "stream", 649 | "text": [ 650 | "1\n", 651 | "(35000, 16942)\n", 652 | "1\n", 653 | "(110775, 16942)\n" 654 | ], 655 | "name": "stdout" 656 | } 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "metadata": { 662 | "colab": { 663 | "base_uri": "https://localhost:8080/" 664 | }, 665 | "id": "hO0q3lXFU2ZN", 666 | "outputId": "607a0b95-93aa-4076-b912-8dd7234b1b12" 667 | }, 668 | "source": [ 669 | "lr_tfidf=LogisticRegression(solver = 'liblinear', C=10, penalty = 'l2')\n", 670 | "lr_tfidf.fit(X_train_vectors_tfidf, y_train)" 671 | ], 672 | "execution_count": 16, 673 | "outputs": [ 674 | { 675 | "output_type": "execute_result", 676 | "data": { 677 | "text/plain": [ 678 | "LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,\n", 679 | " intercept_scaling=1, l1_ratio=None, max_iter=100,\n", 680 | " multi_class='auto', n_jobs=None, penalty='l2',\n", 681 | " random_state=None, solver='liblinear', tol=0.0001, verbose=0,\n", 682 | " warm_start=False)" 683 | ] 684 | }, 685 | "metadata": { 686 | "tags": [] 687 | }, 688 | "execution_count": 16 689 | } 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "metadata": { 695 | "id": "gpEPd5PxWude" 696 | }, 697 | "source": [ 698 | "y_predict = lr_tfidf.predict(X_test_vectors_tfidf)" 699 | ], 700 | "execution_count": 17, 701 | "outputs": [] 702 | }, 703 | { 704 | "cell_type": "code", 705 | "metadata": { 706 | "colab": { 707 | "base_uri": "https://localhost:8080/" 708 | }, 709 | "id": "HSTJUhZtfLX4", 710 | "outputId": "69be8688-c9fc-4cfe-9f72-d5d137ee66fc" 711 | }, 712 | "source": [ 713 | "y_predict" 714 | ], 715 | "execution_count": 18, 716 | "outputs": [ 717 | { 718 | "output_type": "execute_result", 719 | "data": { 720 | "text/plain": [ 721 | "array([1140, 1045, 8269, ..., 800, 800, 800])" 722 | ] 723 | }, 724 | "metadata": { 725 | "tags": [] 726 | }, 727 | "execution_count": 18 728 | } 729 | ] 730 | }, 731 | { 732 | "cell_type": "code", 733 | "metadata": { 734 | "colab": { 735 | "base_uri": "https://localhost:8080/" 736 | }, 737 | "id": "64QUo0xcfTJI", 738 | "outputId": "50e42da2-b50d-4f24-b23d-f1b19deb1f1b" 739 | }, 740 | "source": [ 741 | "len(y_predict)" 742 | ], 743 | "execution_count": 19, 744 | "outputs": [ 745 | { 746 | "output_type": "execute_result", 747 | "data": { 748 | "text/plain": [ 749 | "110775" 750 | ] 751 | }, 752 | "metadata": { 753 | "tags": [] 754 | }, 755 | "execution_count": 19 756 | } 757 | ] 758 | }, 759 | { 760 | "cell_type": "code", 761 | "metadata": { 762 | "id": "-cEEL7QfgBzn" 763 | }, 764 | "source": [ 765 | "d = {\"BROWSE_NODE_ID\" : y_predict}" 766 | ], 767 | "execution_count": 20, 768 | "outputs": [] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "metadata": { 773 | "colab": { 774 | "base_uri": "https://localhost:8080/", 775 | "height": 417 776 | }, 777 | "id": "8ymabN1ugEVa", 778 | "outputId": "58f65c77-f6d3-4d3c-cb60-eab928589a8f" 779 | }, 780 | "source": [ 781 | "df1 = pd.DataFrame(data=d)\n", 782 | "df1" 783 | ], 784 | "execution_count": 21, 785 | "outputs": [ 786 | { 787 | "output_type": "execute_result", 788 | "data": { 789 | "text/html": [ 790 | "
\n", 791 | "\n", 804 | "\n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | "
BROWSE_NODE_ID
01140
11045
28269
3125
41922
......
1107704368
1107711551
110772800
110773800
110774800
\n", 858 | "

110775 rows × 1 columns

\n", 859 | "
" 860 | ], 861 | "text/plain": [ 862 | " BROWSE_NODE_ID\n", 863 | "0 1140\n", 864 | "1 1045\n", 865 | "2 8269\n", 866 | "3 125\n", 867 | "4 1922\n", 868 | "... ...\n", 869 | "110770 4368\n", 870 | "110771 1551\n", 871 | "110772 800\n", 872 | "110773 800\n", 873 | "110774 800\n", 874 | "\n", 875 | "[110775 rows x 1 columns]" 876 | ] 877 | }, 878 | "metadata": { 879 | "tags": [] 880 | }, 881 | "execution_count": 21 882 | } 883 | ] 884 | }, 885 | { 886 | "cell_type": "code", 887 | "metadata": { 888 | "colab": { 889 | "base_uri": "https://localhost:8080/", 890 | "height": 203 891 | }, 892 | "id": "ZpXMka2cgqi9", 893 | "outputId": "585cdf6a-1719-477b-b764-1a9f57fbdc12" 894 | }, 895 | "source": [ 896 | "df_new = pd.concat([df_test, df1], axis = 1)\n", 897 | "df_new.head()" 898 | ], 899 | "execution_count": 22, 900 | "outputs": [ 901 | { 902 | "output_type": "execute_result", 903 | "data": { 904 | "text/html": [ 905 | "
\n", 906 | "\n", 919 | "\n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | "
TitlePRODUCT_IDBROWSE_NODE_ID
0command 3m small kitchen hooks white decorate ...11140
1o'neal jump hardware jag unisex-adult glove (b...21045
2nfl detroit lions portable party fridge 158 quart38269
3panasonic single line kx-ts880mx corded phone ...4125
4zero baby girl 100% cotton innerwear bloomer d...51922
\n", 961 | "
" 962 | ], 963 | "text/plain": [ 964 | " Title ... BROWSE_NODE_ID\n", 965 | "0 command 3m small kitchen hooks white decorate ... ... 1140\n", 966 | "1 o'neal jump hardware jag unisex-adult glove (b... ... 1045\n", 967 | "2 nfl detroit lions portable party fridge 158 quart ... 8269\n", 968 | "3 panasonic single line kx-ts880mx corded phone ... ... 125\n", 969 | "4 zero baby girl 100% cotton innerwear bloomer d... ... 1922\n", 970 | "\n", 971 | "[5 rows x 3 columns]" 972 | ] 973 | }, 974 | "metadata": { 975 | "tags": [] 976 | }, 977 | "execution_count": 22 978 | } 979 | ] 980 | }, 981 | { 982 | "cell_type": "code", 983 | "metadata": { 984 | "colab": { 985 | "base_uri": "https://localhost:8080/", 986 | "height": 417 987 | }, 988 | "id": "0UiIR1C3g0aL", 989 | "outputId": "a869900d-cd40-4364-a8b6-abd3ad6dd8af" 990 | }, 991 | "source": [ 992 | "l = [\"PRODUCT_ID\", \"BROWSE_NODE_ID\"]\n", 993 | "df_new = df_new[l]\n", 994 | "df_new" 995 | ], 996 | "execution_count": 23, 997 | "outputs": [ 998 | { 999 | "output_type": "execute_result", 1000 | "data": { 1001 | "text/html": [ 1002 | "
\n", 1003 | "\n", 1016 | "\n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | "
PRODUCT_IDBROWSE_NODE_ID
011140
121045
238269
34125
451922
.........
1107701107714368
1107711107721551
110772110773800
110773110774800
110774110775800
\n", 1082 | "

110775 rows × 2 columns

\n", 1083 | "
" 1084 | ], 1085 | "text/plain": [ 1086 | " PRODUCT_ID BROWSE_NODE_ID\n", 1087 | "0 1 1140\n", 1088 | "1 2 1045\n", 1089 | "2 3 8269\n", 1090 | "3 4 125\n", 1091 | "4 5 1922\n", 1092 | "... ... ...\n", 1093 | "110770 110771 4368\n", 1094 | "110771 110772 1551\n", 1095 | "110772 110773 800\n", 1096 | "110773 110774 800\n", 1097 | "110774 110775 800\n", 1098 | "\n", 1099 | "[110775 rows x 2 columns]" 1100 | ] 1101 | }, 1102 | "metadata": { 1103 | "tags": [] 1104 | }, 1105 | "execution_count": 23 1106 | } 1107 | ] 1108 | }, 1109 | { 1110 | "cell_type": "code", 1111 | "metadata": { 1112 | "id": "JTyuVvuPhgSE" 1113 | }, 1114 | "source": [ 1115 | "df_new.to_csv(\"/content/drive/MyDrive/Datasets/submission.csv\", index = False, header = True)" 1116 | ], 1117 | "execution_count": 24, 1118 | "outputs": [] 1119 | }, 1120 | { 1121 | "cell_type": "code", 1122 | "metadata": { 1123 | "colab": { 1124 | "base_uri": "https://localhost:8080/", 1125 | "height": 203 1126 | }, 1127 | "id": "u8MNaBoTiuYD", 1128 | "outputId": "82ac3f23-96eb-4876-eda4-e51904beb539" 1129 | }, 1130 | "source": [ 1131 | "path = \"/content/drive/MyDrive/Datasets/submission.csv\"\n", 1132 | "df_sub = pd.read_csv(path)\n", 1133 | "df_sub.head()" 1134 | ], 1135 | "execution_count": 25, 1136 | "outputs": [ 1137 | { 1138 | "output_type": "execute_result", 1139 | "data": { 1140 | "text/html": [ 1141 | "
\n", 1142 | "\n", 1155 | "\n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | "
PRODUCT_IDBROWSE_NODE_ID
011140
121045
238269
34125
451922
\n", 1191 | "
" 1192 | ], 1193 | "text/plain": [ 1194 | " PRODUCT_ID BROWSE_NODE_ID\n", 1195 | "0 1 1140\n", 1196 | "1 2 1045\n", 1197 | "2 3 8269\n", 1198 | "3 4 125\n", 1199 | "4 5 1922" 1200 | ] 1201 | }, 1202 | "metadata": { 1203 | "tags": [] 1204 | }, 1205 | "execution_count": 25 1206 | } 1207 | ] 1208 | }, 1209 | { 1210 | "cell_type": "code", 1211 | "metadata": { 1212 | "colab": { 1213 | "base_uri": "https://localhost:8080/" 1214 | }, 1215 | "id": "ALnvwky6rAA0", 1216 | "outputId": "4d501fed-b5f8-44ab-9b52-c9d38c1a4911" 1217 | }, 1218 | "source": [ 1219 | "df_sub.isna().sum()" 1220 | ], 1221 | "execution_count": 26, 1222 | "outputs": [ 1223 | { 1224 | "output_type": "execute_result", 1225 | "data": { 1226 | "text/plain": [ 1227 | "PRODUCT_ID 0\n", 1228 | "BROWSE_NODE_ID 0\n", 1229 | "dtype: int64" 1230 | ] 1231 | }, 1232 | "metadata": { 1233 | "tags": [] 1234 | }, 1235 | "execution_count": 26 1236 | } 1237 | ] 1238 | }, 1239 | { 1240 | "cell_type": "code", 1241 | "metadata": { 1242 | "colab": { 1243 | "base_uri": "https://localhost:8080/", 1244 | "height": 295 1245 | }, 1246 | "id": "m5FYWOJ3rD1x", 1247 | "outputId": "84d5ac03-9cf2-41e7-a97b-cf1c550521c3" 1248 | }, 1249 | "source": [ 1250 | "df_sub.describe()" 1251 | ], 1252 | "execution_count": 27, 1253 | "outputs": [ 1254 | { 1255 | "output_type": "execute_result", 1256 | "data": { 1257 | "text/html": [ 1258 | "
\n", 1259 | "\n", 1272 | "\n", 1273 | " \n", 1274 | " \n", 1275 | " \n", 1276 | " \n", 1277 | " \n", 1278 | " \n", 1279 | " \n", 1280 | " \n", 1281 | " \n", 1282 | " \n", 1283 | " \n", 1284 | " \n", 1285 | " \n", 1286 | " \n", 1287 | " \n", 1288 | " \n", 1289 | " \n", 1290 | " \n", 1291 | " \n", 1292 | " \n", 1293 | " \n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " \n", 1307 | " \n", 1308 | " \n", 1309 | " \n", 1310 | " \n", 1311 | " \n", 1312 | " \n", 1313 | " \n", 1314 | " \n", 1315 | " \n", 1316 | " \n", 1317 | " \n", 1318 | " \n", 1319 | " \n", 1320 | " \n", 1321 | " \n", 1322 | "
PRODUCT_IDBROWSE_NODE_ID
count110775.000000110775.000000
mean55388.0000001814.854317
std31978.1323722984.057629
min1.0000000.000000
25%27694.500000507.000000
50%55388.0000001045.000000
75%83081.5000001687.000000
max110775.00000047970.000000
\n", 1323 | "
" 1324 | ], 1325 | "text/plain": [ 1326 | " PRODUCT_ID BROWSE_NODE_ID\n", 1327 | "count 110775.000000 110775.000000\n", 1328 | "mean 55388.000000 1814.854317\n", 1329 | "std 31978.132372 2984.057629\n", 1330 | "min 1.000000 0.000000\n", 1331 | "25% 27694.500000 507.000000\n", 1332 | "50% 55388.000000 1045.000000\n", 1333 | "75% 83081.500000 1687.000000\n", 1334 | "max 110775.000000 47970.000000" 1335 | ] 1336 | }, 1337 | "metadata": { 1338 | "tags": [] 1339 | }, 1340 | "execution_count": 27 1341 | } 1342 | ] 1343 | }, 1344 | { 1345 | "cell_type": "code", 1346 | "metadata": { 1347 | "id": "Tx07ygYxrGlu" 1348 | }, 1349 | "source": [ 1350 | "" 1351 | ], 1352 | "execution_count": null, 1353 | "outputs": [] 1354 | } 1355 | ] 1356 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Amazon-ML-Challenge 2 | Product Browse Node Classification
3 | 4 | ## About Amazon ML Challenge 5 | ![](https://he-s3.s3.amazonaws.com/media/cache/0a/be/0abe8c0908dcb9e67941600739f6d651.png)
6 | Amazon ML Challenge is a two-stage competition where students from all engineering campuses across India will get a unique opportunity to work on Amazon’s dataset to bring in fresh ideas and build innovative solutions for a real-world problem statement. The top three winning teams will receive pre-placement interviews (PPIs) for ML roles at Amazon along with cash prizes and certificates. 7 | 8 | ## ML Challenge Stages 9 | ![](https://s3-ap-southeast-1.amazonaws.com/he-public-data/ML%20Challenge%20Stages_Hackerearth%201a6b4fa1.JPG)
10 | 11 | ## Datset Link🔗 12 | [Click Here for the Datset](https://he-s3.s3.ap-southeast-1.amazonaws.com/media/hackathon/amazon-ml-challenge/product-browse-node-classification-2-7ff04e5a/546b594ee0a211eb.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=16c8b3b0a2fcd83db7ed5437de70012b5b4651ce5f543c1fe62512904d5f0ce7&X-Amz-Date=20210802T090306Z&X-Amz-Credential=AKIA6I2ISGOYH7WWS3G5%2F20210802%2Fap-southeast-1%2Fs3%2Faws4_request) 13 | 14 | ## Problem 15 | Amazon catalog consists of billions of products that belong to thousands of browse nodes (each browse node represents a collection of items for sale). Browse nodes are used to help customer navigate through our website and classify products to product type groups. Hence, it is important to predict the node assignment at the time of listing of the product or when the browse node information is absent. 16 | 17 | As part of this hackathon, you will use product metadata to classify products into browse nodes. You will have access to product title, description, bullet points etc. and labels for ~3MM products to train and test your submissions. Note that there is some noise in the data - real world data looks like this!! 18 | 19 | ## Data Description 20 | 21 | Full Train/Test dataset details:
22 | 23 | Key column – PRODUCT_ID
24 | Input features – TITLE, DESCRIPTION, BULLET_POINTS, BRAND
25 | Target column – BROWSE_NODE_ID
26 | Train dataset size – 2,903,024
27 | Number of classes in Train – 9,919
28 | Overall Test dataset size – 110,775
29 | 30 | ## Team Members 31 | 1. [Akshat Jain](https://github.com/akshatprogrammer) 32 | 2. [Anjali Pathak](https://github.com/anjalipathak13) 33 | 3. [Aman Gupta](https://github.com/Aman-Gupta-Ji) 34 | 4. [Himank Kandpal](https://github.com/Himank0) 35 | 36 | ## Team Ranking 37 | ![](https://github.com/akshatprogrammer/Amazon-ML-Challenge/blob/main/Leaderboard%20-%20Amazon%20ML%20Challenge%20_%20HackerEarth%20-%20Google%20Chrome%208_2_2021%202_37_28%20PM.png) 38 |
Out of 3290 Teams we secured 296th Position. 39 | 40 | 41 | ## Contributors 42 | 43 | 44 | 45 | 46 | # Connect With Me 47 | LinkedIn : https://www.linkedin.com/in/akshatjaingeu/
48 | Email : akshat.kodia@gmail.com
49 | Twitter : www.twitter.com/akki_aj89
50 | Website : https://akshatprogrammer.github.io/portfolio/
51 | # Personal 52 | Name : Akshat Jain
53 | University : Graphic Era University, Dehradun(UK) 54 | 55 | If any problem with this program reach me at Telegram
56 | Here is the link -> https://t.me/akki_aj89 57 | 58 | # Gratitude 59 | Thank You, if you like it please leave a Star. 60 | 61 | --------------------------------------------------------------------------------