├── Amazon ML challenge data preprocessing intuition.docx
├── Leaderboard - Amazon ML Challenge _ HackerEarth - Google Chrome 8_2_2021 2_37_28 PM.png
├── Product_Based_Browse_Node.ipynb
├── README.md
└── submission.csv
/Amazon ML challenge data preprocessing intuition.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/akshatprogrammer/Amazon-ML-Challenge/ba4089b58431fb63042e5abd8c28d37dae3d55f1/Amazon ML challenge data preprocessing intuition.docx
--------------------------------------------------------------------------------
/Leaderboard - Amazon ML Challenge _ HackerEarth - Google Chrome 8_2_2021 2_37_28 PM.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/akshatprogrammer/Amazon-ML-Challenge/ba4089b58431fb63042e5abd8c28d37dae3d55f1/Leaderboard - Amazon ML Challenge _ HackerEarth - Google Chrome 8_2_2021 2_37_28 PM.png
--------------------------------------------------------------------------------
/Product_Based_Browse_Node.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "engine.ipynb",
7 | "provenance": []
8 | },
9 | "kernelspec": {
10 | "name": "python3",
11 | "display_name": "Python 3"
12 | },
13 | "language_info": {
14 | "name": "python"
15 | }
16 | },
17 | "cells": [
18 | {
19 | "cell_type": "code",
20 | "metadata": {
21 | "colab": {
22 | "base_uri": "https://localhost:8080/"
23 | },
24 | "id": "EPIz_ZOBSMU-",
25 | "outputId": "174f0363-3c16-47ad-8136-2a90c57b6202"
26 | },
27 | "source": [
28 | "from google.colab import drive\n",
29 | "drive.mount('/content/drive')\n",
30 | "import pandas as pd\n",
31 | "import numpy as np\n",
32 | "import csv\n",
33 | "import nltk\n",
34 | "from nltk.corpus import stopwords\n",
35 | "from nltk.stem import WordNetLemmatizer\n",
36 | "from sklearn.feature_extraction.text import TfidfVectorizer\n",
37 | "from sklearn.model_selection import train_test_split\n",
38 | "from sklearn.feature_selection import chi2\n",
39 | "import re\n",
40 | "from sklearn.model_selection import train_test_split\n",
41 | "from sklearn.linear_model import LogisticRegression\n",
42 | "from sklearn.naive_bayes import MultinomialNB\n",
43 | "from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix"
44 | ],
45 | "execution_count": 1,
46 | "outputs": [
47 | {
48 | "output_type": "stream",
49 | "text": [
50 | "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n"
51 | ],
52 | "name": "stdout"
53 | }
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "metadata": {
59 | "colab": {
60 | "base_uri": "https://localhost:8080/",
61 | "height": 287
62 | },
63 | "id": "tTBsJUAVb6Xx",
64 | "outputId": "633d2041-56e9-4f5d-a133-6c7e7722168e"
65 | },
66 | "source": [
67 | "#test file\n",
68 | "path = \"/content/drive/MyDrive/Datasets/test.csv\"\n",
69 | "df_test = pd.read_csv(path, escapechar = \"\\\\\", quoting = csv.QUOTE_NONE)\n",
70 | "df_test.head()"
71 | ],
72 | "execution_count": 2,
73 | "outputs": [
74 | {
75 | "output_type": "execute_result",
76 | "data": {
77 | "text/html": [
78 | "
\n",
79 | "\n",
92 | "
\n",
93 | " \n",
94 | " \n",
95 | " | \n",
96 | " PRODUCT_ID | \n",
97 | " TITLE | \n",
98 | " DESCRIPTION | \n",
99 | " BULLET_POINTS | \n",
100 | " BRAND | \n",
101 | "
\n",
102 | " \n",
103 | " \n",
104 | " \n",
105 | " 0 | \n",
106 | " 1 | \n",
107 | " Command 3M Small Kitchen Hooks, White, Decorat... | \n",
108 | " Sale Unit: PACK | \n",
109 | " [INCLUDES - 9 hooks and 12 small indoor strips... | \n",
110 | " Command | \n",
111 | "
\n",
112 | " \n",
113 | " 1 | \n",
114 | " 2 | \n",
115 | " O'Neal Jump Hardware JAG Unisex-Adult Glove (B... | \n",
116 | " Synthetic leather palm with double-layer thumb... | \n",
117 | " [Silicone printing for a better grip. Long las... | \n",
118 | " O'Neal | \n",
119 | "
\n",
120 | " \n",
121 | " 2 | \n",
122 | " 3 | \n",
123 | " NFL Detroit Lions Portable Party Fridge, 15.8 ... | \n",
124 | " Boelter Brands lets you celebrate your favorit... | \n",
125 | " [Runs on 12 Volt DC Power or 110 Volt AC Power... | \n",
126 | " Boelter Brands | \n",
127 | "
\n",
128 | " \n",
129 | " 3 | \n",
130 | " 4 | \n",
131 | " Panasonic Single Line KX-TS880MX Corded Phone ... | \n",
132 | " Features: 50 Station Phonebook Corded Phone Al... | \n",
133 | " Panasonic Landline Phones doesn't come with a ... | \n",
134 | " Panasonic | \n",
135 | "
\n",
136 | " \n",
137 | " 4 | \n",
138 | " 5 | \n",
139 | " Zero Baby Girl's 100% Cotton Innerwear Bloomer... | \n",
140 | " Zero Baby Girl Panties Set. 100% Cotton, Breat... | \n",
141 | " [Zero Baby Girl Panties, Pack of 6, 100% Cotto... | \n",
142 | " Zero | \n",
143 | "
\n",
144 | " \n",
145 | "
\n",
146 | "
"
147 | ],
148 | "text/plain": [
149 | " PRODUCT_ID ... BRAND\n",
150 | "0 1 ... Command\n",
151 | "1 2 ... O'Neal\n",
152 | "2 3 ... Boelter Brands\n",
153 | "3 4 ... Panasonic\n",
154 | "4 5 ... Zero\n",
155 | "\n",
156 | "[5 rows x 5 columns]"
157 | ]
158 | },
159 | "metadata": {
160 | "tags": []
161 | },
162 | "execution_count": 2
163 | }
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "metadata": {
169 | "colab": {
170 | "base_uri": "https://localhost:8080/",
171 | "height": 287
172 | },
173 | "id": "q4uQ8Nfrgd9I",
174 | "outputId": "ff09accf-8e03-4a67-b203-9fdd9ff4f66a"
175 | },
176 | "source": [
177 | "#train file\n",
178 | "path = \"/content/drive/MyDrive/Datasets/train.csv\"\n",
179 | "df_train = pd.read_csv(path, escapechar = \"\\\\\", quoting = csv.QUOTE_NONE)\n",
180 | "df_train = df_train.dropna()\n",
181 | "df_train.head()"
182 | ],
183 | "execution_count": 3,
184 | "outputs": [
185 | {
186 | "output_type": "execute_result",
187 | "data": {
188 | "text/html": [
189 | "\n",
190 | "\n",
203 | "
\n",
204 | " \n",
205 | " \n",
206 | " | \n",
207 | " TITLE | \n",
208 | " DESCRIPTION | \n",
209 | " BULLET_POINTS | \n",
210 | " BRAND | \n",
211 | " BROWSE_NODE_ID | \n",
212 | "
\n",
213 | " \n",
214 | " \n",
215 | " \n",
216 | " 0 | \n",
217 | " Pete The Cat Bedtime Blues Doll, 14.5 Inch | \n",
218 | " Pete the Cat is the coolest, most popular cat ... | \n",
219 | " [Pete the Cat Bedtime Blues plush doll,Based o... | \n",
220 | " MerryMakers | \n",
221 | " 0 | \n",
222 | "
\n",
223 | " \n",
224 | " 1 | \n",
225 | " The New Yorker NYHM014 Refrigerator Magnet, 2 ... | \n",
226 | " The New Yorker Handsome Cello Wrapped Hard Mag... | \n",
227 | " [Cat In A Tea Cup by New Yorker cover artist G... | \n",
228 | " The New Yorker | \n",
229 | " 1 | \n",
230 | "
\n",
231 | " \n",
232 | " 5 | \n",
233 | " Men'S Full Sleeve Raglan T-Shirts Denim T-Shir... | \n",
234 | " Men'S Full Sleeve Raglan T-Shirts Denim T-Shir... | \n",
235 | " [Color: Blue,Sleeve: Full Sleeve,Material: Cot... | \n",
236 | " Bhavya Enterprise | \n",
237 | " 5 | \n",
238 | "
\n",
239 | " \n",
240 | " 6 | \n",
241 | " Glance Women's Wallet (Black) (LW-21) | \n",
242 | " This Black wallet by Glance will be a treasure... | \n",
243 | " [The Most Comfortable Women's Wallet That You ... | \n",
244 | " Glance | \n",
245 | " 6 | \n",
246 | "
\n",
247 | " \n",
248 | " 7 | \n",
249 | " Wild Animals Hungry Brain Educational Flash Ca... | \n",
250 | " Wild Animals are the animals that mostly stays... | \n",
251 | " [Playful learning: Flash cards develops the lo... | \n",
252 | " hungry brain | \n",
253 | " 7 | \n",
254 | "
\n",
255 | " \n",
256 | "
\n",
257 | "
"
258 | ],
259 | "text/plain": [
260 | " TITLE ... BROWSE_NODE_ID\n",
261 | "0 Pete The Cat Bedtime Blues Doll, 14.5 Inch ... 0\n",
262 | "1 The New Yorker NYHM014 Refrigerator Magnet, 2 ... ... 1\n",
263 | "5 Men'S Full Sleeve Raglan T-Shirts Denim T-Shir... ... 5\n",
264 | "6 Glance Women's Wallet (Black) (LW-21) ... 6\n",
265 | "7 Wild Animals Hungry Brain Educational Flash Ca... ... 7\n",
266 | "\n",
267 | "[5 rows x 5 columns]"
268 | ]
269 | },
270 | "metadata": {
271 | "tags": []
272 | },
273 | "execution_count": 3
274 | }
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "metadata": {
280 | "colab": {
281 | "base_uri": "https://localhost:8080/"
282 | },
283 | "id": "DfrjJPq8T8nZ",
284 | "outputId": "c6ae6051-73c4-4506-a93c-6dc5da96b52c"
285 | },
286 | "source": [
287 | "punctuation_signs = list(\"?:!.,;\")\n",
288 | "nltk.download('punkt')\n",
289 | "nltk.download('wordnet')\n",
290 | "wordnet_lemmatizer = WordNetLemmatizer()\n",
291 | "nltk.download('stopwords')\n",
292 | "stop_words = list(stopwords.words('english'))"
293 | ],
294 | "execution_count": 4,
295 | "outputs": [
296 | {
297 | "output_type": "stream",
298 | "text": [
299 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
300 | "[nltk_data] Package punkt is already up-to-date!\n",
301 | "[nltk_data] Downloading package wordnet to /root/nltk_data...\n",
302 | "[nltk_data] Package wordnet is already up-to-date!\n",
303 | "[nltk_data] Downloading package stopwords to /root/nltk_data...\n",
304 | "[nltk_data] Package stopwords is already up-to-date!\n"
305 | ],
306 | "name": "stdout"
307 | }
308 | ]
309 | },
310 | {
311 | "cell_type": "code",
312 | "metadata": {
313 | "id": "hw0_D27kT9AU"
314 | },
315 | "source": [
316 | "df_train['Title'] = df_train['TITLE'].str.replace(\"\\r\", \" \")\n",
317 | "df_train['Title'] = df_train['Title'].str.replace(\"\\n\", \" \")\n",
318 | "df_train['Title'] = df_train['Title'].str.replace(\" \", \" \")\n",
319 | "df_train['Title'] = df_train['Title'].str.replace('\"', '')\n",
320 | "df_train['Title'] = df_train['Title'].str.lower()\n",
321 | "for punct_sign in punctuation_signs:\n",
322 | " df_train['Title'] = df_train['Title'].str.replace(punct_sign, '')\n",
323 | "df_train['Title'] = df_train['Title'].str.replace(\"'s\", \"\")"
324 | ],
325 | "execution_count": 5,
326 | "outputs": []
327 | },
328 | {
329 | "cell_type": "code",
330 | "metadata": {
331 | "id": "-OU_-P-0UB_V"
332 | },
333 | "source": [
334 | "final_cols = [\"Title\", \"BROWSE_NODE_ID\"]\n",
335 | "df_train = df_train[final_cols]\n",
336 | "df_train = df_train.iloc[:35000, :]"
337 | ],
338 | "execution_count": 6,
339 | "outputs": []
340 | },
341 | {
342 | "cell_type": "code",
343 | "metadata": {
344 | "id": "_EJDqp4ZULE_"
345 | },
346 | "source": [
347 | "df_test['Title'] = df_test['TITLE'].str.replace(\"\\r\", \" \")\n",
348 | "df_test['Title'] = df_test['Title'].str.replace(\"\\n\", \" \")\n",
349 | "df_test['Title'] = df_test['Title'].str.replace(\" \", \" \")\n",
350 | "df_test['Title'] = df_test['Title'].str.replace('\"', '')\n",
351 | "df_test['Title'] = df_test['Title'].str.lower()\n",
352 | "for punct_sign in punctuation_signs:\n",
353 | " df_test['Title'] = df_test['Title'].str.replace(punct_sign, '')\n",
354 | "df_test['Title'] = df_test['Title'].str.replace(\"'s\", \"\")"
355 | ],
356 | "execution_count": 7,
357 | "outputs": []
358 | },
359 | {
360 | "cell_type": "code",
361 | "metadata": {
362 | "id": "iO2eSD_06gwR"
363 | },
364 | "source": [
365 | "final_cols = [\"Title\", \"PRODUCT_ID\"]\n",
366 | "df_test = df_test[final_cols]"
367 | ],
368 | "execution_count": 8,
369 | "outputs": []
370 | },
371 | {
372 | "cell_type": "code",
373 | "metadata": {
374 | "colab": {
375 | "base_uri": "https://localhost:8080/",
376 | "height": 203
377 | },
378 | "id": "HcXWz9VgUgoR",
379 | "outputId": "fe862498-dfcb-4022-b59a-29f12238f83b"
380 | },
381 | "source": [
382 | "df_train.head()"
383 | ],
384 | "execution_count": 9,
385 | "outputs": [
386 | {
387 | "output_type": "execute_result",
388 | "data": {
389 | "text/html": [
390 | "\n",
391 | "\n",
404 | "
\n",
405 | " \n",
406 | " \n",
407 | " | \n",
408 | " Title | \n",
409 | " BROWSE_NODE_ID | \n",
410 | "
\n",
411 | " \n",
412 | " \n",
413 | " \n",
414 | " 0 | \n",
415 | " pete the cat bedtime blues doll 145 inch | \n",
416 | " 0 | \n",
417 | "
\n",
418 | " \n",
419 | " 1 | \n",
420 | " the new yorker nyhm014 refrigerator magnet 2 x 35 | \n",
421 | " 1 | \n",
422 | "
\n",
423 | " \n",
424 | " 5 | \n",
425 | " men full sleeve raglan t-shirts denim t-shirt ... | \n",
426 | " 5 | \n",
427 | "
\n",
428 | " \n",
429 | " 6 | \n",
430 | " glance women wallet (black) (lw-21) | \n",
431 | " 6 | \n",
432 | "
\n",
433 | " \n",
434 | " 7 | \n",
435 | " wild animals hungry brain educational flash ca... | \n",
436 | " 7 | \n",
437 | "
\n",
438 | " \n",
439 | "
\n",
440 | "
"
441 | ],
442 | "text/plain": [
443 | " Title BROWSE_NODE_ID\n",
444 | "0 pete the cat bedtime blues doll 145 inch 0\n",
445 | "1 the new yorker nyhm014 refrigerator magnet 2 x 35 1\n",
446 | "5 men full sleeve raglan t-shirts denim t-shirt ... 5\n",
447 | "6 glance women wallet (black) (lw-21) 6\n",
448 | "7 wild animals hungry brain educational flash ca... 7"
449 | ]
450 | },
451 | "metadata": {
452 | "tags": []
453 | },
454 | "execution_count": 9
455 | }
456 | ]
457 | },
458 | {
459 | "cell_type": "code",
460 | "metadata": {
461 | "colab": {
462 | "base_uri": "https://localhost:8080/",
463 | "height": 203
464 | },
465 | "id": "0eAngSitUlP2",
466 | "outputId": "173baf1b-3f82-47db-de7f-5d76a6534015"
467 | },
468 | "source": [
469 | "df_test.head()"
470 | ],
471 | "execution_count": 10,
472 | "outputs": [
473 | {
474 | "output_type": "execute_result",
475 | "data": {
476 | "text/html": [
477 | "\n",
478 | "\n",
491 | "
\n",
492 | " \n",
493 | " \n",
494 | " | \n",
495 | " Title | \n",
496 | " PRODUCT_ID | \n",
497 | "
\n",
498 | " \n",
499 | " \n",
500 | " \n",
501 | " 0 | \n",
502 | " command 3m small kitchen hooks white decorate ... | \n",
503 | " 1 | \n",
504 | "
\n",
505 | " \n",
506 | " 1 | \n",
507 | " o'neal jump hardware jag unisex-adult glove (b... | \n",
508 | " 2 | \n",
509 | "
\n",
510 | " \n",
511 | " 2 | \n",
512 | " nfl detroit lions portable party fridge 158 quart | \n",
513 | " 3 | \n",
514 | "
\n",
515 | " \n",
516 | " 3 | \n",
517 | " panasonic single line kx-ts880mx corded phone ... | \n",
518 | " 4 | \n",
519 | "
\n",
520 | " \n",
521 | " 4 | \n",
522 | " zero baby girl 100% cotton innerwear bloomer d... | \n",
523 | " 5 | \n",
524 | "
\n",
525 | " \n",
526 | "
\n",
527 | "
"
528 | ],
529 | "text/plain": [
530 | " Title PRODUCT_ID\n",
531 | "0 command 3m small kitchen hooks white decorate ... 1\n",
532 | "1 o'neal jump hardware jag unisex-adult glove (b... 2\n",
533 | "2 nfl detroit lions portable party fridge 158 quart 3\n",
534 | "3 panasonic single line kx-ts880mx corded phone ... 4\n",
535 | "4 zero baby girl 100% cotton innerwear bloomer d... 5"
536 | ]
537 | },
538 | "metadata": {
539 | "tags": []
540 | },
541 | "execution_count": 10
542 | }
543 | ]
544 | },
545 | {
546 | "cell_type": "code",
547 | "metadata": {
548 | "colab": {
549 | "base_uri": "https://localhost:8080/"
550 | },
551 | "id": "gPq6ragPUmtD",
552 | "outputId": "d0da8e4d-dc01-43c7-824a-2f605dbf3697"
553 | },
554 | "source": [
555 | "df_train.isna().sum()"
556 | ],
557 | "execution_count": 11,
558 | "outputs": [
559 | {
560 | "output_type": "execute_result",
561 | "data": {
562 | "text/plain": [
563 | "Title 0\n",
564 | "BROWSE_NODE_ID 0\n",
565 | "dtype: int64"
566 | ]
567 | },
568 | "metadata": {
569 | "tags": []
570 | },
571 | "execution_count": 11
572 | }
573 | ]
574 | },
575 | {
576 | "cell_type": "code",
577 | "metadata": {
578 | "id": "A0y2bReP75ic"
579 | },
580 | "source": [
581 | "df_test[\"Title\"].fillna(\"No Data\", inplace = True)"
582 | ],
583 | "execution_count": 12,
584 | "outputs": []
585 | },
586 | {
587 | "cell_type": "code",
588 | "metadata": {
589 | "colab": {
590 | "base_uri": "https://localhost:8080/"
591 | },
592 | "id": "91d96DmBaKp6",
593 | "outputId": "8e61dd4e-8c44-4b1e-8b7b-e7205e223dd3"
594 | },
595 | "source": [
596 | "df_test.isna().sum()"
597 | ],
598 | "execution_count": 13,
599 | "outputs": [
600 | {
601 | "output_type": "execute_result",
602 | "data": {
603 | "text/plain": [
604 | "Title 0\n",
605 | "PRODUCT_ID 0\n",
606 | "dtype: int64"
607 | ]
608 | },
609 | "metadata": {
610 | "tags": []
611 | },
612 | "execution_count": 13
613 | }
614 | ]
615 | },
616 | {
617 | "cell_type": "code",
618 | "metadata": {
619 | "id": "s4DVt4gsXQaX"
620 | },
621 | "source": [
622 | "X_train, X_test, y_train = df_train[\"Title\"], df_test[\"Title\"], df_train[\"BROWSE_NODE_ID\"]"
623 | ],
624 | "execution_count": 14,
625 | "outputs": []
626 | },
627 | {
628 | "cell_type": "code",
629 | "metadata": {
630 | "colab": {
631 | "base_uri": "https://localhost:8080/"
632 | },
633 | "id": "J5X_c72rUs1l",
634 | "outputId": "abd8aaef-f245-4e77-ff72-f9b2bb066067"
635 | },
636 | "source": [
637 | "tfidf = TfidfVectorizer(stop_words='english', ngram_range=(1,2), min_df=5)\n",
638 | "print('1')\n",
639 | "X_train_vectors_tfidf = tfidf.fit_transform(X_train)\n",
640 | "print(X_train_vectors_tfidf.shape)\n",
641 | "print('1')\n",
642 | "X_test_vectors_tfidf = tfidf.transform(X_test)\n",
643 | "print(X_test_vectors_tfidf.shape)"
644 | ],
645 | "execution_count": 15,
646 | "outputs": [
647 | {
648 | "output_type": "stream",
649 | "text": [
650 | "1\n",
651 | "(35000, 16942)\n",
652 | "1\n",
653 | "(110775, 16942)\n"
654 | ],
655 | "name": "stdout"
656 | }
657 | ]
658 | },
659 | {
660 | "cell_type": "code",
661 | "metadata": {
662 | "colab": {
663 | "base_uri": "https://localhost:8080/"
664 | },
665 | "id": "hO0q3lXFU2ZN",
666 | "outputId": "607a0b95-93aa-4076-b912-8dd7234b1b12"
667 | },
668 | "source": [
669 | "lr_tfidf=LogisticRegression(solver = 'liblinear', C=10, penalty = 'l2')\n",
670 | "lr_tfidf.fit(X_train_vectors_tfidf, y_train)"
671 | ],
672 | "execution_count": 16,
673 | "outputs": [
674 | {
675 | "output_type": "execute_result",
676 | "data": {
677 | "text/plain": [
678 | "LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,\n",
679 | " intercept_scaling=1, l1_ratio=None, max_iter=100,\n",
680 | " multi_class='auto', n_jobs=None, penalty='l2',\n",
681 | " random_state=None, solver='liblinear', tol=0.0001, verbose=0,\n",
682 | " warm_start=False)"
683 | ]
684 | },
685 | "metadata": {
686 | "tags": []
687 | },
688 | "execution_count": 16
689 | }
690 | ]
691 | },
692 | {
693 | "cell_type": "code",
694 | "metadata": {
695 | "id": "gpEPd5PxWude"
696 | },
697 | "source": [
698 | "y_predict = lr_tfidf.predict(X_test_vectors_tfidf)"
699 | ],
700 | "execution_count": 17,
701 | "outputs": []
702 | },
703 | {
704 | "cell_type": "code",
705 | "metadata": {
706 | "colab": {
707 | "base_uri": "https://localhost:8080/"
708 | },
709 | "id": "HSTJUhZtfLX4",
710 | "outputId": "69be8688-c9fc-4cfe-9f72-d5d137ee66fc"
711 | },
712 | "source": [
713 | "y_predict"
714 | ],
715 | "execution_count": 18,
716 | "outputs": [
717 | {
718 | "output_type": "execute_result",
719 | "data": {
720 | "text/plain": [
721 | "array([1140, 1045, 8269, ..., 800, 800, 800])"
722 | ]
723 | },
724 | "metadata": {
725 | "tags": []
726 | },
727 | "execution_count": 18
728 | }
729 | ]
730 | },
731 | {
732 | "cell_type": "code",
733 | "metadata": {
734 | "colab": {
735 | "base_uri": "https://localhost:8080/"
736 | },
737 | "id": "64QUo0xcfTJI",
738 | "outputId": "50e42da2-b50d-4f24-b23d-f1b19deb1f1b"
739 | },
740 | "source": [
741 | "len(y_predict)"
742 | ],
743 | "execution_count": 19,
744 | "outputs": [
745 | {
746 | "output_type": "execute_result",
747 | "data": {
748 | "text/plain": [
749 | "110775"
750 | ]
751 | },
752 | "metadata": {
753 | "tags": []
754 | },
755 | "execution_count": 19
756 | }
757 | ]
758 | },
759 | {
760 | "cell_type": "code",
761 | "metadata": {
762 | "id": "-cEEL7QfgBzn"
763 | },
764 | "source": [
765 | "d = {\"BROWSE_NODE_ID\" : y_predict}"
766 | ],
767 | "execution_count": 20,
768 | "outputs": []
769 | },
770 | {
771 | "cell_type": "code",
772 | "metadata": {
773 | "colab": {
774 | "base_uri": "https://localhost:8080/",
775 | "height": 417
776 | },
777 | "id": "8ymabN1ugEVa",
778 | "outputId": "58f65c77-f6d3-4d3c-cb60-eab928589a8f"
779 | },
780 | "source": [
781 | "df1 = pd.DataFrame(data=d)\n",
782 | "df1"
783 | ],
784 | "execution_count": 21,
785 | "outputs": [
786 | {
787 | "output_type": "execute_result",
788 | "data": {
789 | "text/html": [
790 | "\n",
791 | "\n",
804 | "
\n",
805 | " \n",
806 | " \n",
807 | " | \n",
808 | " BROWSE_NODE_ID | \n",
809 | "
\n",
810 | " \n",
811 | " \n",
812 | " \n",
813 | " 0 | \n",
814 | " 1140 | \n",
815 | "
\n",
816 | " \n",
817 | " 1 | \n",
818 | " 1045 | \n",
819 | "
\n",
820 | " \n",
821 | " 2 | \n",
822 | " 8269 | \n",
823 | "
\n",
824 | " \n",
825 | " 3 | \n",
826 | " 125 | \n",
827 | "
\n",
828 | " \n",
829 | " 4 | \n",
830 | " 1922 | \n",
831 | "
\n",
832 | " \n",
833 | " ... | \n",
834 | " ... | \n",
835 | "
\n",
836 | " \n",
837 | " 110770 | \n",
838 | " 4368 | \n",
839 | "
\n",
840 | " \n",
841 | " 110771 | \n",
842 | " 1551 | \n",
843 | "
\n",
844 | " \n",
845 | " 110772 | \n",
846 | " 800 | \n",
847 | "
\n",
848 | " \n",
849 | " 110773 | \n",
850 | " 800 | \n",
851 | "
\n",
852 | " \n",
853 | " 110774 | \n",
854 | " 800 | \n",
855 | "
\n",
856 | " \n",
857 | "
\n",
858 | "
110775 rows × 1 columns
\n",
859 | "
"
860 | ],
861 | "text/plain": [
862 | " BROWSE_NODE_ID\n",
863 | "0 1140\n",
864 | "1 1045\n",
865 | "2 8269\n",
866 | "3 125\n",
867 | "4 1922\n",
868 | "... ...\n",
869 | "110770 4368\n",
870 | "110771 1551\n",
871 | "110772 800\n",
872 | "110773 800\n",
873 | "110774 800\n",
874 | "\n",
875 | "[110775 rows x 1 columns]"
876 | ]
877 | },
878 | "metadata": {
879 | "tags": []
880 | },
881 | "execution_count": 21
882 | }
883 | ]
884 | },
885 | {
886 | "cell_type": "code",
887 | "metadata": {
888 | "colab": {
889 | "base_uri": "https://localhost:8080/",
890 | "height": 203
891 | },
892 | "id": "ZpXMka2cgqi9",
893 | "outputId": "585cdf6a-1719-477b-b764-1a9f57fbdc12"
894 | },
895 | "source": [
896 | "df_new = pd.concat([df_test, df1], axis = 1)\n",
897 | "df_new.head()"
898 | ],
899 | "execution_count": 22,
900 | "outputs": [
901 | {
902 | "output_type": "execute_result",
903 | "data": {
904 | "text/html": [
905 | "\n",
906 | "\n",
919 | "
\n",
920 | " \n",
921 | " \n",
922 | " | \n",
923 | " Title | \n",
924 | " PRODUCT_ID | \n",
925 | " BROWSE_NODE_ID | \n",
926 | "
\n",
927 | " \n",
928 | " \n",
929 | " \n",
930 | " 0 | \n",
931 | " command 3m small kitchen hooks white decorate ... | \n",
932 | " 1 | \n",
933 | " 1140 | \n",
934 | "
\n",
935 | " \n",
936 | " 1 | \n",
937 | " o'neal jump hardware jag unisex-adult glove (b... | \n",
938 | " 2 | \n",
939 | " 1045 | \n",
940 | "
\n",
941 | " \n",
942 | " 2 | \n",
943 | " nfl detroit lions portable party fridge 158 quart | \n",
944 | " 3 | \n",
945 | " 8269 | \n",
946 | "
\n",
947 | " \n",
948 | " 3 | \n",
949 | " panasonic single line kx-ts880mx corded phone ... | \n",
950 | " 4 | \n",
951 | " 125 | \n",
952 | "
\n",
953 | " \n",
954 | " 4 | \n",
955 | " zero baby girl 100% cotton innerwear bloomer d... | \n",
956 | " 5 | \n",
957 | " 1922 | \n",
958 | "
\n",
959 | " \n",
960 | "
\n",
961 | "
"
962 | ],
963 | "text/plain": [
964 | " Title ... BROWSE_NODE_ID\n",
965 | "0 command 3m small kitchen hooks white decorate ... ... 1140\n",
966 | "1 o'neal jump hardware jag unisex-adult glove (b... ... 1045\n",
967 | "2 nfl detroit lions portable party fridge 158 quart ... 8269\n",
968 | "3 panasonic single line kx-ts880mx corded phone ... ... 125\n",
969 | "4 zero baby girl 100% cotton innerwear bloomer d... ... 1922\n",
970 | "\n",
971 | "[5 rows x 3 columns]"
972 | ]
973 | },
974 | "metadata": {
975 | "tags": []
976 | },
977 | "execution_count": 22
978 | }
979 | ]
980 | },
981 | {
982 | "cell_type": "code",
983 | "metadata": {
984 | "colab": {
985 | "base_uri": "https://localhost:8080/",
986 | "height": 417
987 | },
988 | "id": "0UiIR1C3g0aL",
989 | "outputId": "a869900d-cd40-4364-a8b6-abd3ad6dd8af"
990 | },
991 | "source": [
992 | "l = [\"PRODUCT_ID\", \"BROWSE_NODE_ID\"]\n",
993 | "df_new = df_new[l]\n",
994 | "df_new"
995 | ],
996 | "execution_count": 23,
997 | "outputs": [
998 | {
999 | "output_type": "execute_result",
1000 | "data": {
1001 | "text/html": [
1002 | "\n",
1003 | "\n",
1016 | "
\n",
1017 | " \n",
1018 | " \n",
1019 | " | \n",
1020 | " PRODUCT_ID | \n",
1021 | " BROWSE_NODE_ID | \n",
1022 | "
\n",
1023 | " \n",
1024 | " \n",
1025 | " \n",
1026 | " 0 | \n",
1027 | " 1 | \n",
1028 | " 1140 | \n",
1029 | "
\n",
1030 | " \n",
1031 | " 1 | \n",
1032 | " 2 | \n",
1033 | " 1045 | \n",
1034 | "
\n",
1035 | " \n",
1036 | " 2 | \n",
1037 | " 3 | \n",
1038 | " 8269 | \n",
1039 | "
\n",
1040 | " \n",
1041 | " 3 | \n",
1042 | " 4 | \n",
1043 | " 125 | \n",
1044 | "
\n",
1045 | " \n",
1046 | " 4 | \n",
1047 | " 5 | \n",
1048 | " 1922 | \n",
1049 | "
\n",
1050 | " \n",
1051 | " ... | \n",
1052 | " ... | \n",
1053 | " ... | \n",
1054 | "
\n",
1055 | " \n",
1056 | " 110770 | \n",
1057 | " 110771 | \n",
1058 | " 4368 | \n",
1059 | "
\n",
1060 | " \n",
1061 | " 110771 | \n",
1062 | " 110772 | \n",
1063 | " 1551 | \n",
1064 | "
\n",
1065 | " \n",
1066 | " 110772 | \n",
1067 | " 110773 | \n",
1068 | " 800 | \n",
1069 | "
\n",
1070 | " \n",
1071 | " 110773 | \n",
1072 | " 110774 | \n",
1073 | " 800 | \n",
1074 | "
\n",
1075 | " \n",
1076 | " 110774 | \n",
1077 | " 110775 | \n",
1078 | " 800 | \n",
1079 | "
\n",
1080 | " \n",
1081 | "
\n",
1082 | "
110775 rows × 2 columns
\n",
1083 | "
"
1084 | ],
1085 | "text/plain": [
1086 | " PRODUCT_ID BROWSE_NODE_ID\n",
1087 | "0 1 1140\n",
1088 | "1 2 1045\n",
1089 | "2 3 8269\n",
1090 | "3 4 125\n",
1091 | "4 5 1922\n",
1092 | "... ... ...\n",
1093 | "110770 110771 4368\n",
1094 | "110771 110772 1551\n",
1095 | "110772 110773 800\n",
1096 | "110773 110774 800\n",
1097 | "110774 110775 800\n",
1098 | "\n",
1099 | "[110775 rows x 2 columns]"
1100 | ]
1101 | },
1102 | "metadata": {
1103 | "tags": []
1104 | },
1105 | "execution_count": 23
1106 | }
1107 | ]
1108 | },
1109 | {
1110 | "cell_type": "code",
1111 | "metadata": {
1112 | "id": "JTyuVvuPhgSE"
1113 | },
1114 | "source": [
1115 | "df_new.to_csv(\"/content/drive/MyDrive/Datasets/submission.csv\", index = False, header = True)"
1116 | ],
1117 | "execution_count": 24,
1118 | "outputs": []
1119 | },
1120 | {
1121 | "cell_type": "code",
1122 | "metadata": {
1123 | "colab": {
1124 | "base_uri": "https://localhost:8080/",
1125 | "height": 203
1126 | },
1127 | "id": "u8MNaBoTiuYD",
1128 | "outputId": "82ac3f23-96eb-4876-eda4-e51904beb539"
1129 | },
1130 | "source": [
1131 | "path = \"/content/drive/MyDrive/Datasets/submission.csv\"\n",
1132 | "df_sub = pd.read_csv(path)\n",
1133 | "df_sub.head()"
1134 | ],
1135 | "execution_count": 25,
1136 | "outputs": [
1137 | {
1138 | "output_type": "execute_result",
1139 | "data": {
1140 | "text/html": [
1141 | "\n",
1142 | "\n",
1155 | "
\n",
1156 | " \n",
1157 | " \n",
1158 | " | \n",
1159 | " PRODUCT_ID | \n",
1160 | " BROWSE_NODE_ID | \n",
1161 | "
\n",
1162 | " \n",
1163 | " \n",
1164 | " \n",
1165 | " 0 | \n",
1166 | " 1 | \n",
1167 | " 1140 | \n",
1168 | "
\n",
1169 | " \n",
1170 | " 1 | \n",
1171 | " 2 | \n",
1172 | " 1045 | \n",
1173 | "
\n",
1174 | " \n",
1175 | " 2 | \n",
1176 | " 3 | \n",
1177 | " 8269 | \n",
1178 | "
\n",
1179 | " \n",
1180 | " 3 | \n",
1181 | " 4 | \n",
1182 | " 125 | \n",
1183 | "
\n",
1184 | " \n",
1185 | " 4 | \n",
1186 | " 5 | \n",
1187 | " 1922 | \n",
1188 | "
\n",
1189 | " \n",
1190 | "
\n",
1191 | "
"
1192 | ],
1193 | "text/plain": [
1194 | " PRODUCT_ID BROWSE_NODE_ID\n",
1195 | "0 1 1140\n",
1196 | "1 2 1045\n",
1197 | "2 3 8269\n",
1198 | "3 4 125\n",
1199 | "4 5 1922"
1200 | ]
1201 | },
1202 | "metadata": {
1203 | "tags": []
1204 | },
1205 | "execution_count": 25
1206 | }
1207 | ]
1208 | },
1209 | {
1210 | "cell_type": "code",
1211 | "metadata": {
1212 | "colab": {
1213 | "base_uri": "https://localhost:8080/"
1214 | },
1215 | "id": "ALnvwky6rAA0",
1216 | "outputId": "4d501fed-b5f8-44ab-9b52-c9d38c1a4911"
1217 | },
1218 | "source": [
1219 | "df_sub.isna().sum()"
1220 | ],
1221 | "execution_count": 26,
1222 | "outputs": [
1223 | {
1224 | "output_type": "execute_result",
1225 | "data": {
1226 | "text/plain": [
1227 | "PRODUCT_ID 0\n",
1228 | "BROWSE_NODE_ID 0\n",
1229 | "dtype: int64"
1230 | ]
1231 | },
1232 | "metadata": {
1233 | "tags": []
1234 | },
1235 | "execution_count": 26
1236 | }
1237 | ]
1238 | },
1239 | {
1240 | "cell_type": "code",
1241 | "metadata": {
1242 | "colab": {
1243 | "base_uri": "https://localhost:8080/",
1244 | "height": 295
1245 | },
1246 | "id": "m5FYWOJ3rD1x",
1247 | "outputId": "84d5ac03-9cf2-41e7-a97b-cf1c550521c3"
1248 | },
1249 | "source": [
1250 | "df_sub.describe()"
1251 | ],
1252 | "execution_count": 27,
1253 | "outputs": [
1254 | {
1255 | "output_type": "execute_result",
1256 | "data": {
1257 | "text/html": [
1258 | "\n",
1259 | "\n",
1272 | "
\n",
1273 | " \n",
1274 | " \n",
1275 | " | \n",
1276 | " PRODUCT_ID | \n",
1277 | " BROWSE_NODE_ID | \n",
1278 | "
\n",
1279 | " \n",
1280 | " \n",
1281 | " \n",
1282 | " count | \n",
1283 | " 110775.000000 | \n",
1284 | " 110775.000000 | \n",
1285 | "
\n",
1286 | " \n",
1287 | " mean | \n",
1288 | " 55388.000000 | \n",
1289 | " 1814.854317 | \n",
1290 | "
\n",
1291 | " \n",
1292 | " std | \n",
1293 | " 31978.132372 | \n",
1294 | " 2984.057629 | \n",
1295 | "
\n",
1296 | " \n",
1297 | " min | \n",
1298 | " 1.000000 | \n",
1299 | " 0.000000 | \n",
1300 | "
\n",
1301 | " \n",
1302 | " 25% | \n",
1303 | " 27694.500000 | \n",
1304 | " 507.000000 | \n",
1305 | "
\n",
1306 | " \n",
1307 | " 50% | \n",
1308 | " 55388.000000 | \n",
1309 | " 1045.000000 | \n",
1310 | "
\n",
1311 | " \n",
1312 | " 75% | \n",
1313 | " 83081.500000 | \n",
1314 | " 1687.000000 | \n",
1315 | "
\n",
1316 | " \n",
1317 | " max | \n",
1318 | " 110775.000000 | \n",
1319 | " 47970.000000 | \n",
1320 | "
\n",
1321 | " \n",
1322 | "
\n",
1323 | "
"
1324 | ],
1325 | "text/plain": [
1326 | " PRODUCT_ID BROWSE_NODE_ID\n",
1327 | "count 110775.000000 110775.000000\n",
1328 | "mean 55388.000000 1814.854317\n",
1329 | "std 31978.132372 2984.057629\n",
1330 | "min 1.000000 0.000000\n",
1331 | "25% 27694.500000 507.000000\n",
1332 | "50% 55388.000000 1045.000000\n",
1333 | "75% 83081.500000 1687.000000\n",
1334 | "max 110775.000000 47970.000000"
1335 | ]
1336 | },
1337 | "metadata": {
1338 | "tags": []
1339 | },
1340 | "execution_count": 27
1341 | }
1342 | ]
1343 | },
1344 | {
1345 | "cell_type": "code",
1346 | "metadata": {
1347 | "id": "Tx07ygYxrGlu"
1348 | },
1349 | "source": [
1350 | ""
1351 | ],
1352 | "execution_count": null,
1353 | "outputs": []
1354 | }
1355 | ]
1356 | }
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Amazon-ML-Challenge
2 | Product Browse Node Classification
3 |
4 | ## About Amazon ML Challenge
5 | 
6 | Amazon ML Challenge is a two-stage competition where students from all engineering campuses across India will get a unique opportunity to work on Amazon’s dataset to bring in fresh ideas and build innovative solutions for a real-world problem statement. The top three winning teams will receive pre-placement interviews (PPIs) for ML roles at Amazon along with cash prizes and certificates.
7 |
8 | ## ML Challenge Stages
9 | 
10 |
11 | ## Datset Link🔗
12 | [Click Here for the Datset](https://he-s3.s3.ap-southeast-1.amazonaws.com/media/hackathon/amazon-ml-challenge/product-browse-node-classification-2-7ff04e5a/546b594ee0a211eb.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=16c8b3b0a2fcd83db7ed5437de70012b5b4651ce5f543c1fe62512904d5f0ce7&X-Amz-Date=20210802T090306Z&X-Amz-Credential=AKIA6I2ISGOYH7WWS3G5%2F20210802%2Fap-southeast-1%2Fs3%2Faws4_request)
13 |
14 | ## Problem
15 | Amazon catalog consists of billions of products that belong to thousands of browse nodes (each browse node represents a collection of items for sale). Browse nodes are used to help customer navigate through our website and classify products to product type groups. Hence, it is important to predict the node assignment at the time of listing of the product or when the browse node information is absent.
16 |
17 | As part of this hackathon, you will use product metadata to classify products into browse nodes. You will have access to product title, description, bullet points etc. and labels for ~3MM products to train and test your submissions. Note that there is some noise in the data - real world data looks like this!!
18 |
19 | ## Data Description
20 |
21 | Full Train/Test dataset details:
22 |
23 | Key column – PRODUCT_ID
24 | Input features – TITLE, DESCRIPTION, BULLET_POINTS, BRAND
25 | Target column – BROWSE_NODE_ID
26 | Train dataset size – 2,903,024
27 | Number of classes in Train – 9,919
28 | Overall Test dataset size – 110,775
29 |
30 | ## Team Members
31 | 1. [Akshat Jain](https://github.com/akshatprogrammer)
32 | 2. [Anjali Pathak](https://github.com/anjalipathak13)
33 | 3. [Aman Gupta](https://github.com/Aman-Gupta-Ji)
34 | 4. [Himank Kandpal](https://github.com/Himank0)
35 |
36 | ## Team Ranking
37 | 
38 | Out of 3290 Teams we secured 296th Position.
39 |
40 |
41 | ## Contributors
42 |
43 |
44 |
45 |
46 | # Connect With Me
47 | LinkedIn : https://www.linkedin.com/in/akshatjaingeu/
48 | Email : akshat.kodia@gmail.com
49 | Twitter : www.twitter.com/akki_aj89
50 | Website : https://akshatprogrammer.github.io/portfolio/
51 | # Personal
52 | Name : Akshat Jain
53 | University : Graphic Era University, Dehradun(UK)
54 |
55 | If any problem with this program reach me at Telegram
56 | Here is the link -> https://t.me/akki_aj89
57 |
58 | # Gratitude
59 | Thank You, if you like it please leave a Star.
60 |
61 |
--------------------------------------------------------------------------------