├── .gitignore
├── Deep Learning
    ├── Deep Learning Concepts - Hackers Realm.ipynb
    ├── audio data
    │   ├── OAF_back_fear.wav
    │   ├── OAF_back_happy.wav
    │   ├── OAF_back_ps.wav
    │   └── OAF_back_sad.wav
    └── image data
    │   ├── 1.jpg
    │   ├── 2.jpg
    │   ├── 3.jpg
    │   └── 4.jpg
├── Machine Learning
    ├── Machine Learning Concepts - Hackers Realm.ipynb
    └── data
    │   ├── 1000000 Sales Records.rar
    │   ├── Loan Prediction Dataset.csv
    │   ├── Traffic data.csv
    │   ├── bike sharing dataset.csv
    │   ├── creditcard.rar
    │   └── winequality.csv
├── NLP
    ├── Natural Language Processing(NLP) Concepts - Hackers Realm.ipynb
    └── data
    │   └── Twitter Sentiments.csv
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | 
2 | Machine Learning/data/1000000 Sales Records.csv
3 | 


--------------------------------------------------------------------------------
/Deep Learning/audio data/OAF_back_fear.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/audio data/OAF_back_fear.wav


--------------------------------------------------------------------------------
/Deep Learning/audio data/OAF_back_happy.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/audio data/OAF_back_happy.wav


--------------------------------------------------------------------------------
/Deep Learning/audio data/OAF_back_ps.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/audio data/OAF_back_ps.wav


--------------------------------------------------------------------------------
/Deep Learning/audio data/OAF_back_sad.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/audio data/OAF_back_sad.wav


--------------------------------------------------------------------------------
/Deep Learning/image data/1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/image data/1.jpg


--------------------------------------------------------------------------------
/Deep Learning/image data/2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/image data/2.jpg


--------------------------------------------------------------------------------
/Deep Learning/image data/3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/image data/3.jpg


--------------------------------------------------------------------------------
/Deep Learning/image data/4.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/image data/4.jpg


--------------------------------------------------------------------------------
/Machine Learning/data/1000000 Sales Records.rar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Machine Learning/data/1000000 Sales Records.rar


--------------------------------------------------------------------------------
/Machine Learning/data/Loan Prediction Dataset.csv:
--------------------------------------------------------------------------------
  1 | Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
  2 | LP001002,Male,No,0,Graduate,No,5849,0,,360,1,Urban,Y
  3 | LP001003,Male,Yes,1,Graduate,No,4583,1508,128,360,1,Rural,N
  4 | LP001005,Male,Yes,0,Graduate,Yes,3000,0,66,360,1,Urban,Y
  5 | LP001006,Male,Yes,0,Not Graduate,No,2583,2358,120,360,1,Urban,Y
  6 | LP001008,Male,No,0,Graduate,No,6000,0,141,360,1,Urban,Y
  7 | LP001011,Male,Yes,2,Graduate,Yes,5417,4196,267,360,1,Urban,Y
  8 | LP001013,Male,Yes,0,Not Graduate,No,2333,1516,95,360,1,Urban,Y
  9 | LP001014,Male,Yes,3+,Graduate,No,3036,2504,158,360,0,Semiurban,N
 10 | LP001018,Male,Yes,2,Graduate,No,4006,1526,168,360,1,Urban,Y
 11 | LP001020,Male,Yes,1,Graduate,No,12841,10968,349,360,1,Semiurban,N
 12 | LP001024,Male,Yes,2,Graduate,No,3200,700,70,360,1,Urban,Y
 13 | LP001027,Male,Yes,2,Graduate,,2500,1840,109,360,1,Urban,Y
 14 | LP001028,Male,Yes,2,Graduate,No,3073,8106,200,360,1,Urban,Y
 15 | LP001029,Male,No,0,Graduate,No,1853,2840,114,360,1,Rural,N
 16 | LP001030,Male,Yes,2,Graduate,No,1299,1086,17,120,1,Urban,Y
 17 | LP001032,Male,No,0,Graduate,No,4950,0,125,360,1,Urban,Y
 18 | LP001034,Male,No,1,Not Graduate,No,3596,0,100,240,,Urban,Y
 19 | LP001036,Female,No,0,Graduate,No,3510,0,76,360,0,Urban,N
 20 | LP001038,Male,Yes,0,Not Graduate,No,4887,0,133,360,1,Rural,N
 21 | LP001041,Male,Yes,0,Graduate,,2600,3500,115,,1,Urban,Y
 22 | LP001043,Male,Yes,0,Not Graduate,No,7660,0,104,360,0,Urban,N
 23 | LP001046,Male,Yes,1,Graduate,No,5955,5625,315,360,1,Urban,Y
 24 | LP001047,Male,Yes,0,Not Graduate,No,2600,1911,116,360,0,Semiurban,N
 25 | LP001050,,Yes,2,Not Graduate,No,3365,1917,112,360,0,Rural,N
 26 | LP001052,Male,Yes,1,Graduate,,3717,2925,151,360,,Semiurban,N
 27 | LP001066,Male,Yes,0,Graduate,Yes,9560,0,191,360,1,Semiurban,Y
 28 | LP001068,Male,Yes,0,Graduate,No,2799,2253,122,360,1,Semiurban,Y
 29 | LP001073,Male,Yes,2,Not Graduate,No,4226,1040,110,360,1,Urban,Y
 30 | LP001086,Male,No,0,Not Graduate,No,1442,0,35,360,1,Urban,N
 31 | LP001087,Female,No,2,Graduate,,3750,2083,120,360,1,Semiurban,Y
 32 | LP001091,Male,Yes,1,Graduate,,4166,3369,201,360,,Urban,N
 33 | LP001095,Male,No,0,Graduate,No,3167,0,74,360,1,Urban,N
 34 | LP001097,Male,No,1,Graduate,Yes,4692,0,106,360,1,Rural,N
 35 | LP001098,Male,Yes,0,Graduate,No,3500,1667,114,360,1,Semiurban,Y
 36 | LP001100,Male,No,3+,Graduate,No,12500,3000,320,360,1,Rural,N
 37 | LP001106,Male,Yes,0,Graduate,No,2275,2067,,360,1,Urban,Y
 38 | LP001109,Male,Yes,0,Graduate,No,1828,1330,100,,0,Urban,N
 39 | LP001112,Female,Yes,0,Graduate,No,3667,1459,144,360,1,Semiurban,Y
 40 | LP001114,Male,No,0,Graduate,No,4166,7210,184,360,1,Urban,Y
 41 | LP001116,Male,No,0,Not Graduate,No,3748,1668,110,360,1,Semiurban,Y
 42 | LP001119,Male,No,0,Graduate,No,3600,0,80,360,1,Urban,N
 43 | LP001120,Male,No,0,Graduate,No,1800,1213,47,360,1,Urban,Y
 44 | LP001123,Male,Yes,0,Graduate,No,2400,0,75,360,,Urban,Y
 45 | LP001131,Male,Yes,0,Graduate,No,3941,2336,134,360,1,Semiurban,Y
 46 | LP001136,Male,Yes,0,Not Graduate,Yes,4695,0,96,,1,Urban,Y
 47 | LP001137,Female,No,0,Graduate,No,3410,0,88,,1,Urban,Y
 48 | LP001138,Male,Yes,1,Graduate,No,5649,0,44,360,1,Urban,Y
 49 | LP001144,Male,Yes,0,Graduate,No,5821,0,144,360,1,Urban,Y
 50 | LP001146,Female,Yes,0,Graduate,No,2645,3440,120,360,0,Urban,N
 51 | LP001151,Female,No,0,Graduate,No,4000,2275,144,360,1,Semiurban,Y
 52 | LP001155,Female,Yes,0,Not Graduate,No,1928,1644,100,360,1,Semiurban,Y
 53 | LP001157,Female,No,0,Graduate,No,3086,0,120,360,1,Semiurban,Y
 54 | LP001164,Female,No,0,Graduate,No,4230,0,112,360,1,Semiurban,N
 55 | LP001179,Male,Yes,2,Graduate,No,4616,0,134,360,1,Urban,N
 56 | LP001186,Female,Yes,1,Graduate,Yes,11500,0,286,360,0,Urban,N
 57 | LP001194,Male,Yes,2,Graduate,No,2708,1167,97,360,1,Semiurban,Y
 58 | LP001195,Male,Yes,0,Graduate,No,2132,1591,96,360,1,Semiurban,Y
 59 | LP001197,Male,Yes,0,Graduate,No,3366,2200,135,360,1,Rural,N
 60 | LP001198,Male,Yes,1,Graduate,No,8080,2250,180,360,1,Urban,Y
 61 | LP001199,Male,Yes,2,Not Graduate,No,3357,2859,144,360,1,Urban,Y
 62 | LP001205,Male,Yes,0,Graduate,No,2500,3796,120,360,1,Urban,Y
 63 | LP001206,Male,Yes,3+,Graduate,No,3029,0,99,360,1,Urban,Y
 64 | LP001207,Male,Yes,0,Not Graduate,Yes,2609,3449,165,180,0,Rural,N
 65 | LP001213,Male,Yes,1,Graduate,No,4945,0,,360,0,Rural,N
 66 | LP001222,Female,No,0,Graduate,No,4166,0,116,360,0,Semiurban,N
 67 | LP001225,Male,Yes,0,Graduate,No,5726,4595,258,360,1,Semiurban,N
 68 | LP001228,Male,No,0,Not Graduate,No,3200,2254,126,180,0,Urban,N
 69 | LP001233,Male,Yes,1,Graduate,No,10750,0,312,360,1,Urban,Y
 70 | LP001238,Male,Yes,3+,Not Graduate,Yes,7100,0,125,60,1,Urban,Y
 71 | LP001241,Female,No,0,Graduate,No,4300,0,136,360,0,Semiurban,N
 72 | LP001243,Male,Yes,0,Graduate,No,3208,3066,172,360,1,Urban,Y
 73 | LP001245,Male,Yes,2,Not Graduate,Yes,1875,1875,97,360,1,Semiurban,Y
 74 | LP001248,Male,No,0,Graduate,No,3500,0,81,300,1,Semiurban,Y
 75 | LP001250,Male,Yes,3+,Not Graduate,No,4755,0,95,,0,Semiurban,N
 76 | LP001253,Male,Yes,3+,Graduate,Yes,5266,1774,187,360,1,Semiurban,Y
 77 | LP001255,Male,No,0,Graduate,No,3750,0,113,480,1,Urban,N
 78 | LP001256,Male,No,0,Graduate,No,3750,4750,176,360,1,Urban,N
 79 | LP001259,Male,Yes,1,Graduate,Yes,1000,3022,110,360,1,Urban,N
 80 | LP001263,Male,Yes,3+,Graduate,No,3167,4000,180,300,0,Semiurban,N
 81 | LP001264,Male,Yes,3+,Not Graduate,Yes,3333,2166,130,360,,Semiurban,Y
 82 | LP001265,Female,No,0,Graduate,No,3846,0,111,360,1,Semiurban,Y
 83 | LP001266,Male,Yes,1,Graduate,Yes,2395,0,,360,1,Semiurban,Y
 84 | LP001267,Female,Yes,2,Graduate,No,1378,1881,167,360,1,Urban,N
 85 | LP001273,Male,Yes,0,Graduate,No,6000,2250,265,360,,Semiurban,N
 86 | LP001275,Male,Yes,1,Graduate,No,3988,0,50,240,1,Urban,Y
 87 | LP001279,Male,No,0,Graduate,No,2366,2531,136,360,1,Semiurban,Y
 88 | LP001280,Male,Yes,2,Not Graduate,No,3333,2000,99,360,,Semiurban,Y
 89 | LP001282,Male,Yes,0,Graduate,No,2500,2118,104,360,1,Semiurban,Y
 90 | LP001289,Male,No,0,Graduate,No,8566,0,210,360,1,Urban,Y
 91 | LP001310,Male,Yes,0,Graduate,No,5695,4167,175,360,1,Semiurban,Y
 92 | LP001316,Male,Yes,0,Graduate,No,2958,2900,131,360,1,Semiurban,Y
 93 | LP001318,Male,Yes,2,Graduate,No,6250,5654,188,180,1,Semiurban,Y
 94 | LP001319,Male,Yes,2,Not Graduate,No,3273,1820,81,360,1,Urban,Y
 95 | LP001322,Male,No,0,Graduate,No,4133,0,122,360,1,Semiurban,Y
 96 | LP001325,Male,No,0,Not Graduate,No,3620,0,25,120,1,Semiurban,Y
 97 | LP001326,Male,No,0,Graduate,,6782,0,,360,,Urban,N
 98 | LP001327,Female,Yes,0,Graduate,No,2484,2302,137,360,1,Semiurban,Y
 99 | LP001333,Male,Yes,0,Graduate,No,1977,997,50,360,1,Semiurban,Y
100 | LP001334,Male,Yes,0,Not Graduate,No,4188,0,115,180,1,Semiurban,Y
101 | LP001343,Male,Yes,0,Graduate,No,1759,3541,131,360,1,Semiurban,Y
102 | LP001345,Male,Yes,2,Not Graduate,No,4288,3263,133,180,1,Urban,Y
103 | LP001349,Male,No,0,Graduate,No,4843,3806,151,360,1,Semiurban,Y
104 | LP001350,Male,Yes,,Graduate,No,13650,0,,360,1,Urban,Y
105 | LP001356,Male,Yes,0,Graduate,No,4652,3583,,360,1,Semiurban,Y
106 | LP001357,Male,,,Graduate,No,3816,754,160,360,1,Urban,Y
107 | LP001367,Male,Yes,1,Graduate,No,3052,1030,100,360,1,Urban,Y
108 | LP001369,Male,Yes,2,Graduate,No,11417,1126,225,360,1,Urban,Y
109 | LP001370,Male,No,0,Not Graduate,,7333,0,120,360,1,Rural,N
110 | LP001379,Male,Yes,2,Graduate,No,3800,3600,216,360,0,Urban,N
111 | LP001384,Male,Yes,3+,Not Graduate,No,2071,754,94,480,1,Semiurban,Y
112 | LP001385,Male,No,0,Graduate,No,5316,0,136,360,1,Urban,Y
113 | LP001387,Female,Yes,0,Graduate,,2929,2333,139,360,1,Semiurban,Y
114 | LP001391,Male,Yes,0,Not Graduate,No,3572,4114,152,,0,Rural,N
115 | LP001392,Female,No,1,Graduate,Yes,7451,0,,360,1,Semiurban,Y
116 | LP001398,Male,No,0,Graduate,,5050,0,118,360,1,Semiurban,Y
117 | LP001401,Male,Yes,1,Graduate,No,14583,0,185,180,1,Rural,Y
118 | LP001404,Female,Yes,0,Graduate,No,3167,2283,154,360,1,Semiurban,Y
119 | LP001405,Male,Yes,1,Graduate,No,2214,1398,85,360,,Urban,Y
120 | LP001421,Male,Yes,0,Graduate,No,5568,2142,175,360,1,Rural,N
121 | LP001422,Female,No,0,Graduate,No,10408,0,259,360,1,Urban,Y
122 | LP001426,Male,Yes,,Graduate,No,5667,2667,180,360,1,Rural,Y
123 | LP001430,Female,No,0,Graduate,No,4166,0,44,360,1,Semiurban,Y
124 | LP001431,Female,No,0,Graduate,No,2137,8980,137,360,0,Semiurban,Y
125 | LP001432,Male,Yes,2,Graduate,No,2957,0,81,360,1,Semiurban,Y
126 | LP001439,Male,Yes,0,Not Graduate,No,4300,2014,194,360,1,Rural,Y
127 | LP001443,Female,No,0,Graduate,No,3692,0,93,360,,Rural,Y
128 | LP001448,,Yes,3+,Graduate,No,23803,0,370,360,1,Rural,Y
129 | LP001449,Male,No,0,Graduate,No,3865,1640,,360,1,Rural,Y
130 | LP001451,Male,Yes,1,Graduate,Yes,10513,3850,160,180,0,Urban,N
131 | LP001465,Male,Yes,0,Graduate,No,6080,2569,182,360,,Rural,N
132 | LP001469,Male,No,0,Graduate,Yes,20166,0,650,480,,Urban,Y
133 | LP001473,Male,No,0,Graduate,No,2014,1929,74,360,1,Urban,Y
134 | LP001478,Male,No,0,Graduate,No,2718,0,70,360,1,Semiurban,Y
135 | LP001482,Male,Yes,0,Graduate,Yes,3459,0,25,120,1,Semiurban,Y
136 | LP001487,Male,No,0,Graduate,No,4895,0,102,360,1,Semiurban,Y
137 | LP001488,Male,Yes,3+,Graduate,No,4000,7750,290,360,1,Semiurban,N
138 | LP001489,Female,Yes,0,Graduate,No,4583,0,84,360,1,Rural,N
139 | LP001491,Male,Yes,2,Graduate,Yes,3316,3500,88,360,1,Urban,Y
140 | LP001492,Male,No,0,Graduate,No,14999,0,242,360,0,Semiurban,N
141 | LP001493,Male,Yes,2,Not Graduate,No,4200,1430,129,360,1,Rural,N
142 | LP001497,Male,Yes,2,Graduate,No,5042,2083,185,360,1,Rural,N
143 | LP001498,Male,No,0,Graduate,No,5417,0,168,360,1,Urban,Y
144 | LP001504,Male,No,0,Graduate,Yes,6950,0,175,180,1,Semiurban,Y
145 | LP001507,Male,Yes,0,Graduate,No,2698,2034,122,360,1,Semiurban,Y
146 | LP001508,Male,Yes,2,Graduate,No,11757,0,187,180,1,Urban,Y
147 | LP001514,Female,Yes,0,Graduate,No,2330,4486,100,360,1,Semiurban,Y
148 | LP001516,Female,Yes,2,Graduate,No,14866,0,70,360,1,Urban,Y
149 | LP001518,Male,Yes,1,Graduate,No,1538,1425,30,360,1,Urban,Y
150 | LP001519,Female,No,0,Graduate,No,10000,1666,225,360,1,Rural,N
151 | LP001520,Male,Yes,0,Graduate,No,4860,830,125,360,1,Semiurban,Y
152 | LP001528,Male,No,0,Graduate,No,6277,0,118,360,0,Rural,N
153 | LP001529,Male,Yes,0,Graduate,Yes,2577,3750,152,360,1,Rural,Y
154 | LP001531,Male,No,0,Graduate,No,9166,0,244,360,1,Urban,N
155 | LP001532,Male,Yes,2,Not Graduate,No,2281,0,113,360,1,Rural,N
156 | LP001535,Male,No,0,Graduate,No,3254,0,50,360,1,Urban,Y
157 | LP001536,Male,Yes,3+,Graduate,No,39999,0,600,180,0,Semiurban,Y
158 | LP001541,Male,Yes,1,Graduate,No,6000,0,160,360,,Rural,Y
159 | LP001543,Male,Yes,1,Graduate,No,9538,0,187,360,1,Urban,Y
160 | LP001546,Male,No,0,Graduate,,2980,2083,120,360,1,Rural,Y
161 | LP001552,Male,Yes,0,Graduate,No,4583,5625,255,360,1,Semiurban,Y
162 | LP001560,Male,Yes,0,Not Graduate,No,1863,1041,98,360,1,Semiurban,Y
163 | LP001562,Male,Yes,0,Graduate,No,7933,0,275,360,1,Urban,N
164 | LP001565,Male,Yes,1,Graduate,No,3089,1280,121,360,0,Semiurban,N
165 | LP001570,Male,Yes,2,Graduate,No,4167,1447,158,360,1,Rural,Y
166 | LP001572,Male,Yes,0,Graduate,No,9323,0,75,180,1,Urban,Y
167 | LP001574,Male,Yes,0,Graduate,No,3707,3166,182,,1,Rural,Y
168 | LP001577,Female,Yes,0,Graduate,No,4583,0,112,360,1,Rural,N
169 | LP001578,Male,Yes,0,Graduate,No,2439,3333,129,360,1,Rural,Y
170 | LP001579,Male,No,0,Graduate,No,2237,0,63,480,0,Semiurban,N
171 | LP001580,Male,Yes,2,Graduate,No,8000,0,200,360,1,Semiurban,Y
172 | LP001581,Male,Yes,0,Not Graduate,,1820,1769,95,360,1,Rural,Y
173 | LP001585,,Yes,3+,Graduate,No,51763,0,700,300,1,Urban,Y
174 | LP001586,Male,Yes,3+,Not Graduate,No,3522,0,81,180,1,Rural,N
175 | LP001594,Male,Yes,0,Graduate,No,5708,5625,187,360,1,Semiurban,Y
176 | LP001603,Male,Yes,0,Not Graduate,Yes,4344,736,87,360,1,Semiurban,N
177 | LP001606,Male,Yes,0,Graduate,No,3497,1964,116,360,1,Rural,Y
178 | LP001608,Male,Yes,2,Graduate,No,2045,1619,101,360,1,Rural,Y
179 | LP001610,Male,Yes,3+,Graduate,No,5516,11300,495,360,0,Semiurban,N
180 | LP001616,Male,Yes,1,Graduate,No,3750,0,116,360,1,Semiurban,Y
181 | LP001630,Male,No,0,Not Graduate,No,2333,1451,102,480,0,Urban,N
182 | LP001633,Male,Yes,1,Graduate,No,6400,7250,180,360,0,Urban,N
183 | LP001634,Male,No,0,Graduate,No,1916,5063,67,360,,Rural,N
184 | LP001636,Male,Yes,0,Graduate,No,4600,0,73,180,1,Semiurban,Y
185 | LP001637,Male,Yes,1,Graduate,No,33846,0,260,360,1,Semiurban,N
186 | LP001639,Female,Yes,0,Graduate,No,3625,0,108,360,1,Semiurban,Y
187 | LP001640,Male,Yes,0,Graduate,Yes,39147,4750,120,360,1,Semiurban,Y
188 | LP001641,Male,Yes,1,Graduate,Yes,2178,0,66,300,0,Rural,N
189 | LP001643,Male,Yes,0,Graduate,No,2383,2138,58,360,,Rural,Y
190 | LP001644,,Yes,0,Graduate,Yes,674,5296,168,360,1,Rural,Y
191 | LP001647,Male,Yes,0,Graduate,No,9328,0,188,180,1,Rural,Y
192 | LP001653,Male,No,0,Not Graduate,No,4885,0,48,360,1,Rural,Y
193 | LP001656,Male,No,0,Graduate,No,12000,0,164,360,1,Semiurban,N
194 | LP001657,Male,Yes,0,Not Graduate,No,6033,0,160,360,1,Urban,N
195 | LP001658,Male,No,0,Graduate,No,3858,0,76,360,1,Semiurban,Y
196 | LP001664,Male,No,0,Graduate,No,4191,0,120,360,1,Rural,Y
197 | LP001665,Male,Yes,1,Graduate,No,3125,2583,170,360,1,Semiurban,N
198 | LP001666,Male,No,0,Graduate,No,8333,3750,187,360,1,Rural,Y
199 | LP001669,Female,No,0,Not Graduate,No,1907,2365,120,,1,Urban,Y
200 | LP001671,Female,Yes,0,Graduate,No,3416,2816,113,360,,Semiurban,Y
201 | LP001673,Male,No,0,Graduate,Yes,11000,0,83,360,1,Urban,N
202 | LP001674,Male,Yes,1,Not Graduate,No,2600,2500,90,360,1,Semiurban,Y
203 | LP001677,Male,No,2,Graduate,No,4923,0,166,360,0,Semiurban,Y
204 | LP001682,Male,Yes,3+,Not Graduate,No,3992,0,,180,1,Urban,N
205 | LP001688,Male,Yes,1,Not Graduate,No,3500,1083,135,360,1,Urban,Y
206 | LP001691,Male,Yes,2,Not Graduate,No,3917,0,124,360,1,Semiurban,Y
207 | LP001692,Female,No,0,Not Graduate,No,4408,0,120,360,1,Semiurban,Y
208 | LP001693,Female,No,0,Graduate,No,3244,0,80,360,1,Urban,Y
209 | LP001698,Male,No,0,Not Graduate,No,3975,2531,55,360,1,Rural,Y
210 | LP001699,Male,No,0,Graduate,No,2479,0,59,360,1,Urban,Y
211 | LP001702,Male,No,0,Graduate,No,3418,0,127,360,1,Semiurban,N
212 | LP001708,Female,No,0,Graduate,No,10000,0,214,360,1,Semiurban,N
213 | LP001711,Male,Yes,3+,Graduate,No,3430,1250,128,360,0,Semiurban,N
214 | LP001713,Male,Yes,1,Graduate,Yes,7787,0,240,360,1,Urban,Y
215 | LP001715,Male,Yes,3+,Not Graduate,Yes,5703,0,130,360,1,Rural,Y
216 | LP001716,Male,Yes,0,Graduate,No,3173,3021,137,360,1,Urban,Y
217 | LP001720,Male,Yes,3+,Not Graduate,No,3850,983,100,360,1,Semiurban,Y
218 | LP001722,Male,Yes,0,Graduate,No,150,1800,135,360,1,Rural,N
219 | LP001726,Male,Yes,0,Graduate,No,3727,1775,131,360,1,Semiurban,Y
220 | LP001732,Male,Yes,2,Graduate,,5000,0,72,360,0,Semiurban,N
221 | LP001734,Female,Yes,2,Graduate,No,4283,2383,127,360,,Semiurban,Y
222 | LP001736,Male,Yes,0,Graduate,No,2221,0,60,360,0,Urban,N
223 | LP001743,Male,Yes,2,Graduate,No,4009,1717,116,360,1,Semiurban,Y
224 | LP001744,Male,No,0,Graduate,No,2971,2791,144,360,1,Semiurban,Y
225 | LP001749,Male,Yes,0,Graduate,No,7578,1010,175,,1,Semiurban,Y
226 | LP001750,Male,Yes,0,Graduate,No,6250,0,128,360,1,Semiurban,Y
227 | LP001751,Male,Yes,0,Graduate,No,3250,0,170,360,1,Rural,N
228 | LP001754,Male,Yes,,Not Graduate,Yes,4735,0,138,360,1,Urban,N
229 | LP001758,Male,Yes,2,Graduate,No,6250,1695,210,360,1,Semiurban,Y
230 | LP001760,Male,,,Graduate,No,4758,0,158,480,1,Semiurban,Y
231 | LP001761,Male,No,0,Graduate,Yes,6400,0,200,360,1,Rural,Y
232 | LP001765,Male,Yes,1,Graduate,No,2491,2054,104,360,1,Semiurban,Y
233 | LP001768,Male,Yes,0,Graduate,,3716,0,42,180,1,Rural,Y
234 | LP001770,Male,No,0,Not Graduate,No,3189,2598,120,,1,Rural,Y
235 | LP001776,Female,No,0,Graduate,No,8333,0,280,360,1,Semiurban,Y
236 | LP001778,Male,Yes,1,Graduate,No,3155,1779,140,360,1,Semiurban,Y
237 | LP001784,Male,Yes,1,Graduate,No,5500,1260,170,360,1,Rural,Y
238 | LP001786,Male,Yes,0,Graduate,,5746,0,255,360,,Urban,N
239 | LP001788,Female,No,0,Graduate,Yes,3463,0,122,360,,Urban,Y
240 | LP001790,Female,No,1,Graduate,No,3812,0,112,360,1,Rural,Y
241 | LP001792,Male,Yes,1,Graduate,No,3315,0,96,360,1,Semiurban,Y
242 | LP001798,Male,Yes,2,Graduate,No,5819,5000,120,360,1,Rural,Y
243 | LP001800,Male,Yes,1,Not Graduate,No,2510,1983,140,180,1,Urban,N
244 | LP001806,Male,No,0,Graduate,No,2965,5701,155,60,1,Urban,Y
245 | LP001807,Male,Yes,2,Graduate,Yes,6250,1300,108,360,1,Rural,Y
246 | LP001811,Male,Yes,0,Not Graduate,No,3406,4417,123,360,1,Semiurban,Y
247 | LP001813,Male,No,0,Graduate,Yes,6050,4333,120,180,1,Urban,N
248 | LP001814,Male,Yes,2,Graduate,No,9703,0,112,360,1,Urban,Y
249 | LP001819,Male,Yes,1,Not Graduate,No,6608,0,137,180,1,Urban,Y
250 | LP001824,Male,Yes,1,Graduate,No,2882,1843,123,480,1,Semiurban,Y
251 | LP001825,Male,Yes,0,Graduate,No,1809,1868,90,360,1,Urban,Y
252 | LP001835,Male,Yes,0,Not Graduate,No,1668,3890,201,360,0,Semiurban,N
253 | LP001836,Female,No,2,Graduate,No,3427,0,138,360,1,Urban,N
254 | LP001841,Male,No,0,Not Graduate,Yes,2583,2167,104,360,1,Rural,Y
255 | LP001843,Male,Yes,1,Not Graduate,No,2661,7101,279,180,1,Semiurban,Y
256 | LP001844,Male,No,0,Graduate,Yes,16250,0,192,360,0,Urban,N
257 | LP001846,Female,No,3+,Graduate,No,3083,0,255,360,1,Rural,Y
258 | LP001849,Male,No,0,Not Graduate,No,6045,0,115,360,0,Rural,N
259 | LP001854,Male,Yes,3+,Graduate,No,5250,0,94,360,1,Urban,N
260 | LP001859,Male,Yes,0,Graduate,No,14683,2100,304,360,1,Rural,N
261 | LP001864,Male,Yes,3+,Not Graduate,No,4931,0,128,360,,Semiurban,N
262 | LP001865,Male,Yes,1,Graduate,No,6083,4250,330,360,,Urban,Y
263 | LP001868,Male,No,0,Graduate,No,2060,2209,134,360,1,Semiurban,Y
264 | LP001870,Female,No,1,Graduate,No,3481,0,155,36,1,Semiurban,N
265 | LP001871,Female,No,0,Graduate,No,7200,0,120,360,1,Rural,Y
266 | LP001872,Male,No,0,Graduate,Yes,5166,0,128,360,1,Semiurban,Y
267 | LP001875,Male,No,0,Graduate,No,4095,3447,151,360,1,Rural,Y
268 | LP001877,Male,Yes,2,Graduate,No,4708,1387,150,360,1,Semiurban,Y
269 | LP001882,Male,Yes,3+,Graduate,No,4333,1811,160,360,0,Urban,Y
270 | LP001883,Female,No,0,Graduate,,3418,0,135,360,1,Rural,N
271 | LP001884,Female,No,1,Graduate,No,2876,1560,90,360,1,Urban,Y
272 | LP001888,Female,No,0,Graduate,No,3237,0,30,360,1,Urban,Y
273 | LP001891,Male,Yes,0,Graduate,No,11146,0,136,360,1,Urban,Y
274 | LP001892,Male,No,0,Graduate,No,2833,1857,126,360,1,Rural,Y
275 | LP001894,Male,Yes,0,Graduate,No,2620,2223,150,360,1,Semiurban,Y
276 | LP001896,Male,Yes,2,Graduate,No,3900,0,90,360,1,Semiurban,Y
277 | LP001900,Male,Yes,1,Graduate,No,2750,1842,115,360,1,Semiurban,Y
278 | LP001903,Male,Yes,0,Graduate,No,3993,3274,207,360,1,Semiurban,Y
279 | LP001904,Male,Yes,0,Graduate,No,3103,1300,80,360,1,Urban,Y
280 | LP001907,Male,Yes,0,Graduate,No,14583,0,436,360,1,Semiurban,Y
281 | LP001908,Female,Yes,0,Not Graduate,No,4100,0,124,360,,Rural,Y
282 | LP001910,Male,No,1,Not Graduate,Yes,4053,2426,158,360,0,Urban,N
283 | LP001914,Male,Yes,0,Graduate,No,3927,800,112,360,1,Semiurban,Y
284 | LP001915,Male,Yes,2,Graduate,No,2301,985.7999878,78,180,1,Urban,Y
285 | LP001917,Female,No,0,Graduate,No,1811,1666,54,360,1,Urban,Y
286 | LP001922,Male,Yes,0,Graduate,No,20667,0,,360,1,Rural,N
287 | LP001924,Male,No,0,Graduate,No,3158,3053,89,360,1,Rural,Y
288 | LP001925,Female,No,0,Graduate,Yes,2600,1717,99,300,1,Semiurban,N
289 | LP001926,Male,Yes,0,Graduate,No,3704,2000,120,360,1,Rural,Y
290 | LP001931,Female,No,0,Graduate,No,4124,0,115,360,1,Semiurban,Y
291 | LP001935,Male,No,0,Graduate,No,9508,0,187,360,1,Rural,Y
292 | LP001936,Male,Yes,0,Graduate,No,3075,2416,139,360,1,Rural,Y
293 | LP001938,Male,Yes,2,Graduate,No,4400,0,127,360,0,Semiurban,N
294 | LP001940,Male,Yes,2,Graduate,No,3153,1560,134,360,1,Urban,Y
295 | LP001945,Female,No,,Graduate,No,5417,0,143,480,0,Urban,N
296 | LP001947,Male,Yes,0,Graduate,No,2383,3334,172,360,1,Semiurban,Y
297 | LP001949,Male,Yes,3+,Graduate,,4416,1250,110,360,1,Urban,Y
298 | LP001953,Male,Yes,1,Graduate,No,6875,0,200,360,1,Semiurban,Y
299 | LP001954,Female,Yes,1,Graduate,No,4666,0,135,360,1,Urban,Y
300 | LP001955,Female,No,0,Graduate,No,5000,2541,151,480,1,Rural,N
301 | LP001963,Male,Yes,1,Graduate,No,2014,2925,113,360,1,Urban,N
302 | LP001964,Male,Yes,0,Not Graduate,No,1800,2934,93,360,0,Urban,N
303 | LP001972,Male,Yes,,Not Graduate,No,2875,1750,105,360,1,Semiurban,Y
304 | LP001974,Female,No,0,Graduate,No,5000,0,132,360,1,Rural,Y
305 | LP001977,Male,Yes,1,Graduate,No,1625,1803,96,360,1,Urban,Y
306 | LP001978,Male,No,0,Graduate,No,4000,2500,140,360,1,Rural,Y
307 | LP001990,Male,No,0,Not Graduate,No,2000,0,,360,1,Urban,N
308 | LP001993,Female,No,0,Graduate,No,3762,1666,135,360,1,Rural,Y
309 | LP001994,Female,No,0,Graduate,No,2400,1863,104,360,0,Urban,N
310 | LP001996,Male,No,0,Graduate,No,20233,0,480,360,1,Rural,N
311 | LP001998,Male,Yes,2,Not Graduate,No,7667,0,185,360,,Rural,Y
312 | LP002002,Female,No,0,Graduate,No,2917,0,84,360,1,Semiurban,Y
313 | LP002004,Male,No,0,Not Graduate,No,2927,2405,111,360,1,Semiurban,Y
314 | LP002006,Female,No,0,Graduate,No,2507,0,56,360,1,Rural,Y
315 | LP002008,Male,Yes,2,Graduate,Yes,5746,0,144,84,,Rural,Y
316 | LP002024,,Yes,0,Graduate,No,2473,1843,159,360,1,Rural,N
317 | LP002031,Male,Yes,1,Not Graduate,No,3399,1640,111,180,1,Urban,Y
318 | LP002035,Male,Yes,2,Graduate,No,3717,0,120,360,1,Semiurban,Y
319 | LP002036,Male,Yes,0,Graduate,No,2058,2134,88,360,,Urban,Y
320 | LP002043,Female,No,1,Graduate,No,3541,0,112,360,,Semiurban,Y
321 | LP002050,Male,Yes,1,Graduate,Yes,10000,0,155,360,1,Rural,N
322 | LP002051,Male,Yes,0,Graduate,No,2400,2167,115,360,1,Semiurban,Y
323 | LP002053,Male,Yes,3+,Graduate,No,4342,189,124,360,1,Semiurban,Y
324 | LP002054,Male,Yes,2,Not Graduate,No,3601,1590,,360,1,Rural,Y
325 | LP002055,Female,No,0,Graduate,No,3166,2985,132,360,,Rural,Y
326 | LP002065,Male,Yes,3+,Graduate,No,15000,0,300,360,1,Rural,Y
327 | LP002067,Male,Yes,1,Graduate,Yes,8666,4983,376,360,0,Rural,N
328 | LP002068,Male,No,0,Graduate,No,4917,0,130,360,0,Rural,Y
329 | LP002082,Male,Yes,0,Graduate,Yes,5818,2160,184,360,1,Semiurban,Y
330 | LP002086,Female,Yes,0,Graduate,No,4333,2451,110,360,1,Urban,N
331 | LP002087,Female,No,0,Graduate,No,2500,0,67,360,1,Urban,Y
332 | LP002097,Male,No,1,Graduate,No,4384,1793,117,360,1,Urban,Y
333 | LP002098,Male,No,0,Graduate,No,2935,0,98,360,1,Semiurban,Y
334 | LP002100,Male,No,,Graduate,No,2833,0,71,360,1,Urban,Y
335 | LP002101,Male,Yes,0,Graduate,,63337,0,490,180,1,Urban,Y
336 | LP002103,,Yes,1,Graduate,Yes,9833,1833,182,180,1,Urban,Y
337 | LP002106,Male,Yes,,Graduate,Yes,5503,4490,70,,1,Semiurban,Y
338 | LP002110,Male,Yes,1,Graduate,,5250,688,160,360,1,Rural,Y
339 | LP002112,Male,Yes,2,Graduate,Yes,2500,4600,176,360,1,Rural,Y
340 | LP002113,Female,No,3+,Not Graduate,No,1830,0,,360,0,Urban,N
341 | LP002114,Female,No,0,Graduate,No,4160,0,71,360,1,Semiurban,Y
342 | LP002115,Male,Yes,3+,Not Graduate,No,2647,1587,173,360,1,Rural,N
343 | LP002116,Female,No,0,Graduate,No,2378,0,46,360,1,Rural,N
344 | LP002119,Male,Yes,1,Not Graduate,No,4554,1229,158,360,1,Urban,Y
345 | LP002126,Male,Yes,3+,Not Graduate,No,3173,0,74,360,1,Semiurban,Y
346 | LP002128,Male,Yes,2,Graduate,,2583,2330,125,360,1,Rural,Y
347 | LP002129,Male,Yes,0,Graduate,No,2499,2458,160,360,1,Semiurban,Y
348 | LP002130,Male,Yes,,Not Graduate,No,3523,3230,152,360,0,Rural,N
349 | LP002131,Male,Yes,2,Not Graduate,No,3083,2168,126,360,1,Urban,Y
350 | LP002137,Male,Yes,0,Graduate,No,6333,4583,259,360,,Semiurban,Y
351 | LP002138,Male,Yes,0,Graduate,No,2625,6250,187,360,1,Rural,Y
352 | LP002139,Male,Yes,0,Graduate,No,9083,0,228,360,1,Semiurban,Y
353 | LP002140,Male,No,0,Graduate,No,8750,4167,308,360,1,Rural,N
354 | LP002141,Male,Yes,3+,Graduate,No,2666,2083,95,360,1,Rural,Y
355 | LP002142,Female,Yes,0,Graduate,Yes,5500,0,105,360,0,Rural,N
356 | LP002143,Female,Yes,0,Graduate,No,2423,505,130,360,1,Semiurban,Y
357 | LP002144,Female,No,,Graduate,No,3813,0,116,180,1,Urban,Y
358 | LP002149,Male,Yes,2,Graduate,No,8333,3167,165,360,1,Rural,Y
359 | LP002151,Male,Yes,1,Graduate,No,3875,0,67,360,1,Urban,N
360 | LP002158,Male,Yes,0,Not Graduate,No,3000,1666,100,480,0,Urban,N
361 | LP002160,Male,Yes,3+,Graduate,No,5167,3167,200,360,1,Semiurban,Y
362 | LP002161,Female,No,1,Graduate,No,4723,0,81,360,1,Semiurban,N
363 | LP002170,Male,Yes,2,Graduate,No,5000,3667,236,360,1,Semiurban,Y
364 | LP002175,Male,Yes,0,Graduate,No,4750,2333,130,360,1,Urban,Y
365 | LP002178,Male,Yes,0,Graduate,No,3013,3033,95,300,,Urban,Y
366 | LP002180,Male,No,0,Graduate,Yes,6822,0,141,360,1,Rural,Y
367 | LP002181,Male,No,0,Not Graduate,No,6216,0,133,360,1,Rural,N
368 | LP002187,Male,No,0,Graduate,No,2500,0,96,480,1,Semiurban,N
369 | LP002188,Male,No,0,Graduate,No,5124,0,124,,0,Rural,N
370 | LP002190,Male,Yes,1,Graduate,No,6325,0,175,360,1,Semiurban,Y
371 | LP002191,Male,Yes,0,Graduate,No,19730,5266,570,360,1,Rural,N
372 | LP002194,Female,No,0,Graduate,Yes,15759,0,55,360,1,Semiurban,Y
373 | LP002197,Male,Yes,2,Graduate,No,5185,0,155,360,1,Semiurban,Y
374 | LP002201,Male,Yes,2,Graduate,Yes,9323,7873,380,300,1,Rural,Y
375 | LP002205,Male,No,1,Graduate,No,3062,1987,111,180,0,Urban,N
376 | LP002209,Female,No,0,Graduate,,2764,1459,110,360,1,Urban,Y
377 | LP002211,Male,Yes,0,Graduate,No,4817,923,120,180,1,Urban,Y
378 | LP002219,Male,Yes,3+,Graduate,No,8750,4996,130,360,1,Rural,Y
379 | LP002223,Male,Yes,0,Graduate,No,4310,0,130,360,,Semiurban,Y
380 | LP002224,Male,No,0,Graduate,No,3069,0,71,480,1,Urban,N
381 | LP002225,Male,Yes,2,Graduate,No,5391,0,130,360,1,Urban,Y
382 | LP002226,Male,Yes,0,Graduate,,3333,2500,128,360,1,Semiurban,Y
383 | LP002229,Male,No,0,Graduate,No,5941,4232,296,360,1,Semiurban,Y
384 | LP002231,Female,No,0,Graduate,No,6000,0,156,360,1,Urban,Y
385 | LP002234,Male,No,0,Graduate,Yes,7167,0,128,360,1,Urban,Y
386 | LP002236,Male,Yes,2,Graduate,No,4566,0,100,360,1,Urban,N
387 | LP002237,Male,No,1,Graduate,,3667,0,113,180,1,Urban,Y
388 | LP002239,Male,No,0,Not Graduate,No,2346,1600,132,360,1,Semiurban,Y
389 | LP002243,Male,Yes,0,Not Graduate,No,3010,3136,,360,0,Urban,N
390 | LP002244,Male,Yes,0,Graduate,No,2333,2417,136,360,1,Urban,Y
391 | LP002250,Male,Yes,0,Graduate,No,5488,0,125,360,1,Rural,Y
392 | LP002255,Male,No,3+,Graduate,No,9167,0,185,360,1,Rural,Y
393 | LP002262,Male,Yes,3+,Graduate,No,9504,0,275,360,1,Rural,Y
394 | LP002263,Male,Yes,0,Graduate,No,2583,2115,120,360,,Urban,Y
395 | LP002265,Male,Yes,2,Not Graduate,No,1993,1625,113,180,1,Semiurban,Y
396 | LP002266,Male,Yes,2,Graduate,No,3100,1400,113,360,1,Urban,Y
397 | LP002272,Male,Yes,2,Graduate,No,3276,484,135,360,,Semiurban,Y
398 | LP002277,Female,No,0,Graduate,No,3180,0,71,360,0,Urban,N
399 | LP002281,Male,Yes,0,Graduate,No,3033,1459,95,360,1,Urban,Y
400 | LP002284,Male,No,0,Not Graduate,No,3902,1666,109,360,1,Rural,Y
401 | LP002287,Female,No,0,Graduate,No,1500,1800,103,360,0,Semiurban,N
402 | LP002288,Male,Yes,2,Not Graduate,No,2889,0,45,180,0,Urban,N
403 | LP002296,Male,No,0,Not Graduate,No,2755,0,65,300,1,Rural,N
404 | LP002297,Male,No,0,Graduate,No,2500,20000,103,360,1,Semiurban,Y
405 | LP002300,Female,No,0,Not Graduate,No,1963,0,53,360,1,Semiurban,Y
406 | LP002301,Female,No,0,Graduate,Yes,7441,0,194,360,1,Rural,N
407 | LP002305,Female,No,0,Graduate,No,4547,0,115,360,1,Semiurban,Y
408 | LP002308,Male,Yes,0,Not Graduate,No,2167,2400,115,360,1,Urban,Y
409 | LP002314,Female,No,0,Not Graduate,No,2213,0,66,360,1,Rural,Y
410 | LP002315,Male,Yes,1,Graduate,No,8300,0,152,300,0,Semiurban,N
411 | LP002317,Male,Yes,3+,Graduate,No,81000,0,360,360,0,Rural,N
412 | LP002318,Female,No,1,Not Graduate,Yes,3867,0,62,360,1,Semiurban,N
413 | LP002319,Male,Yes,0,Graduate,,6256,0,160,360,,Urban,Y
414 | LP002328,Male,Yes,0,Not Graduate,No,6096,0,218,360,0,Rural,N
415 | LP002332,Male,Yes,0,Not Graduate,No,2253,2033,110,360,1,Rural,Y
416 | LP002335,Female,Yes,0,Not Graduate,No,2149,3237,178,360,0,Semiurban,N
417 | LP002337,Female,No,0,Graduate,No,2995,0,60,360,1,Urban,Y
418 | LP002341,Female,No,1,Graduate,No,2600,0,160,360,1,Urban,N
419 | LP002342,Male,Yes,2,Graduate,Yes,1600,20000,239,360,1,Urban,N
420 | LP002345,Male,Yes,0,Graduate,No,1025,2773,112,360,1,Rural,Y
421 | LP002347,Male,Yes,0,Graduate,No,3246,1417,138,360,1,Semiurban,Y
422 | LP002348,Male,Yes,0,Graduate,No,5829,0,138,360,1,Rural,Y
423 | LP002357,Female,No,0,Not Graduate,No,2720,0,80,,0,Urban,N
424 | LP002361,Male,Yes,0,Graduate,No,1820,1719,100,360,1,Urban,Y
425 | LP002362,Male,Yes,1,Graduate,No,7250,1667,110,,0,Urban,N
426 | LP002364,Male,Yes,0,Graduate,No,14880,0,96,360,1,Semiurban,Y
427 | LP002366,Male,Yes,0,Graduate,No,2666,4300,121,360,1,Rural,Y
428 | LP002367,Female,No,1,Not Graduate,No,4606,0,81,360,1,Rural,N
429 | LP002368,Male,Yes,2,Graduate,No,5935,0,133,360,1,Semiurban,Y
430 | LP002369,Male,Yes,0,Graduate,No,2920,16.12000084,87,360,1,Rural,Y
431 | LP002370,Male,No,0,Not Graduate,No,2717,0,60,180,1,Urban,Y
432 | LP002377,Female,No,1,Graduate,Yes,8624,0,150,360,1,Semiurban,Y
433 | LP002379,Male,No,0,Graduate,No,6500,0,105,360,0,Rural,N
434 | LP002386,Male,No,0,Graduate,,12876,0,405,360,1,Semiurban,Y
435 | LP002387,Male,Yes,0,Graduate,No,2425,2340,143,360,1,Semiurban,Y
436 | LP002390,Male,No,0,Graduate,No,3750,0,100,360,1,Urban,Y
437 | LP002393,Female,,,Graduate,No,10047,0,,240,1,Semiurban,Y
438 | LP002398,Male,No,0,Graduate,No,1926,1851,50,360,1,Semiurban,Y
439 | LP002401,Male,Yes,0,Graduate,No,2213,1125,,360,1,Urban,Y
440 | LP002403,Male,No,0,Graduate,Yes,10416,0,187,360,0,Urban,N
441 | LP002407,Female,Yes,0,Not Graduate,Yes,7142,0,138,360,1,Rural,Y
442 | LP002408,Male,No,0,Graduate,No,3660,5064,187,360,1,Semiurban,Y
443 | LP002409,Male,Yes,0,Graduate,No,7901,1833,180,360,1,Rural,Y
444 | LP002418,Male,No,3+,Not Graduate,No,4707,1993,148,360,1,Semiurban,Y
445 | LP002422,Male,No,1,Graduate,No,37719,0,152,360,1,Semiurban,Y
446 | LP002424,Male,Yes,0,Graduate,No,7333,8333,175,300,,Rural,Y
447 | LP002429,Male,Yes,1,Graduate,Yes,3466,1210,130,360,1,Rural,Y
448 | LP002434,Male,Yes,2,Not Graduate,No,4652,0,110,360,1,Rural,Y
449 | LP002435,Male,Yes,0,Graduate,,3539,1376,55,360,1,Rural,N
450 | LP002443,Male,Yes,2,Graduate,No,3340,1710,150,360,0,Rural,N
451 | LP002444,Male,No,1,Not Graduate,Yes,2769,1542,190,360,,Semiurban,N
452 | LP002446,Male,Yes,2,Not Graduate,No,2309,1255,125,360,0,Rural,N
453 | LP002447,Male,Yes,2,Not Graduate,No,1958,1456,60,300,,Urban,Y
454 | LP002448,Male,Yes,0,Graduate,No,3948,1733,149,360,0,Rural,N
455 | LP002449,Male,Yes,0,Graduate,No,2483,2466,90,180,0,Rural,Y
456 | LP002453,Male,No,0,Graduate,Yes,7085,0,84,360,1,Semiurban,Y
457 | LP002455,Male,Yes,2,Graduate,No,3859,0,96,360,1,Semiurban,Y
458 | LP002459,Male,Yes,0,Graduate,No,4301,0,118,360,1,Urban,Y
459 | LP002467,Male,Yes,0,Graduate,No,3708,2569,173,360,1,Urban,N
460 | LP002472,Male,No,2,Graduate,No,4354,0,136,360,1,Rural,Y
461 | LP002473,Male,Yes,0,Graduate,No,8334,0,160,360,1,Semiurban,N
462 | LP002478,,Yes,0,Graduate,Yes,2083,4083,160,360,,Semiurban,Y
463 | LP002484,Male,Yes,3+,Graduate,No,7740,0,128,180,1,Urban,Y
464 | LP002487,Male,Yes,0,Graduate,No,3015,2188,153,360,1,Rural,Y
465 | LP002489,Female,No,1,Not Graduate,,5191,0,132,360,1,Semiurban,Y
466 | LP002493,Male,No,0,Graduate,No,4166,0,98,360,0,Semiurban,N
467 | LP002494,Male,No,0,Graduate,No,6000,0,140,360,1,Rural,Y
468 | LP002500,Male,Yes,3+,Not Graduate,No,2947,1664,70,180,0,Urban,N
469 | LP002501,,Yes,0,Graduate,No,16692,0,110,360,1,Semiurban,Y
470 | LP002502,Female,Yes,2,Not Graduate,,210,2917,98,360,1,Semiurban,Y
471 | LP002505,Male,Yes,0,Graduate,No,4333,2451,110,360,1,Urban,N
472 | LP002515,Male,Yes,1,Graduate,Yes,3450,2079,162,360,1,Semiurban,Y
473 | LP002517,Male,Yes,1,Not Graduate,No,2653,1500,113,180,0,Rural,N
474 | LP002519,Male,Yes,3+,Graduate,No,4691,0,100,360,1,Semiurban,Y
475 | LP002522,Female,No,0,Graduate,Yes,2500,0,93,360,,Urban,Y
476 | LP002524,Male,No,2,Graduate,No,5532,4648,162,360,1,Rural,Y
477 | LP002527,Male,Yes,2,Graduate,Yes,16525,1014,150,360,1,Rural,Y
478 | LP002529,Male,Yes,2,Graduate,No,6700,1750,230,300,1,Semiurban,Y
479 | LP002530,,Yes,2,Graduate,No,2873,1872,132,360,0,Semiurban,N
480 | LP002531,Male,Yes,1,Graduate,Yes,16667,2250,86,360,1,Semiurban,Y
481 | LP002533,Male,Yes,2,Graduate,No,2947,1603,,360,1,Urban,N
482 | LP002534,Female,No,0,Not Graduate,No,4350,0,154,360,1,Rural,Y
483 | LP002536,Male,Yes,3+,Not Graduate,No,3095,0,113,360,1,Rural,Y
484 | LP002537,Male,Yes,0,Graduate,No,2083,3150,128,360,1,Semiurban,Y
485 | LP002541,Male,Yes,0,Graduate,No,10833,0,234,360,1,Semiurban,Y
486 | LP002543,Male,Yes,2,Graduate,No,8333,0,246,360,1,Semiurban,Y
487 | LP002544,Male,Yes,1,Not Graduate,No,1958,2436,131,360,1,Rural,Y
488 | LP002545,Male,No,2,Graduate,No,3547,0,80,360,0,Rural,N
489 | LP002547,Male,Yes,1,Graduate,No,18333,0,500,360,1,Urban,N
490 | LP002555,Male,Yes,2,Graduate,Yes,4583,2083,160,360,1,Semiurban,Y
491 | LP002556,Male,No,0,Graduate,No,2435,0,75,360,1,Urban,N
492 | LP002560,Male,No,0,Not Graduate,No,2699,2785,96,360,,Semiurban,Y
493 | LP002562,Male,Yes,1,Not Graduate,No,5333,1131,186,360,,Urban,Y
494 | LP002571,Male,No,0,Not Graduate,No,3691,0,110,360,1,Rural,Y
495 | LP002582,Female,No,0,Not Graduate,Yes,17263,0,225,360,1,Semiurban,Y
496 | LP002585,Male,Yes,0,Graduate,No,3597,2157,119,360,0,Rural,N
497 | LP002586,Female,Yes,1,Graduate,No,3326,913,105,84,1,Semiurban,Y
498 | LP002587,Male,Yes,0,Not Graduate,No,2600,1700,107,360,1,Rural,Y
499 | LP002588,Male,Yes,0,Graduate,No,4625,2857,111,12,,Urban,Y
500 | LP002600,Male,Yes,1,Graduate,Yes,2895,0,95,360,1,Semiurban,Y
501 | LP002602,Male,No,0,Graduate,No,6283,4416,209,360,0,Rural,N
502 | LP002603,Female,No,0,Graduate,No,645,3683,113,480,1,Rural,Y
503 | LP002606,Female,No,0,Graduate,No,3159,0,100,360,1,Semiurban,Y
504 | LP002615,Male,Yes,2,Graduate,No,4865,5624,208,360,1,Semiurban,Y
505 | LP002618,Male,Yes,1,Not Graduate,No,4050,5302,138,360,,Rural,N
506 | LP002619,Male,Yes,0,Not Graduate,No,3814,1483,124,300,1,Semiurban,Y
507 | LP002622,Male,Yes,2,Graduate,No,3510,4416,243,360,1,Rural,Y
508 | LP002624,Male,Yes,0,Graduate,No,20833,6667,480,360,,Urban,Y
509 | LP002625,,No,0,Graduate,No,3583,0,96,360,1,Urban,N
510 | LP002626,Male,Yes,0,Graduate,Yes,2479,3013,188,360,1,Urban,Y
511 | LP002634,Female,No,1,Graduate,No,13262,0,40,360,1,Urban,Y
512 | LP002637,Male,No,0,Not Graduate,No,3598,1287,100,360,1,Rural,N
513 | LP002640,Male,Yes,1,Graduate,No,6065,2004,250,360,1,Semiurban,Y
514 | LP002643,Male,Yes,2,Graduate,No,3283,2035,148,360,1,Urban,Y
515 | LP002648,Male,Yes,0,Graduate,No,2130,6666,70,180,1,Semiurban,N
516 | LP002652,Male,No,0,Graduate,No,5815,3666,311,360,1,Rural,N
517 | LP002659,Male,Yes,3+,Graduate,No,3466,3428,150,360,1,Rural,Y
518 | LP002670,Female,Yes,2,Graduate,No,2031,1632,113,480,1,Semiurban,Y
519 | LP002682,Male,Yes,,Not Graduate,No,3074,1800,123,360,0,Semiurban,N
520 | LP002683,Male,No,0,Graduate,No,4683,1915,185,360,1,Semiurban,N
521 | LP002684,Female,No,0,Not Graduate,No,3400,0,95,360,1,Rural,N
522 | LP002689,Male,Yes,2,Not Graduate,No,2192,1742,45,360,1,Semiurban,Y
523 | LP002690,Male,No,0,Graduate,No,2500,0,55,360,1,Semiurban,Y
524 | LP002692,Male,Yes,3+,Graduate,Yes,5677,1424,100,360,1,Rural,Y
525 | LP002693,Male,Yes,2,Graduate,Yes,7948,7166,480,360,1,Rural,Y
526 | LP002697,Male,No,0,Graduate,No,4680,2087,,360,1,Semiurban,N
527 | LP002699,Male,Yes,2,Graduate,Yes,17500,0,400,360,1,Rural,Y
528 | LP002705,Male,Yes,0,Graduate,No,3775,0,110,360,1,Semiurban,Y
529 | LP002706,Male,Yes,1,Not Graduate,No,5285,1430,161,360,0,Semiurban,Y
530 | LP002714,Male,No,1,Not Graduate,No,2679,1302,94,360,1,Semiurban,Y
531 | LP002716,Male,No,0,Not Graduate,No,6783,0,130,360,1,Semiurban,Y
532 | LP002717,Male,Yes,0,Graduate,No,1025,5500,216,360,,Rural,Y
533 | LP002720,Male,Yes,3+,Graduate,No,4281,0,100,360,1,Urban,Y
534 | LP002723,Male,No,2,Graduate,No,3588,0,110,360,0,Rural,N
535 | LP002729,Male,No,1,Graduate,No,11250,0,196,360,,Semiurban,N
536 | LP002731,Female,No,0,Not Graduate,Yes,18165,0,125,360,1,Urban,Y
537 | LP002732,Male,No,0,Not Graduate,,2550,2042,126,360,1,Rural,Y
538 | LP002734,Male,Yes,0,Graduate,No,6133,3906,324,360,1,Urban,Y
539 | LP002738,Male,No,2,Graduate,No,3617,0,107,360,1,Semiurban,Y
540 | LP002739,Male,Yes,0,Not Graduate,No,2917,536,66,360,1,Rural,N
541 | LP002740,Male,Yes,3+,Graduate,No,6417,0,157,180,1,Rural,Y
542 | LP002741,Female,Yes,1,Graduate,No,4608,2845,140,180,1,Semiurban,Y
543 | LP002743,Female,No,0,Graduate,No,2138,0,99,360,0,Semiurban,N
544 | LP002753,Female,No,1,Graduate,,3652,0,95,360,1,Semiurban,Y
545 | LP002755,Male,Yes,1,Not Graduate,No,2239,2524,128,360,1,Urban,Y
546 | LP002757,Female,Yes,0,Not Graduate,No,3017,663,102,360,,Semiurban,Y
547 | LP002767,Male,Yes,0,Graduate,No,2768,1950,155,360,1,Rural,Y
548 | LP002768,Male,No,0,Not Graduate,No,3358,0,80,36,1,Semiurban,N
549 | LP002772,Male,No,0,Graduate,No,2526,1783,145,360,1,Rural,Y
550 | LP002776,Female,No,0,Graduate,No,5000,0,103,360,0,Semiurban,N
551 | LP002777,Male,Yes,0,Graduate,No,2785,2016,110,360,1,Rural,Y
552 | LP002778,Male,Yes,2,Graduate,Yes,6633,0,,360,0,Rural,N
553 | LP002784,Male,Yes,1,Not Graduate,No,2492,2375,,360,1,Rural,Y
554 | LP002785,Male,Yes,1,Graduate,No,3333,3250,158,360,1,Urban,Y
555 | LP002788,Male,Yes,0,Not Graduate,No,2454,2333,181,360,0,Urban,N
556 | LP002789,Male,Yes,0,Graduate,No,3593,4266,132,180,0,Rural,N
557 | LP002792,Male,Yes,1,Graduate,No,5468,1032,26,360,1,Semiurban,Y
558 | LP002794,Female,No,0,Graduate,No,2667,1625,84,360,,Urban,Y
559 | LP002795,Male,Yes,3+,Graduate,Yes,10139,0,260,360,1,Semiurban,Y
560 | LP002798,Male,Yes,0,Graduate,No,3887,2669,162,360,1,Semiurban,Y
561 | LP002804,Female,Yes,0,Graduate,No,4180,2306,182,360,1,Semiurban,Y
562 | LP002807,Male,Yes,2,Not Graduate,No,3675,242,108,360,1,Semiurban,Y
563 | LP002813,Female,Yes,1,Graduate,Yes,19484,0,600,360,1,Semiurban,Y
564 | LP002820,Male,Yes,0,Graduate,No,5923,2054,211,360,1,Rural,Y
565 | LP002821,Male,No,0,Not Graduate,Yes,5800,0,132,360,1,Semiurban,Y
566 | LP002832,Male,Yes,2,Graduate,No,8799,0,258,360,0,Urban,N
567 | LP002833,Male,Yes,0,Not Graduate,No,4467,0,120,360,,Rural,Y
568 | LP002836,Male,No,0,Graduate,No,3333,0,70,360,1,Urban,Y
569 | LP002837,Male,Yes,3+,Graduate,No,3400,2500,123,360,0,Rural,N
570 | LP002840,Female,No,0,Graduate,No,2378,0,9,360,1,Urban,N
571 | LP002841,Male,Yes,0,Graduate,No,3166,2064,104,360,0,Urban,N
572 | LP002842,Male,Yes,1,Graduate,No,3417,1750,186,360,1,Urban,Y
573 | LP002847,Male,Yes,,Graduate,No,5116,1451,165,360,0,Urban,N
574 | LP002855,Male,Yes,2,Graduate,No,16666,0,275,360,1,Urban,Y
575 | LP002862,Male,Yes,2,Not Graduate,No,6125,1625,187,480,1,Semiurban,N
576 | LP002863,Male,Yes,3+,Graduate,No,6406,0,150,360,1,Semiurban,N
577 | LP002868,Male,Yes,2,Graduate,No,3159,461,108,84,1,Urban,Y
578 | LP002872,,Yes,0,Graduate,No,3087,2210,136,360,0,Semiurban,N
579 | LP002874,Male,No,0,Graduate,No,3229,2739,110,360,1,Urban,Y
580 | LP002877,Male,Yes,1,Graduate,No,1782,2232,107,360,1,Rural,Y
581 | LP002888,Male,No,0,Graduate,,3182,2917,161,360,1,Urban,Y
582 | LP002892,Male,Yes,2,Graduate,No,6540,0,205,360,1,Semiurban,Y
583 | LP002893,Male,No,0,Graduate,No,1836,33837,90,360,1,Urban,N
584 | LP002894,Female,Yes,0,Graduate,No,3166,0,36,360,1,Semiurban,Y
585 | LP002898,Male,Yes,1,Graduate,No,1880,0,61,360,,Rural,N
586 | LP002911,Male,Yes,1,Graduate,No,2787,1917,146,360,0,Rural,N
587 | LP002912,Male,Yes,1,Graduate,No,4283,3000,172,84,1,Rural,N
588 | LP002916,Male,Yes,0,Graduate,No,2297,1522,104,360,1,Urban,Y
589 | LP002917,Female,No,0,Not Graduate,No,2165,0,70,360,1,Semiurban,Y
590 | LP002925,,No,0,Graduate,No,4750,0,94,360,1,Semiurban,Y
591 | LP002926,Male,Yes,2,Graduate,Yes,2726,0,106,360,0,Semiurban,N
592 | LP002928,Male,Yes,0,Graduate,No,3000,3416,56,180,1,Semiurban,Y
593 | LP002931,Male,Yes,2,Graduate,Yes,6000,0,205,240,1,Semiurban,N
594 | LP002933,,No,3+,Graduate,Yes,9357,0,292,360,1,Semiurban,Y
595 | LP002936,Male,Yes,0,Graduate,No,3859,3300,142,180,1,Rural,Y
596 | LP002938,Male,Yes,0,Graduate,Yes,16120,0,260,360,1,Urban,Y
597 | LP002940,Male,No,0,Not Graduate,No,3833,0,110,360,1,Rural,Y
598 | LP002941,Male,Yes,2,Not Graduate,Yes,6383,1000,187,360,1,Rural,N
599 | LP002943,Male,No,,Graduate,No,2987,0,88,360,0,Semiurban,N
600 | LP002945,Male,Yes,0,Graduate,Yes,9963,0,180,360,1,Rural,Y
601 | LP002948,Male,Yes,2,Graduate,No,5780,0,192,360,1,Urban,Y
602 | LP002949,Female,No,3+,Graduate,,416,41667,350,180,,Urban,N
603 | LP002950,Male,Yes,0,Not Graduate,,2894,2792,155,360,1,Rural,Y
604 | LP002953,Male,Yes,3+,Graduate,No,5703,0,128,360,1,Urban,Y
605 | LP002958,Male,No,0,Graduate,No,3676,4301,172,360,1,Rural,Y
606 | LP002959,Female,Yes,1,Graduate,No,12000,0,496,360,1,Semiurban,Y
607 | LP002960,Male,Yes,0,Not Graduate,No,2400,3800,,180,1,Urban,N
608 | LP002961,Male,Yes,1,Graduate,No,3400,2500,173,360,1,Semiurban,Y
609 | LP002964,Male,Yes,2,Not Graduate,No,3987,1411,157,360,1,Rural,Y
610 | LP002974,Male,Yes,0,Graduate,No,3232,1950,108,360,1,Rural,Y
611 | LP002978,Female,No,0,Graduate,No,2900,0,71,360,1,Rural,Y
612 | LP002979,Male,Yes,3+,Graduate,No,4106,0,40,180,1,Rural,Y
613 | LP002983,Male,Yes,1,Graduate,No,8072,240,253,360,1,Urban,Y
614 | LP002984,Male,Yes,2,Graduate,No,7583,0,187,360,1,Urban,Y
615 | LP002990,Female,No,0,Graduate,Yes,4583,0,133,360,0,Semiurban,N
616 | 


--------------------------------------------------------------------------------
/Machine Learning/data/creditcard.rar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Machine Learning/data/creditcard.rar


--------------------------------------------------------------------------------
/NLP/Natural Language Processing(NLP) Concepts - Hackers Realm.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Tokenization\n",
   8 |     "\n",
   9 |     "Tokenization is the process of breaking down the given text in natural language processing into the smallest unit in a sentence called a token. Punctuation marks, words, and numbers can be considered tokens."
  10 |    ]
  11 |   },
  12 |   {
  13 |    "cell_type": "code",
  14 |    "execution_count": 6,
  15 |    "metadata": {},
  16 |    "outputs": [],
  17 |    "source": [
  18 |     "text = 'Hi Everyone! This is Hackers Realm. We are learning Natural Language Processing. We reached 1000000 views.'"
  19 |    ]
  20 |   },
  21 |   {
  22 |    "cell_type": "code",
  23 |    "execution_count": 7,
  24 |    "metadata": {},
  25 |    "outputs": [
  26 |     {
  27 |      "data": {
  28 |       "text/plain": [
  29 |        "['Hi',\n",
  30 |        " 'Everyone!',\n",
  31 |        " 'This',\n",
  32 |        " 'is',\n",
  33 |        " 'Hackers',\n",
  34 |        " 'Realm.',\n",
  35 |        " 'We',\n",
  36 |        " 'are',\n",
  37 |        " 'learning',\n",
  38 |        " 'Natural',\n",
  39 |        " 'Language',\n",
  40 |        " 'Processing.',\n",
  41 |        " 'We',\n",
  42 |        " 'reached',\n",
  43 |        " '1000000',\n",
  44 |        " 'views.']"
  45 |       ]
  46 |      },
  47 |      "execution_count": 7,
  48 |      "metadata": {},
  49 |      "output_type": "execute_result"
  50 |     }
  51 |    ],
  52 |    "source": [
  53 |     "text.split(' ')"
  54 |    ]
  55 |   },
  56 |   {
  57 |    "cell_type": "code",
  58 |    "execution_count": 8,
  59 |    "metadata": {},
  60 |    "outputs": [],
  61 |    "source": [
  62 |     "from nltk import sent_tokenize, word_tokenize"
  63 |    ]
  64 |   },
  65 |   {
  66 |    "cell_type": "code",
  67 |    "execution_count": 9,
  68 |    "metadata": {},
  69 |    "outputs": [
  70 |     {
  71 |      "data": {
  72 |       "text/plain": [
  73 |        "['Hi Everyone!',\n",
  74 |        " 'This is Hackers Realm.',\n",
  75 |        " 'We are learning Natural Language Processing.',\n",
  76 |        " 'We reached 1000000 views.']"
  77 |       ]
  78 |      },
  79 |      "execution_count": 9,
  80 |      "metadata": {},
  81 |      "output_type": "execute_result"
  82 |     }
  83 |    ],
  84 |    "source": [
  85 |     "# split the text into sentences\n",
  86 |     "sent_tokens = sent_tokenize(text)\n",
  87 |     "sent_tokens"
  88 |    ]
  89 |   },
  90 |   {
  91 |    "cell_type": "code",
  92 |    "execution_count": 10,
  93 |    "metadata": {},
  94 |    "outputs": [
  95 |     {
  96 |      "data": {
  97 |       "text/plain": [
  98 |        "['Hi',\n",
  99 |        " 'Everyone',\n",
 100 |        " '!',\n",
 101 |        " 'This',\n",
 102 |        " 'is',\n",
 103 |        " 'Hackers',\n",
 104 |        " 'Realm',\n",
 105 |        " '.',\n",
 106 |        " 'We',\n",
 107 |        " 'are',\n",
 108 |        " 'learning',\n",
 109 |        " 'Natural',\n",
 110 |        " 'Language',\n",
 111 |        " 'Processing',\n",
 112 |        " '.',\n",
 113 |        " 'We',\n",
 114 |        " 'reached',\n",
 115 |        " '1000000',\n",
 116 |        " 'views',\n",
 117 |        " '.']"
 118 |       ]
 119 |      },
 120 |      "execution_count": 10,
 121 |      "metadata": {},
 122 |      "output_type": "execute_result"
 123 |     }
 124 |    ],
 125 |    "source": [
 126 |     "# split the text into words\n",
 127 |     "word_tokens = word_tokenize(text)\n",
 128 |     "word_tokens"
 129 |    ]
 130 |   },
 131 |   {
 132 |    "cell_type": "code",
 133 |    "execution_count": null,
 134 |    "metadata": {},
 135 |    "outputs": [],
 136 |    "source": []
 137 |   },
 138 |   {
 139 |    "cell_type": "markdown",
 140 |    "metadata": {},
 141 |    "source": [
 142 |     "# Stemming\n",
 143 |     "\n",
 144 |     "Stemming is the process of finding the root of words. A word stem need not be the same root as a dictionary-based morphological root, it just is an equal to or smaller form of the word."
 145 |    ]
 146 |   },
 147 |   {
 148 |    "cell_type": "code",
 149 |    "execution_count": 13,
 150 |    "metadata": {},
 151 |    "outputs": [],
 152 |    "source": [
 153 |     "from nltk.stem import PorterStemmer, SnowballStemmer\n",
 154 |     "ps = PorterStemmer()"
 155 |    ]
 156 |   },
 157 |   {
 158 |    "cell_type": "code",
 159 |    "execution_count": 17,
 160 |    "metadata": {},
 161 |    "outputs": [
 162 |     {
 163 |      "data": {
 164 |       "text/plain": [
 165 |        "'eat'"
 166 |       ]
 167 |      },
 168 |      "execution_count": 17,
 169 |      "metadata": {},
 170 |      "output_type": "execute_result"
 171 |     }
 172 |    ],
 173 |    "source": [
 174 |     "word = ('eats')\n",
 175 |     "ps.stem(word)"
 176 |    ]
 177 |   },
 178 |   {
 179 |    "cell_type": "code",
 180 |    "execution_count": 16,
 181 |    "metadata": {},
 182 |    "outputs": [
 183 |     {
 184 |      "data": {
 185 |       "text/plain": [
 186 |        "'eat'"
 187 |       ]
 188 |      },
 189 |      "execution_count": 16,
 190 |      "metadata": {},
 191 |      "output_type": "execute_result"
 192 |     }
 193 |    ],
 194 |    "source": [
 195 |     "word = ('eating')\n",
 196 |     "ps.stem(word)"
 197 |    ]
 198 |   },
 199 |   {
 200 |    "cell_type": "code",
 201 |    "execution_count": 18,
 202 |    "metadata": {},
 203 |    "outputs": [
 204 |     {
 205 |      "data": {
 206 |       "text/plain": [
 207 |        "'eaten'"
 208 |       ]
 209 |      },
 210 |      "execution_count": 18,
 211 |      "metadata": {},
 212 |      "output_type": "execute_result"
 213 |     }
 214 |    ],
 215 |    "source": [
 216 |     "word = ('eaten')\n",
 217 |     "ps.stem(word)"
 218 |    ]
 219 |   },
 220 |   {
 221 |    "cell_type": "code",
 222 |    "execution_count": 19,
 223 |    "metadata": {},
 224 |    "outputs": [],
 225 |    "source": [
 226 |     "text = 'Hi Everyone! This is Hackers Realm. We are learning Natural Language Processing. We reached 1000000 views.'"
 227 |    ]
 228 |   },
 229 |   {
 230 |    "cell_type": "code",
 231 |    "execution_count": 20,
 232 |    "metadata": {},
 233 |    "outputs": [],
 234 |    "source": [
 235 |     "word_tokens = word_tokenize(text)"
 236 |    ]
 237 |   },
 238 |   {
 239 |    "cell_type": "code",
 240 |    "execution_count": 21,
 241 |    "metadata": {},
 242 |    "outputs": [
 243 |     {
 244 |      "data": {
 245 |       "text/plain": [
 246 |        "'Hi everyon ! thi is hacker realm . We are learn natur languag process . We reach 1000000 view .'"
 247 |       ]
 248 |      },
 249 |      "execution_count": 21,
 250 |      "metadata": {},
 251 |      "output_type": "execute_result"
 252 |     }
 253 |    ],
 254 |    "source": [
 255 |     "stemmed_sentence = \" \".join(ps.stem(word) for word in word_tokens)\n",
 256 |     "stemmed_sentence"
 257 |    ]
 258 |   },
 259 |   {
 260 |    "cell_type": "code",
 261 |    "execution_count": null,
 262 |    "metadata": {},
 263 |    "outputs": [],
 264 |    "source": []
 265 |   },
 266 |   {
 267 |    "cell_type": "markdown",
 268 |    "metadata": {},
 269 |    "source": [
 270 |     "# Lemmatization\n",
 271 |     "\n",
 272 |     "Lemmatization is the process of finding the form of the related word in the dictionary. It is different from Stemming. It involves longer processes to calculate than Stemming."
 273 |    ]
 274 |   },
 275 |   {
 276 |    "cell_type": "code",
 277 |    "execution_count": 22,
 278 |    "metadata": {},
 279 |    "outputs": [],
 280 |    "source": [
 281 |     "from nltk.stem import WordNetLemmatizer\n",
 282 |     "lemmatizer = WordNetLemmatizer()"
 283 |    ]
 284 |   },
 285 |   {
 286 |    "cell_type": "code",
 287 |    "execution_count": 30,
 288 |    "metadata": {},
 289 |    "outputs": [
 290 |     {
 291 |      "data": {
 292 |       "text/plain": [
 293 |        "'worker'"
 294 |       ]
 295 |      },
 296 |      "execution_count": 30,
 297 |      "metadata": {},
 298 |      "output_type": "execute_result"
 299 |     }
 300 |    ],
 301 |    "source": [
 302 |     "lemmatizer.lemmatize('workers')"
 303 |    ]
 304 |   },
 305 |   {
 306 |    "cell_type": "code",
 307 |    "execution_count": 31,
 308 |    "metadata": {},
 309 |    "outputs": [
 310 |     {
 311 |      "data": {
 312 |       "text/plain": [
 313 |        "'word'"
 314 |       ]
 315 |      },
 316 |      "execution_count": 31,
 317 |      "metadata": {},
 318 |      "output_type": "execute_result"
 319 |     }
 320 |    ],
 321 |    "source": [
 322 |     "lemmatizer.lemmatize('words')"
 323 |    ]
 324 |   },
 325 |   {
 326 |    "cell_type": "code",
 327 |    "execution_count": 37,
 328 |    "metadata": {},
 329 |    "outputs": [
 330 |     {
 331 |      "data": {
 332 |       "text/plain": [
 333 |        "'foot'"
 334 |       ]
 335 |      },
 336 |      "execution_count": 37,
 337 |      "metadata": {},
 338 |      "output_type": "execute_result"
 339 |     }
 340 |    ],
 341 |    "source": [
 342 |     "lemmatizer.lemmatize('feet')"
 343 |    ]
 344 |   },
 345 |   {
 346 |    "cell_type": "code",
 347 |    "execution_count": 39,
 348 |    "metadata": {},
 349 |    "outputs": [
 350 |     {
 351 |      "data": {
 352 |       "text/plain": [
 353 |        "'strip'"
 354 |       ]
 355 |      },
 356 |      "execution_count": 39,
 357 |      "metadata": {},
 358 |      "output_type": "execute_result"
 359 |     }
 360 |    ],
 361 |    "source": [
 362 |     "lemmatizer.lemmatize('stripes', 'v')"
 363 |    ]
 364 |   },
 365 |   {
 366 |    "cell_type": "code",
 367 |    "execution_count": 40,
 368 |    "metadata": {},
 369 |    "outputs": [
 370 |     {
 371 |      "data": {
 372 |       "text/plain": [
 373 |        "'stripe'"
 374 |       ]
 375 |      },
 376 |      "execution_count": 40,
 377 |      "metadata": {},
 378 |      "output_type": "execute_result"
 379 |     }
 380 |    ],
 381 |    "source": [
 382 |     "lemmatizer.lemmatize('stripes', 'n')"
 383 |    ]
 384 |   },
 385 |   {
 386 |    "cell_type": "code",
 387 |    "execution_count": 41,
 388 |    "metadata": {},
 389 |    "outputs": [],
 390 |    "source": [
 391 |     "text = 'Hi Everyone! This is Hackers Realm. We are learning Natural Language Processing. We reached 1000000 views.'"
 392 |    ]
 393 |   },
 394 |   {
 395 |    "cell_type": "code",
 396 |    "execution_count": 42,
 397 |    "metadata": {},
 398 |    "outputs": [],
 399 |    "source": [
 400 |     "word_tokens = word_tokenize(text)"
 401 |    ]
 402 |   },
 403 |   {
 404 |    "cell_type": "code",
 405 |    "execution_count": 44,
 406 |    "metadata": {},
 407 |    "outputs": [
 408 |     {
 409 |      "data": {
 410 |       "text/plain": [
 411 |        "'hi everyone ! this is hacker realm . we are learning natural language processing . we reached 1000000 view .'"
 412 |       ]
 413 |      },
 414 |      "execution_count": 44,
 415 |      "metadata": {},
 416 |      "output_type": "execute_result"
 417 |     }
 418 |    ],
 419 |    "source": [
 420 |     "lemmatized_sentence = \" \".join(lemmatizer.lemmatize(word.lower()) for word in word_tokens)\n",
 421 |     "lemmatized_sentence"
 422 |    ]
 423 |   },
 424 |   {
 425 |    "cell_type": "code",
 426 |    "execution_count": null,
 427 |    "metadata": {},
 428 |    "outputs": [],
 429 |    "source": []
 430 |   },
 431 |   {
 432 |    "cell_type": "markdown",
 433 |    "metadata": {},
 434 |    "source": [
 435 |     "# Part of Speech Tagging (POS)\n",
 436 |     "\n",
 437 |     "Part of Speech Tagging is a process of converting a sentence to forms — list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on.\n",
 438 |     "\n",
 439 |     "https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html"
 440 |    ]
 441 |   },
 442 |   {
 443 |    "cell_type": "code",
 444 |    "execution_count": 45,
 445 |    "metadata": {},
 446 |    "outputs": [],
 447 |    "source": [
 448 |     "from nltk import pos_tag"
 449 |    ]
 450 |   },
 451 |   {
 452 |    "cell_type": "code",
 453 |    "execution_count": 51,
 454 |    "metadata": {},
 455 |    "outputs": [
 456 |     {
 457 |      "data": {
 458 |       "text/plain": [
 459 |        "[('fighting', 'VBG')]"
 460 |       ]
 461 |      },
 462 |      "execution_count": 51,
 463 |      "metadata": {},
 464 |      "output_type": "execute_result"
 465 |     }
 466 |    ],
 467 |    "source": [
 468 |     "pos_tag(['fighting'])"
 469 |    ]
 470 |   },
 471 |   {
 472 |    "cell_type": "code",
 473 |    "execution_count": 46,
 474 |    "metadata": {},
 475 |    "outputs": [],
 476 |    "source": [
 477 |     "text = 'Hi Everyone! This is Hackers Realm. We are learning Natural Language Processing. We reached 1000000 views.'"
 478 |    ]
 479 |   },
 480 |   {
 481 |    "cell_type": "code",
 482 |    "execution_count": 47,
 483 |    "metadata": {},
 484 |    "outputs": [],
 485 |    "source": [
 486 |     "word_tokens = word_tokenize(text)"
 487 |    ]
 488 |   },
 489 |   {
 490 |    "cell_type": "code",
 491 |    "execution_count": 52,
 492 |    "metadata": {},
 493 |    "outputs": [
 494 |     {
 495 |      "data": {
 496 |       "text/plain": [
 497 |        "[('Hi', 'NNP'),\n",
 498 |        " ('Everyone', 'NN'),\n",
 499 |        " ('!', '.'),\n",
 500 |        " ('This', 'DT'),\n",
 501 |        " ('is', 'VBZ'),\n",
 502 |        " ('Hackers', 'NNP'),\n",
 503 |        " ('Realm', 'NNP'),\n",
 504 |        " ('.', '.'),\n",
 505 |        " ('We', 'PRP'),\n",
 506 |        " ('are', 'VBP'),\n",
 507 |        " ('learning', 'VBG'),\n",
 508 |        " ('Natural', 'NNP'),\n",
 509 |        " ('Language', 'NNP'),\n",
 510 |        " ('Processing', 'NNP'),\n",
 511 |        " ('.', '.'),\n",
 512 |        " ('We', 'PRP'),\n",
 513 |        " ('reached', 'VBD'),\n",
 514 |        " ('1000000', 'CD'),\n",
 515 |        " ('views', 'NNS'),\n",
 516 |        " ('.', '.')]"
 517 |       ]
 518 |      },
 519 |      "execution_count": 52,
 520 |      "metadata": {},
 521 |      "output_type": "execute_result"
 522 |     }
 523 |    ],
 524 |    "source": [
 525 |     "pos_tag(word_tokens)"
 526 |    ]
 527 |   },
 528 |   {
 529 |    "cell_type": "code",
 530 |    "execution_count": null,
 531 |    "metadata": {},
 532 |    "outputs": [],
 533 |    "source": []
 534 |   },
 535 |   {
 536 |    "cell_type": "markdown",
 537 |    "metadata": {},
 538 |    "source": [
 539 |     "# Text Preprocessing (Clean Data)"
 540 |    ]
 541 |   },
 542 |   {
 543 |    "cell_type": "code",
 544 |    "execution_count": 9,
 545 |    "metadata": {},
 546 |    "outputs": [
 547 |     {
 548 |      "data": {
 549 |       "text/html": [
 550 |        "<div>\n",
 551 |        "<style scoped>\n",
 552 |        "    .dataframe tbody tr th:only-of-type {\n",
 553 |        "        vertical-align: middle;\n",
 554 |        "    }\n",
 555 |        "\n",
 556 |        "    .dataframe tbody tr th {\n",
 557 |        "        vertical-align: top;\n",
 558 |        "    }\n",
 559 |        "\n",
 560 |        "    .dataframe thead th {\n",
 561 |        "        text-align: right;\n",
 562 |        "    }\n",
 563 |        "</style>\n",
 564 |        "<table border=\"1\" class=\"dataframe\">\n",
 565 |        "  <thead>\n",
 566 |        "    <tr style=\"text-align: right;\">\n",
 567 |        "      <th></th>\n",
 568 |        "      <th>tweet</th>\n",
 569 |        "    </tr>\n",
 570 |        "  </thead>\n",
 571 |        "  <tbody>\n",
 572 |        "    <tr>\n",
 573 |        "      <th>0</th>\n",
 574 |        "      <td>@user when a father is dysfunctional and is s...</td>\n",
 575 |        "    </tr>\n",
 576 |        "    <tr>\n",
 577 |        "      <th>1</th>\n",
 578 |        "      <td>@user @user thanks for #lyft credit i can't us...</td>\n",
 579 |        "    </tr>\n",
 580 |        "    <tr>\n",
 581 |        "      <th>2</th>\n",
 582 |        "      <td>bihday your majesty</td>\n",
 583 |        "    </tr>\n",
 584 |        "    <tr>\n",
 585 |        "      <th>3</th>\n",
 586 |        "      <td>#model   i love u take with u all the time in ...</td>\n",
 587 |        "    </tr>\n",
 588 |        "    <tr>\n",
 589 |        "      <th>4</th>\n",
 590 |        "      <td>factsguide: society now    #motivation</td>\n",
 591 |        "    </tr>\n",
 592 |        "  </tbody>\n",
 593 |        "</table>\n",
 594 |        "</div>"
 595 |       ],
 596 |       "text/plain": [
 597 |        "                                               tweet\n",
 598 |        "0   @user when a father is dysfunctional and is s...\n",
 599 |        "1  @user @user thanks for #lyft credit i can't us...\n",
 600 |        "2                                bihday your majesty\n",
 601 |        "3  #model   i love u take with u all the time in ...\n",
 602 |        "4             factsguide: society now    #motivation"
 603 |       ]
 604 |      },
 605 |      "execution_count": 9,
 606 |      "metadata": {},
 607 |      "output_type": "execute_result"
 608 |     }
 609 |    ],
 610 |    "source": [
 611 |     "import pandas as pd\n",
 612 |     "import string\n",
 613 |     "df = pd.read_csv('data/Twitter Sentiments.csv')\n",
 614 |     "# drop the columns\n",
 615 |     "df = df.drop(columns=['id', 'label'], axis=1)\n",
 616 |     "df.head()"
 617 |    ]
 618 |   },
 619 |   {
 620 |    "cell_type": "markdown",
 621 |    "metadata": {},
 622 |    "source": [
 623 |     "## Convert to lowercase"
 624 |    ]
 625 |   },
 626 |   {
 627 |    "cell_type": "code",
 628 |    "execution_count": 10,
 629 |    "metadata": {},
 630 |    "outputs": [
 631 |     {
 632 |      "data": {
 633 |       "text/html": [
 634 |        "<div>\n",
 635 |        "<style scoped>\n",
 636 |        "    .dataframe tbody tr th:only-of-type {\n",
 637 |        "        vertical-align: middle;\n",
 638 |        "    }\n",
 639 |        "\n",
 640 |        "    .dataframe tbody tr th {\n",
 641 |        "        vertical-align: top;\n",
 642 |        "    }\n",
 643 |        "\n",
 644 |        "    .dataframe thead th {\n",
 645 |        "        text-align: right;\n",
 646 |        "    }\n",
 647 |        "</style>\n",
 648 |        "<table border=\"1\" class=\"dataframe\">\n",
 649 |        "  <thead>\n",
 650 |        "    <tr style=\"text-align: right;\">\n",
 651 |        "      <th></th>\n",
 652 |        "      <th>tweet</th>\n",
 653 |        "      <th>clean_text</th>\n",
 654 |        "    </tr>\n",
 655 |        "  </thead>\n",
 656 |        "  <tbody>\n",
 657 |        "    <tr>\n",
 658 |        "      <th>0</th>\n",
 659 |        "      <td>@user when a father is dysfunctional and is s...</td>\n",
 660 |        "      <td>@user when a father is dysfunctional and is s...</td>\n",
 661 |        "    </tr>\n",
 662 |        "    <tr>\n",
 663 |        "      <th>1</th>\n",
 664 |        "      <td>@user @user thanks for #lyft credit i can't us...</td>\n",
 665 |        "      <td>@user @user thanks for #lyft credit i can't us...</td>\n",
 666 |        "    </tr>\n",
 667 |        "    <tr>\n",
 668 |        "      <th>2</th>\n",
 669 |        "      <td>bihday your majesty</td>\n",
 670 |        "      <td>bihday your majesty</td>\n",
 671 |        "    </tr>\n",
 672 |        "    <tr>\n",
 673 |        "      <th>3</th>\n",
 674 |        "      <td>#model   i love u take with u all the time in ...</td>\n",
 675 |        "      <td>#model   i love u take with u all the time in ...</td>\n",
 676 |        "    </tr>\n",
 677 |        "    <tr>\n",
 678 |        "      <th>4</th>\n",
 679 |        "      <td>factsguide: society now    #motivation</td>\n",
 680 |        "      <td>factsguide: society now    #motivation</td>\n",
 681 |        "    </tr>\n",
 682 |        "  </tbody>\n",
 683 |        "</table>\n",
 684 |        "</div>"
 685 |       ],
 686 |       "text/plain": [
 687 |        "                                               tweet  \\\n",
 688 |        "0   @user when a father is dysfunctional and is s...   \n",
 689 |        "1  @user @user thanks for #lyft credit i can't us...   \n",
 690 |        "2                                bihday your majesty   \n",
 691 |        "3  #model   i love u take with u all the time in ...   \n",
 692 |        "4             factsguide: society now    #motivation   \n",
 693 |        "\n",
 694 |        "                                          clean_text  \n",
 695 |        "0   @user when a father is dysfunctional and is s...  \n",
 696 |        "1  @user @user thanks for #lyft credit i can't us...  \n",
 697 |        "2                                bihday your majesty  \n",
 698 |        "3  #model   i love u take with u all the time in ...  \n",
 699 |        "4             factsguide: society now    #motivation  "
 700 |       ]
 701 |      },
 702 |      "execution_count": 10,
 703 |      "metadata": {},
 704 |      "output_type": "execute_result"
 705 |     }
 706 |    ],
 707 |    "source": [
 708 |     "df['clean_text'] = df['tweet'].str.lower()\n",
 709 |     "df.head()"
 710 |    ]
 711 |   },
 712 |   {
 713 |    "cell_type": "markdown",
 714 |    "metadata": {},
 715 |    "source": [
 716 |     "## Removal of Punctuations"
 717 |    ]
 718 |   },
 719 |   {
 720 |    "cell_type": "code",
 721 |    "execution_count": 12,
 722 |    "metadata": {},
 723 |    "outputs": [
 724 |     {
 725 |      "data": {
 726 |       "text/plain": [
 727 |        "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'"
 728 |       ]
 729 |      },
 730 |      "execution_count": 12,
 731 |      "metadata": {},
 732 |      "output_type": "execute_result"
 733 |     }
 734 |    ],
 735 |    "source": [
 736 |     "string.punctuation"
 737 |    ]
 738 |   },
 739 |   {
 740 |    "cell_type": "code",
 741 |    "execution_count": 13,
 742 |    "metadata": {},
 743 |    "outputs": [],
 744 |    "source": [
 745 |     "def remove_punctuations(text):\n",
 746 |     "    punctuations = string.punctuation\n",
 747 |     "    return text.translate(str.maketrans('', '', punctuations))"
 748 |    ]
 749 |   },
 750 |   {
 751 |    "cell_type": "code",
 752 |    "execution_count": 14,
 753 |    "metadata": {},
 754 |    "outputs": [
 755 |     {
 756 |      "data": {
 757 |       "text/html": [
 758 |        "<div>\n",
 759 |        "<style scoped>\n",
 760 |        "    .dataframe tbody tr th:only-of-type {\n",
 761 |        "        vertical-align: middle;\n",
 762 |        "    }\n",
 763 |        "\n",
 764 |        "    .dataframe tbody tr th {\n",
 765 |        "        vertical-align: top;\n",
 766 |        "    }\n",
 767 |        "\n",
 768 |        "    .dataframe thead th {\n",
 769 |        "        text-align: right;\n",
 770 |        "    }\n",
 771 |        "</style>\n",
 772 |        "<table border=\"1\" class=\"dataframe\">\n",
 773 |        "  <thead>\n",
 774 |        "    <tr style=\"text-align: right;\">\n",
 775 |        "      <th></th>\n",
 776 |        "      <th>tweet</th>\n",
 777 |        "      <th>clean_text</th>\n",
 778 |        "    </tr>\n",
 779 |        "  </thead>\n",
 780 |        "  <tbody>\n",
 781 |        "    <tr>\n",
 782 |        "      <th>0</th>\n",
 783 |        "      <td>@user when a father is dysfunctional and is s...</td>\n",
 784 |        "      <td>user when a father is dysfunctional and is so...</td>\n",
 785 |        "    </tr>\n",
 786 |        "    <tr>\n",
 787 |        "      <th>1</th>\n",
 788 |        "      <td>@user @user thanks for #lyft credit i can't us...</td>\n",
 789 |        "      <td>user user thanks for lyft credit i cant use ca...</td>\n",
 790 |        "    </tr>\n",
 791 |        "    <tr>\n",
 792 |        "      <th>2</th>\n",
 793 |        "      <td>bihday your majesty</td>\n",
 794 |        "      <td>bihday your majesty</td>\n",
 795 |        "    </tr>\n",
 796 |        "    <tr>\n",
 797 |        "      <th>3</th>\n",
 798 |        "      <td>#model   i love u take with u all the time in ...</td>\n",
 799 |        "      <td>model   i love u take with u all the time in u...</td>\n",
 800 |        "    </tr>\n",
 801 |        "    <tr>\n",
 802 |        "      <th>4</th>\n",
 803 |        "      <td>factsguide: society now    #motivation</td>\n",
 804 |        "      <td>factsguide society now    motivation</td>\n",
 805 |        "    </tr>\n",
 806 |        "  </tbody>\n",
 807 |        "</table>\n",
 808 |        "</div>"
 809 |       ],
 810 |       "text/plain": [
 811 |        "                                               tweet  \\\n",
 812 |        "0   @user when a father is dysfunctional and is s...   \n",
 813 |        "1  @user @user thanks for #lyft credit i can't us...   \n",
 814 |        "2                                bihday your majesty   \n",
 815 |        "3  #model   i love u take with u all the time in ...   \n",
 816 |        "4             factsguide: society now    #motivation   \n",
 817 |        "\n",
 818 |        "                                          clean_text  \n",
 819 |        "0   user when a father is dysfunctional and is so...  \n",
 820 |        "1  user user thanks for lyft credit i cant use ca...  \n",
 821 |        "2                                bihday your majesty  \n",
 822 |        "3  model   i love u take with u all the time in u...  \n",
 823 |        "4               factsguide society now    motivation  "
 824 |       ]
 825 |      },
 826 |      "execution_count": 14,
 827 |      "metadata": {},
 828 |      "output_type": "execute_result"
 829 |     }
 830 |    ],
 831 |    "source": [
 832 |     "df['clean_text'] = df['clean_text'].apply(lambda x: remove_punctuations(x))\n",
 833 |     "df.head()"
 834 |    ]
 835 |   },
 836 |   {
 837 |    "cell_type": "markdown",
 838 |    "metadata": {},
 839 |    "source": [
 840 |     "## Removal of Stopwords"
 841 |    ]
 842 |   },
 843 |   {
 844 |    "cell_type": "code",
 845 |    "execution_count": 17,
 846 |    "metadata": {},
 847 |    "outputs": [
 848 |     {
 849 |      "data": {
 850 |       "text/plain": [
 851 |        "\"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mustn't, needn, needn't, shan, shan't, shouldn, shouldn't, wasn, wasn't, weren, weren't, won, won't, wouldn, wouldn't\""
 852 |       ]
 853 |      },
 854 |      "execution_count": 17,
 855 |      "metadata": {},
 856 |      "output_type": "execute_result"
 857 |     }
 858 |    ],
 859 |    "source": [
 860 |     "from nltk.corpus import stopwords\n",
 861 |     "\", \".join(stopwords.words('english'))"
 862 |    ]
 863 |   },
 864 |   {
 865 |    "cell_type": "code",
 866 |    "execution_count": 18,
 867 |    "metadata": {},
 868 |    "outputs": [],
 869 |    "source": [
 870 |     "STOPWORDS = set(stopwords.words('english'))\n",
 871 |     "def remove_stopwords(text):\n",
 872 |     "    return \" \".join([word for word in text.split() if word not in STOPWORDS])"
 873 |    ]
 874 |   },
 875 |   {
 876 |    "cell_type": "code",
 877 |    "execution_count": 19,
 878 |    "metadata": {},
 879 |    "outputs": [
 880 |     {
 881 |      "data": {
 882 |       "text/html": [
 883 |        "<div>\n",
 884 |        "<style scoped>\n",
 885 |        "    .dataframe tbody tr th:only-of-type {\n",
 886 |        "        vertical-align: middle;\n",
 887 |        "    }\n",
 888 |        "\n",
 889 |        "    .dataframe tbody tr th {\n",
 890 |        "        vertical-align: top;\n",
 891 |        "    }\n",
 892 |        "\n",
 893 |        "    .dataframe thead th {\n",
 894 |        "        text-align: right;\n",
 895 |        "    }\n",
 896 |        "</style>\n",
 897 |        "<table border=\"1\" class=\"dataframe\">\n",
 898 |        "  <thead>\n",
 899 |        "    <tr style=\"text-align: right;\">\n",
 900 |        "      <th></th>\n",
 901 |        "      <th>tweet</th>\n",
 902 |        "      <th>clean_text</th>\n",
 903 |        "    </tr>\n",
 904 |        "  </thead>\n",
 905 |        "  <tbody>\n",
 906 |        "    <tr>\n",
 907 |        "      <th>0</th>\n",
 908 |        "      <td>@user when a father is dysfunctional and is s...</td>\n",
 909 |        "      <td>user father dysfunctional selfish drags kids d...</td>\n",
 910 |        "    </tr>\n",
 911 |        "    <tr>\n",
 912 |        "      <th>1</th>\n",
 913 |        "      <td>@user @user thanks for #lyft credit i can't us...</td>\n",
 914 |        "      <td>user user thanks lyft credit cant use cause do...</td>\n",
 915 |        "    </tr>\n",
 916 |        "    <tr>\n",
 917 |        "      <th>2</th>\n",
 918 |        "      <td>bihday your majesty</td>\n",
 919 |        "      <td>bihday majesty</td>\n",
 920 |        "    </tr>\n",
 921 |        "    <tr>\n",
 922 |        "      <th>3</th>\n",
 923 |        "      <td>#model   i love u take with u all the time in ...</td>\n",
 924 |        "      <td>model love u take u time urð± ðððð...</td>\n",
 925 |        "    </tr>\n",
 926 |        "    <tr>\n",
 927 |        "      <th>4</th>\n",
 928 |        "      <td>factsguide: society now    #motivation</td>\n",
 929 |        "      <td>factsguide society motivation</td>\n",
 930 |        "    </tr>\n",
 931 |        "  </tbody>\n",
 932 |        "</table>\n",
 933 |        "</div>"
 934 |       ],
 935 |       "text/plain": [
 936 |        "                                               tweet  \\\n",
 937 |        "0   @user when a father is dysfunctional and is s...   \n",
 938 |        "1  @user @user thanks for #lyft credit i can't us...   \n",
 939 |        "2                                bihday your majesty   \n",
 940 |        "3  #model   i love u take with u all the time in ...   \n",
 941 |        "4             factsguide: society now    #motivation   \n",
 942 |        "\n",
 943 |        "                                          clean_text  \n",
 944 |        "0  user father dysfunctional selfish drags kids d...  \n",
 945 |        "1  user user thanks lyft credit cant use cause do...  \n",
 946 |        "2                                     bihday majesty  \n",
 947 |        "3  model love u take u time urð± ðððð...  \n",
 948 |        "4                      factsguide society motivation  "
 949 |       ]
 950 |      },
 951 |      "execution_count": 19,
 952 |      "metadata": {},
 953 |      "output_type": "execute_result"
 954 |     }
 955 |    ],
 956 |    "source": [
 957 |     "df['clean_text'] = df['clean_text'].apply(lambda x: remove_stopwords(x))\n",
 958 |     "df.head()"
 959 |    ]
 960 |   },
 961 |   {
 962 |    "cell_type": "markdown",
 963 |    "metadata": {},
 964 |    "source": [
 965 |     "## Removal of Frequent Words"
 966 |    ]
 967 |   },
 968 |   {
 969 |    "cell_type": "code",
 970 |    "execution_count": 23,
 971 |    "metadata": {},
 972 |    "outputs": [
 973 |     {
 974 |      "data": {
 975 |       "text/plain": [
 976 |        "[('user', 17473),\n",
 977 |        " ('love', 2647),\n",
 978 |        " ('day', 2198),\n",
 979 |        " ('happy', 1663),\n",
 980 |        " ('amp', 1582),\n",
 981 |        " ('im', 1139),\n",
 982 |        " ('u', 1136),\n",
 983 |        " ('time', 1110),\n",
 984 |        " ('life', 1086),\n",
 985 |        " ('like', 1042)]"
 986 |       ]
 987 |      },
 988 |      "execution_count": 23,
 989 |      "metadata": {},
 990 |      "output_type": "execute_result"
 991 |     }
 992 |    ],
 993 |    "source": [
 994 |     "from collections import Counter\n",
 995 |     "word_count = Counter()\n",
 996 |     "for text in df['clean_text']:\n",
 997 |     "    for word in text.split():\n",
 998 |     "        word_count[word] += 1\n",
 999 |     "        \n",
1000 |     "word_count.most_common(10)"
1001 |    ]
1002 |   },
1003 |   {
1004 |    "cell_type": "code",
1005 |    "execution_count": 24,
1006 |    "metadata": {},
1007 |    "outputs": [],
1008 |    "source": [
1009 |     "FREQUENT_WORDS = set(word for (word, wc) in word_count.most_common(3))\n",
1010 |     "def remove_freq_words(text):\n",
1011 |     "    return \" \".join([word for word in text.split() if word not in FREQUENT_WORDS])"
1012 |    ]
1013 |   },
1014 |   {
1015 |    "cell_type": "code",
1016 |    "execution_count": 25,
1017 |    "metadata": {},
1018 |    "outputs": [
1019 |     {
1020 |      "data": {
1021 |       "text/html": [
1022 |        "<div>\n",
1023 |        "<style scoped>\n",
1024 |        "    .dataframe tbody tr th:only-of-type {\n",
1025 |        "        vertical-align: middle;\n",
1026 |        "    }\n",
1027 |        "\n",
1028 |        "    .dataframe tbody tr th {\n",
1029 |        "        vertical-align: top;\n",
1030 |        "    }\n",
1031 |        "\n",
1032 |        "    .dataframe thead th {\n",
1033 |        "        text-align: right;\n",
1034 |        "    }\n",
1035 |        "</style>\n",
1036 |        "<table border=\"1\" class=\"dataframe\">\n",
1037 |        "  <thead>\n",
1038 |        "    <tr style=\"text-align: right;\">\n",
1039 |        "      <th></th>\n",
1040 |        "      <th>tweet</th>\n",
1041 |        "      <th>clean_text</th>\n",
1042 |        "    </tr>\n",
1043 |        "  </thead>\n",
1044 |        "  <tbody>\n",
1045 |        "    <tr>\n",
1046 |        "      <th>0</th>\n",
1047 |        "      <td>@user when a father is dysfunctional and is s...</td>\n",
1048 |        "      <td>father dysfunctional selfish drags kids dysfun...</td>\n",
1049 |        "    </tr>\n",
1050 |        "    <tr>\n",
1051 |        "      <th>1</th>\n",
1052 |        "      <td>@user @user thanks for #lyft credit i can't us...</td>\n",
1053 |        "      <td>thanks lyft credit cant use cause dont offer w...</td>\n",
1054 |        "    </tr>\n",
1055 |        "    <tr>\n",
1056 |        "      <th>2</th>\n",
1057 |        "      <td>bihday your majesty</td>\n",
1058 |        "      <td>bihday majesty</td>\n",
1059 |        "    </tr>\n",
1060 |        "    <tr>\n",
1061 |        "      <th>3</th>\n",
1062 |        "      <td>#model   i love u take with u all the time in ...</td>\n",
1063 |        "      <td>model u take u time urð± ðððð ð...</td>\n",
1064 |        "    </tr>\n",
1065 |        "    <tr>\n",
1066 |        "      <th>4</th>\n",
1067 |        "      <td>factsguide: society now    #motivation</td>\n",
1068 |        "      <td>factsguide society motivation</td>\n",
1069 |        "    </tr>\n",
1070 |        "  </tbody>\n",
1071 |        "</table>\n",
1072 |        "</div>"
1073 |       ],
1074 |       "text/plain": [
1075 |        "                                               tweet  \\\n",
1076 |        "0   @user when a father is dysfunctional and is s...   \n",
1077 |        "1  @user @user thanks for #lyft credit i can't us...   \n",
1078 |        "2                                bihday your majesty   \n",
1079 |        "3  #model   i love u take with u all the time in ...   \n",
1080 |        "4             factsguide: society now    #motivation   \n",
1081 |        "\n",
1082 |        "                                          clean_text  \n",
1083 |        "0  father dysfunctional selfish drags kids dysfun...  \n",
1084 |        "1  thanks lyft credit cant use cause dont offer w...  \n",
1085 |        "2                                     bihday majesty  \n",
1086 |        "3  model u take u time urð± ðððð ð...  \n",
1087 |        "4                      factsguide society motivation  "
1088 |       ]
1089 |      },
1090 |      "execution_count": 25,
1091 |      "metadata": {},
1092 |      "output_type": "execute_result"
1093 |     }
1094 |    ],
1095 |    "source": [
1096 |     "df['clean_text'] = df['clean_text'].apply(lambda x: remove_freq_words(x))\n",
1097 |     "df.head()"
1098 |    ]
1099 |   },
1100 |   {
1101 |    "cell_type": "markdown",
1102 |    "metadata": {},
1103 |    "source": [
1104 |     "## Removal of Rare Words"
1105 |    ]
1106 |   },
1107 |   {
1108 |    "cell_type": "code",
1109 |    "execution_count": 30,
1110 |    "metadata": {},
1111 |    "outputs": [
1112 |     {
1113 |      "data": {
1114 |       "text/plain": [
1115 |        "{'airwaves',\n",
1116 |        " 'carnt',\n",
1117 |        " 'chisolm',\n",
1118 |        " 'ibizabringitonmallorcaholidayssummer',\n",
1119 |        " 'isz',\n",
1120 |        " 'mantle',\n",
1121 |        " 'shirley',\n",
1122 |        " 'youuuð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dâ\\x9d¤ï¸\\x8f',\n",
1123 |        " 'ð\\x9f\\x99\\x8fð\\x9f\\x8f¼ð\\x9f\\x8d¹ð\\x9f\\x98\\x8eð\\x9f\\x8eµ'}"
1124 |       ]
1125 |      },
1126 |      "execution_count": 30,
1127 |      "metadata": {},
1128 |      "output_type": "execute_result"
1129 |     }
1130 |    ],
1131 |    "source": [
1132 |     "RARE_WORDS = set(word for (word, wc) in word_count.most_common()[:-10:-1])\n",
1133 |     "RARE_WORDS"
1134 |    ]
1135 |   },
1136 |   {
1137 |    "cell_type": "code",
1138 |    "execution_count": 31,
1139 |    "metadata": {},
1140 |    "outputs": [],
1141 |    "source": [
1142 |     "def remove_rare_words(text):\n",
1143 |     "    return \" \".join([word for word in text.split() if word not in RARE_WORDS])"
1144 |    ]
1145 |   },
1146 |   {
1147 |    "cell_type": "code",
1148 |    "execution_count": 32,
1149 |    "metadata": {},
1150 |    "outputs": [
1151 |     {
1152 |      "data": {
1153 |       "text/html": [
1154 |        "<div>\n",
1155 |        "<style scoped>\n",
1156 |        "    .dataframe tbody tr th:only-of-type {\n",
1157 |        "        vertical-align: middle;\n",
1158 |        "    }\n",
1159 |        "\n",
1160 |        "    .dataframe tbody tr th {\n",
1161 |        "        vertical-align: top;\n",
1162 |        "    }\n",
1163 |        "\n",
1164 |        "    .dataframe thead th {\n",
1165 |        "        text-align: right;\n",
1166 |        "    }\n",
1167 |        "</style>\n",
1168 |        "<table border=\"1\" class=\"dataframe\">\n",
1169 |        "  <thead>\n",
1170 |        "    <tr style=\"text-align: right;\">\n",
1171 |        "      <th></th>\n",
1172 |        "      <th>tweet</th>\n",
1173 |        "      <th>clean_text</th>\n",
1174 |        "    </tr>\n",
1175 |        "  </thead>\n",
1176 |        "  <tbody>\n",
1177 |        "    <tr>\n",
1178 |        "      <th>0</th>\n",
1179 |        "      <td>@user when a father is dysfunctional and is s...</td>\n",
1180 |        "      <td>father dysfunctional selfish drags kids dysfun...</td>\n",
1181 |        "    </tr>\n",
1182 |        "    <tr>\n",
1183 |        "      <th>1</th>\n",
1184 |        "      <td>@user @user thanks for #lyft credit i can't us...</td>\n",
1185 |        "      <td>thanks lyft credit cant use cause dont offer w...</td>\n",
1186 |        "    </tr>\n",
1187 |        "    <tr>\n",
1188 |        "      <th>2</th>\n",
1189 |        "      <td>bihday your majesty</td>\n",
1190 |        "      <td>bihday majesty</td>\n",
1191 |        "    </tr>\n",
1192 |        "    <tr>\n",
1193 |        "      <th>3</th>\n",
1194 |        "      <td>#model   i love u take with u all the time in ...</td>\n",
1195 |        "      <td>model u take u time urð± ðððð ð...</td>\n",
1196 |        "    </tr>\n",
1197 |        "    <tr>\n",
1198 |        "      <th>4</th>\n",
1199 |        "      <td>factsguide: society now    #motivation</td>\n",
1200 |        "      <td>factsguide society motivation</td>\n",
1201 |        "    </tr>\n",
1202 |        "  </tbody>\n",
1203 |        "</table>\n",
1204 |        "</div>"
1205 |       ],
1206 |       "text/plain": [
1207 |        "                                               tweet  \\\n",
1208 |        "0   @user when a father is dysfunctional and is s...   \n",
1209 |        "1  @user @user thanks for #lyft credit i can't us...   \n",
1210 |        "2                                bihday your majesty   \n",
1211 |        "3  #model   i love u take with u all the time in ...   \n",
1212 |        "4             factsguide: society now    #motivation   \n",
1213 |        "\n",
1214 |        "                                          clean_text  \n",
1215 |        "0  father dysfunctional selfish drags kids dysfun...  \n",
1216 |        "1  thanks lyft credit cant use cause dont offer w...  \n",
1217 |        "2                                     bihday majesty  \n",
1218 |        "3  model u take u time urð± ðððð ð...  \n",
1219 |        "4                      factsguide society motivation  "
1220 |       ]
1221 |      },
1222 |      "execution_count": 32,
1223 |      "metadata": {},
1224 |      "output_type": "execute_result"
1225 |     }
1226 |    ],
1227 |    "source": [
1228 |     "df['clean_text'] = df['clean_text'].apply(lambda x: remove_rare_words(x))\n",
1229 |     "df.head()"
1230 |    ]
1231 |   },
1232 |   {
1233 |    "cell_type": "markdown",
1234 |    "metadata": {},
1235 |    "source": [
1236 |     "## Removal of Special characters"
1237 |    ]
1238 |   },
1239 |   {
1240 |    "cell_type": "code",
1241 |    "execution_count": 33,
1242 |    "metadata": {},
1243 |    "outputs": [],
1244 |    "source": [
1245 |     "import re\n",
1246 |     "def remove_spl_chars(text):\n",
1247 |     "    text = re.sub('[^a-zA-Z0-9]', ' ', text)\n",
1248 |     "    text = re.sub('\\s+', ' ', text)\n",
1249 |     "    return text"
1250 |    ]
1251 |   },
1252 |   {
1253 |    "cell_type": "code",
1254 |    "execution_count": 34,
1255 |    "metadata": {},
1256 |    "outputs": [
1257 |     {
1258 |      "data": {
1259 |       "text/html": [
1260 |        "<div>\n",
1261 |        "<style scoped>\n",
1262 |        "    .dataframe tbody tr th:only-of-type {\n",
1263 |        "        vertical-align: middle;\n",
1264 |        "    }\n",
1265 |        "\n",
1266 |        "    .dataframe tbody tr th {\n",
1267 |        "        vertical-align: top;\n",
1268 |        "    }\n",
1269 |        "\n",
1270 |        "    .dataframe thead th {\n",
1271 |        "        text-align: right;\n",
1272 |        "    }\n",
1273 |        "</style>\n",
1274 |        "<table border=\"1\" class=\"dataframe\">\n",
1275 |        "  <thead>\n",
1276 |        "    <tr style=\"text-align: right;\">\n",
1277 |        "      <th></th>\n",
1278 |        "      <th>tweet</th>\n",
1279 |        "      <th>clean_text</th>\n",
1280 |        "    </tr>\n",
1281 |        "  </thead>\n",
1282 |        "  <tbody>\n",
1283 |        "    <tr>\n",
1284 |        "      <th>0</th>\n",
1285 |        "      <td>@user when a father is dysfunctional and is s...</td>\n",
1286 |        "      <td>father dysfunctional selfish drags kids dysfun...</td>\n",
1287 |        "    </tr>\n",
1288 |        "    <tr>\n",
1289 |        "      <th>1</th>\n",
1290 |        "      <td>@user @user thanks for #lyft credit i can't us...</td>\n",
1291 |        "      <td>thanks lyft credit cant use cause dont offer w...</td>\n",
1292 |        "    </tr>\n",
1293 |        "    <tr>\n",
1294 |        "      <th>2</th>\n",
1295 |        "      <td>bihday your majesty</td>\n",
1296 |        "      <td>bihday majesty</td>\n",
1297 |        "    </tr>\n",
1298 |        "    <tr>\n",
1299 |        "      <th>3</th>\n",
1300 |        "      <td>#model   i love u take with u all the time in ...</td>\n",
1301 |        "      <td>model u take u time ur</td>\n",
1302 |        "    </tr>\n",
1303 |        "    <tr>\n",
1304 |        "      <th>4</th>\n",
1305 |        "      <td>factsguide: society now    #motivation</td>\n",
1306 |        "      <td>factsguide society motivation</td>\n",
1307 |        "    </tr>\n",
1308 |        "  </tbody>\n",
1309 |        "</table>\n",
1310 |        "</div>"
1311 |       ],
1312 |       "text/plain": [
1313 |        "                                               tweet  \\\n",
1314 |        "0   @user when a father is dysfunctional and is s...   \n",
1315 |        "1  @user @user thanks for #lyft credit i can't us...   \n",
1316 |        "2                                bihday your majesty   \n",
1317 |        "3  #model   i love u take with u all the time in ...   \n",
1318 |        "4             factsguide: society now    #motivation   \n",
1319 |        "\n",
1320 |        "                                          clean_text  \n",
1321 |        "0  father dysfunctional selfish drags kids dysfun...  \n",
1322 |        "1  thanks lyft credit cant use cause dont offer w...  \n",
1323 |        "2                                     bihday majesty  \n",
1324 |        "3                            model u take u time ur   \n",
1325 |        "4                      factsguide society motivation  "
1326 |       ]
1327 |      },
1328 |      "execution_count": 34,
1329 |      "metadata": {},
1330 |      "output_type": "execute_result"
1331 |     }
1332 |    ],
1333 |    "source": [
1334 |     "df['clean_text'] = df['clean_text'].apply(lambda x: remove_spl_chars(x))\n",
1335 |     "df.head()"
1336 |    ]
1337 |   },
1338 |   {
1339 |    "cell_type": "markdown",
1340 |    "metadata": {},
1341 |    "source": [
1342 |     "## Stemming"
1343 |    ]
1344 |   },
1345 |   {
1346 |    "cell_type": "code",
1347 |    "execution_count": 35,
1348 |    "metadata": {},
1349 |    "outputs": [],
1350 |    "source": [
1351 |     "from nltk.stem.porter import PorterStemmer\n",
1352 |     "ps = PorterStemmer()\n",
1353 |     "def stem_words(text):\n",
1354 |     "    return \" \".join([ps.stem(word) for word in text.split()])"
1355 |    ]
1356 |   },
1357 |   {
1358 |    "cell_type": "code",
1359 |    "execution_count": 36,
1360 |    "metadata": {},
1361 |    "outputs": [
1362 |     {
1363 |      "data": {
1364 |       "text/html": [
1365 |        "<div>\n",
1366 |        "<style scoped>\n",
1367 |        "    .dataframe tbody tr th:only-of-type {\n",
1368 |        "        vertical-align: middle;\n",
1369 |        "    }\n",
1370 |        "\n",
1371 |        "    .dataframe tbody tr th {\n",
1372 |        "        vertical-align: top;\n",
1373 |        "    }\n",
1374 |        "\n",
1375 |        "    .dataframe thead th {\n",
1376 |        "        text-align: right;\n",
1377 |        "    }\n",
1378 |        "</style>\n",
1379 |        "<table border=\"1\" class=\"dataframe\">\n",
1380 |        "  <thead>\n",
1381 |        "    <tr style=\"text-align: right;\">\n",
1382 |        "      <th></th>\n",
1383 |        "      <th>tweet</th>\n",
1384 |        "      <th>clean_text</th>\n",
1385 |        "      <th>stemmed_text</th>\n",
1386 |        "    </tr>\n",
1387 |        "  </thead>\n",
1388 |        "  <tbody>\n",
1389 |        "    <tr>\n",
1390 |        "      <th>0</th>\n",
1391 |        "      <td>@user when a father is dysfunctional and is s...</td>\n",
1392 |        "      <td>father dysfunctional selfish drags kids dysfun...</td>\n",
1393 |        "      <td>father dysfunct selfish drag kid dysfunct run</td>\n",
1394 |        "    </tr>\n",
1395 |        "    <tr>\n",
1396 |        "      <th>1</th>\n",
1397 |        "      <td>@user @user thanks for #lyft credit i can't us...</td>\n",
1398 |        "      <td>thanks lyft credit cant use cause dont offer w...</td>\n",
1399 |        "      <td>thank lyft credit cant use caus dont offer whe...</td>\n",
1400 |        "    </tr>\n",
1401 |        "    <tr>\n",
1402 |        "      <th>2</th>\n",
1403 |        "      <td>bihday your majesty</td>\n",
1404 |        "      <td>bihday majesty</td>\n",
1405 |        "      <td>bihday majesti</td>\n",
1406 |        "    </tr>\n",
1407 |        "    <tr>\n",
1408 |        "      <th>3</th>\n",
1409 |        "      <td>#model   i love u take with u all the time in ...</td>\n",
1410 |        "      <td>model u take u time ur</td>\n",
1411 |        "      <td>model u take u time ur</td>\n",
1412 |        "    </tr>\n",
1413 |        "    <tr>\n",
1414 |        "      <th>4</th>\n",
1415 |        "      <td>factsguide: society now    #motivation</td>\n",
1416 |        "      <td>factsguide society motivation</td>\n",
1417 |        "      <td>factsguid societi motiv</td>\n",
1418 |        "    </tr>\n",
1419 |        "  </tbody>\n",
1420 |        "</table>\n",
1421 |        "</div>"
1422 |       ],
1423 |       "text/plain": [
1424 |        "                                               tweet  \\\n",
1425 |        "0   @user when a father is dysfunctional and is s...   \n",
1426 |        "1  @user @user thanks for #lyft credit i can't us...   \n",
1427 |        "2                                bihday your majesty   \n",
1428 |        "3  #model   i love u take with u all the time in ...   \n",
1429 |        "4             factsguide: society now    #motivation   \n",
1430 |        "\n",
1431 |        "                                          clean_text  \\\n",
1432 |        "0  father dysfunctional selfish drags kids dysfun...   \n",
1433 |        "1  thanks lyft credit cant use cause dont offer w...   \n",
1434 |        "2                                     bihday majesty   \n",
1435 |        "3                            model u take u time ur    \n",
1436 |        "4                      factsguide society motivation   \n",
1437 |        "\n",
1438 |        "                                        stemmed_text  \n",
1439 |        "0      father dysfunct selfish drag kid dysfunct run  \n",
1440 |        "1  thank lyft credit cant use caus dont offer whe...  \n",
1441 |        "2                                     bihday majesti  \n",
1442 |        "3                             model u take u time ur  \n",
1443 |        "4                            factsguid societi motiv  "
1444 |       ]
1445 |      },
1446 |      "execution_count": 36,
1447 |      "metadata": {},
1448 |      "output_type": "execute_result"
1449 |     }
1450 |    ],
1451 |    "source": [
1452 |     "df['stemmed_text'] = df['clean_text'].apply(lambda x: stem_words(x))\n",
1453 |     "df.head()"
1454 |    ]
1455 |   },
1456 |   {
1457 |    "cell_type": "markdown",
1458 |    "metadata": {},
1459 |    "source": [
1460 |     "## Lemmatization & POS Tagging"
1461 |    ]
1462 |   },
1463 |   {
1464 |    "cell_type": "code",
1465 |    "execution_count": 41,
1466 |    "metadata": {},
1467 |    "outputs": [],
1468 |    "source": [
1469 |     "from nltk import pos_tag\n",
1470 |     "from nltk.corpus import wordnet\n",
1471 |     "from nltk.stem import WordNetLemmatizer\n",
1472 |     "\n",
1473 |     "lemmatizer = WordNetLemmatizer()\n",
1474 |     "wordnet_map = {\"N\":wordnet.NOUN, \"V\": wordnet.VERB, \"J\": wordnet.ADJ, \"R\": wordnet.ADV}\n",
1475 |     "\n",
1476 |     "def lemmatize_words(text):\n",
1477 |     "    # find pos tags\n",
1478 |     "    pos_text = pos_tag(text.split())\n",
1479 |     "    return \" \".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_text])"
1480 |    ]
1481 |   },
1482 |   {
1483 |    "cell_type": "code",
1484 |    "execution_count": 42,
1485 |    "metadata": {},
1486 |    "outputs": [
1487 |     {
1488 |      "data": {
1489 |       "text/plain": [
1490 |        "'n'"
1491 |       ]
1492 |      },
1493 |      "execution_count": 42,
1494 |      "metadata": {},
1495 |      "output_type": "execute_result"
1496 |     }
1497 |    ],
1498 |    "source": [
1499 |     "wordnet.NOUN"
1500 |    ]
1501 |   },
1502 |   {
1503 |    "cell_type": "code",
1504 |    "execution_count": 43,
1505 |    "metadata": {},
1506 |    "outputs": [
1507 |     {
1508 |      "data": {
1509 |       "text/html": [
1510 |        "<div>\n",
1511 |        "<style scoped>\n",
1512 |        "    .dataframe tbody tr th:only-of-type {\n",
1513 |        "        vertical-align: middle;\n",
1514 |        "    }\n",
1515 |        "\n",
1516 |        "    .dataframe tbody tr th {\n",
1517 |        "        vertical-align: top;\n",
1518 |        "    }\n",
1519 |        "\n",
1520 |        "    .dataframe thead th {\n",
1521 |        "        text-align: right;\n",
1522 |        "    }\n",
1523 |        "</style>\n",
1524 |        "<table border=\"1\" class=\"dataframe\">\n",
1525 |        "  <thead>\n",
1526 |        "    <tr style=\"text-align: right;\">\n",
1527 |        "      <th></th>\n",
1528 |        "      <th>tweet</th>\n",
1529 |        "      <th>clean_text</th>\n",
1530 |        "      <th>stemmed_text</th>\n",
1531 |        "      <th>lemmatized_text</th>\n",
1532 |        "    </tr>\n",
1533 |        "  </thead>\n",
1534 |        "  <tbody>\n",
1535 |        "    <tr>\n",
1536 |        "      <th>0</th>\n",
1537 |        "      <td>@user when a father is dysfunctional and is s...</td>\n",
1538 |        "      <td>father dysfunctional selfish drags kids dysfun...</td>\n",
1539 |        "      <td>father dysfunct selfish drag kid dysfunct run</td>\n",
1540 |        "      <td>father dysfunctional selfish drag kid dysfunct...</td>\n",
1541 |        "    </tr>\n",
1542 |        "    <tr>\n",
1543 |        "      <th>1</th>\n",
1544 |        "      <td>@user @user thanks for #lyft credit i can't us...</td>\n",
1545 |        "      <td>thanks lyft credit cant use cause dont offer w...</td>\n",
1546 |        "      <td>thank lyft credit cant use caus dont offer whe...</td>\n",
1547 |        "      <td>thanks lyft credit cant use cause dont offer w...</td>\n",
1548 |        "    </tr>\n",
1549 |        "    <tr>\n",
1550 |        "      <th>2</th>\n",
1551 |        "      <td>bihday your majesty</td>\n",
1552 |        "      <td>bihday majesty</td>\n",
1553 |        "      <td>bihday majesti</td>\n",
1554 |        "      <td>bihday majesty</td>\n",
1555 |        "    </tr>\n",
1556 |        "    <tr>\n",
1557 |        "      <th>3</th>\n",
1558 |        "      <td>#model   i love u take with u all the time in ...</td>\n",
1559 |        "      <td>model u take u time ur</td>\n",
1560 |        "      <td>model u take u time ur</td>\n",
1561 |        "      <td>model u take u time ur</td>\n",
1562 |        "    </tr>\n",
1563 |        "    <tr>\n",
1564 |        "      <th>4</th>\n",
1565 |        "      <td>factsguide: society now    #motivation</td>\n",
1566 |        "      <td>factsguide society motivation</td>\n",
1567 |        "      <td>factsguid societi motiv</td>\n",
1568 |        "      <td>factsguide society motivation</td>\n",
1569 |        "    </tr>\n",
1570 |        "  </tbody>\n",
1571 |        "</table>\n",
1572 |        "</div>"
1573 |       ],
1574 |       "text/plain": [
1575 |        "                                               tweet  \\\n",
1576 |        "0   @user when a father is dysfunctional and is s...   \n",
1577 |        "1  @user @user thanks for #lyft credit i can't us...   \n",
1578 |        "2                                bihday your majesty   \n",
1579 |        "3  #model   i love u take with u all the time in ...   \n",
1580 |        "4             factsguide: society now    #motivation   \n",
1581 |        "\n",
1582 |        "                                          clean_text  \\\n",
1583 |        "0  father dysfunctional selfish drags kids dysfun...   \n",
1584 |        "1  thanks lyft credit cant use cause dont offer w...   \n",
1585 |        "2                                     bihday majesty   \n",
1586 |        "3                            model u take u time ur    \n",
1587 |        "4                      factsguide society motivation   \n",
1588 |        "\n",
1589 |        "                                        stemmed_text  \\\n",
1590 |        "0      father dysfunct selfish drag kid dysfunct run   \n",
1591 |        "1  thank lyft credit cant use caus dont offer whe...   \n",
1592 |        "2                                     bihday majesti   \n",
1593 |        "3                             model u take u time ur   \n",
1594 |        "4                            factsguid societi motiv   \n",
1595 |        "\n",
1596 |        "                                     lemmatized_text  \n",
1597 |        "0  father dysfunctional selfish drag kid dysfunct...  \n",
1598 |        "1  thanks lyft credit cant use cause dont offer w...  \n",
1599 |        "2                                     bihday majesty  \n",
1600 |        "3                             model u take u time ur  \n",
1601 |        "4                      factsguide society motivation  "
1602 |       ]
1603 |      },
1604 |      "execution_count": 43,
1605 |      "metadata": {},
1606 |      "output_type": "execute_result"
1607 |     }
1608 |    ],
1609 |    "source": [
1610 |     "df['lemmatized_text'] = df['clean_text'].apply(lambda x: lemmatize_words(x))\n",
1611 |     "df.head()"
1612 |    ]
1613 |   },
1614 |   {
1615 |    "cell_type": "code",
1616 |    "execution_count": 44,
1617 |    "metadata": {},
1618 |    "outputs": [
1619 |     {
1620 |      "data": {
1621 |       "text/html": [
1622 |        "<div>\n",
1623 |        "<style scoped>\n",
1624 |        "    .dataframe tbody tr th:only-of-type {\n",
1625 |        "        vertical-align: middle;\n",
1626 |        "    }\n",
1627 |        "\n",
1628 |        "    .dataframe tbody tr th {\n",
1629 |        "        vertical-align: top;\n",
1630 |        "    }\n",
1631 |        "\n",
1632 |        "    .dataframe thead th {\n",
1633 |        "        text-align: right;\n",
1634 |        "    }\n",
1635 |        "</style>\n",
1636 |        "<table border=\"1\" class=\"dataframe\">\n",
1637 |        "  <thead>\n",
1638 |        "    <tr style=\"text-align: right;\">\n",
1639 |        "      <th></th>\n",
1640 |        "      <th>tweet</th>\n",
1641 |        "      <th>clean_text</th>\n",
1642 |        "      <th>stemmed_text</th>\n",
1643 |        "      <th>lemmatized_text</th>\n",
1644 |        "    </tr>\n",
1645 |        "  </thead>\n",
1646 |        "  <tbody>\n",
1647 |        "    <tr>\n",
1648 |        "      <th>21468</th>\n",
1649 |        "      <td>@user for real now: we will be playing @user ...</td>\n",
1650 |        "      <td>real playing czech republic china championship...</td>\n",
1651 |        "      <td>real play czech republ china championship wugc...</td>\n",
1652 |        "      <td>real play czech republic china championship wu...</td>\n",
1653 |        "    </tr>\n",
1654 |        "    <tr>\n",
1655 |        "      <th>9568</th>\n",
1656 |        "      <td>dear america. please don't let this influence ...</td>\n",
1657 |        "      <td>dear america please dont let influence vote tr...</td>\n",
1658 |        "      <td>dear america pleas dont let influenc vote trum...</td>\n",
1659 |        "      <td>dear america please dont let influence vote tr...</td>\n",
1660 |        "    </tr>\n",
1661 |        "    <tr>\n",
1662 |        "      <th>19804</th>\n",
1663 |        "      <td>finally... now on to other suppos~   #leagueof...</td>\n",
1664 |        "      <td>finally suppos leagueoflegends</td>\n",
1665 |        "      <td>final suppo leagueoflegend</td>\n",
1666 |        "      <td>finally suppos leagueoflegends</td>\n",
1667 |        "    </tr>\n",
1668 |        "    <tr>\n",
1669 |        "      <th>22323</th>\n",
1670 |        "      <td>@user @user @user @user feeling    #worried.</td>\n",
1671 |        "      <td>feeling worried</td>\n",
1672 |        "      <td>feel worri</td>\n",
1673 |        "      <td>feel worry</td>\n",
1674 |        "    </tr>\n",
1675 |        "    <tr>\n",
1676 |        "      <th>20171</th>\n",
1677 |        "      <td>i am valued. #i_am #positive #affirmation</td>\n",
1678 |        "      <td>valued iam positive affirmation</td>\n",
1679 |        "      <td>valu iam posit affirm</td>\n",
1680 |        "      <td>value iam positive affirmation</td>\n",
1681 |        "    </tr>\n",
1682 |        "    <tr>\n",
1683 |        "      <th>29669</th>\n",
1684 |        "      <td>fathers day selfie â¤ï¸   #grandad #selfie #...</td>\n",
1685 |        "      <td>fathers selfie grandad selfie fathersday bless...</td>\n",
1686 |        "      <td>father selfi grandad selfi fathersday bless su...</td>\n",
1687 |        "      <td>father selfie grandad selfie fathersday bless ...</td>\n",
1688 |        "    </tr>\n",
1689 |        "    <tr>\n",
1690 |        "      <th>4360</th>\n",
1691 |        "      <td>when 8th #graders say they're   for high #school</td>\n",
1692 |        "      <td>8th graders say theyre high school</td>\n",
1693 |        "      <td>8th grader say theyr high school</td>\n",
1694 |        "      <td>8th grader say theyre high school</td>\n",
1695 |        "    </tr>\n",
1696 |        "    <tr>\n",
1697 |        "      <th>15915</th>\n",
1698 |        "      <td>current mood ðð¦   #alone #anxiety #rain ...</td>\n",
1699 |        "      <td>current mood alone anxiety rain thistooshallpass</td>\n",
1700 |        "      <td>current mood alon anxieti rain thistooshallpass</td>\n",
1701 |        "      <td>current mood alone anxiety rain thistooshallpass</td>\n",
1702 |        "    </tr>\n",
1703 |        "    <tr>\n",
1704 |        "      <th>92</th>\n",
1705 |        "      <td>yes! received my acceptance letter for my mast...</td>\n",
1706 |        "      <td>yes received acceptance letter masters back oc...</td>\n",
1707 |        "      <td>ye receiv accept letter master back octob good...</td>\n",
1708 |        "      <td>yes receive acceptance letter master back octo...</td>\n",
1709 |        "    </tr>\n",
1710 |        "    <tr>\n",
1711 |        "      <th>18745</th>\n",
1712 |        "      <td>@user @user this so made me smile</td>\n",
1713 |        "      <td>made smile</td>\n",
1714 |        "      <td>made smile</td>\n",
1715 |        "      <td>make smile</td>\n",
1716 |        "    </tr>\n",
1717 |        "  </tbody>\n",
1718 |        "</table>\n",
1719 |        "</div>"
1720 |       ],
1721 |       "text/plain": [
1722 |        "                                                   tweet  \\\n",
1723 |        "21468   @user for real now: we will be playing @user ...   \n",
1724 |        "9568   dear america. please don't let this influence ...   \n",
1725 |        "19804  finally... now on to other suppos~   #leagueof...   \n",
1726 |        "22323      @user @user @user @user feeling    #worried.    \n",
1727 |        "20171     i am valued. #i_am #positive #affirmation        \n",
1728 |        "29669  fathers day selfie â¤ï¸   #grandad #selfie #...   \n",
1729 |        "4360   when 8th #graders say they're   for high #school    \n",
1730 |        "15915  current mood ðð¦   #alone #anxiety #rain ...   \n",
1731 |        "92     yes! received my acceptance letter for my mast...   \n",
1732 |        "18745                 @user @user this so made me smile    \n",
1733 |        "\n",
1734 |        "                                              clean_text  \\\n",
1735 |        "21468  real playing czech republic china championship...   \n",
1736 |        "9568   dear america please dont let influence vote tr...   \n",
1737 |        "19804                     finally suppos leagueoflegends   \n",
1738 |        "22323                                    feeling worried   \n",
1739 |        "20171                    valued iam positive affirmation   \n",
1740 |        "29669  fathers selfie grandad selfie fathersday bless...   \n",
1741 |        "4360                  8th graders say theyre high school   \n",
1742 |        "15915   current mood alone anxiety rain thistooshallpass   \n",
1743 |        "92     yes received acceptance letter masters back oc...   \n",
1744 |        "18745                                         made smile   \n",
1745 |        "\n",
1746 |        "                                            stemmed_text  \\\n",
1747 |        "21468  real play czech republ china championship wugc...   \n",
1748 |        "9568   dear america pleas dont let influenc vote trum...   \n",
1749 |        "19804                         final suppo leagueoflegend   \n",
1750 |        "22323                                         feel worri   \n",
1751 |        "20171                              valu iam posit affirm   \n",
1752 |        "29669  father selfi grandad selfi fathersday bless su...   \n",
1753 |        "4360                    8th grader say theyr high school   \n",
1754 |        "15915    current mood alon anxieti rain thistooshallpass   \n",
1755 |        "92     ye receiv accept letter master back octob good...   \n",
1756 |        "18745                                         made smile   \n",
1757 |        "\n",
1758 |        "                                         lemmatized_text  \n",
1759 |        "21468  real play czech republic china championship wu...  \n",
1760 |        "9568   dear america please dont let influence vote tr...  \n",
1761 |        "19804                     finally suppos leagueoflegends  \n",
1762 |        "22323                                         feel worry  \n",
1763 |        "20171                     value iam positive affirmation  \n",
1764 |        "29669  father selfie grandad selfie fathersday bless ...  \n",
1765 |        "4360                   8th grader say theyre high school  \n",
1766 |        "15915   current mood alone anxiety rain thistooshallpass  \n",
1767 |        "92     yes receive acceptance letter master back octo...  \n",
1768 |        "18745                                         make smile  "
1769 |       ]
1770 |      },
1771 |      "execution_count": 44,
1772 |      "metadata": {},
1773 |      "output_type": "execute_result"
1774 |     }
1775 |    ],
1776 |    "source": [
1777 |     "df.sample(frac=1).head(10)"
1778 |    ]
1779 |   },
1780 |   {
1781 |    "cell_type": "markdown",
1782 |    "metadata": {},
1783 |    "source": [
1784 |     "## Removal of URLs"
1785 |    ]
1786 |   },
1787 |   {
1788 |    "cell_type": "code",
1789 |    "execution_count": 53,
1790 |    "metadata": {},
1791 |    "outputs": [],
1792 |    "source": [
1793 |     "text = \"https://www.hackersrealm.net is the URL of the channel Hackers Realm\""
1794 |    ]
1795 |   },
1796 |   {
1797 |    "cell_type": "code",
1798 |    "execution_count": 54,
1799 |    "metadata": {},
1800 |    "outputs": [],
1801 |    "source": [
1802 |     "def remove_url(text):\n",
1803 |     "    return re.sub(r'https?://\\S+|www\\.\\S+', '', text)"
1804 |    ]
1805 |   },
1806 |   {
1807 |    "cell_type": "code",
1808 |    "execution_count": 55,
1809 |    "metadata": {},
1810 |    "outputs": [
1811 |     {
1812 |      "data": {
1813 |       "text/plain": [
1814 |        "' is the URL of the channel Hackers Realm'"
1815 |       ]
1816 |      },
1817 |      "execution_count": 55,
1818 |      "metadata": {},
1819 |      "output_type": "execute_result"
1820 |     }
1821 |    ],
1822 |    "source": [
1823 |     "remove_url(text)"
1824 |    ]
1825 |   },
1826 |   {
1827 |    "cell_type": "markdown",
1828 |    "metadata": {},
1829 |    "source": [
1830 |     "## Removal of HTML Tags"
1831 |    ]
1832 |   },
1833 |   {
1834 |    "cell_type": "code",
1835 |    "execution_count": 56,
1836 |    "metadata": {},
1837 |    "outputs": [],
1838 |    "source": [
1839 |     "text = \"<html><body> <h1>Hackers Realm</h1> <p>This is NLP text preprocessing tutorial</p> </body></html>\""
1840 |    ]
1841 |   },
1842 |   {
1843 |    "cell_type": "code",
1844 |    "execution_count": 57,
1845 |    "metadata": {},
1846 |    "outputs": [],
1847 |    "source": [
1848 |     "def remove_html_tags(text):\n",
1849 |     "    return re.sub(r'<.*?>', '', text)"
1850 |    ]
1851 |   },
1852 |   {
1853 |    "cell_type": "code",
1854 |    "execution_count": 58,
1855 |    "metadata": {},
1856 |    "outputs": [
1857 |     {
1858 |      "data": {
1859 |       "text/plain": [
1860 |        "' Hackers Realm This is NLP text preprocessing tutorial '"
1861 |       ]
1862 |      },
1863 |      "execution_count": 58,
1864 |      "metadata": {},
1865 |      "output_type": "execute_result"
1866 |     }
1867 |    ],
1868 |    "source": [
1869 |     "remove_html_tags(text)"
1870 |    ]
1871 |   },
1872 |   {
1873 |    "cell_type": "markdown",
1874 |    "metadata": {},
1875 |    "source": [
1876 |     "## Spelling Correction"
1877 |    ]
1878 |   },
1879 |   {
1880 |    "cell_type": "code",
1881 |    "execution_count": 64,
1882 |    "metadata": {},
1883 |    "outputs": [],
1884 |    "source": [
1885 |     "!pip install pyspellchecker"
1886 |    ]
1887 |   },
1888 |   {
1889 |    "cell_type": "code",
1890 |    "execution_count": 7,
1891 |    "metadata": {},
1892 |    "outputs": [],
1893 |    "source": [
1894 |     "text = 'natur is a beuty'"
1895 |    ]
1896 |   },
1897 |   {
1898 |    "cell_type": "code",
1899 |    "execution_count": 8,
1900 |    "metadata": {},
1901 |    "outputs": [],
1902 |    "source": [
1903 |     "from spellchecker import SpellChecker\n",
1904 |     "spell = SpellChecker()\n",
1905 |     "\n",
1906 |     "def correct_spellings(text):\n",
1907 |     "    corrected_text = []\n",
1908 |     "    misspelled_text = spell.unknown(text.split())\n",
1909 |     "    # print(misspelled_text)\n",
1910 |     "    for word in text.split():\n",
1911 |     "        if word in misspelled_text:\n",
1912 |     "            corrected_text.append(spell.correction(word))\n",
1913 |     "        else:\n",
1914 |     "            corrected_text.append(word)\n",
1915 |     "            \n",
1916 |     "    return \" \".join(corrected_text)"
1917 |    ]
1918 |   },
1919 |   {
1920 |    "cell_type": "code",
1921 |    "execution_count": 9,
1922 |    "metadata": {},
1923 |    "outputs": [
1924 |     {
1925 |      "data": {
1926 |       "text/plain": [
1927 |        "'nature is a beauty'"
1928 |       ]
1929 |      },
1930 |      "execution_count": 9,
1931 |      "metadata": {},
1932 |      "output_type": "execute_result"
1933 |     }
1934 |    ],
1935 |    "source": [
1936 |     "correct_spellings(text)"
1937 |    ]
1938 |   },
1939 |   {
1940 |    "cell_type": "code",
1941 |    "execution_count": null,
1942 |    "metadata": {},
1943 |    "outputs": [],
1944 |    "source": []
1945 |   },
1946 |   {
1947 |    "cell_type": "markdown",
1948 |    "metadata": {},
1949 |    "source": [
1950 |     "# Feature Extraction from Text Data"
1951 |    ]
1952 |   },
1953 |   {
1954 |    "cell_type": "markdown",
1955 |    "metadata": {},
1956 |    "source": [
1957 |     "## Bag of Words\n",
1958 |     "\n",
1959 |     "A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: A vocabulary of known words. A measure of the presence of known words."
1960 |    ]
1961 |   },
1962 |   {
1963 |    "cell_type": "code",
1964 |    "execution_count": 5,
1965 |    "metadata": {},
1966 |    "outputs": [],
1967 |    "source": [
1968 |     "text_data = ['I am interested in NLP', 'This is a good tutorial with good topic', 'Feature extraction is very important topic']"
1969 |    ]
1970 |   },
1971 |   {
1972 |    "cell_type": "code",
1973 |    "execution_count": 6,
1974 |    "metadata": {},
1975 |    "outputs": [],
1976 |    "source": [
1977 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
1978 |     "bow = CountVectorizer(stop_words='english')"
1979 |    ]
1980 |   },
1981 |   {
1982 |    "cell_type": "code",
1983 |    "execution_count": 7,
1984 |    "metadata": {},
1985 |    "outputs": [
1986 |     {
1987 |      "data": {
1988 |       "text/plain": [
1989 |        "CountVectorizer(stop_words='english')"
1990 |       ]
1991 |      },
1992 |      "execution_count": 7,
1993 |      "metadata": {},
1994 |      "output_type": "execute_result"
1995 |     }
1996 |    ],
1997 |    "source": [
1998 |     "# fit the data\n",
1999 |     "bow.fit(text_data)"
2000 |    ]
2001 |   },
2002 |   {
2003 |    "cell_type": "code",
2004 |    "execution_count": 8,
2005 |    "metadata": {},
2006 |    "outputs": [
2007 |     {
2008 |      "data": {
2009 |       "text/plain": [
2010 |        "['extraction',\n",
2011 |        " 'feature',\n",
2012 |        " 'good',\n",
2013 |        " 'important',\n",
2014 |        " 'interested',\n",
2015 |        " 'nlp',\n",
2016 |        " 'topic',\n",
2017 |        " 'tutorial']"
2018 |       ]
2019 |      },
2020 |      "execution_count": 8,
2021 |      "metadata": {},
2022 |      "output_type": "execute_result"
2023 |     }
2024 |    ],
2025 |    "source": [
2026 |     "# get the vocabulary list\n",
2027 |     "bow.get_feature_names()"
2028 |    ]
2029 |   },
2030 |   {
2031 |    "cell_type": "code",
2032 |    "execution_count": 9,
2033 |    "metadata": {},
2034 |    "outputs": [
2035 |     {
2036 |      "data": {
2037 |       "text/plain": [
2038 |        "<3x8 sparse matrix of type '<class 'numpy.int64'>'\n",
2039 |        "\twith 9 stored elements in Compressed Sparse Row format>"
2040 |       ]
2041 |      },
2042 |      "execution_count": 9,
2043 |      "metadata": {},
2044 |      "output_type": "execute_result"
2045 |     }
2046 |    ],
2047 |    "source": [
2048 |     "bow_features = bow.transform(text_data)\n",
2049 |     "bow_features"
2050 |    ]
2051 |   },
2052 |   {
2053 |    "cell_type": "code",
2054 |    "execution_count": 10,
2055 |    "metadata": {},
2056 |    "outputs": [
2057 |     {
2058 |      "data": {
2059 |       "text/plain": [
2060 |        "array([[0, 0, 0, 0, 1, 1, 0, 0],\n",
2061 |        "       [0, 0, 2, 0, 0, 0, 1, 1],\n",
2062 |        "       [1, 1, 0, 1, 0, 0, 1, 0]], dtype=int64)"
2063 |       ]
2064 |      },
2065 |      "execution_count": 10,
2066 |      "metadata": {},
2067 |      "output_type": "execute_result"
2068 |     }
2069 |    ],
2070 |    "source": [
2071 |     "bow_feature_array = bow_features.toarray()\n",
2072 |     "bow_feature_array"
2073 |    ]
2074 |   },
2075 |   {
2076 |    "cell_type": "code",
2077 |    "execution_count": 11,
2078 |    "metadata": {},
2079 |    "outputs": [
2080 |     {
2081 |      "name": "stdout",
2082 |      "output_type": "stream",
2083 |      "text": [
2084 |       "['extraction', 'feature', 'good', 'important', 'interested', 'nlp', 'topic', 'tutorial']\n",
2085 |       "I am interested in NLP\n",
2086 |       "[0 0 0 0 1 1 0 0]\n",
2087 |       "This is a good tutorial with good topic\n",
2088 |       "[0 0 2 0 0 0 1 1]\n",
2089 |       "Feature extraction is very important topic\n",
2090 |       "[1 1 0 1 0 0 1 0]\n"
2091 |      ]
2092 |     }
2093 |    ],
2094 |    "source": [
2095 |     "print(bow.get_feature_names())\n",
2096 |     "for sentence, feature in zip(text_data, bow_feature_array):\n",
2097 |     "    print(sentence)\n",
2098 |     "    print(feature)"
2099 |    ]
2100 |   },
2101 |   {
2102 |    "cell_type": "code",
2103 |    "execution_count": null,
2104 |    "metadata": {},
2105 |    "outputs": [],
2106 |    "source": []
2107 |   },
2108 |   {
2109 |    "cell_type": "markdown",
2110 |    "metadata": {},
2111 |    "source": [
2112 |     "## TF-IDF (Term Frequency/Inverse Document Frequency)\n",
2113 |     "\n",
2114 |     "TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc)  in a document amongst a collection of documents"
2115 |    ]
2116 |   },
2117 |   {
2118 |    "cell_type": "code",
2119 |    "execution_count": 12,
2120 |    "metadata": {},
2121 |    "outputs": [],
2122 |    "source": [
2123 |     "text_data = ['I am interested in NLP', 'This is a good tutorial with good topic', 'Feature extraction is very important topic']"
2124 |    ]
2125 |   },
2126 |   {
2127 |    "cell_type": "code",
2128 |    "execution_count": 13,
2129 |    "metadata": {},
2130 |    "outputs": [],
2131 |    "source": [
2132 |     "from sklearn.feature_extraction.text import TfidfVectorizer\n",
2133 |     "tfidf = TfidfVectorizer(stop_words='english')"
2134 |    ]
2135 |   },
2136 |   {
2137 |    "cell_type": "code",
2138 |    "execution_count": 14,
2139 |    "metadata": {},
2140 |    "outputs": [
2141 |     {
2142 |      "data": {
2143 |       "text/plain": [
2144 |        "TfidfVectorizer(stop_words='english')"
2145 |       ]
2146 |      },
2147 |      "execution_count": 14,
2148 |      "metadata": {},
2149 |      "output_type": "execute_result"
2150 |     }
2151 |    ],
2152 |    "source": [
2153 |     "# fit the data\n",
2154 |     "tfidf.fit(text_data)"
2155 |    ]
2156 |   },
2157 |   {
2158 |    "cell_type": "code",
2159 |    "execution_count": 15,
2160 |    "metadata": {},
2161 |    "outputs": [
2162 |     {
2163 |      "data": {
2164 |       "text/plain": [
2165 |        "{'interested': 4,\n",
2166 |        " 'nlp': 5,\n",
2167 |        " 'good': 2,\n",
2168 |        " 'tutorial': 7,\n",
2169 |        " 'topic': 6,\n",
2170 |        " 'feature': 1,\n",
2171 |        " 'extraction': 0,\n",
2172 |        " 'important': 3}"
2173 |       ]
2174 |      },
2175 |      "execution_count": 15,
2176 |      "metadata": {},
2177 |      "output_type": "execute_result"
2178 |     }
2179 |    ],
2180 |    "source": [
2181 |     "# get the vocabulary list\n",
2182 |     "tfidf.vocabulary_"
2183 |    ]
2184 |   },
2185 |   {
2186 |    "cell_type": "code",
2187 |    "execution_count": 16,
2188 |    "metadata": {},
2189 |    "outputs": [
2190 |     {
2191 |      "data": {
2192 |       "text/plain": [
2193 |        "<3x8 sparse matrix of type '<class 'numpy.float64'>'\n",
2194 |        "\twith 9 stored elements in Compressed Sparse Row format>"
2195 |       ]
2196 |      },
2197 |      "execution_count": 16,
2198 |      "metadata": {},
2199 |      "output_type": "execute_result"
2200 |     }
2201 |    ],
2202 |    "source": [
2203 |     "tfidf_features = tfidf.transform(text_data)\n",
2204 |     "tfidf_features"
2205 |    ]
2206 |   },
2207 |   {
2208 |    "cell_type": "code",
2209 |    "execution_count": 17,
2210 |    "metadata": {},
2211 |    "outputs": [
2212 |     {
2213 |      "data": {
2214 |       "text/plain": [
2215 |        "array([[0.        , 0.        , 0.        , 0.        , 0.70710678,\n",
2216 |        "        0.70710678, 0.        , 0.        ],\n",
2217 |        "       [0.        , 0.        , 0.84678897, 0.        , 0.        ,\n",
2218 |        "        0.        , 0.32200242, 0.42339448],\n",
2219 |        "       [0.52863461, 0.52863461, 0.        , 0.52863461, 0.        ,\n",
2220 |        "        0.        , 0.40204024, 0.        ]])"
2221 |       ]
2222 |      },
2223 |      "execution_count": 17,
2224 |      "metadata": {},
2225 |      "output_type": "execute_result"
2226 |     }
2227 |    ],
2228 |    "source": [
2229 |     "tfidf_feature_array = tfidf_features.toarray()\n",
2230 |     "tfidf_feature_array"
2231 |    ]
2232 |   },
2233 |   {
2234 |    "cell_type": "code",
2235 |    "execution_count": 19,
2236 |    "metadata": {},
2237 |    "outputs": [
2238 |     {
2239 |      "name": "stdout",
2240 |      "output_type": "stream",
2241 |      "text": [
2242 |       "I am interested in NLP\n",
2243 |       "  (0, 5)\t0.7071067811865476\n",
2244 |       "  (0, 4)\t0.7071067811865476\n",
2245 |       "This is a good tutorial with good topic\n",
2246 |       "  (0, 7)\t0.42339448341195934\n",
2247 |       "  (0, 6)\t0.3220024178194947\n",
2248 |       "  (0, 2)\t0.8467889668239187\n",
2249 |       "Feature extraction is very important topic\n",
2250 |       "  (0, 6)\t0.4020402441612698\n",
2251 |       "  (0, 3)\t0.5286346066596935\n",
2252 |       "  (0, 1)\t0.5286346066596935\n",
2253 |       "  (0, 0)\t0.5286346066596935\n"
2254 |      ]
2255 |     }
2256 |    ],
2257 |    "source": [
2258 |     "for sentence, feature in zip(text_data, tfidf_features):\n",
2259 |     "    print(sentence)\n",
2260 |     "    print(feature)"
2261 |    ]
2262 |   },
2263 |   {
2264 |    "cell_type": "code",
2265 |    "execution_count": null,
2266 |    "metadata": {},
2267 |    "outputs": [],
2268 |    "source": []
2269 |   },
2270 |   {
2271 |    "cell_type": "markdown",
2272 |    "metadata": {},
2273 |    "source": [
2274 |     "## Word2vec\n",
2275 |     "\n",
2276 |     "The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence."
2277 |    ]
2278 |   },
2279 |   {
2280 |    "cell_type": "code",
2281 |    "execution_count": 20,
2282 |    "metadata": {},
2283 |    "outputs": [],
2284 |    "source": [
2285 |     "from gensim.test.utils import common_texts\n",
2286 |     "from gensim.models import Word2Vec"
2287 |    ]
2288 |   },
2289 |   {
2290 |    "cell_type": "code",
2291 |    "execution_count": 21,
2292 |    "metadata": {},
2293 |    "outputs": [
2294 |     {
2295 |      "data": {
2296 |       "text/plain": [
2297 |        "[['human', 'interface', 'computer'],\n",
2298 |        " ['survey', 'user', 'computer', 'system', 'response', 'time'],\n",
2299 |        " ['eps', 'user', 'interface', 'system'],\n",
2300 |        " ['system', 'human', 'system', 'eps'],\n",
2301 |        " ['user', 'response', 'time'],\n",
2302 |        " ['trees'],\n",
2303 |        " ['graph', 'trees'],\n",
2304 |        " ['graph', 'minors', 'trees'],\n",
2305 |        " ['graph', 'minors', 'survey']]"
2306 |       ]
2307 |      },
2308 |      "execution_count": 21,
2309 |      "metadata": {},
2310 |      "output_type": "execute_result"
2311 |     }
2312 |    ],
2313 |    "source": [
2314 |     "# text data\n",
2315 |     "common_texts"
2316 |    ]
2317 |   },
2318 |   {
2319 |    "cell_type": "code",
2320 |    "execution_count": 23,
2321 |    "metadata": {},
2322 |    "outputs": [],
2323 |    "source": [
2324 |     "# initialize and fit the data\n",
2325 |     "model = Word2Vec(common_texts, size=100, min_count=1)"
2326 |    ]
2327 |   },
2328 |   {
2329 |    "cell_type": "code",
2330 |    "execution_count": 25,
2331 |    "metadata": {},
2332 |    "outputs": [
2333 |     {
2334 |      "data": {
2335 |       "text/plain": [
2336 |        "array([-0.00042112,  0.00126945, -0.00348724,  0.00373327,  0.00387501,\n",
2337 |        "       -0.00306736, -0.00138952, -0.00139083,  0.00334137,  0.00413064,\n",
2338 |        "        0.00045129, -0.00390373, -0.00159695, -0.00369461, -0.00036086,\n",
2339 |        "        0.00444261, -0.00391653,  0.00447466, -0.00032617,  0.00056412,\n",
2340 |        "       -0.00017338, -0.00464378,  0.00039338, -0.00353649,  0.0040346 ,\n",
2341 |        "        0.00179682, -0.00186994, -0.00121431, -0.00370716,  0.00039535,\n",
2342 |        "       -0.00117291,  0.00498948, -0.00243317,  0.00480749, -0.00128626,\n",
2343 |        "       -0.0018426 , -0.00086148, -0.00347201, -0.0025697 , -0.00409948,\n",
2344 |        "        0.00433477, -0.00424404,  0.00389087,  0.0024296 ,  0.0009781 ,\n",
2345 |        "       -0.00267652, -0.00039598,  0.00188174, -0.00141169,  0.00143257,\n",
2346 |        "        0.00363962, -0.00445332,  0.00499313, -0.00013036,  0.00411159,\n",
2347 |        "        0.00307077, -0.00048517,  0.00491026, -0.00315512, -0.00091287,\n",
2348 |        "        0.00465486,  0.00034458,  0.00097905,  0.00187424, -0.00452135,\n",
2349 |        "       -0.00365111,  0.00260027,  0.00464861, -0.00243504, -0.00425601,\n",
2350 |        "       -0.00265299, -0.00108813,  0.00284521, -0.00437486, -0.0015496 ,\n",
2351 |        "       -0.00054869,  0.00228153,  0.00360572,  0.00255484, -0.00357945,\n",
2352 |        "       -0.00235164,  0.00220505, -0.0016885 ,  0.00294839, -0.00337972,\n",
2353 |        "        0.00291201,  0.00250298,  0.00447992, -0.00129002,  0.0025    ,\n",
2354 |        "       -0.00430755, -0.00419162, -0.00029911,  0.00166961,  0.00417119,\n",
2355 |        "       -0.00209666,  0.00452041,  0.00010931, -0.00115822, -0.00154263],\n",
2356 |        "      dtype=float32)"
2357 |       ]
2358 |      },
2359 |      "execution_count": 25,
2360 |      "metadata": {},
2361 |      "output_type": "execute_result"
2362 |     }
2363 |    ],
2364 |    "source": [
2365 |     "model.wv['graph']"
2366 |    ]
2367 |   },
2368 |   {
2369 |    "cell_type": "code",
2370 |    "execution_count": 26,
2371 |    "metadata": {},
2372 |    "outputs": [
2373 |     {
2374 |      "data": {
2375 |       "text/plain": [
2376 |        "[('interface', 0.1710839718580246),\n",
2377 |        " ('user', 0.08987751603126526),\n",
2378 |        " ('trees', 0.07364125549793243),\n",
2379 |        " ('minors', 0.045832667499780655),\n",
2380 |        " ('computer', 0.025292515754699707),\n",
2381 |        " ('system', 0.012846874073147774),\n",
2382 |        " ('human', -0.03873271495103836),\n",
2383 |        " ('survey', -0.06853737682104111),\n",
2384 |        " ('time', -0.07515352964401245),\n",
2385 |        " ('eps', -0.07798048853874207)]"
2386 |       ]
2387 |      },
2388 |      "execution_count": 26,
2389 |      "metadata": {},
2390 |      "output_type": "execute_result"
2391 |     }
2392 |    ],
2393 |    "source": [
2394 |     "model.wv.most_similar('graph')"
2395 |    ]
2396 |   },
2397 |   {
2398 |    "cell_type": "code",
2399 |    "execution_count": null,
2400 |    "metadata": {},
2401 |    "outputs": [],
2402 |    "source": []
2403 |   },
2404 |   {
2405 |    "cell_type": "markdown",
2406 |    "metadata": {},
2407 |    "source": [
2408 |     "## Word Embedding using Glove\n",
2409 |     "\n",
2410 |     "GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space\n",
2411 |     "\n",
2412 |     "Download link: https://www.kaggle.com/datasets/danielwillgeorge/glove6b100dtxt"
2413 |    ]
2414 |   },
2415 |   {
2416 |    "cell_type": "code",
2417 |    "execution_count": 28,
2418 |    "metadata": {},
2419 |    "outputs": [
2420 |     {
2421 |      "data": {
2422 |       "text/html": [
2423 |        "<div>\n",
2424 |        "<style scoped>\n",
2425 |        "    .dataframe tbody tr th:only-of-type {\n",
2426 |        "        vertical-align: middle;\n",
2427 |        "    }\n",
2428 |        "\n",
2429 |        "    .dataframe tbody tr th {\n",
2430 |        "        vertical-align: top;\n",
2431 |        "    }\n",
2432 |        "\n",
2433 |        "    .dataframe thead th {\n",
2434 |        "        text-align: right;\n",
2435 |        "    }\n",
2436 |        "</style>\n",
2437 |        "<table border=\"1\" class=\"dataframe\">\n",
2438 |        "  <thead>\n",
2439 |        "    <tr style=\"text-align: right;\">\n",
2440 |        "      <th></th>\n",
2441 |        "      <th>tweet</th>\n",
2442 |        "      <th>clean_text</th>\n",
2443 |        "    </tr>\n",
2444 |        "  </thead>\n",
2445 |        "  <tbody>\n",
2446 |        "    <tr>\n",
2447 |        "      <th>0</th>\n",
2448 |        "      <td>@user when a father is dysfunctional and is s...</td>\n",
2449 |        "      <td>user father dysfunctional selfish drags kids ...</td>\n",
2450 |        "    </tr>\n",
2451 |        "    <tr>\n",
2452 |        "      <th>1</th>\n",
2453 |        "      <td>@user @user thanks for #lyft credit i can't us...</td>\n",
2454 |        "      <td>user user thanks lyft credit can t use cause ...</td>\n",
2455 |        "    </tr>\n",
2456 |        "    <tr>\n",
2457 |        "      <th>2</th>\n",
2458 |        "      <td>bihday your majesty</td>\n",
2459 |        "      <td>bihday majesty</td>\n",
2460 |        "    </tr>\n",
2461 |        "    <tr>\n",
2462 |        "      <th>3</th>\n",
2463 |        "      <td>#model   i love u take with u all the time in ...</td>\n",
2464 |        "      <td>model love u take u time ur</td>\n",
2465 |        "    </tr>\n",
2466 |        "    <tr>\n",
2467 |        "      <th>4</th>\n",
2468 |        "      <td>factsguide: society now    #motivation</td>\n",
2469 |        "      <td>factsguide society motivation</td>\n",
2470 |        "    </tr>\n",
2471 |        "  </tbody>\n",
2472 |        "</table>\n",
2473 |        "</div>"
2474 |       ],
2475 |       "text/plain": [
2476 |        "                                               tweet  \\\n",
2477 |        "0   @user when a father is dysfunctional and is s...   \n",
2478 |        "1  @user @user thanks for #lyft credit i can't us...   \n",
2479 |        "2                                bihday your majesty   \n",
2480 |        "3  #model   i love u take with u all the time in ...   \n",
2481 |        "4             factsguide: society now    #motivation   \n",
2482 |        "\n",
2483 |        "                                          clean_text  \n",
2484 |        "0   user father dysfunctional selfish drags kids ...  \n",
2485 |        "1   user user thanks lyft credit can t use cause ...  \n",
2486 |        "2                                     bihday majesty  \n",
2487 |        "3                       model love u take u time ur   \n",
2488 |        "4                      factsguide society motivation  "
2489 |       ]
2490 |      },
2491 |      "execution_count": 28,
2492 |      "metadata": {},
2493 |      "output_type": "execute_result"
2494 |     }
2495 |    ],
2496 |    "source": [
2497 |     "import pandas as pd\n",
2498 |     "import string\n",
2499 |     "from nltk.corpus import stopwords\n",
2500 |     "df = pd.read_csv('data/Twitter Sentiments.csv')\n",
2501 |     "# drop the columns\n",
2502 |     "df = df.drop(columns=['id', 'label'], axis=1)\n",
2503 |     "\n",
2504 |     "df['clean_text'] = df['tweet'].str.lower()\n",
2505 |     "\n",
2506 |     "STOPWORDS = set(stopwords.words('english'))\n",
2507 |     "def remove_stopwords(text):\n",
2508 |     "    return \" \".join([word for word in text.split() if word not in STOPWORDS])\n",
2509 |     "df['clean_text'] = df['clean_text'].apply(lambda x: remove_stopwords(x))\n",
2510 |     "\n",
2511 |     "import re\n",
2512 |     "def remove_spl_chars(text):\n",
2513 |     "    text = re.sub('[^a-zA-Z0-9]', ' ', text)\n",
2514 |     "    text = re.sub('\\s+', ' ', text)\n",
2515 |     "    return text\n",
2516 |     "df['clean_text'] = df['clean_text'].apply(lambda x: remove_spl_chars(x))\n",
2517 |     "\n",
2518 |     "df.head()"
2519 |    ]
2520 |   },
2521 |   {
2522 |    "cell_type": "code",
2523 |    "execution_count": 34,
2524 |    "metadata": {},
2525 |    "outputs": [],
2526 |    "source": [
2527 |     "from keras.preprocessing.text import Tokenizer\n",
2528 |     "from keras.preprocessing.sequence import pad_sequences\n",
2529 |     "import numpy as np"
2530 |    ]
2531 |   },
2532 |   {
2533 |    "cell_type": "code",
2534 |    "execution_count": 30,
2535 |    "metadata": {},
2536 |    "outputs": [
2537 |     {
2538 |      "data": {
2539 |       "text/plain": [
2540 |        "39085"
2541 |       ]
2542 |      },
2543 |      "execution_count": 30,
2544 |      "metadata": {},
2545 |      "output_type": "execute_result"
2546 |     }
2547 |    ],
2548 |    "source": [
2549 |     "# tokenize text\n",
2550 |     "tokenizer = Tokenizer()\n",
2551 |     "tokenizer.fit_on_texts(df['clean_text'])\n",
2552 |     "\n",
2553 |     "word_index = tokenizer.word_index\n",
2554 |     "vocab_size = len(word_index)\n",
2555 |     "vocab_size"
2556 |    ]
2557 |   },
2558 |   {
2559 |    "cell_type": "code",
2560 |    "execution_count": 40,
2561 |    "metadata": {},
2562 |    "outputs": [],
2563 |    "source": [
2564 |     "# word_index"
2565 |    ]
2566 |   },
2567 |   {
2568 |    "cell_type": "code",
2569 |    "execution_count": 31,
2570 |    "metadata": {},
2571 |    "outputs": [
2572 |     {
2573 |      "data": {
2574 |       "text/plain": [
2575 |        "131"
2576 |       ]
2577 |      },
2578 |      "execution_count": 31,
2579 |      "metadata": {},
2580 |      "output_type": "execute_result"
2581 |     }
2582 |    ],
2583 |    "source": [
2584 |     "max(len(data) for data in df['clean_text'])"
2585 |    ]
2586 |   },
2587 |   {
2588 |    "cell_type": "code",
2589 |    "execution_count": 32,
2590 |    "metadata": {},
2591 |    "outputs": [],
2592 |    "source": [
2593 |     "# padding text data\n",
2594 |     "sequences = tokenizer.texts_to_sequences(df['clean_text'])\n",
2595 |     "padded_seq = pad_sequences(sequences, maxlen=131, padding='post', truncating='post')"
2596 |    ]
2597 |   },
2598 |   {
2599 |    "cell_type": "code",
2600 |    "execution_count": 33,
2601 |    "metadata": {},
2602 |    "outputs": [
2603 |     {
2604 |      "data": {
2605 |       "text/plain": [
2606 |        "array([    1,    28, 15330,  2630,  6365,   184,  7786,   385,     0,\n",
2607 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
2608 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
2609 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
2610 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
2611 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
2612 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
2613 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
2614 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
2615 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
2616 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
2617 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
2618 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
2619 |        "           0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
2620 |        "           0,     0,     0,     0,     0])"
2621 |       ]
2622 |      },
2623 |      "execution_count": 33,
2624 |      "metadata": {},
2625 |      "output_type": "execute_result"
2626 |     }
2627 |    ],
2628 |    "source": [
2629 |     "padded_seq[0]"
2630 |    ]
2631 |   },
2632 |   {
2633 |    "cell_type": "code",
2634 |    "execution_count": 35,
2635 |    "metadata": {},
2636 |    "outputs": [],
2637 |    "source": [
2638 |     "# create embedding index\n",
2639 |     "embedding_index = {}\n",
2640 |     "with open('glove.6B.100d.txt', encoding='utf-8') as f:\n",
2641 |     "    for line in f:\n",
2642 |     "        values = line.split()\n",
2643 |     "        word = values[0]\n",
2644 |     "        coefs = np.asarray(values[1:], dtype='float32')\n",
2645 |     "        embedding_index[word] = coefs"
2646 |    ]
2647 |   },
2648 |   {
2649 |    "cell_type": "code",
2650 |    "execution_count": 36,
2651 |    "metadata": {},
2652 |    "outputs": [
2653 |     {
2654 |      "data": {
2655 |       "text/plain": [
2656 |        "array([-0.030769 ,  0.11993  ,  0.53909  , -0.43696  , -0.73937  ,\n",
2657 |        "       -0.15345  ,  0.081126 , -0.38559  , -0.68797  , -0.41632  ,\n",
2658 |        "       -0.13183  , -0.24922  ,  0.441    ,  0.085919 ,  0.20871  ,\n",
2659 |        "       -0.063582 ,  0.062228 , -0.051234 , -0.13398  ,  1.1418   ,\n",
2660 |        "        0.036526 ,  0.49029  , -0.24567  , -0.412    ,  0.12349  ,\n",
2661 |        "        0.41336  , -0.48397  , -0.54243  , -0.27787  , -0.26015  ,\n",
2662 |        "       -0.38485  ,  0.78656  ,  0.1023   , -0.20712  ,  0.40751  ,\n",
2663 |        "        0.32026  , -0.51052  ,  0.48362  , -0.0099498, -0.38685  ,\n",
2664 |        "        0.034975 , -0.167    ,  0.4237   , -0.54164  , -0.30323  ,\n",
2665 |        "       -0.36983  ,  0.082836 , -0.52538  , -0.064531 , -1.398    ,\n",
2666 |        "       -0.14873  , -0.35327  , -0.1118   ,  1.0912   ,  0.095864 ,\n",
2667 |        "       -2.8129   ,  0.45238  ,  0.46213  ,  1.6012   , -0.20837  ,\n",
2668 |        "       -0.27377  ,  0.71197  , -1.0754   , -0.046974 ,  0.67479  ,\n",
2669 |        "       -0.065839 ,  0.75824  ,  0.39405  ,  0.15507  , -0.64719  ,\n",
2670 |        "        0.32796  , -0.031748 ,  0.52899  , -0.43886  ,  0.67405  ,\n",
2671 |        "        0.42136  , -0.11981  , -0.21777  , -0.29756  , -0.1351   ,\n",
2672 |        "        0.59898  ,  0.46529  , -0.58258  , -0.02323  , -1.5442   ,\n",
2673 |        "        0.01901  , -0.015877 ,  0.024499 , -0.58017  , -0.67659  ,\n",
2674 |        "       -0.040379 , -0.44043  ,  0.083292 ,  0.20035  , -0.75499  ,\n",
2675 |        "        0.16918  , -0.26573  , -0.52878  ,  0.17584  ,  1.065    ],\n",
2676 |        "      dtype=float32)"
2677 |       ]
2678 |      },
2679 |      "execution_count": 36,
2680 |      "metadata": {},
2681 |      "output_type": "execute_result"
2682 |     }
2683 |    ],
2684 |    "source": [
2685 |     "embedding_index['good']"
2686 |    ]
2687 |   },
2688 |   {
2689 |    "cell_type": "code",
2690 |    "execution_count": 41,
2691 |    "metadata": {},
2692 |    "outputs": [],
2693 |    "source": [
2694 |     "# create embedding matrix\n",
2695 |     "embedding_matrix = np.zeros((vocab_size+1, 100))\n",
2696 |     "for word, i in word_index.items():\n",
2697 |     "    embedding_vector = embedding_index.get(word)\n",
2698 |     "    if embedding_vector is not None:\n",
2699 |     "        embedding_matrix[i] = embedding_vector"
2700 |    ]
2701 |   },
2702 |   {
2703 |    "cell_type": "code",
2704 |    "execution_count": 42,
2705 |    "metadata": {},
2706 |    "outputs": [
2707 |     {
2708 |      "data": {
2709 |       "text/plain": [
2710 |        "(39086, 100)"
2711 |       ]
2712 |      },
2713 |      "execution_count": 42,
2714 |      "metadata": {},
2715 |      "output_type": "execute_result"
2716 |     }
2717 |    ],
2718 |    "source": [
2719 |     "embedding_matrix.shape"
2720 |    ]
2721 |   },
2722 |   {
2723 |    "cell_type": "code",
2724 |    "execution_count": null,
2725 |    "metadata": {},
2726 |    "outputs": [],
2727 |    "source": []
2728 |   },
2729 |   {
2730 |    "cell_type": "markdown",
2731 |    "metadata": {},
2732 |    "source": [
2733 |     "# Named Entity Recognition"
2734 |    ]
2735 |   },
2736 |   {
2737 |    "cell_type": "code",
2738 |    "execution_count": null,
2739 |    "metadata": {
2740 |     "id": "PCnOhgdCcked"
2741 |    },
2742 |    "outputs": [],
2743 |    "source": [
2744 |     "# !pip install -U pip setuptools wheel\n",
2745 |     "# !pip install -U spacy\n",
2746 |     "# !python -m spacy download en_core_web_sm"
2747 |    ]
2748 |   },
2749 |   {
2750 |    "cell_type": "code",
2751 |    "execution_count": null,
2752 |    "metadata": {
2753 |     "id": "9LfSlz4Ye9SD"
2754 |    },
2755 |    "outputs": [],
2756 |    "source": [
2757 |     "import spacy\n",
2758 |     "from spacy import displacy"
2759 |    ]
2760 |   },
2761 |   {
2762 |    "cell_type": "code",
2763 |    "execution_count": null,
2764 |    "metadata": {
2765 |     "id": "nD0z2jmtfGfk"
2766 |    },
2767 |    "outputs": [],
2768 |    "source": [
2769 |     "NER = spacy.load('en_core_web_sm')"
2770 |    ]
2771 |   },
2772 |   {
2773 |    "cell_type": "code",
2774 |    "execution_count": null,
2775 |    "metadata": {
2776 |     "id": "Q5OJPYA2fyiK"
2777 |    },
2778 |    "outputs": [],
2779 |    "source": [
2780 |     "text = 'Mark Zuckerberg is one of the founders of Facebook, a company from the United States'"
2781 |    ]
2782 |   },
2783 |   {
2784 |    "cell_type": "code",
2785 |    "execution_count": null,
2786 |    "metadata": {
2787 |     "id": "pYE8UjWsgf6S"
2788 |    },
2789 |    "outputs": [],
2790 |    "source": [
2791 |     "ner_text = NER(text)"
2792 |    ]
2793 |   },
2794 |   {
2795 |    "cell_type": "code",
2796 |    "execution_count": null,
2797 |    "metadata": {
2798 |     "colab": {
2799 |      "base_uri": "https://localhost:8080/"
2800 |     },
2801 |     "id": "kpWcdBlWgf38",
2802 |     "outputId": "b8cac9aa-02e2-42a3-dc01-fc3970b7c7e5"
2803 |    },
2804 |    "outputs": [
2805 |     {
2806 |      "name": "stdout",
2807 |      "output_type": "stream",
2808 |      "text": [
2809 |       "Mark Zuckerberg PERSON\n",
2810 |       "one CARDINAL\n",
2811 |       "Facebook ORG\n",
2812 |       "the United States GPE\n"
2813 |      ]
2814 |     }
2815 |    ],
2816 |    "source": [
2817 |     "for word in ner_text.ents:\n",
2818 |     "    print(word.text, word.label_)"
2819 |    ]
2820 |   },
2821 |   {
2822 |    "cell_type": "code",
2823 |    "execution_count": null,
2824 |    "metadata": {
2825 |     "colab": {
2826 |      "base_uri": "https://localhost:8080/",
2827 |      "height": 35
2828 |     },
2829 |     "id": "iFBCYIvDgrL6",
2830 |     "outputId": "5e0291e3-8c6d-4082-f3c0-473ad0bdac43"
2831 |    },
2832 |    "outputs": [
2833 |     {
2834 |      "data": {
2835 |       "application/vnd.google.colaboratory.intrinsic+json": {
2836 |        "type": "string"
2837 |       },
2838 |       "text/plain": [
2839 |        "'Countries, cities, states'"
2840 |       ]
2841 |      },
2842 |      "execution_count": 13,
2843 |      "metadata": {},
2844 |      "output_type": "execute_result"
2845 |     }
2846 |    ],
2847 |    "source": [
2848 |     "spacy.explain('GPE')"
2849 |    ]
2850 |   },
2851 |   {
2852 |    "cell_type": "code",
2853 |    "execution_count": null,
2854 |    "metadata": {
2855 |     "colab": {
2856 |      "base_uri": "https://localhost:8080/",
2857 |      "height": 35
2858 |     },
2859 |     "id": "vkzMb7Bwg1Fi",
2860 |     "outputId": "4b6c9ed6-1270-4b9f-a35a-28e24122d7d3"
2861 |    },
2862 |    "outputs": [
2863 |     {
2864 |      "data": {
2865 |       "application/vnd.google.colaboratory.intrinsic+json": {
2866 |        "type": "string"
2867 |       },
2868 |       "text/plain": [
2869 |        "'Numerals that do not fall under another type'"
2870 |       ]
2871 |      },
2872 |      "execution_count": 14,
2873 |      "metadata": {},
2874 |      "output_type": "execute_result"
2875 |     }
2876 |    ],
2877 |    "source": [
2878 |     "spacy.explain('CARDINAL')"
2879 |    ]
2880 |   },
2881 |   {
2882 |    "cell_type": "code",
2883 |    "execution_count": null,
2884 |    "metadata": {
2885 |     "colab": {
2886 |      "base_uri": "https://localhost:8080/",
2887 |      "height": 52
2888 |     },
2889 |     "id": "LBPSsLT5g9nS",
2890 |     "outputId": "69a16d27-bf86-4b7f-e5cd-b3e56adeb149"
2891 |    },
2892 |    "outputs": [
2893 |     {
2894 |      "data": {
2895 |       "text/html": [
2896 |        "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">\n",
2897 |        "<mark class=\"entity\" style=\"background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
2898 |        "    Mark Zuckerberg\n",
2899 |        "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">PERSON</span>\n",
2900 |        "</mark>\n",
2901 |        " is \n",
2902 |        "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
2903 |        "    one\n",
2904 |        "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">CARDINAL</span>\n",
2905 |        "</mark>\n",
2906 |        " of the founders of \n",
2907 |        "<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
2908 |        "    Facebook\n",
2909 |        "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
2910 |        "</mark>\n",
2911 |        ", a company from \n",
2912 |        "<mark class=\"entity\" style=\"background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
2913 |        "    the United States\n",
2914 |        "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">GPE</span>\n",
2915 |        "</mark>\n",
2916 |        "</div></span>"
2917 |       ],
2918 |       "text/plain": [
2919 |        "<IPython.core.display.HTML object>"
2920 |       ]
2921 |      },
2922 |      "metadata": {},
2923 |      "output_type": "display_data"
2924 |     }
2925 |    ],
2926 |    "source": [
2927 |     "displacy.render(ner_text, style='ent', jupyter=True)"
2928 |    ]
2929 |   },
2930 |   {
2931 |    "cell_type": "code",
2932 |    "execution_count": null,
2933 |    "metadata": {},
2934 |    "outputs": [],
2935 |    "source": []
2936 |   },
2937 |   {
2938 |    "cell_type": "markdown",
2939 |    "metadata": {},
2940 |    "source": [
2941 |     "# Data Augmentation for Text"
2942 |    ]
2943 |   },
2944 |   {
2945 |    "cell_type": "code",
2946 |    "execution_count": null,
2947 |    "metadata": {},
2948 |    "outputs": [],
2949 |    "source": [
2950 |     "# uses\n",
2951 |     "# 1. increase the dataset size by creating more samples\n",
2952 |     "# 2. reduce overfitting\n",
2953 |     "# 3. improve model generalization\n",
2954 |     "# 4. handling imbalance dataset"
2955 |    ]
2956 |   },
2957 |   {
2958 |    "cell_type": "code",
2959 |    "execution_count": null,
2960 |    "metadata": {},
2961 |    "outputs": [],
2962 |    "source": [
2963 |     "!pip install nlpaug\n",
2964 |     "!pip install sacremoses"
2965 |    ]
2966 |   },
2967 |   {
2968 |    "cell_type": "code",
2969 |    "execution_count": 2,
2970 |    "metadata": {},
2971 |    "outputs": [],
2972 |    "source": [
2973 |     "import nlpaug.augmenter.word as naw"
2974 |    ]
2975 |   },
2976 |   {
2977 |    "cell_type": "code",
2978 |    "execution_count": 3,
2979 |    "metadata": {},
2980 |    "outputs": [],
2981 |    "source": [
2982 |     "text = 'The quick brown fox jumps over a lazy dog'"
2983 |    ]
2984 |   },
2985 |   {
2986 |    "cell_type": "markdown",
2987 |    "metadata": {},
2988 |    "source": [
2989 |     "### Synonym Replacement"
2990 |    ]
2991 |   },
2992 |   {
2993 |    "cell_type": "code",
2994 |    "execution_count": 10,
2995 |    "metadata": {},
2996 |    "outputs": [
2997 |     {
2998 |      "name": "stdout",
2999 |      "output_type": "stream",
3000 |      "text": [
3001 |       "Synonym Text: ['The flying brownness fox jumps over a lazy andiron']\n"
3002 |      ]
3003 |     }
3004 |    ],
3005 |    "source": [
3006 |     "syn_aug = naw.synonym.SynonymAug(aug_src='wordnet')\n",
3007 |     "synonym_text = syn_aug.augment(text)\n",
3008 |     "print('Synonym Text:', synonym_text)"
3009 |    ]
3010 |   },
3011 |   {
3012 |    "cell_type": "markdown",
3013 |    "metadata": {},
3014 |    "source": [
3015 |     "### Random Substitution"
3016 |    ]
3017 |   },
3018 |   {
3019 |    "cell_type": "code",
3020 |    "execution_count": 11,
3021 |    "metadata": {},
3022 |    "outputs": [
3023 |     {
3024 |      "name": "stdout",
3025 |      "output_type": "stream",
3026 |      "text": [
3027 |       "Substituted Text: ['_ _ brown fox jumps _ a lazy dog']\n"
3028 |      ]
3029 |     }
3030 |    ],
3031 |    "source": [
3032 |     "sub_aug = naw.random.RandomWordAug(action='substitute')\n",
3033 |     "substituted_text = sub_aug.augment(text)\n",
3034 |     "print('Substituted Text:', substituted_text)"
3035 |    ]
3036 |   },
3037 |   {
3038 |    "cell_type": "markdown",
3039 |    "metadata": {},
3040 |    "source": [
3041 |     "### Random Deletion"
3042 |    ]
3043 |   },
3044 |   {
3045 |    "cell_type": "code",
3046 |    "execution_count": 12,
3047 |    "metadata": {},
3048 |    "outputs": [
3049 |     {
3050 |      "name": "stdout",
3051 |      "output_type": "stream",
3052 |      "text": [
3053 |       "Deletion Text: ['Quick brown jumps over a lazy dog']\n"
3054 |      ]
3055 |     }
3056 |    ],
3057 |    "source": [
3058 |     "del_aug = naw.random.RandomWordAug(action='delete')\n",
3059 |     "deletion_text = del_aug.augment(text)\n",
3060 |     "print('Deletion Text:', deletion_text)"
3061 |    ]
3062 |   },
3063 |   {
3064 |    "cell_type": "markdown",
3065 |    "metadata": {},
3066 |    "source": [
3067 |     "### Random Swap"
3068 |    ]
3069 |   },
3070 |   {
3071 |    "cell_type": "code",
3072 |    "execution_count": 13,
3073 |    "metadata": {},
3074 |    "outputs": [
3075 |     {
3076 |      "name": "stdout",
3077 |      "output_type": "stream",
3078 |      "text": [
3079 |       "Swap Text: ['The quick brown jumps fox a lazy over dog']\n"
3080 |      ]
3081 |     }
3082 |    ],
3083 |    "source": [
3084 |     "swap_aug = naw.random.RandomWordAug(action='swap')\n",
3085 |     "swap_text = swap_aug.augment(text)\n",
3086 |     "print('Swap Text:', swap_text)"
3087 |    ]
3088 |   },
3089 |   {
3090 |    "cell_type": "markdown",
3091 |    "metadata": {},
3092 |    "source": [
3093 |     "### Back Translation"
3094 |    ]
3095 |   },
3096 |   {
3097 |    "cell_type": "code",
3098 |    "execution_count": 15,
3099 |    "metadata": {},
3100 |    "outputs": [
3101 |     {
3102 |      "name": "stdout",
3103 |      "output_type": "stream",
3104 |      "text": [
3105 |       "Back Translated Text: ['The speedy brown fox jumps over a lazy dog']\n"
3106 |      ]
3107 |     }
3108 |    ],
3109 |    "source": [
3110 |     "# translate original text to other language (german) and convert back to english language\n",
3111 |     "back_trans_aug = naw.back_translation.BackTranslationAug()\n",
3112 |     "back_trans_text = back_trans_aug.augment(text)\n",
3113 |     "print('Back Translated Text:', back_trans_text)"
3114 |    ]
3115 |   },
3116 |   {
3117 |    "cell_type": "code",
3118 |    "execution_count": null,
3119 |    "metadata": {},
3120 |    "outputs": [],
3121 |    "source": []
3122 |   },
3123 |   {
3124 |    "cell_type": "code",
3125 |    "execution_count": null,
3126 |    "metadata": {},
3127 |    "outputs": [],
3128 |    "source": []
3129 |   }
3130 |  ],
3131 |  "metadata": {
3132 |   "kernelspec": {
3133 |    "display_name": "Python 3 (ipykernel)",
3134 |    "language": "python",
3135 |    "name": "python3"
3136 |   },
3137 |   "language_info": {
3138 |    "codemirror_mode": {
3139 |     "name": "ipython",
3140 |     "version": 3
3141 |    },
3142 |    "file_extension": ".py",
3143 |    "mimetype": "text/x-python",
3144 |    "name": "python",
3145 |    "nbconvert_exporter": "python",
3146 |    "pygments_lexer": "ipython3",
3147 |    "version": "3.11.5"
3148 |   }
3149 |  },
3150 |  "nbformat": 4,
3151 |  "nbformat_minor": 4
3152 | }
3153 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Data Science Concepts
 2 |  
 3 | The repository contains all the machine learning, deep learning and NLP concepts with examples in python
 4 | 
 5 | 💻  Machine learning concepts playlist: http://bit.ly/mlconcepts
 6 | 
 7 | ✍🏼  Natural Language Processing(NLP) concepts playlist: http://bit.ly/nlpconcepts
 8 | 
 9 | 
10 | ## Machine Learning
11 | 
12 | 1. Normalize data using Max Absolute & Min Max Scaling - https://youtu.be/wSgWf-lUdDU
13 | 2. Standardize data using Z-Score/Standard Scalar - https://youtu.be/AmCkjGPmdvI
14 | 3. Detect and Remove Outliers in the Data - https://youtu.be/Cw2IvmWRcXs
15 | 4. Label Encoding for Categorical Attributes - https://youtu.be/YuzLkF7Ymf4
16 | 5. One Hot Encoding for Categorical Attributes - https://youtu.be/LqMHkc_F1WA
17 | 6. Target/Mean Encoding for Categorical Attributes - https://youtu.be/nd7vc4MZQz4
18 | 7. Frequency Encoding & Binary Encoding - https://youtu.be/2oCfBpnWQws
19 | 8. Extract Features from Datetime Attribute - https://youtu.be/PbyHFUVuqn8
20 | 9. How to Fill Missing Values in Dataset - https://youtu.be/FEQpdgoH_pM
21 | 10. Feature Selection using Correlation Matrix (Numerical) - https://youtu.be/1fFVt4tQjRE
22 | 11. Feature Selection using Chi Square (Category) - https://youtu.be/6N9H9KxdZdk
23 | 12. Feature Selection using Recursive Feature Elimination (RFE) - https://youtu.be/vxdVKbAv6as
24 | 13. Repeated Stratified KFold Cross Validation - https://youtu.be/cChWbibT-JI
25 | 14. How to handle Imbalanced Classes in Dataset - https://youtu.be/rVuUqpyPwEs
26 | 15. Ensemble Techniques to improve Model Performance - https://youtu.be/qPN-S5Ltbm4
27 | 16. Dimensionality Reduction using PCA vs LDA vs t-SNE vs UMAP - https://youtu.be/gk7ntPrxy-k
28 | 17. Handle Large Data using pandas - https://youtu.be/bd_1T2JCr4M
29 | 
30 | ## Natural Language Processing
31 | 
32 | 1. Tokenization - https://youtu.be/ivCcY8JCxeY
33 | 2. Stemming | Extract Root Words - https://youtu.be/O-SaH_dnb9A
34 | 3. Lemmatization - https://youtu.be/uvKKEkYZcdw
35 | 4. Part of Speech Tagging (POS) - https://youtu.be/n6j-T3_F9dI
36 | 5. Text Preprocessing in NLP - https://youtu.be/Br5dmsa49wo
37 | 6. Bag of Words (BOW) - https://youtu.be/dSce20oYIPY
38 | 7. Term Frequency - Inverse Document Frequency (TF-IDF) - https://youtu.be/O9aAwvk6SNI
39 | 8. Word2Vec - https://youtu.be/4DoJcQblpGQ
40 | 9. Word Embedding | GloVe - https://youtu.be/6uxVtUMtqtk
41 | 
42 | ## Deep Learning
43 | 
44 | 1. 


--------------------------------------------------------------------------------