├── .gitignore ├── Deep Learning ├── Deep Learning Concepts - Hackers Realm.ipynb ├── audio data │ ├── OAF_back_fear.wav │ ├── OAF_back_happy.wav │ ├── OAF_back_ps.wav │ └── OAF_back_sad.wav └── image data │ ├── 1.jpg │ ├── 2.jpg │ ├── 3.jpg │ └── 4.jpg ├── Machine Learning ├── Machine Learning Concepts - Hackers Realm.ipynb └── data │ ├── 1000000 Sales Records.rar │ ├── Loan Prediction Dataset.csv │ ├── Traffic data.csv │ ├── bike sharing dataset.csv │ ├── creditcard.rar │ └── winequality.csv ├── NLP ├── Natural Language Processing(NLP) Concepts - Hackers Realm.ipynb └── data │ └── Twitter Sentiments.csv └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | Machine Learning/data/1000000 Sales Records.csv 3 | -------------------------------------------------------------------------------- /Deep Learning/audio data/OAF_back_fear.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/audio data/OAF_back_fear.wav -------------------------------------------------------------------------------- /Deep Learning/audio data/OAF_back_happy.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/audio data/OAF_back_happy.wav -------------------------------------------------------------------------------- /Deep Learning/audio data/OAF_back_ps.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/audio data/OAF_back_ps.wav -------------------------------------------------------------------------------- /Deep Learning/audio data/OAF_back_sad.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/audio data/OAF_back_sad.wav -------------------------------------------------------------------------------- /Deep Learning/image data/1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/image data/1.jpg -------------------------------------------------------------------------------- /Deep Learning/image data/2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/image data/2.jpg -------------------------------------------------------------------------------- /Deep Learning/image data/3.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/image data/3.jpg -------------------------------------------------------------------------------- /Deep Learning/image data/4.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/image data/4.jpg -------------------------------------------------------------------------------- /Machine Learning/data/1000000 Sales Records.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Machine Learning/data/1000000 Sales Records.rar -------------------------------------------------------------------------------- /Machine Learning/data/Loan Prediction Dataset.csv: -------------------------------------------------------------------------------- 1 | Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status 2 | LP001002,Male,No,0,Graduate,No,5849,0,,360,1,Urban,Y 3 | LP001003,Male,Yes,1,Graduate,No,4583,1508,128,360,1,Rural,N 4 | LP001005,Male,Yes,0,Graduate,Yes,3000,0,66,360,1,Urban,Y 5 | LP001006,Male,Yes,0,Not Graduate,No,2583,2358,120,360,1,Urban,Y 6 | LP001008,Male,No,0,Graduate,No,6000,0,141,360,1,Urban,Y 7 | LP001011,Male,Yes,2,Graduate,Yes,5417,4196,267,360,1,Urban,Y 8 | LP001013,Male,Yes,0,Not Graduate,No,2333,1516,95,360,1,Urban,Y 9 | LP001014,Male,Yes,3+,Graduate,No,3036,2504,158,360,0,Semiurban,N 10 | LP001018,Male,Yes,2,Graduate,No,4006,1526,168,360,1,Urban,Y 11 | LP001020,Male,Yes,1,Graduate,No,12841,10968,349,360,1,Semiurban,N 12 | LP001024,Male,Yes,2,Graduate,No,3200,700,70,360,1,Urban,Y 13 | LP001027,Male,Yes,2,Graduate,,2500,1840,109,360,1,Urban,Y 14 | LP001028,Male,Yes,2,Graduate,No,3073,8106,200,360,1,Urban,Y 15 | LP001029,Male,No,0,Graduate,No,1853,2840,114,360,1,Rural,N 16 | LP001030,Male,Yes,2,Graduate,No,1299,1086,17,120,1,Urban,Y 17 | LP001032,Male,No,0,Graduate,No,4950,0,125,360,1,Urban,Y 18 | LP001034,Male,No,1,Not Graduate,No,3596,0,100,240,,Urban,Y 19 | LP001036,Female,No,0,Graduate,No,3510,0,76,360,0,Urban,N 20 | LP001038,Male,Yes,0,Not Graduate,No,4887,0,133,360,1,Rural,N 21 | LP001041,Male,Yes,0,Graduate,,2600,3500,115,,1,Urban,Y 22 | LP001043,Male,Yes,0,Not Graduate,No,7660,0,104,360,0,Urban,N 23 | LP001046,Male,Yes,1,Graduate,No,5955,5625,315,360,1,Urban,Y 24 | LP001047,Male,Yes,0,Not Graduate,No,2600,1911,116,360,0,Semiurban,N 25 | LP001050,,Yes,2,Not Graduate,No,3365,1917,112,360,0,Rural,N 26 | LP001052,Male,Yes,1,Graduate,,3717,2925,151,360,,Semiurban,N 27 | LP001066,Male,Yes,0,Graduate,Yes,9560,0,191,360,1,Semiurban,Y 28 | LP001068,Male,Yes,0,Graduate,No,2799,2253,122,360,1,Semiurban,Y 29 | LP001073,Male,Yes,2,Not Graduate,No,4226,1040,110,360,1,Urban,Y 30 | LP001086,Male,No,0,Not Graduate,No,1442,0,35,360,1,Urban,N 31 | LP001087,Female,No,2,Graduate,,3750,2083,120,360,1,Semiurban,Y 32 | LP001091,Male,Yes,1,Graduate,,4166,3369,201,360,,Urban,N 33 | LP001095,Male,No,0,Graduate,No,3167,0,74,360,1,Urban,N 34 | LP001097,Male,No,1,Graduate,Yes,4692,0,106,360,1,Rural,N 35 | LP001098,Male,Yes,0,Graduate,No,3500,1667,114,360,1,Semiurban,Y 36 | LP001100,Male,No,3+,Graduate,No,12500,3000,320,360,1,Rural,N 37 | LP001106,Male,Yes,0,Graduate,No,2275,2067,,360,1,Urban,Y 38 | LP001109,Male,Yes,0,Graduate,No,1828,1330,100,,0,Urban,N 39 | LP001112,Female,Yes,0,Graduate,No,3667,1459,144,360,1,Semiurban,Y 40 | LP001114,Male,No,0,Graduate,No,4166,7210,184,360,1,Urban,Y 41 | LP001116,Male,No,0,Not Graduate,No,3748,1668,110,360,1,Semiurban,Y 42 | LP001119,Male,No,0,Graduate,No,3600,0,80,360,1,Urban,N 43 | LP001120,Male,No,0,Graduate,No,1800,1213,47,360,1,Urban,Y 44 | LP001123,Male,Yes,0,Graduate,No,2400,0,75,360,,Urban,Y 45 | LP001131,Male,Yes,0,Graduate,No,3941,2336,134,360,1,Semiurban,Y 46 | LP001136,Male,Yes,0,Not Graduate,Yes,4695,0,96,,1,Urban,Y 47 | LP001137,Female,No,0,Graduate,No,3410,0,88,,1,Urban,Y 48 | LP001138,Male,Yes,1,Graduate,No,5649,0,44,360,1,Urban,Y 49 | LP001144,Male,Yes,0,Graduate,No,5821,0,144,360,1,Urban,Y 50 | LP001146,Female,Yes,0,Graduate,No,2645,3440,120,360,0,Urban,N 51 | LP001151,Female,No,0,Graduate,No,4000,2275,144,360,1,Semiurban,Y 52 | LP001155,Female,Yes,0,Not Graduate,No,1928,1644,100,360,1,Semiurban,Y 53 | LP001157,Female,No,0,Graduate,No,3086,0,120,360,1,Semiurban,Y 54 | LP001164,Female,No,0,Graduate,No,4230,0,112,360,1,Semiurban,N 55 | LP001179,Male,Yes,2,Graduate,No,4616,0,134,360,1,Urban,N 56 | LP001186,Female,Yes,1,Graduate,Yes,11500,0,286,360,0,Urban,N 57 | LP001194,Male,Yes,2,Graduate,No,2708,1167,97,360,1,Semiurban,Y 58 | LP001195,Male,Yes,0,Graduate,No,2132,1591,96,360,1,Semiurban,Y 59 | LP001197,Male,Yes,0,Graduate,No,3366,2200,135,360,1,Rural,N 60 | LP001198,Male,Yes,1,Graduate,No,8080,2250,180,360,1,Urban,Y 61 | LP001199,Male,Yes,2,Not Graduate,No,3357,2859,144,360,1,Urban,Y 62 | LP001205,Male,Yes,0,Graduate,No,2500,3796,120,360,1,Urban,Y 63 | LP001206,Male,Yes,3+,Graduate,No,3029,0,99,360,1,Urban,Y 64 | LP001207,Male,Yes,0,Not Graduate,Yes,2609,3449,165,180,0,Rural,N 65 | LP001213,Male,Yes,1,Graduate,No,4945,0,,360,0,Rural,N 66 | LP001222,Female,No,0,Graduate,No,4166,0,116,360,0,Semiurban,N 67 | LP001225,Male,Yes,0,Graduate,No,5726,4595,258,360,1,Semiurban,N 68 | LP001228,Male,No,0,Not Graduate,No,3200,2254,126,180,0,Urban,N 69 | LP001233,Male,Yes,1,Graduate,No,10750,0,312,360,1,Urban,Y 70 | LP001238,Male,Yes,3+,Not Graduate,Yes,7100,0,125,60,1,Urban,Y 71 | LP001241,Female,No,0,Graduate,No,4300,0,136,360,0,Semiurban,N 72 | LP001243,Male,Yes,0,Graduate,No,3208,3066,172,360,1,Urban,Y 73 | LP001245,Male,Yes,2,Not Graduate,Yes,1875,1875,97,360,1,Semiurban,Y 74 | LP001248,Male,No,0,Graduate,No,3500,0,81,300,1,Semiurban,Y 75 | LP001250,Male,Yes,3+,Not Graduate,No,4755,0,95,,0,Semiurban,N 76 | LP001253,Male,Yes,3+,Graduate,Yes,5266,1774,187,360,1,Semiurban,Y 77 | LP001255,Male,No,0,Graduate,No,3750,0,113,480,1,Urban,N 78 | LP001256,Male,No,0,Graduate,No,3750,4750,176,360,1,Urban,N 79 | LP001259,Male,Yes,1,Graduate,Yes,1000,3022,110,360,1,Urban,N 80 | LP001263,Male,Yes,3+,Graduate,No,3167,4000,180,300,0,Semiurban,N 81 | LP001264,Male,Yes,3+,Not Graduate,Yes,3333,2166,130,360,,Semiurban,Y 82 | LP001265,Female,No,0,Graduate,No,3846,0,111,360,1,Semiurban,Y 83 | LP001266,Male,Yes,1,Graduate,Yes,2395,0,,360,1,Semiurban,Y 84 | LP001267,Female,Yes,2,Graduate,No,1378,1881,167,360,1,Urban,N 85 | LP001273,Male,Yes,0,Graduate,No,6000,2250,265,360,,Semiurban,N 86 | LP001275,Male,Yes,1,Graduate,No,3988,0,50,240,1,Urban,Y 87 | LP001279,Male,No,0,Graduate,No,2366,2531,136,360,1,Semiurban,Y 88 | LP001280,Male,Yes,2,Not Graduate,No,3333,2000,99,360,,Semiurban,Y 89 | LP001282,Male,Yes,0,Graduate,No,2500,2118,104,360,1,Semiurban,Y 90 | LP001289,Male,No,0,Graduate,No,8566,0,210,360,1,Urban,Y 91 | LP001310,Male,Yes,0,Graduate,No,5695,4167,175,360,1,Semiurban,Y 92 | LP001316,Male,Yes,0,Graduate,No,2958,2900,131,360,1,Semiurban,Y 93 | LP001318,Male,Yes,2,Graduate,No,6250,5654,188,180,1,Semiurban,Y 94 | LP001319,Male,Yes,2,Not Graduate,No,3273,1820,81,360,1,Urban,Y 95 | LP001322,Male,No,0,Graduate,No,4133,0,122,360,1,Semiurban,Y 96 | LP001325,Male,No,0,Not Graduate,No,3620,0,25,120,1,Semiurban,Y 97 | LP001326,Male,No,0,Graduate,,6782,0,,360,,Urban,N 98 | LP001327,Female,Yes,0,Graduate,No,2484,2302,137,360,1,Semiurban,Y 99 | LP001333,Male,Yes,0,Graduate,No,1977,997,50,360,1,Semiurban,Y 100 | LP001334,Male,Yes,0,Not Graduate,No,4188,0,115,180,1,Semiurban,Y 101 | LP001343,Male,Yes,0,Graduate,No,1759,3541,131,360,1,Semiurban,Y 102 | LP001345,Male,Yes,2,Not Graduate,No,4288,3263,133,180,1,Urban,Y 103 | LP001349,Male,No,0,Graduate,No,4843,3806,151,360,1,Semiurban,Y 104 | LP001350,Male,Yes,,Graduate,No,13650,0,,360,1,Urban,Y 105 | LP001356,Male,Yes,0,Graduate,No,4652,3583,,360,1,Semiurban,Y 106 | LP001357,Male,,,Graduate,No,3816,754,160,360,1,Urban,Y 107 | LP001367,Male,Yes,1,Graduate,No,3052,1030,100,360,1,Urban,Y 108 | LP001369,Male,Yes,2,Graduate,No,11417,1126,225,360,1,Urban,Y 109 | LP001370,Male,No,0,Not Graduate,,7333,0,120,360,1,Rural,N 110 | LP001379,Male,Yes,2,Graduate,No,3800,3600,216,360,0,Urban,N 111 | LP001384,Male,Yes,3+,Not Graduate,No,2071,754,94,480,1,Semiurban,Y 112 | LP001385,Male,No,0,Graduate,No,5316,0,136,360,1,Urban,Y 113 | LP001387,Female,Yes,0,Graduate,,2929,2333,139,360,1,Semiurban,Y 114 | LP001391,Male,Yes,0,Not Graduate,No,3572,4114,152,,0,Rural,N 115 | LP001392,Female,No,1,Graduate,Yes,7451,0,,360,1,Semiurban,Y 116 | LP001398,Male,No,0,Graduate,,5050,0,118,360,1,Semiurban,Y 117 | LP001401,Male,Yes,1,Graduate,No,14583,0,185,180,1,Rural,Y 118 | LP001404,Female,Yes,0,Graduate,No,3167,2283,154,360,1,Semiurban,Y 119 | LP001405,Male,Yes,1,Graduate,No,2214,1398,85,360,,Urban,Y 120 | LP001421,Male,Yes,0,Graduate,No,5568,2142,175,360,1,Rural,N 121 | LP001422,Female,No,0,Graduate,No,10408,0,259,360,1,Urban,Y 122 | LP001426,Male,Yes,,Graduate,No,5667,2667,180,360,1,Rural,Y 123 | LP001430,Female,No,0,Graduate,No,4166,0,44,360,1,Semiurban,Y 124 | LP001431,Female,No,0,Graduate,No,2137,8980,137,360,0,Semiurban,Y 125 | LP001432,Male,Yes,2,Graduate,No,2957,0,81,360,1,Semiurban,Y 126 | LP001439,Male,Yes,0,Not Graduate,No,4300,2014,194,360,1,Rural,Y 127 | LP001443,Female,No,0,Graduate,No,3692,0,93,360,,Rural,Y 128 | LP001448,,Yes,3+,Graduate,No,23803,0,370,360,1,Rural,Y 129 | LP001449,Male,No,0,Graduate,No,3865,1640,,360,1,Rural,Y 130 | LP001451,Male,Yes,1,Graduate,Yes,10513,3850,160,180,0,Urban,N 131 | LP001465,Male,Yes,0,Graduate,No,6080,2569,182,360,,Rural,N 132 | LP001469,Male,No,0,Graduate,Yes,20166,0,650,480,,Urban,Y 133 | LP001473,Male,No,0,Graduate,No,2014,1929,74,360,1,Urban,Y 134 | LP001478,Male,No,0,Graduate,No,2718,0,70,360,1,Semiurban,Y 135 | LP001482,Male,Yes,0,Graduate,Yes,3459,0,25,120,1,Semiurban,Y 136 | LP001487,Male,No,0,Graduate,No,4895,0,102,360,1,Semiurban,Y 137 | LP001488,Male,Yes,3+,Graduate,No,4000,7750,290,360,1,Semiurban,N 138 | LP001489,Female,Yes,0,Graduate,No,4583,0,84,360,1,Rural,N 139 | LP001491,Male,Yes,2,Graduate,Yes,3316,3500,88,360,1,Urban,Y 140 | LP001492,Male,No,0,Graduate,No,14999,0,242,360,0,Semiurban,N 141 | LP001493,Male,Yes,2,Not Graduate,No,4200,1430,129,360,1,Rural,N 142 | LP001497,Male,Yes,2,Graduate,No,5042,2083,185,360,1,Rural,N 143 | LP001498,Male,No,0,Graduate,No,5417,0,168,360,1,Urban,Y 144 | LP001504,Male,No,0,Graduate,Yes,6950,0,175,180,1,Semiurban,Y 145 | LP001507,Male,Yes,0,Graduate,No,2698,2034,122,360,1,Semiurban,Y 146 | LP001508,Male,Yes,2,Graduate,No,11757,0,187,180,1,Urban,Y 147 | LP001514,Female,Yes,0,Graduate,No,2330,4486,100,360,1,Semiurban,Y 148 | LP001516,Female,Yes,2,Graduate,No,14866,0,70,360,1,Urban,Y 149 | LP001518,Male,Yes,1,Graduate,No,1538,1425,30,360,1,Urban,Y 150 | LP001519,Female,No,0,Graduate,No,10000,1666,225,360,1,Rural,N 151 | LP001520,Male,Yes,0,Graduate,No,4860,830,125,360,1,Semiurban,Y 152 | LP001528,Male,No,0,Graduate,No,6277,0,118,360,0,Rural,N 153 | LP001529,Male,Yes,0,Graduate,Yes,2577,3750,152,360,1,Rural,Y 154 | LP001531,Male,No,0,Graduate,No,9166,0,244,360,1,Urban,N 155 | LP001532,Male,Yes,2,Not Graduate,No,2281,0,113,360,1,Rural,N 156 | LP001535,Male,No,0,Graduate,No,3254,0,50,360,1,Urban,Y 157 | LP001536,Male,Yes,3+,Graduate,No,39999,0,600,180,0,Semiurban,Y 158 | LP001541,Male,Yes,1,Graduate,No,6000,0,160,360,,Rural,Y 159 | LP001543,Male,Yes,1,Graduate,No,9538,0,187,360,1,Urban,Y 160 | LP001546,Male,No,0,Graduate,,2980,2083,120,360,1,Rural,Y 161 | LP001552,Male,Yes,0,Graduate,No,4583,5625,255,360,1,Semiurban,Y 162 | LP001560,Male,Yes,0,Not Graduate,No,1863,1041,98,360,1,Semiurban,Y 163 | LP001562,Male,Yes,0,Graduate,No,7933,0,275,360,1,Urban,N 164 | LP001565,Male,Yes,1,Graduate,No,3089,1280,121,360,0,Semiurban,N 165 | LP001570,Male,Yes,2,Graduate,No,4167,1447,158,360,1,Rural,Y 166 | LP001572,Male,Yes,0,Graduate,No,9323,0,75,180,1,Urban,Y 167 | LP001574,Male,Yes,0,Graduate,No,3707,3166,182,,1,Rural,Y 168 | LP001577,Female,Yes,0,Graduate,No,4583,0,112,360,1,Rural,N 169 | LP001578,Male,Yes,0,Graduate,No,2439,3333,129,360,1,Rural,Y 170 | LP001579,Male,No,0,Graduate,No,2237,0,63,480,0,Semiurban,N 171 | LP001580,Male,Yes,2,Graduate,No,8000,0,200,360,1,Semiurban,Y 172 | LP001581,Male,Yes,0,Not Graduate,,1820,1769,95,360,1,Rural,Y 173 | LP001585,,Yes,3+,Graduate,No,51763,0,700,300,1,Urban,Y 174 | LP001586,Male,Yes,3+,Not Graduate,No,3522,0,81,180,1,Rural,N 175 | LP001594,Male,Yes,0,Graduate,No,5708,5625,187,360,1,Semiurban,Y 176 | LP001603,Male,Yes,0,Not Graduate,Yes,4344,736,87,360,1,Semiurban,N 177 | LP001606,Male,Yes,0,Graduate,No,3497,1964,116,360,1,Rural,Y 178 | LP001608,Male,Yes,2,Graduate,No,2045,1619,101,360,1,Rural,Y 179 | LP001610,Male,Yes,3+,Graduate,No,5516,11300,495,360,0,Semiurban,N 180 | LP001616,Male,Yes,1,Graduate,No,3750,0,116,360,1,Semiurban,Y 181 | LP001630,Male,No,0,Not Graduate,No,2333,1451,102,480,0,Urban,N 182 | LP001633,Male,Yes,1,Graduate,No,6400,7250,180,360,0,Urban,N 183 | LP001634,Male,No,0,Graduate,No,1916,5063,67,360,,Rural,N 184 | LP001636,Male,Yes,0,Graduate,No,4600,0,73,180,1,Semiurban,Y 185 | LP001637,Male,Yes,1,Graduate,No,33846,0,260,360,1,Semiurban,N 186 | LP001639,Female,Yes,0,Graduate,No,3625,0,108,360,1,Semiurban,Y 187 | LP001640,Male,Yes,0,Graduate,Yes,39147,4750,120,360,1,Semiurban,Y 188 | LP001641,Male,Yes,1,Graduate,Yes,2178,0,66,300,0,Rural,N 189 | LP001643,Male,Yes,0,Graduate,No,2383,2138,58,360,,Rural,Y 190 | LP001644,,Yes,0,Graduate,Yes,674,5296,168,360,1,Rural,Y 191 | LP001647,Male,Yes,0,Graduate,No,9328,0,188,180,1,Rural,Y 192 | LP001653,Male,No,0,Not Graduate,No,4885,0,48,360,1,Rural,Y 193 | LP001656,Male,No,0,Graduate,No,12000,0,164,360,1,Semiurban,N 194 | LP001657,Male,Yes,0,Not Graduate,No,6033,0,160,360,1,Urban,N 195 | LP001658,Male,No,0,Graduate,No,3858,0,76,360,1,Semiurban,Y 196 | LP001664,Male,No,0,Graduate,No,4191,0,120,360,1,Rural,Y 197 | LP001665,Male,Yes,1,Graduate,No,3125,2583,170,360,1,Semiurban,N 198 | LP001666,Male,No,0,Graduate,No,8333,3750,187,360,1,Rural,Y 199 | LP001669,Female,No,0,Not Graduate,No,1907,2365,120,,1,Urban,Y 200 | LP001671,Female,Yes,0,Graduate,No,3416,2816,113,360,,Semiurban,Y 201 | LP001673,Male,No,0,Graduate,Yes,11000,0,83,360,1,Urban,N 202 | LP001674,Male,Yes,1,Not Graduate,No,2600,2500,90,360,1,Semiurban,Y 203 | LP001677,Male,No,2,Graduate,No,4923,0,166,360,0,Semiurban,Y 204 | LP001682,Male,Yes,3+,Not Graduate,No,3992,0,,180,1,Urban,N 205 | LP001688,Male,Yes,1,Not Graduate,No,3500,1083,135,360,1,Urban,Y 206 | LP001691,Male,Yes,2,Not Graduate,No,3917,0,124,360,1,Semiurban,Y 207 | LP001692,Female,No,0,Not Graduate,No,4408,0,120,360,1,Semiurban,Y 208 | LP001693,Female,No,0,Graduate,No,3244,0,80,360,1,Urban,Y 209 | LP001698,Male,No,0,Not Graduate,No,3975,2531,55,360,1,Rural,Y 210 | LP001699,Male,No,0,Graduate,No,2479,0,59,360,1,Urban,Y 211 | LP001702,Male,No,0,Graduate,No,3418,0,127,360,1,Semiurban,N 212 | LP001708,Female,No,0,Graduate,No,10000,0,214,360,1,Semiurban,N 213 | LP001711,Male,Yes,3+,Graduate,No,3430,1250,128,360,0,Semiurban,N 214 | LP001713,Male,Yes,1,Graduate,Yes,7787,0,240,360,1,Urban,Y 215 | LP001715,Male,Yes,3+,Not Graduate,Yes,5703,0,130,360,1,Rural,Y 216 | LP001716,Male,Yes,0,Graduate,No,3173,3021,137,360,1,Urban,Y 217 | LP001720,Male,Yes,3+,Not Graduate,No,3850,983,100,360,1,Semiurban,Y 218 | LP001722,Male,Yes,0,Graduate,No,150,1800,135,360,1,Rural,N 219 | LP001726,Male,Yes,0,Graduate,No,3727,1775,131,360,1,Semiurban,Y 220 | LP001732,Male,Yes,2,Graduate,,5000,0,72,360,0,Semiurban,N 221 | LP001734,Female,Yes,2,Graduate,No,4283,2383,127,360,,Semiurban,Y 222 | LP001736,Male,Yes,0,Graduate,No,2221,0,60,360,0,Urban,N 223 | LP001743,Male,Yes,2,Graduate,No,4009,1717,116,360,1,Semiurban,Y 224 | LP001744,Male,No,0,Graduate,No,2971,2791,144,360,1,Semiurban,Y 225 | LP001749,Male,Yes,0,Graduate,No,7578,1010,175,,1,Semiurban,Y 226 | LP001750,Male,Yes,0,Graduate,No,6250,0,128,360,1,Semiurban,Y 227 | LP001751,Male,Yes,0,Graduate,No,3250,0,170,360,1,Rural,N 228 | LP001754,Male,Yes,,Not Graduate,Yes,4735,0,138,360,1,Urban,N 229 | LP001758,Male,Yes,2,Graduate,No,6250,1695,210,360,1,Semiurban,Y 230 | LP001760,Male,,,Graduate,No,4758,0,158,480,1,Semiurban,Y 231 | LP001761,Male,No,0,Graduate,Yes,6400,0,200,360,1,Rural,Y 232 | LP001765,Male,Yes,1,Graduate,No,2491,2054,104,360,1,Semiurban,Y 233 | LP001768,Male,Yes,0,Graduate,,3716,0,42,180,1,Rural,Y 234 | LP001770,Male,No,0,Not Graduate,No,3189,2598,120,,1,Rural,Y 235 | LP001776,Female,No,0,Graduate,No,8333,0,280,360,1,Semiurban,Y 236 | LP001778,Male,Yes,1,Graduate,No,3155,1779,140,360,1,Semiurban,Y 237 | LP001784,Male,Yes,1,Graduate,No,5500,1260,170,360,1,Rural,Y 238 | LP001786,Male,Yes,0,Graduate,,5746,0,255,360,,Urban,N 239 | LP001788,Female,No,0,Graduate,Yes,3463,0,122,360,,Urban,Y 240 | LP001790,Female,No,1,Graduate,No,3812,0,112,360,1,Rural,Y 241 | LP001792,Male,Yes,1,Graduate,No,3315,0,96,360,1,Semiurban,Y 242 | LP001798,Male,Yes,2,Graduate,No,5819,5000,120,360,1,Rural,Y 243 | LP001800,Male,Yes,1,Not Graduate,No,2510,1983,140,180,1,Urban,N 244 | LP001806,Male,No,0,Graduate,No,2965,5701,155,60,1,Urban,Y 245 | LP001807,Male,Yes,2,Graduate,Yes,6250,1300,108,360,1,Rural,Y 246 | LP001811,Male,Yes,0,Not Graduate,No,3406,4417,123,360,1,Semiurban,Y 247 | LP001813,Male,No,0,Graduate,Yes,6050,4333,120,180,1,Urban,N 248 | LP001814,Male,Yes,2,Graduate,No,9703,0,112,360,1,Urban,Y 249 | LP001819,Male,Yes,1,Not Graduate,No,6608,0,137,180,1,Urban,Y 250 | LP001824,Male,Yes,1,Graduate,No,2882,1843,123,480,1,Semiurban,Y 251 | LP001825,Male,Yes,0,Graduate,No,1809,1868,90,360,1,Urban,Y 252 | LP001835,Male,Yes,0,Not Graduate,No,1668,3890,201,360,0,Semiurban,N 253 | LP001836,Female,No,2,Graduate,No,3427,0,138,360,1,Urban,N 254 | LP001841,Male,No,0,Not Graduate,Yes,2583,2167,104,360,1,Rural,Y 255 | LP001843,Male,Yes,1,Not Graduate,No,2661,7101,279,180,1,Semiurban,Y 256 | LP001844,Male,No,0,Graduate,Yes,16250,0,192,360,0,Urban,N 257 | LP001846,Female,No,3+,Graduate,No,3083,0,255,360,1,Rural,Y 258 | LP001849,Male,No,0,Not Graduate,No,6045,0,115,360,0,Rural,N 259 | LP001854,Male,Yes,3+,Graduate,No,5250,0,94,360,1,Urban,N 260 | LP001859,Male,Yes,0,Graduate,No,14683,2100,304,360,1,Rural,N 261 | LP001864,Male,Yes,3+,Not Graduate,No,4931,0,128,360,,Semiurban,N 262 | LP001865,Male,Yes,1,Graduate,No,6083,4250,330,360,,Urban,Y 263 | LP001868,Male,No,0,Graduate,No,2060,2209,134,360,1,Semiurban,Y 264 | LP001870,Female,No,1,Graduate,No,3481,0,155,36,1,Semiurban,N 265 | LP001871,Female,No,0,Graduate,No,7200,0,120,360,1,Rural,Y 266 | LP001872,Male,No,0,Graduate,Yes,5166,0,128,360,1,Semiurban,Y 267 | LP001875,Male,No,0,Graduate,No,4095,3447,151,360,1,Rural,Y 268 | LP001877,Male,Yes,2,Graduate,No,4708,1387,150,360,1,Semiurban,Y 269 | LP001882,Male,Yes,3+,Graduate,No,4333,1811,160,360,0,Urban,Y 270 | LP001883,Female,No,0,Graduate,,3418,0,135,360,1,Rural,N 271 | LP001884,Female,No,1,Graduate,No,2876,1560,90,360,1,Urban,Y 272 | LP001888,Female,No,0,Graduate,No,3237,0,30,360,1,Urban,Y 273 | LP001891,Male,Yes,0,Graduate,No,11146,0,136,360,1,Urban,Y 274 | LP001892,Male,No,0,Graduate,No,2833,1857,126,360,1,Rural,Y 275 | LP001894,Male,Yes,0,Graduate,No,2620,2223,150,360,1,Semiurban,Y 276 | LP001896,Male,Yes,2,Graduate,No,3900,0,90,360,1,Semiurban,Y 277 | LP001900,Male,Yes,1,Graduate,No,2750,1842,115,360,1,Semiurban,Y 278 | LP001903,Male,Yes,0,Graduate,No,3993,3274,207,360,1,Semiurban,Y 279 | LP001904,Male,Yes,0,Graduate,No,3103,1300,80,360,1,Urban,Y 280 | LP001907,Male,Yes,0,Graduate,No,14583,0,436,360,1,Semiurban,Y 281 | LP001908,Female,Yes,0,Not Graduate,No,4100,0,124,360,,Rural,Y 282 | LP001910,Male,No,1,Not Graduate,Yes,4053,2426,158,360,0,Urban,N 283 | LP001914,Male,Yes,0,Graduate,No,3927,800,112,360,1,Semiurban,Y 284 | LP001915,Male,Yes,2,Graduate,No,2301,985.7999878,78,180,1,Urban,Y 285 | LP001917,Female,No,0,Graduate,No,1811,1666,54,360,1,Urban,Y 286 | LP001922,Male,Yes,0,Graduate,No,20667,0,,360,1,Rural,N 287 | LP001924,Male,No,0,Graduate,No,3158,3053,89,360,1,Rural,Y 288 | LP001925,Female,No,0,Graduate,Yes,2600,1717,99,300,1,Semiurban,N 289 | LP001926,Male,Yes,0,Graduate,No,3704,2000,120,360,1,Rural,Y 290 | LP001931,Female,No,0,Graduate,No,4124,0,115,360,1,Semiurban,Y 291 | LP001935,Male,No,0,Graduate,No,9508,0,187,360,1,Rural,Y 292 | LP001936,Male,Yes,0,Graduate,No,3075,2416,139,360,1,Rural,Y 293 | LP001938,Male,Yes,2,Graduate,No,4400,0,127,360,0,Semiurban,N 294 | LP001940,Male,Yes,2,Graduate,No,3153,1560,134,360,1,Urban,Y 295 | LP001945,Female,No,,Graduate,No,5417,0,143,480,0,Urban,N 296 | LP001947,Male,Yes,0,Graduate,No,2383,3334,172,360,1,Semiurban,Y 297 | LP001949,Male,Yes,3+,Graduate,,4416,1250,110,360,1,Urban,Y 298 | LP001953,Male,Yes,1,Graduate,No,6875,0,200,360,1,Semiurban,Y 299 | LP001954,Female,Yes,1,Graduate,No,4666,0,135,360,1,Urban,Y 300 | LP001955,Female,No,0,Graduate,No,5000,2541,151,480,1,Rural,N 301 | LP001963,Male,Yes,1,Graduate,No,2014,2925,113,360,1,Urban,N 302 | LP001964,Male,Yes,0,Not Graduate,No,1800,2934,93,360,0,Urban,N 303 | LP001972,Male,Yes,,Not Graduate,No,2875,1750,105,360,1,Semiurban,Y 304 | LP001974,Female,No,0,Graduate,No,5000,0,132,360,1,Rural,Y 305 | LP001977,Male,Yes,1,Graduate,No,1625,1803,96,360,1,Urban,Y 306 | LP001978,Male,No,0,Graduate,No,4000,2500,140,360,1,Rural,Y 307 | LP001990,Male,No,0,Not Graduate,No,2000,0,,360,1,Urban,N 308 | LP001993,Female,No,0,Graduate,No,3762,1666,135,360,1,Rural,Y 309 | LP001994,Female,No,0,Graduate,No,2400,1863,104,360,0,Urban,N 310 | LP001996,Male,No,0,Graduate,No,20233,0,480,360,1,Rural,N 311 | LP001998,Male,Yes,2,Not Graduate,No,7667,0,185,360,,Rural,Y 312 | LP002002,Female,No,0,Graduate,No,2917,0,84,360,1,Semiurban,Y 313 | LP002004,Male,No,0,Not Graduate,No,2927,2405,111,360,1,Semiurban,Y 314 | LP002006,Female,No,0,Graduate,No,2507,0,56,360,1,Rural,Y 315 | LP002008,Male,Yes,2,Graduate,Yes,5746,0,144,84,,Rural,Y 316 | LP002024,,Yes,0,Graduate,No,2473,1843,159,360,1,Rural,N 317 | LP002031,Male,Yes,1,Not Graduate,No,3399,1640,111,180,1,Urban,Y 318 | LP002035,Male,Yes,2,Graduate,No,3717,0,120,360,1,Semiurban,Y 319 | LP002036,Male,Yes,0,Graduate,No,2058,2134,88,360,,Urban,Y 320 | LP002043,Female,No,1,Graduate,No,3541,0,112,360,,Semiurban,Y 321 | LP002050,Male,Yes,1,Graduate,Yes,10000,0,155,360,1,Rural,N 322 | LP002051,Male,Yes,0,Graduate,No,2400,2167,115,360,1,Semiurban,Y 323 | LP002053,Male,Yes,3+,Graduate,No,4342,189,124,360,1,Semiurban,Y 324 | LP002054,Male,Yes,2,Not Graduate,No,3601,1590,,360,1,Rural,Y 325 | LP002055,Female,No,0,Graduate,No,3166,2985,132,360,,Rural,Y 326 | LP002065,Male,Yes,3+,Graduate,No,15000,0,300,360,1,Rural,Y 327 | LP002067,Male,Yes,1,Graduate,Yes,8666,4983,376,360,0,Rural,N 328 | LP002068,Male,No,0,Graduate,No,4917,0,130,360,0,Rural,Y 329 | LP002082,Male,Yes,0,Graduate,Yes,5818,2160,184,360,1,Semiurban,Y 330 | LP002086,Female,Yes,0,Graduate,No,4333,2451,110,360,1,Urban,N 331 | LP002087,Female,No,0,Graduate,No,2500,0,67,360,1,Urban,Y 332 | LP002097,Male,No,1,Graduate,No,4384,1793,117,360,1,Urban,Y 333 | LP002098,Male,No,0,Graduate,No,2935,0,98,360,1,Semiurban,Y 334 | LP002100,Male,No,,Graduate,No,2833,0,71,360,1,Urban,Y 335 | LP002101,Male,Yes,0,Graduate,,63337,0,490,180,1,Urban,Y 336 | LP002103,,Yes,1,Graduate,Yes,9833,1833,182,180,1,Urban,Y 337 | LP002106,Male,Yes,,Graduate,Yes,5503,4490,70,,1,Semiurban,Y 338 | LP002110,Male,Yes,1,Graduate,,5250,688,160,360,1,Rural,Y 339 | LP002112,Male,Yes,2,Graduate,Yes,2500,4600,176,360,1,Rural,Y 340 | LP002113,Female,No,3+,Not Graduate,No,1830,0,,360,0,Urban,N 341 | LP002114,Female,No,0,Graduate,No,4160,0,71,360,1,Semiurban,Y 342 | LP002115,Male,Yes,3+,Not Graduate,No,2647,1587,173,360,1,Rural,N 343 | LP002116,Female,No,0,Graduate,No,2378,0,46,360,1,Rural,N 344 | LP002119,Male,Yes,1,Not Graduate,No,4554,1229,158,360,1,Urban,Y 345 | LP002126,Male,Yes,3+,Not Graduate,No,3173,0,74,360,1,Semiurban,Y 346 | LP002128,Male,Yes,2,Graduate,,2583,2330,125,360,1,Rural,Y 347 | LP002129,Male,Yes,0,Graduate,No,2499,2458,160,360,1,Semiurban,Y 348 | LP002130,Male,Yes,,Not Graduate,No,3523,3230,152,360,0,Rural,N 349 | LP002131,Male,Yes,2,Not Graduate,No,3083,2168,126,360,1,Urban,Y 350 | LP002137,Male,Yes,0,Graduate,No,6333,4583,259,360,,Semiurban,Y 351 | LP002138,Male,Yes,0,Graduate,No,2625,6250,187,360,1,Rural,Y 352 | LP002139,Male,Yes,0,Graduate,No,9083,0,228,360,1,Semiurban,Y 353 | LP002140,Male,No,0,Graduate,No,8750,4167,308,360,1,Rural,N 354 | LP002141,Male,Yes,3+,Graduate,No,2666,2083,95,360,1,Rural,Y 355 | LP002142,Female,Yes,0,Graduate,Yes,5500,0,105,360,0,Rural,N 356 | LP002143,Female,Yes,0,Graduate,No,2423,505,130,360,1,Semiurban,Y 357 | LP002144,Female,No,,Graduate,No,3813,0,116,180,1,Urban,Y 358 | LP002149,Male,Yes,2,Graduate,No,8333,3167,165,360,1,Rural,Y 359 | LP002151,Male,Yes,1,Graduate,No,3875,0,67,360,1,Urban,N 360 | LP002158,Male,Yes,0,Not Graduate,No,3000,1666,100,480,0,Urban,N 361 | LP002160,Male,Yes,3+,Graduate,No,5167,3167,200,360,1,Semiurban,Y 362 | LP002161,Female,No,1,Graduate,No,4723,0,81,360,1,Semiurban,N 363 | LP002170,Male,Yes,2,Graduate,No,5000,3667,236,360,1,Semiurban,Y 364 | LP002175,Male,Yes,0,Graduate,No,4750,2333,130,360,1,Urban,Y 365 | LP002178,Male,Yes,0,Graduate,No,3013,3033,95,300,,Urban,Y 366 | LP002180,Male,No,0,Graduate,Yes,6822,0,141,360,1,Rural,Y 367 | LP002181,Male,No,0,Not Graduate,No,6216,0,133,360,1,Rural,N 368 | LP002187,Male,No,0,Graduate,No,2500,0,96,480,1,Semiurban,N 369 | LP002188,Male,No,0,Graduate,No,5124,0,124,,0,Rural,N 370 | LP002190,Male,Yes,1,Graduate,No,6325,0,175,360,1,Semiurban,Y 371 | LP002191,Male,Yes,0,Graduate,No,19730,5266,570,360,1,Rural,N 372 | LP002194,Female,No,0,Graduate,Yes,15759,0,55,360,1,Semiurban,Y 373 | LP002197,Male,Yes,2,Graduate,No,5185,0,155,360,1,Semiurban,Y 374 | LP002201,Male,Yes,2,Graduate,Yes,9323,7873,380,300,1,Rural,Y 375 | LP002205,Male,No,1,Graduate,No,3062,1987,111,180,0,Urban,N 376 | LP002209,Female,No,0,Graduate,,2764,1459,110,360,1,Urban,Y 377 | LP002211,Male,Yes,0,Graduate,No,4817,923,120,180,1,Urban,Y 378 | LP002219,Male,Yes,3+,Graduate,No,8750,4996,130,360,1,Rural,Y 379 | LP002223,Male,Yes,0,Graduate,No,4310,0,130,360,,Semiurban,Y 380 | LP002224,Male,No,0,Graduate,No,3069,0,71,480,1,Urban,N 381 | LP002225,Male,Yes,2,Graduate,No,5391,0,130,360,1,Urban,Y 382 | LP002226,Male,Yes,0,Graduate,,3333,2500,128,360,1,Semiurban,Y 383 | LP002229,Male,No,0,Graduate,No,5941,4232,296,360,1,Semiurban,Y 384 | LP002231,Female,No,0,Graduate,No,6000,0,156,360,1,Urban,Y 385 | LP002234,Male,No,0,Graduate,Yes,7167,0,128,360,1,Urban,Y 386 | LP002236,Male,Yes,2,Graduate,No,4566,0,100,360,1,Urban,N 387 | LP002237,Male,No,1,Graduate,,3667,0,113,180,1,Urban,Y 388 | LP002239,Male,No,0,Not Graduate,No,2346,1600,132,360,1,Semiurban,Y 389 | LP002243,Male,Yes,0,Not Graduate,No,3010,3136,,360,0,Urban,N 390 | LP002244,Male,Yes,0,Graduate,No,2333,2417,136,360,1,Urban,Y 391 | LP002250,Male,Yes,0,Graduate,No,5488,0,125,360,1,Rural,Y 392 | LP002255,Male,No,3+,Graduate,No,9167,0,185,360,1,Rural,Y 393 | LP002262,Male,Yes,3+,Graduate,No,9504,0,275,360,1,Rural,Y 394 | LP002263,Male,Yes,0,Graduate,No,2583,2115,120,360,,Urban,Y 395 | LP002265,Male,Yes,2,Not Graduate,No,1993,1625,113,180,1,Semiurban,Y 396 | LP002266,Male,Yes,2,Graduate,No,3100,1400,113,360,1,Urban,Y 397 | LP002272,Male,Yes,2,Graduate,No,3276,484,135,360,,Semiurban,Y 398 | LP002277,Female,No,0,Graduate,No,3180,0,71,360,0,Urban,N 399 | LP002281,Male,Yes,0,Graduate,No,3033,1459,95,360,1,Urban,Y 400 | LP002284,Male,No,0,Not Graduate,No,3902,1666,109,360,1,Rural,Y 401 | LP002287,Female,No,0,Graduate,No,1500,1800,103,360,0,Semiurban,N 402 | LP002288,Male,Yes,2,Not Graduate,No,2889,0,45,180,0,Urban,N 403 | LP002296,Male,No,0,Not Graduate,No,2755,0,65,300,1,Rural,N 404 | LP002297,Male,No,0,Graduate,No,2500,20000,103,360,1,Semiurban,Y 405 | LP002300,Female,No,0,Not Graduate,No,1963,0,53,360,1,Semiurban,Y 406 | LP002301,Female,No,0,Graduate,Yes,7441,0,194,360,1,Rural,N 407 | LP002305,Female,No,0,Graduate,No,4547,0,115,360,1,Semiurban,Y 408 | LP002308,Male,Yes,0,Not Graduate,No,2167,2400,115,360,1,Urban,Y 409 | LP002314,Female,No,0,Not Graduate,No,2213,0,66,360,1,Rural,Y 410 | LP002315,Male,Yes,1,Graduate,No,8300,0,152,300,0,Semiurban,N 411 | LP002317,Male,Yes,3+,Graduate,No,81000,0,360,360,0,Rural,N 412 | LP002318,Female,No,1,Not Graduate,Yes,3867,0,62,360,1,Semiurban,N 413 | LP002319,Male,Yes,0,Graduate,,6256,0,160,360,,Urban,Y 414 | LP002328,Male,Yes,0,Not Graduate,No,6096,0,218,360,0,Rural,N 415 | LP002332,Male,Yes,0,Not Graduate,No,2253,2033,110,360,1,Rural,Y 416 | LP002335,Female,Yes,0,Not Graduate,No,2149,3237,178,360,0,Semiurban,N 417 | LP002337,Female,No,0,Graduate,No,2995,0,60,360,1,Urban,Y 418 | LP002341,Female,No,1,Graduate,No,2600,0,160,360,1,Urban,N 419 | LP002342,Male,Yes,2,Graduate,Yes,1600,20000,239,360,1,Urban,N 420 | LP002345,Male,Yes,0,Graduate,No,1025,2773,112,360,1,Rural,Y 421 | LP002347,Male,Yes,0,Graduate,No,3246,1417,138,360,1,Semiurban,Y 422 | LP002348,Male,Yes,0,Graduate,No,5829,0,138,360,1,Rural,Y 423 | LP002357,Female,No,0,Not Graduate,No,2720,0,80,,0,Urban,N 424 | LP002361,Male,Yes,0,Graduate,No,1820,1719,100,360,1,Urban,Y 425 | LP002362,Male,Yes,1,Graduate,No,7250,1667,110,,0,Urban,N 426 | LP002364,Male,Yes,0,Graduate,No,14880,0,96,360,1,Semiurban,Y 427 | LP002366,Male,Yes,0,Graduate,No,2666,4300,121,360,1,Rural,Y 428 | LP002367,Female,No,1,Not Graduate,No,4606,0,81,360,1,Rural,N 429 | LP002368,Male,Yes,2,Graduate,No,5935,0,133,360,1,Semiurban,Y 430 | LP002369,Male,Yes,0,Graduate,No,2920,16.12000084,87,360,1,Rural,Y 431 | LP002370,Male,No,0,Not Graduate,No,2717,0,60,180,1,Urban,Y 432 | LP002377,Female,No,1,Graduate,Yes,8624,0,150,360,1,Semiurban,Y 433 | LP002379,Male,No,0,Graduate,No,6500,0,105,360,0,Rural,N 434 | LP002386,Male,No,0,Graduate,,12876,0,405,360,1,Semiurban,Y 435 | LP002387,Male,Yes,0,Graduate,No,2425,2340,143,360,1,Semiurban,Y 436 | LP002390,Male,No,0,Graduate,No,3750,0,100,360,1,Urban,Y 437 | LP002393,Female,,,Graduate,No,10047,0,,240,1,Semiurban,Y 438 | LP002398,Male,No,0,Graduate,No,1926,1851,50,360,1,Semiurban,Y 439 | LP002401,Male,Yes,0,Graduate,No,2213,1125,,360,1,Urban,Y 440 | LP002403,Male,No,0,Graduate,Yes,10416,0,187,360,0,Urban,N 441 | LP002407,Female,Yes,0,Not Graduate,Yes,7142,0,138,360,1,Rural,Y 442 | LP002408,Male,No,0,Graduate,No,3660,5064,187,360,1,Semiurban,Y 443 | LP002409,Male,Yes,0,Graduate,No,7901,1833,180,360,1,Rural,Y 444 | LP002418,Male,No,3+,Not Graduate,No,4707,1993,148,360,1,Semiurban,Y 445 | LP002422,Male,No,1,Graduate,No,37719,0,152,360,1,Semiurban,Y 446 | LP002424,Male,Yes,0,Graduate,No,7333,8333,175,300,,Rural,Y 447 | LP002429,Male,Yes,1,Graduate,Yes,3466,1210,130,360,1,Rural,Y 448 | LP002434,Male,Yes,2,Not Graduate,No,4652,0,110,360,1,Rural,Y 449 | LP002435,Male,Yes,0,Graduate,,3539,1376,55,360,1,Rural,N 450 | LP002443,Male,Yes,2,Graduate,No,3340,1710,150,360,0,Rural,N 451 | LP002444,Male,No,1,Not Graduate,Yes,2769,1542,190,360,,Semiurban,N 452 | LP002446,Male,Yes,2,Not Graduate,No,2309,1255,125,360,0,Rural,N 453 | LP002447,Male,Yes,2,Not Graduate,No,1958,1456,60,300,,Urban,Y 454 | LP002448,Male,Yes,0,Graduate,No,3948,1733,149,360,0,Rural,N 455 | LP002449,Male,Yes,0,Graduate,No,2483,2466,90,180,0,Rural,Y 456 | LP002453,Male,No,0,Graduate,Yes,7085,0,84,360,1,Semiurban,Y 457 | LP002455,Male,Yes,2,Graduate,No,3859,0,96,360,1,Semiurban,Y 458 | LP002459,Male,Yes,0,Graduate,No,4301,0,118,360,1,Urban,Y 459 | LP002467,Male,Yes,0,Graduate,No,3708,2569,173,360,1,Urban,N 460 | LP002472,Male,No,2,Graduate,No,4354,0,136,360,1,Rural,Y 461 | LP002473,Male,Yes,0,Graduate,No,8334,0,160,360,1,Semiurban,N 462 | LP002478,,Yes,0,Graduate,Yes,2083,4083,160,360,,Semiurban,Y 463 | LP002484,Male,Yes,3+,Graduate,No,7740,0,128,180,1,Urban,Y 464 | LP002487,Male,Yes,0,Graduate,No,3015,2188,153,360,1,Rural,Y 465 | LP002489,Female,No,1,Not Graduate,,5191,0,132,360,1,Semiurban,Y 466 | LP002493,Male,No,0,Graduate,No,4166,0,98,360,0,Semiurban,N 467 | LP002494,Male,No,0,Graduate,No,6000,0,140,360,1,Rural,Y 468 | LP002500,Male,Yes,3+,Not Graduate,No,2947,1664,70,180,0,Urban,N 469 | LP002501,,Yes,0,Graduate,No,16692,0,110,360,1,Semiurban,Y 470 | LP002502,Female,Yes,2,Not Graduate,,210,2917,98,360,1,Semiurban,Y 471 | LP002505,Male,Yes,0,Graduate,No,4333,2451,110,360,1,Urban,N 472 | LP002515,Male,Yes,1,Graduate,Yes,3450,2079,162,360,1,Semiurban,Y 473 | LP002517,Male,Yes,1,Not Graduate,No,2653,1500,113,180,0,Rural,N 474 | LP002519,Male,Yes,3+,Graduate,No,4691,0,100,360,1,Semiurban,Y 475 | LP002522,Female,No,0,Graduate,Yes,2500,0,93,360,,Urban,Y 476 | LP002524,Male,No,2,Graduate,No,5532,4648,162,360,1,Rural,Y 477 | LP002527,Male,Yes,2,Graduate,Yes,16525,1014,150,360,1,Rural,Y 478 | LP002529,Male,Yes,2,Graduate,No,6700,1750,230,300,1,Semiurban,Y 479 | LP002530,,Yes,2,Graduate,No,2873,1872,132,360,0,Semiurban,N 480 | LP002531,Male,Yes,1,Graduate,Yes,16667,2250,86,360,1,Semiurban,Y 481 | LP002533,Male,Yes,2,Graduate,No,2947,1603,,360,1,Urban,N 482 | LP002534,Female,No,0,Not Graduate,No,4350,0,154,360,1,Rural,Y 483 | LP002536,Male,Yes,3+,Not Graduate,No,3095,0,113,360,1,Rural,Y 484 | LP002537,Male,Yes,0,Graduate,No,2083,3150,128,360,1,Semiurban,Y 485 | LP002541,Male,Yes,0,Graduate,No,10833,0,234,360,1,Semiurban,Y 486 | LP002543,Male,Yes,2,Graduate,No,8333,0,246,360,1,Semiurban,Y 487 | LP002544,Male,Yes,1,Not Graduate,No,1958,2436,131,360,1,Rural,Y 488 | LP002545,Male,No,2,Graduate,No,3547,0,80,360,0,Rural,N 489 | LP002547,Male,Yes,1,Graduate,No,18333,0,500,360,1,Urban,N 490 | LP002555,Male,Yes,2,Graduate,Yes,4583,2083,160,360,1,Semiurban,Y 491 | LP002556,Male,No,0,Graduate,No,2435,0,75,360,1,Urban,N 492 | LP002560,Male,No,0,Not Graduate,No,2699,2785,96,360,,Semiurban,Y 493 | LP002562,Male,Yes,1,Not Graduate,No,5333,1131,186,360,,Urban,Y 494 | LP002571,Male,No,0,Not Graduate,No,3691,0,110,360,1,Rural,Y 495 | LP002582,Female,No,0,Not Graduate,Yes,17263,0,225,360,1,Semiurban,Y 496 | LP002585,Male,Yes,0,Graduate,No,3597,2157,119,360,0,Rural,N 497 | LP002586,Female,Yes,1,Graduate,No,3326,913,105,84,1,Semiurban,Y 498 | LP002587,Male,Yes,0,Not Graduate,No,2600,1700,107,360,1,Rural,Y 499 | LP002588,Male,Yes,0,Graduate,No,4625,2857,111,12,,Urban,Y 500 | LP002600,Male,Yes,1,Graduate,Yes,2895,0,95,360,1,Semiurban,Y 501 | LP002602,Male,No,0,Graduate,No,6283,4416,209,360,0,Rural,N 502 | LP002603,Female,No,0,Graduate,No,645,3683,113,480,1,Rural,Y 503 | LP002606,Female,No,0,Graduate,No,3159,0,100,360,1,Semiurban,Y 504 | LP002615,Male,Yes,2,Graduate,No,4865,5624,208,360,1,Semiurban,Y 505 | LP002618,Male,Yes,1,Not Graduate,No,4050,5302,138,360,,Rural,N 506 | LP002619,Male,Yes,0,Not Graduate,No,3814,1483,124,300,1,Semiurban,Y 507 | LP002622,Male,Yes,2,Graduate,No,3510,4416,243,360,1,Rural,Y 508 | LP002624,Male,Yes,0,Graduate,No,20833,6667,480,360,,Urban,Y 509 | LP002625,,No,0,Graduate,No,3583,0,96,360,1,Urban,N 510 | LP002626,Male,Yes,0,Graduate,Yes,2479,3013,188,360,1,Urban,Y 511 | LP002634,Female,No,1,Graduate,No,13262,0,40,360,1,Urban,Y 512 | LP002637,Male,No,0,Not Graduate,No,3598,1287,100,360,1,Rural,N 513 | LP002640,Male,Yes,1,Graduate,No,6065,2004,250,360,1,Semiurban,Y 514 | LP002643,Male,Yes,2,Graduate,No,3283,2035,148,360,1,Urban,Y 515 | LP002648,Male,Yes,0,Graduate,No,2130,6666,70,180,1,Semiurban,N 516 | LP002652,Male,No,0,Graduate,No,5815,3666,311,360,1,Rural,N 517 | LP002659,Male,Yes,3+,Graduate,No,3466,3428,150,360,1,Rural,Y 518 | LP002670,Female,Yes,2,Graduate,No,2031,1632,113,480,1,Semiurban,Y 519 | LP002682,Male,Yes,,Not Graduate,No,3074,1800,123,360,0,Semiurban,N 520 | LP002683,Male,No,0,Graduate,No,4683,1915,185,360,1,Semiurban,N 521 | LP002684,Female,No,0,Not Graduate,No,3400,0,95,360,1,Rural,N 522 | LP002689,Male,Yes,2,Not Graduate,No,2192,1742,45,360,1,Semiurban,Y 523 | LP002690,Male,No,0,Graduate,No,2500,0,55,360,1,Semiurban,Y 524 | LP002692,Male,Yes,3+,Graduate,Yes,5677,1424,100,360,1,Rural,Y 525 | LP002693,Male,Yes,2,Graduate,Yes,7948,7166,480,360,1,Rural,Y 526 | LP002697,Male,No,0,Graduate,No,4680,2087,,360,1,Semiurban,N 527 | LP002699,Male,Yes,2,Graduate,Yes,17500,0,400,360,1,Rural,Y 528 | LP002705,Male,Yes,0,Graduate,No,3775,0,110,360,1,Semiurban,Y 529 | LP002706,Male,Yes,1,Not Graduate,No,5285,1430,161,360,0,Semiurban,Y 530 | LP002714,Male,No,1,Not Graduate,No,2679,1302,94,360,1,Semiurban,Y 531 | LP002716,Male,No,0,Not Graduate,No,6783,0,130,360,1,Semiurban,Y 532 | LP002717,Male,Yes,0,Graduate,No,1025,5500,216,360,,Rural,Y 533 | LP002720,Male,Yes,3+,Graduate,No,4281,0,100,360,1,Urban,Y 534 | LP002723,Male,No,2,Graduate,No,3588,0,110,360,0,Rural,N 535 | LP002729,Male,No,1,Graduate,No,11250,0,196,360,,Semiurban,N 536 | LP002731,Female,No,0,Not Graduate,Yes,18165,0,125,360,1,Urban,Y 537 | LP002732,Male,No,0,Not Graduate,,2550,2042,126,360,1,Rural,Y 538 | LP002734,Male,Yes,0,Graduate,No,6133,3906,324,360,1,Urban,Y 539 | LP002738,Male,No,2,Graduate,No,3617,0,107,360,1,Semiurban,Y 540 | LP002739,Male,Yes,0,Not Graduate,No,2917,536,66,360,1,Rural,N 541 | LP002740,Male,Yes,3+,Graduate,No,6417,0,157,180,1,Rural,Y 542 | LP002741,Female,Yes,1,Graduate,No,4608,2845,140,180,1,Semiurban,Y 543 | LP002743,Female,No,0,Graduate,No,2138,0,99,360,0,Semiurban,N 544 | LP002753,Female,No,1,Graduate,,3652,0,95,360,1,Semiurban,Y 545 | LP002755,Male,Yes,1,Not Graduate,No,2239,2524,128,360,1,Urban,Y 546 | LP002757,Female,Yes,0,Not Graduate,No,3017,663,102,360,,Semiurban,Y 547 | LP002767,Male,Yes,0,Graduate,No,2768,1950,155,360,1,Rural,Y 548 | LP002768,Male,No,0,Not Graduate,No,3358,0,80,36,1,Semiurban,N 549 | LP002772,Male,No,0,Graduate,No,2526,1783,145,360,1,Rural,Y 550 | LP002776,Female,No,0,Graduate,No,5000,0,103,360,0,Semiurban,N 551 | LP002777,Male,Yes,0,Graduate,No,2785,2016,110,360,1,Rural,Y 552 | LP002778,Male,Yes,2,Graduate,Yes,6633,0,,360,0,Rural,N 553 | LP002784,Male,Yes,1,Not Graduate,No,2492,2375,,360,1,Rural,Y 554 | LP002785,Male,Yes,1,Graduate,No,3333,3250,158,360,1,Urban,Y 555 | LP002788,Male,Yes,0,Not Graduate,No,2454,2333,181,360,0,Urban,N 556 | LP002789,Male,Yes,0,Graduate,No,3593,4266,132,180,0,Rural,N 557 | LP002792,Male,Yes,1,Graduate,No,5468,1032,26,360,1,Semiurban,Y 558 | LP002794,Female,No,0,Graduate,No,2667,1625,84,360,,Urban,Y 559 | LP002795,Male,Yes,3+,Graduate,Yes,10139,0,260,360,1,Semiurban,Y 560 | LP002798,Male,Yes,0,Graduate,No,3887,2669,162,360,1,Semiurban,Y 561 | LP002804,Female,Yes,0,Graduate,No,4180,2306,182,360,1,Semiurban,Y 562 | LP002807,Male,Yes,2,Not Graduate,No,3675,242,108,360,1,Semiurban,Y 563 | LP002813,Female,Yes,1,Graduate,Yes,19484,0,600,360,1,Semiurban,Y 564 | LP002820,Male,Yes,0,Graduate,No,5923,2054,211,360,1,Rural,Y 565 | LP002821,Male,No,0,Not Graduate,Yes,5800,0,132,360,1,Semiurban,Y 566 | LP002832,Male,Yes,2,Graduate,No,8799,0,258,360,0,Urban,N 567 | LP002833,Male,Yes,0,Not Graduate,No,4467,0,120,360,,Rural,Y 568 | LP002836,Male,No,0,Graduate,No,3333,0,70,360,1,Urban,Y 569 | LP002837,Male,Yes,3+,Graduate,No,3400,2500,123,360,0,Rural,N 570 | LP002840,Female,No,0,Graduate,No,2378,0,9,360,1,Urban,N 571 | LP002841,Male,Yes,0,Graduate,No,3166,2064,104,360,0,Urban,N 572 | LP002842,Male,Yes,1,Graduate,No,3417,1750,186,360,1,Urban,Y 573 | LP002847,Male,Yes,,Graduate,No,5116,1451,165,360,0,Urban,N 574 | LP002855,Male,Yes,2,Graduate,No,16666,0,275,360,1,Urban,Y 575 | LP002862,Male,Yes,2,Not Graduate,No,6125,1625,187,480,1,Semiurban,N 576 | LP002863,Male,Yes,3+,Graduate,No,6406,0,150,360,1,Semiurban,N 577 | LP002868,Male,Yes,2,Graduate,No,3159,461,108,84,1,Urban,Y 578 | LP002872,,Yes,0,Graduate,No,3087,2210,136,360,0,Semiurban,N 579 | LP002874,Male,No,0,Graduate,No,3229,2739,110,360,1,Urban,Y 580 | LP002877,Male,Yes,1,Graduate,No,1782,2232,107,360,1,Rural,Y 581 | LP002888,Male,No,0,Graduate,,3182,2917,161,360,1,Urban,Y 582 | LP002892,Male,Yes,2,Graduate,No,6540,0,205,360,1,Semiurban,Y 583 | LP002893,Male,No,0,Graduate,No,1836,33837,90,360,1,Urban,N 584 | LP002894,Female,Yes,0,Graduate,No,3166,0,36,360,1,Semiurban,Y 585 | LP002898,Male,Yes,1,Graduate,No,1880,0,61,360,,Rural,N 586 | LP002911,Male,Yes,1,Graduate,No,2787,1917,146,360,0,Rural,N 587 | LP002912,Male,Yes,1,Graduate,No,4283,3000,172,84,1,Rural,N 588 | LP002916,Male,Yes,0,Graduate,No,2297,1522,104,360,1,Urban,Y 589 | LP002917,Female,No,0,Not Graduate,No,2165,0,70,360,1,Semiurban,Y 590 | LP002925,,No,0,Graduate,No,4750,0,94,360,1,Semiurban,Y 591 | LP002926,Male,Yes,2,Graduate,Yes,2726,0,106,360,0,Semiurban,N 592 | LP002928,Male,Yes,0,Graduate,No,3000,3416,56,180,1,Semiurban,Y 593 | LP002931,Male,Yes,2,Graduate,Yes,6000,0,205,240,1,Semiurban,N 594 | LP002933,,No,3+,Graduate,Yes,9357,0,292,360,1,Semiurban,Y 595 | LP002936,Male,Yes,0,Graduate,No,3859,3300,142,180,1,Rural,Y 596 | LP002938,Male,Yes,0,Graduate,Yes,16120,0,260,360,1,Urban,Y 597 | LP002940,Male,No,0,Not Graduate,No,3833,0,110,360,1,Rural,Y 598 | LP002941,Male,Yes,2,Not Graduate,Yes,6383,1000,187,360,1,Rural,N 599 | LP002943,Male,No,,Graduate,No,2987,0,88,360,0,Semiurban,N 600 | LP002945,Male,Yes,0,Graduate,Yes,9963,0,180,360,1,Rural,Y 601 | LP002948,Male,Yes,2,Graduate,No,5780,0,192,360,1,Urban,Y 602 | LP002949,Female,No,3+,Graduate,,416,41667,350,180,,Urban,N 603 | LP002950,Male,Yes,0,Not Graduate,,2894,2792,155,360,1,Rural,Y 604 | LP002953,Male,Yes,3+,Graduate,No,5703,0,128,360,1,Urban,Y 605 | LP002958,Male,No,0,Graduate,No,3676,4301,172,360,1,Rural,Y 606 | LP002959,Female,Yes,1,Graduate,No,12000,0,496,360,1,Semiurban,Y 607 | LP002960,Male,Yes,0,Not Graduate,No,2400,3800,,180,1,Urban,N 608 | LP002961,Male,Yes,1,Graduate,No,3400,2500,173,360,1,Semiurban,Y 609 | LP002964,Male,Yes,2,Not Graduate,No,3987,1411,157,360,1,Rural,Y 610 | LP002974,Male,Yes,0,Graduate,No,3232,1950,108,360,1,Rural,Y 611 | LP002978,Female,No,0,Graduate,No,2900,0,71,360,1,Rural,Y 612 | LP002979,Male,Yes,3+,Graduate,No,4106,0,40,180,1,Rural,Y 613 | LP002983,Male,Yes,1,Graduate,No,8072,240,253,360,1,Urban,Y 614 | LP002984,Male,Yes,2,Graduate,No,7583,0,187,360,1,Urban,Y 615 | LP002990,Female,No,0,Graduate,Yes,4583,0,133,360,0,Semiurban,N 616 | -------------------------------------------------------------------------------- /Machine Learning/data/creditcard.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Machine Learning/data/creditcard.rar -------------------------------------------------------------------------------- /NLP/Natural Language Processing(NLP) Concepts - Hackers Realm.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tokenization\n", 8 | "\n", 9 | "Tokenization is the process of breaking down the given text in natural language processing into the smallest unit in a sentence called a token. Punctuation marks, words, and numbers can be considered tokens." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 6, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "text = 'Hi Everyone! This is Hackers Realm. We are learning Natural Language Processing. We reached 1000000 views.'" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 7, 24 | "metadata": {}, 25 | "outputs": [ 26 | { 27 | "data": { 28 | "text/plain": [ 29 | "['Hi',\n", 30 | " 'Everyone!',\n", 31 | " 'This',\n", 32 | " 'is',\n", 33 | " 'Hackers',\n", 34 | " 'Realm.',\n", 35 | " 'We',\n", 36 | " 'are',\n", 37 | " 'learning',\n", 38 | " 'Natural',\n", 39 | " 'Language',\n", 40 | " 'Processing.',\n", 41 | " 'We',\n", 42 | " 'reached',\n", 43 | " '1000000',\n", 44 | " 'views.']" 45 | ] 46 | }, 47 | "execution_count": 7, 48 | "metadata": {}, 49 | "output_type": "execute_result" 50 | } 51 | ], 52 | "source": [ 53 | "text.split(' ')" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 8, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "from nltk import sent_tokenize, word_tokenize" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 9, 68 | "metadata": {}, 69 | "outputs": [ 70 | { 71 | "data": { 72 | "text/plain": [ 73 | "['Hi Everyone!',\n", 74 | " 'This is Hackers Realm.',\n", 75 | " 'We are learning Natural Language Processing.',\n", 76 | " 'We reached 1000000 views.']" 77 | ] 78 | }, 79 | "execution_count": 9, 80 | "metadata": {}, 81 | "output_type": "execute_result" 82 | } 83 | ], 84 | "source": [ 85 | "# split the text into sentences\n", 86 | "sent_tokens = sent_tokenize(text)\n", 87 | "sent_tokens" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 10, 93 | "metadata": {}, 94 | "outputs": [ 95 | { 96 | "data": { 97 | "text/plain": [ 98 | "['Hi',\n", 99 | " 'Everyone',\n", 100 | " '!',\n", 101 | " 'This',\n", 102 | " 'is',\n", 103 | " 'Hackers',\n", 104 | " 'Realm',\n", 105 | " '.',\n", 106 | " 'We',\n", 107 | " 'are',\n", 108 | " 'learning',\n", 109 | " 'Natural',\n", 110 | " 'Language',\n", 111 | " 'Processing',\n", 112 | " '.',\n", 113 | " 'We',\n", 114 | " 'reached',\n", 115 | " '1000000',\n", 116 | " 'views',\n", 117 | " '.']" 118 | ] 119 | }, 120 | "execution_count": 10, 121 | "metadata": {}, 122 | "output_type": "execute_result" 123 | } 124 | ], 125 | "source": [ 126 | "# split the text into words\n", 127 | "word_tokens = word_tokenize(text)\n", 128 | "word_tokens" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": {}, 135 | "outputs": [], 136 | "source": [] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "# Stemming\n", 143 | "\n", 144 | "Stemming is the process of finding the root of words. A word stem need not be the same root as a dictionary-based morphological root, it just is an equal to or smaller form of the word." 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 13, 150 | "metadata": {}, 151 | "outputs": [], 152 | "source": [ 153 | "from nltk.stem import PorterStemmer, SnowballStemmer\n", 154 | "ps = PorterStemmer()" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 17, 160 | "metadata": {}, 161 | "outputs": [ 162 | { 163 | "data": { 164 | "text/plain": [ 165 | "'eat'" 166 | ] 167 | }, 168 | "execution_count": 17, 169 | "metadata": {}, 170 | "output_type": "execute_result" 171 | } 172 | ], 173 | "source": [ 174 | "word = ('eats')\n", 175 | "ps.stem(word)" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 16, 181 | "metadata": {}, 182 | "outputs": [ 183 | { 184 | "data": { 185 | "text/plain": [ 186 | "'eat'" 187 | ] 188 | }, 189 | "execution_count": 16, 190 | "metadata": {}, 191 | "output_type": "execute_result" 192 | } 193 | ], 194 | "source": [ 195 | "word = ('eating')\n", 196 | "ps.stem(word)" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 18, 202 | "metadata": {}, 203 | "outputs": [ 204 | { 205 | "data": { 206 | "text/plain": [ 207 | "'eaten'" 208 | ] 209 | }, 210 | "execution_count": 18, 211 | "metadata": {}, 212 | "output_type": "execute_result" 213 | } 214 | ], 215 | "source": [ 216 | "word = ('eaten')\n", 217 | "ps.stem(word)" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": 19, 223 | "metadata": {}, 224 | "outputs": [], 225 | "source": [ 226 | "text = 'Hi Everyone! This is Hackers Realm. We are learning Natural Language Processing. We reached 1000000 views.'" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 20, 232 | "metadata": {}, 233 | "outputs": [], 234 | "source": [ 235 | "word_tokens = word_tokenize(text)" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 21, 241 | "metadata": {}, 242 | "outputs": [ 243 | { 244 | "data": { 245 | "text/plain": [ 246 | "'Hi everyon ! thi is hacker realm . We are learn natur languag process . We reach 1000000 view .'" 247 | ] 248 | }, 249 | "execution_count": 21, 250 | "metadata": {}, 251 | "output_type": "execute_result" 252 | } 253 | ], 254 | "source": [ 255 | "stemmed_sentence = \" \".join(ps.stem(word) for word in word_tokens)\n", 256 | "stemmed_sentence" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": null, 262 | "metadata": {}, 263 | "outputs": [], 264 | "source": [] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "# Lemmatization\n", 271 | "\n", 272 | "Lemmatization is the process of finding the form of the related word in the dictionary. It is different from Stemming. It involves longer processes to calculate than Stemming." 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": 22, 278 | "metadata": {}, 279 | "outputs": [], 280 | "source": [ 281 | "from nltk.stem import WordNetLemmatizer\n", 282 | "lemmatizer = WordNetLemmatizer()" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": 30, 288 | "metadata": {}, 289 | "outputs": [ 290 | { 291 | "data": { 292 | "text/plain": [ 293 | "'worker'" 294 | ] 295 | }, 296 | "execution_count": 30, 297 | "metadata": {}, 298 | "output_type": "execute_result" 299 | } 300 | ], 301 | "source": [ 302 | "lemmatizer.lemmatize('workers')" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": 31, 308 | "metadata": {}, 309 | "outputs": [ 310 | { 311 | "data": { 312 | "text/plain": [ 313 | "'word'" 314 | ] 315 | }, 316 | "execution_count": 31, 317 | "metadata": {}, 318 | "output_type": "execute_result" 319 | } 320 | ], 321 | "source": [ 322 | "lemmatizer.lemmatize('words')" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": 37, 328 | "metadata": {}, 329 | "outputs": [ 330 | { 331 | "data": { 332 | "text/plain": [ 333 | "'foot'" 334 | ] 335 | }, 336 | "execution_count": 37, 337 | "metadata": {}, 338 | "output_type": "execute_result" 339 | } 340 | ], 341 | "source": [ 342 | "lemmatizer.lemmatize('feet')" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": 39, 348 | "metadata": {}, 349 | "outputs": [ 350 | { 351 | "data": { 352 | "text/plain": [ 353 | "'strip'" 354 | ] 355 | }, 356 | "execution_count": 39, 357 | "metadata": {}, 358 | "output_type": "execute_result" 359 | } 360 | ], 361 | "source": [ 362 | "lemmatizer.lemmatize('stripes', 'v')" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": 40, 368 | "metadata": {}, 369 | "outputs": [ 370 | { 371 | "data": { 372 | "text/plain": [ 373 | "'stripe'" 374 | ] 375 | }, 376 | "execution_count": 40, 377 | "metadata": {}, 378 | "output_type": "execute_result" 379 | } 380 | ], 381 | "source": [ 382 | "lemmatizer.lemmatize('stripes', 'n')" 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": 41, 388 | "metadata": {}, 389 | "outputs": [], 390 | "source": [ 391 | "text = 'Hi Everyone! This is Hackers Realm. We are learning Natural Language Processing. We reached 1000000 views.'" 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": 42, 397 | "metadata": {}, 398 | "outputs": [], 399 | "source": [ 400 | "word_tokens = word_tokenize(text)" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": 44, 406 | "metadata": {}, 407 | "outputs": [ 408 | { 409 | "data": { 410 | "text/plain": [ 411 | "'hi everyone ! this is hacker realm . we are learning natural language processing . we reached 1000000 view .'" 412 | ] 413 | }, 414 | "execution_count": 44, 415 | "metadata": {}, 416 | "output_type": "execute_result" 417 | } 418 | ], 419 | "source": [ 420 | "lemmatized_sentence = \" \".join(lemmatizer.lemmatize(word.lower()) for word in word_tokens)\n", 421 | "lemmatized_sentence" 422 | ] 423 | }, 424 | { 425 | "cell_type": "code", 426 | "execution_count": null, 427 | "metadata": {}, 428 | "outputs": [], 429 | "source": [] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "# Part of Speech Tagging (POS)\n", 436 | "\n", 437 | "Part of Speech Tagging is a process of converting a sentence to forms — list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on.\n", 438 | "\n", 439 | "https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html" 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": 45, 445 | "metadata": {}, 446 | "outputs": [], 447 | "source": [ 448 | "from nltk import pos_tag" 449 | ] 450 | }, 451 | { 452 | "cell_type": "code", 453 | "execution_count": 51, 454 | "metadata": {}, 455 | "outputs": [ 456 | { 457 | "data": { 458 | "text/plain": [ 459 | "[('fighting', 'VBG')]" 460 | ] 461 | }, 462 | "execution_count": 51, 463 | "metadata": {}, 464 | "output_type": "execute_result" 465 | } 466 | ], 467 | "source": [ 468 | "pos_tag(['fighting'])" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": 46, 474 | "metadata": {}, 475 | "outputs": [], 476 | "source": [ 477 | "text = 'Hi Everyone! This is Hackers Realm. We are learning Natural Language Processing. We reached 1000000 views.'" 478 | ] 479 | }, 480 | { 481 | "cell_type": "code", 482 | "execution_count": 47, 483 | "metadata": {}, 484 | "outputs": [], 485 | "source": [ 486 | "word_tokens = word_tokenize(text)" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": 52, 492 | "metadata": {}, 493 | "outputs": [ 494 | { 495 | "data": { 496 | "text/plain": [ 497 | "[('Hi', 'NNP'),\n", 498 | " ('Everyone', 'NN'),\n", 499 | " ('!', '.'),\n", 500 | " ('This', 'DT'),\n", 501 | " ('is', 'VBZ'),\n", 502 | " ('Hackers', 'NNP'),\n", 503 | " ('Realm', 'NNP'),\n", 504 | " ('.', '.'),\n", 505 | " ('We', 'PRP'),\n", 506 | " ('are', 'VBP'),\n", 507 | " ('learning', 'VBG'),\n", 508 | " ('Natural', 'NNP'),\n", 509 | " ('Language', 'NNP'),\n", 510 | " ('Processing', 'NNP'),\n", 511 | " ('.', '.'),\n", 512 | " ('We', 'PRP'),\n", 513 | " ('reached', 'VBD'),\n", 514 | " ('1000000', 'CD'),\n", 515 | " ('views', 'NNS'),\n", 516 | " ('.', '.')]" 517 | ] 518 | }, 519 | "execution_count": 52, 520 | "metadata": {}, 521 | "output_type": "execute_result" 522 | } 523 | ], 524 | "source": [ 525 | "pos_tag(word_tokens)" 526 | ] 527 | }, 528 | { 529 | "cell_type": "code", 530 | "execution_count": null, 531 | "metadata": {}, 532 | "outputs": [], 533 | "source": [] 534 | }, 535 | { 536 | "cell_type": "markdown", 537 | "metadata": {}, 538 | "source": [ 539 | "# Text Preprocessing (Clean Data)" 540 | ] 541 | }, 542 | { 543 | "cell_type": "code", 544 | "execution_count": 9, 545 | "metadata": {}, 546 | "outputs": [ 547 | { 548 | "data": { 549 | "text/html": [ 550 | "
\n", 551 | "\n", 564 | "\n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | "
tweet
0@user when a father is dysfunctional and is s...
1@user @user thanks for #lyft credit i can't us...
2bihday your majesty
3#model i love u take with u all the time in ...
4factsguide: society now #motivation
\n", 594 | "
" 595 | ], 596 | "text/plain": [ 597 | " tweet\n", 598 | "0 @user when a father is dysfunctional and is s...\n", 599 | "1 @user @user thanks for #lyft credit i can't us...\n", 600 | "2 bihday your majesty\n", 601 | "3 #model i love u take with u all the time in ...\n", 602 | "4 factsguide: society now #motivation" 603 | ] 604 | }, 605 | "execution_count": 9, 606 | "metadata": {}, 607 | "output_type": "execute_result" 608 | } 609 | ], 610 | "source": [ 611 | "import pandas as pd\n", 612 | "import string\n", 613 | "df = pd.read_csv('data/Twitter Sentiments.csv')\n", 614 | "# drop the columns\n", 615 | "df = df.drop(columns=['id', 'label'], axis=1)\n", 616 | "df.head()" 617 | ] 618 | }, 619 | { 620 | "cell_type": "markdown", 621 | "metadata": {}, 622 | "source": [ 623 | "## Convert to lowercase" 624 | ] 625 | }, 626 | { 627 | "cell_type": "code", 628 | "execution_count": 10, 629 | "metadata": {}, 630 | "outputs": [ 631 | { 632 | "data": { 633 | "text/html": [ 634 | "
\n", 635 | "\n", 648 | "\n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | "
tweetclean_text
0@user when a father is dysfunctional and is s...@user when a father is dysfunctional and is s...
1@user @user thanks for #lyft credit i can't us...@user @user thanks for #lyft credit i can't us...
2bihday your majestybihday your majesty
3#model i love u take with u all the time in ...#model i love u take with u all the time in ...
4factsguide: society now #motivationfactsguide: society now #motivation
\n", 684 | "
" 685 | ], 686 | "text/plain": [ 687 | " tweet \\\n", 688 | "0 @user when a father is dysfunctional and is s... \n", 689 | "1 @user @user thanks for #lyft credit i can't us... \n", 690 | "2 bihday your majesty \n", 691 | "3 #model i love u take with u all the time in ... \n", 692 | "4 factsguide: society now #motivation \n", 693 | "\n", 694 | " clean_text \n", 695 | "0 @user when a father is dysfunctional and is s... \n", 696 | "1 @user @user thanks for #lyft credit i can't us... \n", 697 | "2 bihday your majesty \n", 698 | "3 #model i love u take with u all the time in ... \n", 699 | "4 factsguide: society now #motivation " 700 | ] 701 | }, 702 | "execution_count": 10, 703 | "metadata": {}, 704 | "output_type": "execute_result" 705 | } 706 | ], 707 | "source": [ 708 | "df['clean_text'] = df['tweet'].str.lower()\n", 709 | "df.head()" 710 | ] 711 | }, 712 | { 713 | "cell_type": "markdown", 714 | "metadata": {}, 715 | "source": [ 716 | "## Removal of Punctuations" 717 | ] 718 | }, 719 | { 720 | "cell_type": "code", 721 | "execution_count": 12, 722 | "metadata": {}, 723 | "outputs": [ 724 | { 725 | "data": { 726 | "text/plain": [ 727 | "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'" 728 | ] 729 | }, 730 | "execution_count": 12, 731 | "metadata": {}, 732 | "output_type": "execute_result" 733 | } 734 | ], 735 | "source": [ 736 | "string.punctuation" 737 | ] 738 | }, 739 | { 740 | "cell_type": "code", 741 | "execution_count": 13, 742 | "metadata": {}, 743 | "outputs": [], 744 | "source": [ 745 | "def remove_punctuations(text):\n", 746 | " punctuations = string.punctuation\n", 747 | " return text.translate(str.maketrans('', '', punctuations))" 748 | ] 749 | }, 750 | { 751 | "cell_type": "code", 752 | "execution_count": 14, 753 | "metadata": {}, 754 | "outputs": [ 755 | { 756 | "data": { 757 | "text/html": [ 758 | "
\n", 759 | "\n", 772 | "\n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | "
tweetclean_text
0@user when a father is dysfunctional and is s...user when a father is dysfunctional and is so...
1@user @user thanks for #lyft credit i can't us...user user thanks for lyft credit i cant use ca...
2bihday your majestybihday your majesty
3#model i love u take with u all the time in ...model i love u take with u all the time in u...
4factsguide: society now #motivationfactsguide society now motivation
\n", 808 | "
" 809 | ], 810 | "text/plain": [ 811 | " tweet \\\n", 812 | "0 @user when a father is dysfunctional and is s... \n", 813 | "1 @user @user thanks for #lyft credit i can't us... \n", 814 | "2 bihday your majesty \n", 815 | "3 #model i love u take with u all the time in ... \n", 816 | "4 factsguide: society now #motivation \n", 817 | "\n", 818 | " clean_text \n", 819 | "0 user when a father is dysfunctional and is so... \n", 820 | "1 user user thanks for lyft credit i cant use ca... \n", 821 | "2 bihday your majesty \n", 822 | "3 model i love u take with u all the time in u... \n", 823 | "4 factsguide society now motivation " 824 | ] 825 | }, 826 | "execution_count": 14, 827 | "metadata": {}, 828 | "output_type": "execute_result" 829 | } 830 | ], 831 | "source": [ 832 | "df['clean_text'] = df['clean_text'].apply(lambda x: remove_punctuations(x))\n", 833 | "df.head()" 834 | ] 835 | }, 836 | { 837 | "cell_type": "markdown", 838 | "metadata": {}, 839 | "source": [ 840 | "## Removal of Stopwords" 841 | ] 842 | }, 843 | { 844 | "cell_type": "code", 845 | "execution_count": 17, 846 | "metadata": {}, 847 | "outputs": [ 848 | { 849 | "data": { 850 | "text/plain": [ 851 | "\"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mustn't, needn, needn't, shan, shan't, shouldn, shouldn't, wasn, wasn't, weren, weren't, won, won't, wouldn, wouldn't\"" 852 | ] 853 | }, 854 | "execution_count": 17, 855 | "metadata": {}, 856 | "output_type": "execute_result" 857 | } 858 | ], 859 | "source": [ 860 | "from nltk.corpus import stopwords\n", 861 | "\", \".join(stopwords.words('english'))" 862 | ] 863 | }, 864 | { 865 | "cell_type": "code", 866 | "execution_count": 18, 867 | "metadata": {}, 868 | "outputs": [], 869 | "source": [ 870 | "STOPWORDS = set(stopwords.words('english'))\n", 871 | "def remove_stopwords(text):\n", 872 | " return \" \".join([word for word in text.split() if word not in STOPWORDS])" 873 | ] 874 | }, 875 | { 876 | "cell_type": "code", 877 | "execution_count": 19, 878 | "metadata": {}, 879 | "outputs": [ 880 | { 881 | "data": { 882 | "text/html": [ 883 | "
\n", 884 | "\n", 897 | "\n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | "
tweetclean_text
0@user when a father is dysfunctional and is s...user father dysfunctional selfish drags kids d...
1@user @user thanks for #lyft credit i can't us...user user thanks lyft credit cant use cause do...
2bihday your majestybihday majesty
3#model i love u take with u all the time in ...model love u take u time ur📱 😙😎👄ðŸ...
4factsguide: society now #motivationfactsguide society motivation
\n", 933 | "
" 934 | ], 935 | "text/plain": [ 936 | " tweet \\\n", 937 | "0 @user when a father is dysfunctional and is s... \n", 938 | "1 @user @user thanks for #lyft credit i can't us... \n", 939 | "2 bihday your majesty \n", 940 | "3 #model i love u take with u all the time in ... \n", 941 | "4 factsguide: society now #motivation \n", 942 | "\n", 943 | " clean_text \n", 944 | "0 user father dysfunctional selfish drags kids d... \n", 945 | "1 user user thanks lyft credit cant use cause do... \n", 946 | "2 bihday majesty \n", 947 | "3 model love u take u time ur📱 😙😎👄ðŸ... \n", 948 | "4 factsguide society motivation " 949 | ] 950 | }, 951 | "execution_count": 19, 952 | "metadata": {}, 953 | "output_type": "execute_result" 954 | } 955 | ], 956 | "source": [ 957 | "df['clean_text'] = df['clean_text'].apply(lambda x: remove_stopwords(x))\n", 958 | "df.head()" 959 | ] 960 | }, 961 | { 962 | "cell_type": "markdown", 963 | "metadata": {}, 964 | "source": [ 965 | "## Removal of Frequent Words" 966 | ] 967 | }, 968 | { 969 | "cell_type": "code", 970 | "execution_count": 23, 971 | "metadata": {}, 972 | "outputs": [ 973 | { 974 | "data": { 975 | "text/plain": [ 976 | "[('user', 17473),\n", 977 | " ('love', 2647),\n", 978 | " ('day', 2198),\n", 979 | " ('happy', 1663),\n", 980 | " ('amp', 1582),\n", 981 | " ('im', 1139),\n", 982 | " ('u', 1136),\n", 983 | " ('time', 1110),\n", 984 | " ('life', 1086),\n", 985 | " ('like', 1042)]" 986 | ] 987 | }, 988 | "execution_count": 23, 989 | "metadata": {}, 990 | "output_type": "execute_result" 991 | } 992 | ], 993 | "source": [ 994 | "from collections import Counter\n", 995 | "word_count = Counter()\n", 996 | "for text in df['clean_text']:\n", 997 | " for word in text.split():\n", 998 | " word_count[word] += 1\n", 999 | " \n", 1000 | "word_count.most_common(10)" 1001 | ] 1002 | }, 1003 | { 1004 | "cell_type": "code", 1005 | "execution_count": 24, 1006 | "metadata": {}, 1007 | "outputs": [], 1008 | "source": [ 1009 | "FREQUENT_WORDS = set(word for (word, wc) in word_count.most_common(3))\n", 1010 | "def remove_freq_words(text):\n", 1011 | " return \" \".join([word for word in text.split() if word not in FREQUENT_WORDS])" 1012 | ] 1013 | }, 1014 | { 1015 | "cell_type": "code", 1016 | "execution_count": 25, 1017 | "metadata": {}, 1018 | "outputs": [ 1019 | { 1020 | "data": { 1021 | "text/html": [ 1022 | "
\n", 1023 | "\n", 1036 | "\n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | "
tweetclean_text
0@user when a father is dysfunctional and is s...father dysfunctional selfish drags kids dysfun...
1@user @user thanks for #lyft credit i can't us...thanks lyft credit cant use cause dont offer w...
2bihday your majestybihday majesty
3#model i love u take with u all the time in ...model u take u time ur📱 😙😎👄👠ðŸ’...
4factsguide: society now #motivationfactsguide society motivation
\n", 1072 | "
" 1073 | ], 1074 | "text/plain": [ 1075 | " tweet \\\n", 1076 | "0 @user when a father is dysfunctional and is s... \n", 1077 | "1 @user @user thanks for #lyft credit i can't us... \n", 1078 | "2 bihday your majesty \n", 1079 | "3 #model i love u take with u all the time in ... \n", 1080 | "4 factsguide: society now #motivation \n", 1081 | "\n", 1082 | " clean_text \n", 1083 | "0 father dysfunctional selfish drags kids dysfun... \n", 1084 | "1 thanks lyft credit cant use cause dont offer w... \n", 1085 | "2 bihday majesty \n", 1086 | "3 model u take u time ur📱 😙😎👄👠ðŸ’... \n", 1087 | "4 factsguide society motivation " 1088 | ] 1089 | }, 1090 | "execution_count": 25, 1091 | "metadata": {}, 1092 | "output_type": "execute_result" 1093 | } 1094 | ], 1095 | "source": [ 1096 | "df['clean_text'] = df['clean_text'].apply(lambda x: remove_freq_words(x))\n", 1097 | "df.head()" 1098 | ] 1099 | }, 1100 | { 1101 | "cell_type": "markdown", 1102 | "metadata": {}, 1103 | "source": [ 1104 | "## Removal of Rare Words" 1105 | ] 1106 | }, 1107 | { 1108 | "cell_type": "code", 1109 | "execution_count": 30, 1110 | "metadata": {}, 1111 | "outputs": [ 1112 | { 1113 | "data": { 1114 | "text/plain": [ 1115 | "{'airwaves',\n", 1116 | " 'carnt',\n", 1117 | " 'chisolm',\n", 1118 | " 'ibizabringitonmallorcaholidayssummer',\n", 1119 | " 'isz',\n", 1120 | " 'mantle',\n", 1121 | " 'shirley',\n", 1122 | " 'youuuð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dâ\\x9d¤ï¸\\x8f',\n", 1123 | " 'ð\\x9f\\x99\\x8fð\\x9f\\x8f¼ð\\x9f\\x8d¹ð\\x9f\\x98\\x8eð\\x9f\\x8eµ'}" 1124 | ] 1125 | }, 1126 | "execution_count": 30, 1127 | "metadata": {}, 1128 | "output_type": "execute_result" 1129 | } 1130 | ], 1131 | "source": [ 1132 | "RARE_WORDS = set(word for (word, wc) in word_count.most_common()[:-10:-1])\n", 1133 | "RARE_WORDS" 1134 | ] 1135 | }, 1136 | { 1137 | "cell_type": "code", 1138 | "execution_count": 31, 1139 | "metadata": {}, 1140 | "outputs": [], 1141 | "source": [ 1142 | "def remove_rare_words(text):\n", 1143 | " return \" \".join([word for word in text.split() if word not in RARE_WORDS])" 1144 | ] 1145 | }, 1146 | { 1147 | "cell_type": "code", 1148 | "execution_count": 32, 1149 | "metadata": {}, 1150 | "outputs": [ 1151 | { 1152 | "data": { 1153 | "text/html": [ 1154 | "
\n", 1155 | "\n", 1168 | "\n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | "
tweetclean_text
0@user when a father is dysfunctional and is s...father dysfunctional selfish drags kids dysfun...
1@user @user thanks for #lyft credit i can't us...thanks lyft credit cant use cause dont offer w...
2bihday your majestybihday majesty
3#model i love u take with u all the time in ...model u take u time ur📱 😙😎👄👠ðŸ’...
4factsguide: society now #motivationfactsguide society motivation
\n", 1204 | "
" 1205 | ], 1206 | "text/plain": [ 1207 | " tweet \\\n", 1208 | "0 @user when a father is dysfunctional and is s... \n", 1209 | "1 @user @user thanks for #lyft credit i can't us... \n", 1210 | "2 bihday your majesty \n", 1211 | "3 #model i love u take with u all the time in ... \n", 1212 | "4 factsguide: society now #motivation \n", 1213 | "\n", 1214 | " clean_text \n", 1215 | "0 father dysfunctional selfish drags kids dysfun... \n", 1216 | "1 thanks lyft credit cant use cause dont offer w... \n", 1217 | "2 bihday majesty \n", 1218 | "3 model u take u time ur📱 😙😎👄👠ðŸ’... \n", 1219 | "4 factsguide society motivation " 1220 | ] 1221 | }, 1222 | "execution_count": 32, 1223 | "metadata": {}, 1224 | "output_type": "execute_result" 1225 | } 1226 | ], 1227 | "source": [ 1228 | "df['clean_text'] = df['clean_text'].apply(lambda x: remove_rare_words(x))\n", 1229 | "df.head()" 1230 | ] 1231 | }, 1232 | { 1233 | "cell_type": "markdown", 1234 | "metadata": {}, 1235 | "source": [ 1236 | "## Removal of Special characters" 1237 | ] 1238 | }, 1239 | { 1240 | "cell_type": "code", 1241 | "execution_count": 33, 1242 | "metadata": {}, 1243 | "outputs": [], 1244 | "source": [ 1245 | "import re\n", 1246 | "def remove_spl_chars(text):\n", 1247 | " text = re.sub('[^a-zA-Z0-9]', ' ', text)\n", 1248 | " text = re.sub('\\s+', ' ', text)\n", 1249 | " return text" 1250 | ] 1251 | }, 1252 | { 1253 | "cell_type": "code", 1254 | "execution_count": 34, 1255 | "metadata": {}, 1256 | "outputs": [ 1257 | { 1258 | "data": { 1259 | "text/html": [ 1260 | "
\n", 1261 | "\n", 1274 | "\n", 1275 | " \n", 1276 | " \n", 1277 | " \n", 1278 | " \n", 1279 | " \n", 1280 | " \n", 1281 | " \n", 1282 | " \n", 1283 | " \n", 1284 | " \n", 1285 | " \n", 1286 | " \n", 1287 | " \n", 1288 | " \n", 1289 | " \n", 1290 | " \n", 1291 | " \n", 1292 | " \n", 1293 | " \n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " \n", 1307 | " \n", 1308 | " \n", 1309 | "
tweetclean_text
0@user when a father is dysfunctional and is s...father dysfunctional selfish drags kids dysfun...
1@user @user thanks for #lyft credit i can't us...thanks lyft credit cant use cause dont offer w...
2bihday your majestybihday majesty
3#model i love u take with u all the time in ...model u take u time ur
4factsguide: society now #motivationfactsguide society motivation
\n", 1310 | "
" 1311 | ], 1312 | "text/plain": [ 1313 | " tweet \\\n", 1314 | "0 @user when a father is dysfunctional and is s... \n", 1315 | "1 @user @user thanks for #lyft credit i can't us... \n", 1316 | "2 bihday your majesty \n", 1317 | "3 #model i love u take with u all the time in ... \n", 1318 | "4 factsguide: society now #motivation \n", 1319 | "\n", 1320 | " clean_text \n", 1321 | "0 father dysfunctional selfish drags kids dysfun... \n", 1322 | "1 thanks lyft credit cant use cause dont offer w... \n", 1323 | "2 bihday majesty \n", 1324 | "3 model u take u time ur \n", 1325 | "4 factsguide society motivation " 1326 | ] 1327 | }, 1328 | "execution_count": 34, 1329 | "metadata": {}, 1330 | "output_type": "execute_result" 1331 | } 1332 | ], 1333 | "source": [ 1334 | "df['clean_text'] = df['clean_text'].apply(lambda x: remove_spl_chars(x))\n", 1335 | "df.head()" 1336 | ] 1337 | }, 1338 | { 1339 | "cell_type": "markdown", 1340 | "metadata": {}, 1341 | "source": [ 1342 | "## Stemming" 1343 | ] 1344 | }, 1345 | { 1346 | "cell_type": "code", 1347 | "execution_count": 35, 1348 | "metadata": {}, 1349 | "outputs": [], 1350 | "source": [ 1351 | "from nltk.stem.porter import PorterStemmer\n", 1352 | "ps = PorterStemmer()\n", 1353 | "def stem_words(text):\n", 1354 | " return \" \".join([ps.stem(word) for word in text.split()])" 1355 | ] 1356 | }, 1357 | { 1358 | "cell_type": "code", 1359 | "execution_count": 36, 1360 | "metadata": {}, 1361 | "outputs": [ 1362 | { 1363 | "data": { 1364 | "text/html": [ 1365 | "
\n", 1366 | "\n", 1379 | "\n", 1380 | " \n", 1381 | " \n", 1382 | " \n", 1383 | " \n", 1384 | " \n", 1385 | " \n", 1386 | " \n", 1387 | " \n", 1388 | " \n", 1389 | " \n", 1390 | " \n", 1391 | " \n", 1392 | " \n", 1393 | " \n", 1394 | " \n", 1395 | " \n", 1396 | " \n", 1397 | " \n", 1398 | " \n", 1399 | " \n", 1400 | " \n", 1401 | " \n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | "
tweetclean_textstemmed_text
0@user when a father is dysfunctional and is s...father dysfunctional selfish drags kids dysfun...father dysfunct selfish drag kid dysfunct run
1@user @user thanks for #lyft credit i can't us...thanks lyft credit cant use cause dont offer w...thank lyft credit cant use caus dont offer whe...
2bihday your majestybihday majestybihday majesti
3#model i love u take with u all the time in ...model u take u time urmodel u take u time ur
4factsguide: society now #motivationfactsguide society motivationfactsguid societi motiv
\n", 1421 | "
" 1422 | ], 1423 | "text/plain": [ 1424 | " tweet \\\n", 1425 | "0 @user when a father is dysfunctional and is s... \n", 1426 | "1 @user @user thanks for #lyft credit i can't us... \n", 1427 | "2 bihday your majesty \n", 1428 | "3 #model i love u take with u all the time in ... \n", 1429 | "4 factsguide: society now #motivation \n", 1430 | "\n", 1431 | " clean_text \\\n", 1432 | "0 father dysfunctional selfish drags kids dysfun... \n", 1433 | "1 thanks lyft credit cant use cause dont offer w... \n", 1434 | "2 bihday majesty \n", 1435 | "3 model u take u time ur \n", 1436 | "4 factsguide society motivation \n", 1437 | "\n", 1438 | " stemmed_text \n", 1439 | "0 father dysfunct selfish drag kid dysfunct run \n", 1440 | "1 thank lyft credit cant use caus dont offer whe... \n", 1441 | "2 bihday majesti \n", 1442 | "3 model u take u time ur \n", 1443 | "4 factsguid societi motiv " 1444 | ] 1445 | }, 1446 | "execution_count": 36, 1447 | "metadata": {}, 1448 | "output_type": "execute_result" 1449 | } 1450 | ], 1451 | "source": [ 1452 | "df['stemmed_text'] = df['clean_text'].apply(lambda x: stem_words(x))\n", 1453 | "df.head()" 1454 | ] 1455 | }, 1456 | { 1457 | "cell_type": "markdown", 1458 | "metadata": {}, 1459 | "source": [ 1460 | "## Lemmatization & POS Tagging" 1461 | ] 1462 | }, 1463 | { 1464 | "cell_type": "code", 1465 | "execution_count": 41, 1466 | "metadata": {}, 1467 | "outputs": [], 1468 | "source": [ 1469 | "from nltk import pos_tag\n", 1470 | "from nltk.corpus import wordnet\n", 1471 | "from nltk.stem import WordNetLemmatizer\n", 1472 | "\n", 1473 | "lemmatizer = WordNetLemmatizer()\n", 1474 | "wordnet_map = {\"N\":wordnet.NOUN, \"V\": wordnet.VERB, \"J\": wordnet.ADJ, \"R\": wordnet.ADV}\n", 1475 | "\n", 1476 | "def lemmatize_words(text):\n", 1477 | " # find pos tags\n", 1478 | " pos_text = pos_tag(text.split())\n", 1479 | " return \" \".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_text])" 1480 | ] 1481 | }, 1482 | { 1483 | "cell_type": "code", 1484 | "execution_count": 42, 1485 | "metadata": {}, 1486 | "outputs": [ 1487 | { 1488 | "data": { 1489 | "text/plain": [ 1490 | "'n'" 1491 | ] 1492 | }, 1493 | "execution_count": 42, 1494 | "metadata": {}, 1495 | "output_type": "execute_result" 1496 | } 1497 | ], 1498 | "source": [ 1499 | "wordnet.NOUN" 1500 | ] 1501 | }, 1502 | { 1503 | "cell_type": "code", 1504 | "execution_count": 43, 1505 | "metadata": {}, 1506 | "outputs": [ 1507 | { 1508 | "data": { 1509 | "text/html": [ 1510 | "
\n", 1511 | "\n", 1524 | "\n", 1525 | " \n", 1526 | " \n", 1527 | " \n", 1528 | " \n", 1529 | " \n", 1530 | " \n", 1531 | " \n", 1532 | " \n", 1533 | " \n", 1534 | " \n", 1535 | " \n", 1536 | " \n", 1537 | " \n", 1538 | " \n", 1539 | " \n", 1540 | " \n", 1541 | " \n", 1542 | " \n", 1543 | " \n", 1544 | " \n", 1545 | " \n", 1546 | " \n", 1547 | " \n", 1548 | " \n", 1549 | " \n", 1550 | " \n", 1551 | " \n", 1552 | " \n", 1553 | " \n", 1554 | " \n", 1555 | " \n", 1556 | " \n", 1557 | " \n", 1558 | " \n", 1559 | " \n", 1560 | " \n", 1561 | " \n", 1562 | " \n", 1563 | " \n", 1564 | " \n", 1565 | " \n", 1566 | " \n", 1567 | " \n", 1568 | " \n", 1569 | " \n", 1570 | " \n", 1571 | "
tweetclean_textstemmed_textlemmatized_text
0@user when a father is dysfunctional and is s...father dysfunctional selfish drags kids dysfun...father dysfunct selfish drag kid dysfunct runfather dysfunctional selfish drag kid dysfunct...
1@user @user thanks for #lyft credit i can't us...thanks lyft credit cant use cause dont offer w...thank lyft credit cant use caus dont offer whe...thanks lyft credit cant use cause dont offer w...
2bihday your majestybihday majestybihday majestibihday majesty
3#model i love u take with u all the time in ...model u take u time urmodel u take u time urmodel u take u time ur
4factsguide: society now #motivationfactsguide society motivationfactsguid societi motivfactsguide society motivation
\n", 1572 | "
" 1573 | ], 1574 | "text/plain": [ 1575 | " tweet \\\n", 1576 | "0 @user when a father is dysfunctional and is s... \n", 1577 | "1 @user @user thanks for #lyft credit i can't us... \n", 1578 | "2 bihday your majesty \n", 1579 | "3 #model i love u take with u all the time in ... \n", 1580 | "4 factsguide: society now #motivation \n", 1581 | "\n", 1582 | " clean_text \\\n", 1583 | "0 father dysfunctional selfish drags kids dysfun... \n", 1584 | "1 thanks lyft credit cant use cause dont offer w... \n", 1585 | "2 bihday majesty \n", 1586 | "3 model u take u time ur \n", 1587 | "4 factsguide society motivation \n", 1588 | "\n", 1589 | " stemmed_text \\\n", 1590 | "0 father dysfunct selfish drag kid dysfunct run \n", 1591 | "1 thank lyft credit cant use caus dont offer whe... \n", 1592 | "2 bihday majesti \n", 1593 | "3 model u take u time ur \n", 1594 | "4 factsguid societi motiv \n", 1595 | "\n", 1596 | " lemmatized_text \n", 1597 | "0 father dysfunctional selfish drag kid dysfunct... \n", 1598 | "1 thanks lyft credit cant use cause dont offer w... \n", 1599 | "2 bihday majesty \n", 1600 | "3 model u take u time ur \n", 1601 | "4 factsguide society motivation " 1602 | ] 1603 | }, 1604 | "execution_count": 43, 1605 | "metadata": {}, 1606 | "output_type": "execute_result" 1607 | } 1608 | ], 1609 | "source": [ 1610 | "df['lemmatized_text'] = df['clean_text'].apply(lambda x: lemmatize_words(x))\n", 1611 | "df.head()" 1612 | ] 1613 | }, 1614 | { 1615 | "cell_type": "code", 1616 | "execution_count": 44, 1617 | "metadata": {}, 1618 | "outputs": [ 1619 | { 1620 | "data": { 1621 | "text/html": [ 1622 | "
\n", 1623 | "\n", 1636 | "\n", 1637 | " \n", 1638 | " \n", 1639 | " \n", 1640 | " \n", 1641 | " \n", 1642 | " \n", 1643 | " \n", 1644 | " \n", 1645 | " \n", 1646 | " \n", 1647 | " \n", 1648 | " \n", 1649 | " \n", 1650 | " \n", 1651 | " \n", 1652 | " \n", 1653 | " \n", 1654 | " \n", 1655 | " \n", 1656 | " \n", 1657 | " \n", 1658 | " \n", 1659 | " \n", 1660 | " \n", 1661 | " \n", 1662 | " \n", 1663 | " \n", 1664 | " \n", 1665 | " \n", 1666 | " \n", 1667 | " \n", 1668 | " \n", 1669 | " \n", 1670 | " \n", 1671 | " \n", 1672 | " \n", 1673 | " \n", 1674 | " \n", 1675 | " \n", 1676 | " \n", 1677 | " \n", 1678 | " \n", 1679 | " \n", 1680 | " \n", 1681 | " \n", 1682 | " \n", 1683 | " \n", 1684 | " \n", 1685 | " \n", 1686 | " \n", 1687 | " \n", 1688 | " \n", 1689 | " \n", 1690 | " \n", 1691 | " \n", 1692 | " \n", 1693 | " \n", 1694 | " \n", 1695 | " \n", 1696 | " \n", 1697 | " \n", 1698 | " \n", 1699 | " \n", 1700 | " \n", 1701 | " \n", 1702 | " \n", 1703 | " \n", 1704 | " \n", 1705 | " \n", 1706 | " \n", 1707 | " \n", 1708 | " \n", 1709 | " \n", 1710 | " \n", 1711 | " \n", 1712 | " \n", 1713 | " \n", 1714 | " \n", 1715 | " \n", 1716 | " \n", 1717 | " \n", 1718 | "
tweetclean_textstemmed_textlemmatized_text
21468@user for real now: we will be playing @user ...real playing czech republic china championship...real play czech republ china championship wugc...real play czech republic china championship wu...
9568dear america. please don't let this influence ...dear america please dont let influence vote tr...dear america pleas dont let influenc vote trum...dear america please dont let influence vote tr...
19804finally... now on to other suppos~ #leagueof...finally suppos leagueoflegendsfinal suppo leagueoflegendfinally suppos leagueoflegends
22323@user @user @user @user feeling #worried.feeling worriedfeel worrifeel worry
20171i am valued. #i_am #positive #affirmationvalued iam positive affirmationvalu iam posit affirmvalue iam positive affirmation
29669fathers day selfie ❤️ #grandad #selfie #...fathers selfie grandad selfie fathersday bless...father selfi grandad selfi fathersday bless su...father selfie grandad selfie fathersday bless ...
4360when 8th #graders say they're for high #school8th graders say theyre high school8th grader say theyr high school8th grader say theyre high school
15915current mood 😔💦 #alone #anxiety #rain ...current mood alone anxiety rain thistooshallpasscurrent mood alon anxieti rain thistooshallpasscurrent mood alone anxiety rain thistooshallpass
92yes! received my acceptance letter for my mast...yes received acceptance letter masters back oc...ye receiv accept letter master back octob good...yes receive acceptance letter master back octo...
18745@user @user this so made me smilemade smilemade smilemake smile
\n", 1719 | "
" 1720 | ], 1721 | "text/plain": [ 1722 | " tweet \\\n", 1723 | "21468 @user for real now: we will be playing @user ... \n", 1724 | "9568 dear america. please don't let this influence ... \n", 1725 | "19804 finally... now on to other suppos~ #leagueof... \n", 1726 | "22323 @user @user @user @user feeling #worried. \n", 1727 | "20171 i am valued. #i_am #positive #affirmation \n", 1728 | "29669 fathers day selfie ❤️ #grandad #selfie #... \n", 1729 | "4360 when 8th #graders say they're for high #school \n", 1730 | "15915 current mood 😔💦 #alone #anxiety #rain ... \n", 1731 | "92 yes! received my acceptance letter for my mast... \n", 1732 | "18745 @user @user this so made me smile \n", 1733 | "\n", 1734 | " clean_text \\\n", 1735 | "21468 real playing czech republic china championship... \n", 1736 | "9568 dear america please dont let influence vote tr... \n", 1737 | "19804 finally suppos leagueoflegends \n", 1738 | "22323 feeling worried \n", 1739 | "20171 valued iam positive affirmation \n", 1740 | "29669 fathers selfie grandad selfie fathersday bless... \n", 1741 | "4360 8th graders say theyre high school \n", 1742 | "15915 current mood alone anxiety rain thistooshallpass \n", 1743 | "92 yes received acceptance letter masters back oc... \n", 1744 | "18745 made smile \n", 1745 | "\n", 1746 | " stemmed_text \\\n", 1747 | "21468 real play czech republ china championship wugc... \n", 1748 | "9568 dear america pleas dont let influenc vote trum... \n", 1749 | "19804 final suppo leagueoflegend \n", 1750 | "22323 feel worri \n", 1751 | "20171 valu iam posit affirm \n", 1752 | "29669 father selfi grandad selfi fathersday bless su... \n", 1753 | "4360 8th grader say theyr high school \n", 1754 | "15915 current mood alon anxieti rain thistooshallpass \n", 1755 | "92 ye receiv accept letter master back octob good... \n", 1756 | "18745 made smile \n", 1757 | "\n", 1758 | " lemmatized_text \n", 1759 | "21468 real play czech republic china championship wu... \n", 1760 | "9568 dear america please dont let influence vote tr... \n", 1761 | "19804 finally suppos leagueoflegends \n", 1762 | "22323 feel worry \n", 1763 | "20171 value iam positive affirmation \n", 1764 | "29669 father selfie grandad selfie fathersday bless ... \n", 1765 | "4360 8th grader say theyre high school \n", 1766 | "15915 current mood alone anxiety rain thistooshallpass \n", 1767 | "92 yes receive acceptance letter master back octo... \n", 1768 | "18745 make smile " 1769 | ] 1770 | }, 1771 | "execution_count": 44, 1772 | "metadata": {}, 1773 | "output_type": "execute_result" 1774 | } 1775 | ], 1776 | "source": [ 1777 | "df.sample(frac=1).head(10)" 1778 | ] 1779 | }, 1780 | { 1781 | "cell_type": "markdown", 1782 | "metadata": {}, 1783 | "source": [ 1784 | "## Removal of URLs" 1785 | ] 1786 | }, 1787 | { 1788 | "cell_type": "code", 1789 | "execution_count": 53, 1790 | "metadata": {}, 1791 | "outputs": [], 1792 | "source": [ 1793 | "text = \"https://www.hackersrealm.net is the URL of the channel Hackers Realm\"" 1794 | ] 1795 | }, 1796 | { 1797 | "cell_type": "code", 1798 | "execution_count": 54, 1799 | "metadata": {}, 1800 | "outputs": [], 1801 | "source": [ 1802 | "def remove_url(text):\n", 1803 | " return re.sub(r'https?://\\S+|www\\.\\S+', '', text)" 1804 | ] 1805 | }, 1806 | { 1807 | "cell_type": "code", 1808 | "execution_count": 55, 1809 | "metadata": {}, 1810 | "outputs": [ 1811 | { 1812 | "data": { 1813 | "text/plain": [ 1814 | "' is the URL of the channel Hackers Realm'" 1815 | ] 1816 | }, 1817 | "execution_count": 55, 1818 | "metadata": {}, 1819 | "output_type": "execute_result" 1820 | } 1821 | ], 1822 | "source": [ 1823 | "remove_url(text)" 1824 | ] 1825 | }, 1826 | { 1827 | "cell_type": "markdown", 1828 | "metadata": {}, 1829 | "source": [ 1830 | "## Removal of HTML Tags" 1831 | ] 1832 | }, 1833 | { 1834 | "cell_type": "code", 1835 | "execution_count": 56, 1836 | "metadata": {}, 1837 | "outputs": [], 1838 | "source": [ 1839 | "text = \"

Hackers Realm

This is NLP text preprocessing tutorial

\"" 1840 | ] 1841 | }, 1842 | { 1843 | "cell_type": "code", 1844 | "execution_count": 57, 1845 | "metadata": {}, 1846 | "outputs": [], 1847 | "source": [ 1848 | "def remove_html_tags(text):\n", 1849 | " return re.sub(r'<.*?>', '', text)" 1850 | ] 1851 | }, 1852 | { 1853 | "cell_type": "code", 1854 | "execution_count": 58, 1855 | "metadata": {}, 1856 | "outputs": [ 1857 | { 1858 | "data": { 1859 | "text/plain": [ 1860 | "' Hackers Realm This is NLP text preprocessing tutorial '" 1861 | ] 1862 | }, 1863 | "execution_count": 58, 1864 | "metadata": {}, 1865 | "output_type": "execute_result" 1866 | } 1867 | ], 1868 | "source": [ 1869 | "remove_html_tags(text)" 1870 | ] 1871 | }, 1872 | { 1873 | "cell_type": "markdown", 1874 | "metadata": {}, 1875 | "source": [ 1876 | "## Spelling Correction" 1877 | ] 1878 | }, 1879 | { 1880 | "cell_type": "code", 1881 | "execution_count": 64, 1882 | "metadata": {}, 1883 | "outputs": [], 1884 | "source": [ 1885 | "!pip install pyspellchecker" 1886 | ] 1887 | }, 1888 | { 1889 | "cell_type": "code", 1890 | "execution_count": 7, 1891 | "metadata": {}, 1892 | "outputs": [], 1893 | "source": [ 1894 | "text = 'natur is a beuty'" 1895 | ] 1896 | }, 1897 | { 1898 | "cell_type": "code", 1899 | "execution_count": 8, 1900 | "metadata": {}, 1901 | "outputs": [], 1902 | "source": [ 1903 | "from spellchecker import SpellChecker\n", 1904 | "spell = SpellChecker()\n", 1905 | "\n", 1906 | "def correct_spellings(text):\n", 1907 | " corrected_text = []\n", 1908 | " misspelled_text = spell.unknown(text.split())\n", 1909 | " # print(misspelled_text)\n", 1910 | " for word in text.split():\n", 1911 | " if word in misspelled_text:\n", 1912 | " corrected_text.append(spell.correction(word))\n", 1913 | " else:\n", 1914 | " corrected_text.append(word)\n", 1915 | " \n", 1916 | " return \" \".join(corrected_text)" 1917 | ] 1918 | }, 1919 | { 1920 | "cell_type": "code", 1921 | "execution_count": 9, 1922 | "metadata": {}, 1923 | "outputs": [ 1924 | { 1925 | "data": { 1926 | "text/plain": [ 1927 | "'nature is a beauty'" 1928 | ] 1929 | }, 1930 | "execution_count": 9, 1931 | "metadata": {}, 1932 | "output_type": "execute_result" 1933 | } 1934 | ], 1935 | "source": [ 1936 | "correct_spellings(text)" 1937 | ] 1938 | }, 1939 | { 1940 | "cell_type": "code", 1941 | "execution_count": null, 1942 | "metadata": {}, 1943 | "outputs": [], 1944 | "source": [] 1945 | }, 1946 | { 1947 | "cell_type": "markdown", 1948 | "metadata": {}, 1949 | "source": [ 1950 | "# Feature Extraction from Text Data" 1951 | ] 1952 | }, 1953 | { 1954 | "cell_type": "markdown", 1955 | "metadata": {}, 1956 | "source": [ 1957 | "## Bag of Words\n", 1958 | "\n", 1959 | "A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: A vocabulary of known words. A measure of the presence of known words." 1960 | ] 1961 | }, 1962 | { 1963 | "cell_type": "code", 1964 | "execution_count": 5, 1965 | "metadata": {}, 1966 | "outputs": [], 1967 | "source": [ 1968 | "text_data = ['I am interested in NLP', 'This is a good tutorial with good topic', 'Feature extraction is very important topic']" 1969 | ] 1970 | }, 1971 | { 1972 | "cell_type": "code", 1973 | "execution_count": 6, 1974 | "metadata": {}, 1975 | "outputs": [], 1976 | "source": [ 1977 | "from sklearn.feature_extraction.text import CountVectorizer\n", 1978 | "bow = CountVectorizer(stop_words='english')" 1979 | ] 1980 | }, 1981 | { 1982 | "cell_type": "code", 1983 | "execution_count": 7, 1984 | "metadata": {}, 1985 | "outputs": [ 1986 | { 1987 | "data": { 1988 | "text/plain": [ 1989 | "CountVectorizer(stop_words='english')" 1990 | ] 1991 | }, 1992 | "execution_count": 7, 1993 | "metadata": {}, 1994 | "output_type": "execute_result" 1995 | } 1996 | ], 1997 | "source": [ 1998 | "# fit the data\n", 1999 | "bow.fit(text_data)" 2000 | ] 2001 | }, 2002 | { 2003 | "cell_type": "code", 2004 | "execution_count": 8, 2005 | "metadata": {}, 2006 | "outputs": [ 2007 | { 2008 | "data": { 2009 | "text/plain": [ 2010 | "['extraction',\n", 2011 | " 'feature',\n", 2012 | " 'good',\n", 2013 | " 'important',\n", 2014 | " 'interested',\n", 2015 | " 'nlp',\n", 2016 | " 'topic',\n", 2017 | " 'tutorial']" 2018 | ] 2019 | }, 2020 | "execution_count": 8, 2021 | "metadata": {}, 2022 | "output_type": "execute_result" 2023 | } 2024 | ], 2025 | "source": [ 2026 | "# get the vocabulary list\n", 2027 | "bow.get_feature_names()" 2028 | ] 2029 | }, 2030 | { 2031 | "cell_type": "code", 2032 | "execution_count": 9, 2033 | "metadata": {}, 2034 | "outputs": [ 2035 | { 2036 | "data": { 2037 | "text/plain": [ 2038 | "<3x8 sparse matrix of type ''\n", 2039 | "\twith 9 stored elements in Compressed Sparse Row format>" 2040 | ] 2041 | }, 2042 | "execution_count": 9, 2043 | "metadata": {}, 2044 | "output_type": "execute_result" 2045 | } 2046 | ], 2047 | "source": [ 2048 | "bow_features = bow.transform(text_data)\n", 2049 | "bow_features" 2050 | ] 2051 | }, 2052 | { 2053 | "cell_type": "code", 2054 | "execution_count": 10, 2055 | "metadata": {}, 2056 | "outputs": [ 2057 | { 2058 | "data": { 2059 | "text/plain": [ 2060 | "array([[0, 0, 0, 0, 1, 1, 0, 0],\n", 2061 | " [0, 0, 2, 0, 0, 0, 1, 1],\n", 2062 | " [1, 1, 0, 1, 0, 0, 1, 0]], dtype=int64)" 2063 | ] 2064 | }, 2065 | "execution_count": 10, 2066 | "metadata": {}, 2067 | "output_type": "execute_result" 2068 | } 2069 | ], 2070 | "source": [ 2071 | "bow_feature_array = bow_features.toarray()\n", 2072 | "bow_feature_array" 2073 | ] 2074 | }, 2075 | { 2076 | "cell_type": "code", 2077 | "execution_count": 11, 2078 | "metadata": {}, 2079 | "outputs": [ 2080 | { 2081 | "name": "stdout", 2082 | "output_type": "stream", 2083 | "text": [ 2084 | "['extraction', 'feature', 'good', 'important', 'interested', 'nlp', 'topic', 'tutorial']\n", 2085 | "I am interested in NLP\n", 2086 | "[0 0 0 0 1 1 0 0]\n", 2087 | "This is a good tutorial with good topic\n", 2088 | "[0 0 2 0 0 0 1 1]\n", 2089 | "Feature extraction is very important topic\n", 2090 | "[1 1 0 1 0 0 1 0]\n" 2091 | ] 2092 | } 2093 | ], 2094 | "source": [ 2095 | "print(bow.get_feature_names())\n", 2096 | "for sentence, feature in zip(text_data, bow_feature_array):\n", 2097 | " print(sentence)\n", 2098 | " print(feature)" 2099 | ] 2100 | }, 2101 | { 2102 | "cell_type": "code", 2103 | "execution_count": null, 2104 | "metadata": {}, 2105 | "outputs": [], 2106 | "source": [] 2107 | }, 2108 | { 2109 | "cell_type": "markdown", 2110 | "metadata": {}, 2111 | "source": [ 2112 | "## TF-IDF (Term Frequency/Inverse Document Frequency)\n", 2113 | "\n", 2114 | "TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc) in a document amongst a collection of documents" 2115 | ] 2116 | }, 2117 | { 2118 | "cell_type": "code", 2119 | "execution_count": 12, 2120 | "metadata": {}, 2121 | "outputs": [], 2122 | "source": [ 2123 | "text_data = ['I am interested in NLP', 'This is a good tutorial with good topic', 'Feature extraction is very important topic']" 2124 | ] 2125 | }, 2126 | { 2127 | "cell_type": "code", 2128 | "execution_count": 13, 2129 | "metadata": {}, 2130 | "outputs": [], 2131 | "source": [ 2132 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 2133 | "tfidf = TfidfVectorizer(stop_words='english')" 2134 | ] 2135 | }, 2136 | { 2137 | "cell_type": "code", 2138 | "execution_count": 14, 2139 | "metadata": {}, 2140 | "outputs": [ 2141 | { 2142 | "data": { 2143 | "text/plain": [ 2144 | "TfidfVectorizer(stop_words='english')" 2145 | ] 2146 | }, 2147 | "execution_count": 14, 2148 | "metadata": {}, 2149 | "output_type": "execute_result" 2150 | } 2151 | ], 2152 | "source": [ 2153 | "# fit the data\n", 2154 | "tfidf.fit(text_data)" 2155 | ] 2156 | }, 2157 | { 2158 | "cell_type": "code", 2159 | "execution_count": 15, 2160 | "metadata": {}, 2161 | "outputs": [ 2162 | { 2163 | "data": { 2164 | "text/plain": [ 2165 | "{'interested': 4,\n", 2166 | " 'nlp': 5,\n", 2167 | " 'good': 2,\n", 2168 | " 'tutorial': 7,\n", 2169 | " 'topic': 6,\n", 2170 | " 'feature': 1,\n", 2171 | " 'extraction': 0,\n", 2172 | " 'important': 3}" 2173 | ] 2174 | }, 2175 | "execution_count": 15, 2176 | "metadata": {}, 2177 | "output_type": "execute_result" 2178 | } 2179 | ], 2180 | "source": [ 2181 | "# get the vocabulary list\n", 2182 | "tfidf.vocabulary_" 2183 | ] 2184 | }, 2185 | { 2186 | "cell_type": "code", 2187 | "execution_count": 16, 2188 | "metadata": {}, 2189 | "outputs": [ 2190 | { 2191 | "data": { 2192 | "text/plain": [ 2193 | "<3x8 sparse matrix of type ''\n", 2194 | "\twith 9 stored elements in Compressed Sparse Row format>" 2195 | ] 2196 | }, 2197 | "execution_count": 16, 2198 | "metadata": {}, 2199 | "output_type": "execute_result" 2200 | } 2201 | ], 2202 | "source": [ 2203 | "tfidf_features = tfidf.transform(text_data)\n", 2204 | "tfidf_features" 2205 | ] 2206 | }, 2207 | { 2208 | "cell_type": "code", 2209 | "execution_count": 17, 2210 | "metadata": {}, 2211 | "outputs": [ 2212 | { 2213 | "data": { 2214 | "text/plain": [ 2215 | "array([[0. , 0. , 0. , 0. , 0.70710678,\n", 2216 | " 0.70710678, 0. , 0. ],\n", 2217 | " [0. , 0. , 0.84678897, 0. , 0. ,\n", 2218 | " 0. , 0.32200242, 0.42339448],\n", 2219 | " [0.52863461, 0.52863461, 0. , 0.52863461, 0. ,\n", 2220 | " 0. , 0.40204024, 0. ]])" 2221 | ] 2222 | }, 2223 | "execution_count": 17, 2224 | "metadata": {}, 2225 | "output_type": "execute_result" 2226 | } 2227 | ], 2228 | "source": [ 2229 | "tfidf_feature_array = tfidf_features.toarray()\n", 2230 | "tfidf_feature_array" 2231 | ] 2232 | }, 2233 | { 2234 | "cell_type": "code", 2235 | "execution_count": 19, 2236 | "metadata": {}, 2237 | "outputs": [ 2238 | { 2239 | "name": "stdout", 2240 | "output_type": "stream", 2241 | "text": [ 2242 | "I am interested in NLP\n", 2243 | " (0, 5)\t0.7071067811865476\n", 2244 | " (0, 4)\t0.7071067811865476\n", 2245 | "This is a good tutorial with good topic\n", 2246 | " (0, 7)\t0.42339448341195934\n", 2247 | " (0, 6)\t0.3220024178194947\n", 2248 | " (0, 2)\t0.8467889668239187\n", 2249 | "Feature extraction is very important topic\n", 2250 | " (0, 6)\t0.4020402441612698\n", 2251 | " (0, 3)\t0.5286346066596935\n", 2252 | " (0, 1)\t0.5286346066596935\n", 2253 | " (0, 0)\t0.5286346066596935\n" 2254 | ] 2255 | } 2256 | ], 2257 | "source": [ 2258 | "for sentence, feature in zip(text_data, tfidf_features):\n", 2259 | " print(sentence)\n", 2260 | " print(feature)" 2261 | ] 2262 | }, 2263 | { 2264 | "cell_type": "code", 2265 | "execution_count": null, 2266 | "metadata": {}, 2267 | "outputs": [], 2268 | "source": [] 2269 | }, 2270 | { 2271 | "cell_type": "markdown", 2272 | "metadata": {}, 2273 | "source": [ 2274 | "## Word2vec\n", 2275 | "\n", 2276 | "The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence." 2277 | ] 2278 | }, 2279 | { 2280 | "cell_type": "code", 2281 | "execution_count": 20, 2282 | "metadata": {}, 2283 | "outputs": [], 2284 | "source": [ 2285 | "from gensim.test.utils import common_texts\n", 2286 | "from gensim.models import Word2Vec" 2287 | ] 2288 | }, 2289 | { 2290 | "cell_type": "code", 2291 | "execution_count": 21, 2292 | "metadata": {}, 2293 | "outputs": [ 2294 | { 2295 | "data": { 2296 | "text/plain": [ 2297 | "[['human', 'interface', 'computer'],\n", 2298 | " ['survey', 'user', 'computer', 'system', 'response', 'time'],\n", 2299 | " ['eps', 'user', 'interface', 'system'],\n", 2300 | " ['system', 'human', 'system', 'eps'],\n", 2301 | " ['user', 'response', 'time'],\n", 2302 | " ['trees'],\n", 2303 | " ['graph', 'trees'],\n", 2304 | " ['graph', 'minors', 'trees'],\n", 2305 | " ['graph', 'minors', 'survey']]" 2306 | ] 2307 | }, 2308 | "execution_count": 21, 2309 | "metadata": {}, 2310 | "output_type": "execute_result" 2311 | } 2312 | ], 2313 | "source": [ 2314 | "# text data\n", 2315 | "common_texts" 2316 | ] 2317 | }, 2318 | { 2319 | "cell_type": "code", 2320 | "execution_count": 23, 2321 | "metadata": {}, 2322 | "outputs": [], 2323 | "source": [ 2324 | "# initialize and fit the data\n", 2325 | "model = Word2Vec(common_texts, size=100, min_count=1)" 2326 | ] 2327 | }, 2328 | { 2329 | "cell_type": "code", 2330 | "execution_count": 25, 2331 | "metadata": {}, 2332 | "outputs": [ 2333 | { 2334 | "data": { 2335 | "text/plain": [ 2336 | "array([-0.00042112, 0.00126945, -0.00348724, 0.00373327, 0.00387501,\n", 2337 | " -0.00306736, -0.00138952, -0.00139083, 0.00334137, 0.00413064,\n", 2338 | " 0.00045129, -0.00390373, -0.00159695, -0.00369461, -0.00036086,\n", 2339 | " 0.00444261, -0.00391653, 0.00447466, -0.00032617, 0.00056412,\n", 2340 | " -0.00017338, -0.00464378, 0.00039338, -0.00353649, 0.0040346 ,\n", 2341 | " 0.00179682, -0.00186994, -0.00121431, -0.00370716, 0.00039535,\n", 2342 | " -0.00117291, 0.00498948, -0.00243317, 0.00480749, -0.00128626,\n", 2343 | " -0.0018426 , -0.00086148, -0.00347201, -0.0025697 , -0.00409948,\n", 2344 | " 0.00433477, -0.00424404, 0.00389087, 0.0024296 , 0.0009781 ,\n", 2345 | " -0.00267652, -0.00039598, 0.00188174, -0.00141169, 0.00143257,\n", 2346 | " 0.00363962, -0.00445332, 0.00499313, -0.00013036, 0.00411159,\n", 2347 | " 0.00307077, -0.00048517, 0.00491026, -0.00315512, -0.00091287,\n", 2348 | " 0.00465486, 0.00034458, 0.00097905, 0.00187424, -0.00452135,\n", 2349 | " -0.00365111, 0.00260027, 0.00464861, -0.00243504, -0.00425601,\n", 2350 | " -0.00265299, -0.00108813, 0.00284521, -0.00437486, -0.0015496 ,\n", 2351 | " -0.00054869, 0.00228153, 0.00360572, 0.00255484, -0.00357945,\n", 2352 | " -0.00235164, 0.00220505, -0.0016885 , 0.00294839, -0.00337972,\n", 2353 | " 0.00291201, 0.00250298, 0.00447992, -0.00129002, 0.0025 ,\n", 2354 | " -0.00430755, -0.00419162, -0.00029911, 0.00166961, 0.00417119,\n", 2355 | " -0.00209666, 0.00452041, 0.00010931, -0.00115822, -0.00154263],\n", 2356 | " dtype=float32)" 2357 | ] 2358 | }, 2359 | "execution_count": 25, 2360 | "metadata": {}, 2361 | "output_type": "execute_result" 2362 | } 2363 | ], 2364 | "source": [ 2365 | "model.wv['graph']" 2366 | ] 2367 | }, 2368 | { 2369 | "cell_type": "code", 2370 | "execution_count": 26, 2371 | "metadata": {}, 2372 | "outputs": [ 2373 | { 2374 | "data": { 2375 | "text/plain": [ 2376 | "[('interface', 0.1710839718580246),\n", 2377 | " ('user', 0.08987751603126526),\n", 2378 | " ('trees', 0.07364125549793243),\n", 2379 | " ('minors', 0.045832667499780655),\n", 2380 | " ('computer', 0.025292515754699707),\n", 2381 | " ('system', 0.012846874073147774),\n", 2382 | " ('human', -0.03873271495103836),\n", 2383 | " ('survey', -0.06853737682104111),\n", 2384 | " ('time', -0.07515352964401245),\n", 2385 | " ('eps', -0.07798048853874207)]" 2386 | ] 2387 | }, 2388 | "execution_count": 26, 2389 | "metadata": {}, 2390 | "output_type": "execute_result" 2391 | } 2392 | ], 2393 | "source": [ 2394 | "model.wv.most_similar('graph')" 2395 | ] 2396 | }, 2397 | { 2398 | "cell_type": "code", 2399 | "execution_count": null, 2400 | "metadata": {}, 2401 | "outputs": [], 2402 | "source": [] 2403 | }, 2404 | { 2405 | "cell_type": "markdown", 2406 | "metadata": {}, 2407 | "source": [ 2408 | "## Word Embedding using Glove\n", 2409 | "\n", 2410 | "GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space\n", 2411 | "\n", 2412 | "Download link: https://www.kaggle.com/datasets/danielwillgeorge/glove6b100dtxt" 2413 | ] 2414 | }, 2415 | { 2416 | "cell_type": "code", 2417 | "execution_count": 28, 2418 | "metadata": {}, 2419 | "outputs": [ 2420 | { 2421 | "data": { 2422 | "text/html": [ 2423 | "
\n", 2424 | "\n", 2437 | "\n", 2438 | " \n", 2439 | " \n", 2440 | " \n", 2441 | " \n", 2442 | " \n", 2443 | " \n", 2444 | " \n", 2445 | " \n", 2446 | " \n", 2447 | " \n", 2448 | " \n", 2449 | " \n", 2450 | " \n", 2451 | " \n", 2452 | " \n", 2453 | " \n", 2454 | " \n", 2455 | " \n", 2456 | " \n", 2457 | " \n", 2458 | " \n", 2459 | " \n", 2460 | " \n", 2461 | " \n", 2462 | " \n", 2463 | " \n", 2464 | " \n", 2465 | " \n", 2466 | " \n", 2467 | " \n", 2468 | " \n", 2469 | " \n", 2470 | " \n", 2471 | " \n", 2472 | "
tweetclean_text
0@user when a father is dysfunctional and is s...user father dysfunctional selfish drags kids ...
1@user @user thanks for #lyft credit i can't us...user user thanks lyft credit can t use cause ...
2bihday your majestybihday majesty
3#model i love u take with u all the time in ...model love u take u time ur
4factsguide: society now #motivationfactsguide society motivation
\n", 2473 | "
" 2474 | ], 2475 | "text/plain": [ 2476 | " tweet \\\n", 2477 | "0 @user when a father is dysfunctional and is s... \n", 2478 | "1 @user @user thanks for #lyft credit i can't us... \n", 2479 | "2 bihday your majesty \n", 2480 | "3 #model i love u take with u all the time in ... \n", 2481 | "4 factsguide: society now #motivation \n", 2482 | "\n", 2483 | " clean_text \n", 2484 | "0 user father dysfunctional selfish drags kids ... \n", 2485 | "1 user user thanks lyft credit can t use cause ... \n", 2486 | "2 bihday majesty \n", 2487 | "3 model love u take u time ur \n", 2488 | "4 factsguide society motivation " 2489 | ] 2490 | }, 2491 | "execution_count": 28, 2492 | "metadata": {}, 2493 | "output_type": "execute_result" 2494 | } 2495 | ], 2496 | "source": [ 2497 | "import pandas as pd\n", 2498 | "import string\n", 2499 | "from nltk.corpus import stopwords\n", 2500 | "df = pd.read_csv('data/Twitter Sentiments.csv')\n", 2501 | "# drop the columns\n", 2502 | "df = df.drop(columns=['id', 'label'], axis=1)\n", 2503 | "\n", 2504 | "df['clean_text'] = df['tweet'].str.lower()\n", 2505 | "\n", 2506 | "STOPWORDS = set(stopwords.words('english'))\n", 2507 | "def remove_stopwords(text):\n", 2508 | " return \" \".join([word for word in text.split() if word not in STOPWORDS])\n", 2509 | "df['clean_text'] = df['clean_text'].apply(lambda x: remove_stopwords(x))\n", 2510 | "\n", 2511 | "import re\n", 2512 | "def remove_spl_chars(text):\n", 2513 | " text = re.sub('[^a-zA-Z0-9]', ' ', text)\n", 2514 | " text = re.sub('\\s+', ' ', text)\n", 2515 | " return text\n", 2516 | "df['clean_text'] = df['clean_text'].apply(lambda x: remove_spl_chars(x))\n", 2517 | "\n", 2518 | "df.head()" 2519 | ] 2520 | }, 2521 | { 2522 | "cell_type": "code", 2523 | "execution_count": 34, 2524 | "metadata": {}, 2525 | "outputs": [], 2526 | "source": [ 2527 | "from keras.preprocessing.text import Tokenizer\n", 2528 | "from keras.preprocessing.sequence import pad_sequences\n", 2529 | "import numpy as np" 2530 | ] 2531 | }, 2532 | { 2533 | "cell_type": "code", 2534 | "execution_count": 30, 2535 | "metadata": {}, 2536 | "outputs": [ 2537 | { 2538 | "data": { 2539 | "text/plain": [ 2540 | "39085" 2541 | ] 2542 | }, 2543 | "execution_count": 30, 2544 | "metadata": {}, 2545 | "output_type": "execute_result" 2546 | } 2547 | ], 2548 | "source": [ 2549 | "# tokenize text\n", 2550 | "tokenizer = Tokenizer()\n", 2551 | "tokenizer.fit_on_texts(df['clean_text'])\n", 2552 | "\n", 2553 | "word_index = tokenizer.word_index\n", 2554 | "vocab_size = len(word_index)\n", 2555 | "vocab_size" 2556 | ] 2557 | }, 2558 | { 2559 | "cell_type": "code", 2560 | "execution_count": 40, 2561 | "metadata": {}, 2562 | "outputs": [], 2563 | "source": [ 2564 | "# word_index" 2565 | ] 2566 | }, 2567 | { 2568 | "cell_type": "code", 2569 | "execution_count": 31, 2570 | "metadata": {}, 2571 | "outputs": [ 2572 | { 2573 | "data": { 2574 | "text/plain": [ 2575 | "131" 2576 | ] 2577 | }, 2578 | "execution_count": 31, 2579 | "metadata": {}, 2580 | "output_type": "execute_result" 2581 | } 2582 | ], 2583 | "source": [ 2584 | "max(len(data) for data in df['clean_text'])" 2585 | ] 2586 | }, 2587 | { 2588 | "cell_type": "code", 2589 | "execution_count": 32, 2590 | "metadata": {}, 2591 | "outputs": [], 2592 | "source": [ 2593 | "# padding text data\n", 2594 | "sequences = tokenizer.texts_to_sequences(df['clean_text'])\n", 2595 | "padded_seq = pad_sequences(sequences, maxlen=131, padding='post', truncating='post')" 2596 | ] 2597 | }, 2598 | { 2599 | "cell_type": "code", 2600 | "execution_count": 33, 2601 | "metadata": {}, 2602 | "outputs": [ 2603 | { 2604 | "data": { 2605 | "text/plain": [ 2606 | "array([ 1, 28, 15330, 2630, 6365, 184, 7786, 385, 0,\n", 2607 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 2608 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 2609 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 2610 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 2611 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 2612 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 2613 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 2614 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 2615 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 2616 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 2617 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 2618 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 2619 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 2620 | " 0, 0, 0, 0, 0])" 2621 | ] 2622 | }, 2623 | "execution_count": 33, 2624 | "metadata": {}, 2625 | "output_type": "execute_result" 2626 | } 2627 | ], 2628 | "source": [ 2629 | "padded_seq[0]" 2630 | ] 2631 | }, 2632 | { 2633 | "cell_type": "code", 2634 | "execution_count": 35, 2635 | "metadata": {}, 2636 | "outputs": [], 2637 | "source": [ 2638 | "# create embedding index\n", 2639 | "embedding_index = {}\n", 2640 | "with open('glove.6B.100d.txt', encoding='utf-8') as f:\n", 2641 | " for line in f:\n", 2642 | " values = line.split()\n", 2643 | " word = values[0]\n", 2644 | " coefs = np.asarray(values[1:], dtype='float32')\n", 2645 | " embedding_index[word] = coefs" 2646 | ] 2647 | }, 2648 | { 2649 | "cell_type": "code", 2650 | "execution_count": 36, 2651 | "metadata": {}, 2652 | "outputs": [ 2653 | { 2654 | "data": { 2655 | "text/plain": [ 2656 | "array([-0.030769 , 0.11993 , 0.53909 , -0.43696 , -0.73937 ,\n", 2657 | " -0.15345 , 0.081126 , -0.38559 , -0.68797 , -0.41632 ,\n", 2658 | " -0.13183 , -0.24922 , 0.441 , 0.085919 , 0.20871 ,\n", 2659 | " -0.063582 , 0.062228 , -0.051234 , -0.13398 , 1.1418 ,\n", 2660 | " 0.036526 , 0.49029 , -0.24567 , -0.412 , 0.12349 ,\n", 2661 | " 0.41336 , -0.48397 , -0.54243 , -0.27787 , -0.26015 ,\n", 2662 | " -0.38485 , 0.78656 , 0.1023 , -0.20712 , 0.40751 ,\n", 2663 | " 0.32026 , -0.51052 , 0.48362 , -0.0099498, -0.38685 ,\n", 2664 | " 0.034975 , -0.167 , 0.4237 , -0.54164 , -0.30323 ,\n", 2665 | " -0.36983 , 0.082836 , -0.52538 , -0.064531 , -1.398 ,\n", 2666 | " -0.14873 , -0.35327 , -0.1118 , 1.0912 , 0.095864 ,\n", 2667 | " -2.8129 , 0.45238 , 0.46213 , 1.6012 , -0.20837 ,\n", 2668 | " -0.27377 , 0.71197 , -1.0754 , -0.046974 , 0.67479 ,\n", 2669 | " -0.065839 , 0.75824 , 0.39405 , 0.15507 , -0.64719 ,\n", 2670 | " 0.32796 , -0.031748 , 0.52899 , -0.43886 , 0.67405 ,\n", 2671 | " 0.42136 , -0.11981 , -0.21777 , -0.29756 , -0.1351 ,\n", 2672 | " 0.59898 , 0.46529 , -0.58258 , -0.02323 , -1.5442 ,\n", 2673 | " 0.01901 , -0.015877 , 0.024499 , -0.58017 , -0.67659 ,\n", 2674 | " -0.040379 , -0.44043 , 0.083292 , 0.20035 , -0.75499 ,\n", 2675 | " 0.16918 , -0.26573 , -0.52878 , 0.17584 , 1.065 ],\n", 2676 | " dtype=float32)" 2677 | ] 2678 | }, 2679 | "execution_count": 36, 2680 | "metadata": {}, 2681 | "output_type": "execute_result" 2682 | } 2683 | ], 2684 | "source": [ 2685 | "embedding_index['good']" 2686 | ] 2687 | }, 2688 | { 2689 | "cell_type": "code", 2690 | "execution_count": 41, 2691 | "metadata": {}, 2692 | "outputs": [], 2693 | "source": [ 2694 | "# create embedding matrix\n", 2695 | "embedding_matrix = np.zeros((vocab_size+1, 100))\n", 2696 | "for word, i in word_index.items():\n", 2697 | " embedding_vector = embedding_index.get(word)\n", 2698 | " if embedding_vector is not None:\n", 2699 | " embedding_matrix[i] = embedding_vector" 2700 | ] 2701 | }, 2702 | { 2703 | "cell_type": "code", 2704 | "execution_count": 42, 2705 | "metadata": {}, 2706 | "outputs": [ 2707 | { 2708 | "data": { 2709 | "text/plain": [ 2710 | "(39086, 100)" 2711 | ] 2712 | }, 2713 | "execution_count": 42, 2714 | "metadata": {}, 2715 | "output_type": "execute_result" 2716 | } 2717 | ], 2718 | "source": [ 2719 | "embedding_matrix.shape" 2720 | ] 2721 | }, 2722 | { 2723 | "cell_type": "code", 2724 | "execution_count": null, 2725 | "metadata": {}, 2726 | "outputs": [], 2727 | "source": [] 2728 | }, 2729 | { 2730 | "cell_type": "markdown", 2731 | "metadata": {}, 2732 | "source": [ 2733 | "# Named Entity Recognition" 2734 | ] 2735 | }, 2736 | { 2737 | "cell_type": "code", 2738 | "execution_count": null, 2739 | "metadata": { 2740 | "id": "PCnOhgdCcked" 2741 | }, 2742 | "outputs": [], 2743 | "source": [ 2744 | "# !pip install -U pip setuptools wheel\n", 2745 | "# !pip install -U spacy\n", 2746 | "# !python -m spacy download en_core_web_sm" 2747 | ] 2748 | }, 2749 | { 2750 | "cell_type": "code", 2751 | "execution_count": null, 2752 | "metadata": { 2753 | "id": "9LfSlz4Ye9SD" 2754 | }, 2755 | "outputs": [], 2756 | "source": [ 2757 | "import spacy\n", 2758 | "from spacy import displacy" 2759 | ] 2760 | }, 2761 | { 2762 | "cell_type": "code", 2763 | "execution_count": null, 2764 | "metadata": { 2765 | "id": "nD0z2jmtfGfk" 2766 | }, 2767 | "outputs": [], 2768 | "source": [ 2769 | "NER = spacy.load('en_core_web_sm')" 2770 | ] 2771 | }, 2772 | { 2773 | "cell_type": "code", 2774 | "execution_count": null, 2775 | "metadata": { 2776 | "id": "Q5OJPYA2fyiK" 2777 | }, 2778 | "outputs": [], 2779 | "source": [ 2780 | "text = 'Mark Zuckerberg is one of the founders of Facebook, a company from the United States'" 2781 | ] 2782 | }, 2783 | { 2784 | "cell_type": "code", 2785 | "execution_count": null, 2786 | "metadata": { 2787 | "id": "pYE8UjWsgf6S" 2788 | }, 2789 | "outputs": [], 2790 | "source": [ 2791 | "ner_text = NER(text)" 2792 | ] 2793 | }, 2794 | { 2795 | "cell_type": "code", 2796 | "execution_count": null, 2797 | "metadata": { 2798 | "colab": { 2799 | "base_uri": "https://localhost:8080/" 2800 | }, 2801 | "id": "kpWcdBlWgf38", 2802 | "outputId": "b8cac9aa-02e2-42a3-dc01-fc3970b7c7e5" 2803 | }, 2804 | "outputs": [ 2805 | { 2806 | "name": "stdout", 2807 | "output_type": "stream", 2808 | "text": [ 2809 | "Mark Zuckerberg PERSON\n", 2810 | "one CARDINAL\n", 2811 | "Facebook ORG\n", 2812 | "the United States GPE\n" 2813 | ] 2814 | } 2815 | ], 2816 | "source": [ 2817 | "for word in ner_text.ents:\n", 2818 | " print(word.text, word.label_)" 2819 | ] 2820 | }, 2821 | { 2822 | "cell_type": "code", 2823 | "execution_count": null, 2824 | "metadata": { 2825 | "colab": { 2826 | "base_uri": "https://localhost:8080/", 2827 | "height": 35 2828 | }, 2829 | "id": "iFBCYIvDgrL6", 2830 | "outputId": "5e0291e3-8c6d-4082-f3c0-473ad0bdac43" 2831 | }, 2832 | "outputs": [ 2833 | { 2834 | "data": { 2835 | "application/vnd.google.colaboratory.intrinsic+json": { 2836 | "type": "string" 2837 | }, 2838 | "text/plain": [ 2839 | "'Countries, cities, states'" 2840 | ] 2841 | }, 2842 | "execution_count": 13, 2843 | "metadata": {}, 2844 | "output_type": "execute_result" 2845 | } 2846 | ], 2847 | "source": [ 2848 | "spacy.explain('GPE')" 2849 | ] 2850 | }, 2851 | { 2852 | "cell_type": "code", 2853 | "execution_count": null, 2854 | "metadata": { 2855 | "colab": { 2856 | "base_uri": "https://localhost:8080/", 2857 | "height": 35 2858 | }, 2859 | "id": "vkzMb7Bwg1Fi", 2860 | "outputId": "4b6c9ed6-1270-4b9f-a35a-28e24122d7d3" 2861 | }, 2862 | "outputs": [ 2863 | { 2864 | "data": { 2865 | "application/vnd.google.colaboratory.intrinsic+json": { 2866 | "type": "string" 2867 | }, 2868 | "text/plain": [ 2869 | "'Numerals that do not fall under another type'" 2870 | ] 2871 | }, 2872 | "execution_count": 14, 2873 | "metadata": {}, 2874 | "output_type": "execute_result" 2875 | } 2876 | ], 2877 | "source": [ 2878 | "spacy.explain('CARDINAL')" 2879 | ] 2880 | }, 2881 | { 2882 | "cell_type": "code", 2883 | "execution_count": null, 2884 | "metadata": { 2885 | "colab": { 2886 | "base_uri": "https://localhost:8080/", 2887 | "height": 52 2888 | }, 2889 | "id": "LBPSsLT5g9nS", 2890 | "outputId": "69a16d27-bf86-4b7f-e5cd-b3e56adeb149" 2891 | }, 2892 | "outputs": [ 2893 | { 2894 | "data": { 2895 | "text/html": [ 2896 | "
\n", 2897 | "\n", 2898 | " Mark Zuckerberg\n", 2899 | " PERSON\n", 2900 | "\n", 2901 | " is \n", 2902 | "\n", 2903 | " one\n", 2904 | " CARDINAL\n", 2905 | "\n", 2906 | " of the founders of \n", 2907 | "\n", 2908 | " Facebook\n", 2909 | " ORG\n", 2910 | "\n", 2911 | ", a company from \n", 2912 | "\n", 2913 | " the United States\n", 2914 | " GPE\n", 2915 | "\n", 2916 | "
" 2917 | ], 2918 | "text/plain": [ 2919 | "" 2920 | ] 2921 | }, 2922 | "metadata": {}, 2923 | "output_type": "display_data" 2924 | } 2925 | ], 2926 | "source": [ 2927 | "displacy.render(ner_text, style='ent', jupyter=True)" 2928 | ] 2929 | }, 2930 | { 2931 | "cell_type": "code", 2932 | "execution_count": null, 2933 | "metadata": {}, 2934 | "outputs": [], 2935 | "source": [] 2936 | }, 2937 | { 2938 | "cell_type": "markdown", 2939 | "metadata": {}, 2940 | "source": [ 2941 | "# Data Augmentation for Text" 2942 | ] 2943 | }, 2944 | { 2945 | "cell_type": "code", 2946 | "execution_count": null, 2947 | "metadata": {}, 2948 | "outputs": [], 2949 | "source": [ 2950 | "# uses\n", 2951 | "# 1. increase the dataset size by creating more samples\n", 2952 | "# 2. reduce overfitting\n", 2953 | "# 3. improve model generalization\n", 2954 | "# 4. handling imbalance dataset" 2955 | ] 2956 | }, 2957 | { 2958 | "cell_type": "code", 2959 | "execution_count": null, 2960 | "metadata": {}, 2961 | "outputs": [], 2962 | "source": [ 2963 | "!pip install nlpaug\n", 2964 | "!pip install sacremoses" 2965 | ] 2966 | }, 2967 | { 2968 | "cell_type": "code", 2969 | "execution_count": 2, 2970 | "metadata": {}, 2971 | "outputs": [], 2972 | "source": [ 2973 | "import nlpaug.augmenter.word as naw" 2974 | ] 2975 | }, 2976 | { 2977 | "cell_type": "code", 2978 | "execution_count": 3, 2979 | "metadata": {}, 2980 | "outputs": [], 2981 | "source": [ 2982 | "text = 'The quick brown fox jumps over a lazy dog'" 2983 | ] 2984 | }, 2985 | { 2986 | "cell_type": "markdown", 2987 | "metadata": {}, 2988 | "source": [ 2989 | "### Synonym Replacement" 2990 | ] 2991 | }, 2992 | { 2993 | "cell_type": "code", 2994 | "execution_count": 10, 2995 | "metadata": {}, 2996 | "outputs": [ 2997 | { 2998 | "name": "stdout", 2999 | "output_type": "stream", 3000 | "text": [ 3001 | "Synonym Text: ['The flying brownness fox jumps over a lazy andiron']\n" 3002 | ] 3003 | } 3004 | ], 3005 | "source": [ 3006 | "syn_aug = naw.synonym.SynonymAug(aug_src='wordnet')\n", 3007 | "synonym_text = syn_aug.augment(text)\n", 3008 | "print('Synonym Text:', synonym_text)" 3009 | ] 3010 | }, 3011 | { 3012 | "cell_type": "markdown", 3013 | "metadata": {}, 3014 | "source": [ 3015 | "### Random Substitution" 3016 | ] 3017 | }, 3018 | { 3019 | "cell_type": "code", 3020 | "execution_count": 11, 3021 | "metadata": {}, 3022 | "outputs": [ 3023 | { 3024 | "name": "stdout", 3025 | "output_type": "stream", 3026 | "text": [ 3027 | "Substituted Text: ['_ _ brown fox jumps _ a lazy dog']\n" 3028 | ] 3029 | } 3030 | ], 3031 | "source": [ 3032 | "sub_aug = naw.random.RandomWordAug(action='substitute')\n", 3033 | "substituted_text = sub_aug.augment(text)\n", 3034 | "print('Substituted Text:', substituted_text)" 3035 | ] 3036 | }, 3037 | { 3038 | "cell_type": "markdown", 3039 | "metadata": {}, 3040 | "source": [ 3041 | "### Random Deletion" 3042 | ] 3043 | }, 3044 | { 3045 | "cell_type": "code", 3046 | "execution_count": 12, 3047 | "metadata": {}, 3048 | "outputs": [ 3049 | { 3050 | "name": "stdout", 3051 | "output_type": "stream", 3052 | "text": [ 3053 | "Deletion Text: ['Quick brown jumps over a lazy dog']\n" 3054 | ] 3055 | } 3056 | ], 3057 | "source": [ 3058 | "del_aug = naw.random.RandomWordAug(action='delete')\n", 3059 | "deletion_text = del_aug.augment(text)\n", 3060 | "print('Deletion Text:', deletion_text)" 3061 | ] 3062 | }, 3063 | { 3064 | "cell_type": "markdown", 3065 | "metadata": {}, 3066 | "source": [ 3067 | "### Random Swap" 3068 | ] 3069 | }, 3070 | { 3071 | "cell_type": "code", 3072 | "execution_count": 13, 3073 | "metadata": {}, 3074 | "outputs": [ 3075 | { 3076 | "name": "stdout", 3077 | "output_type": "stream", 3078 | "text": [ 3079 | "Swap Text: ['The quick brown jumps fox a lazy over dog']\n" 3080 | ] 3081 | } 3082 | ], 3083 | "source": [ 3084 | "swap_aug = naw.random.RandomWordAug(action='swap')\n", 3085 | "swap_text = swap_aug.augment(text)\n", 3086 | "print('Swap Text:', swap_text)" 3087 | ] 3088 | }, 3089 | { 3090 | "cell_type": "markdown", 3091 | "metadata": {}, 3092 | "source": [ 3093 | "### Back Translation" 3094 | ] 3095 | }, 3096 | { 3097 | "cell_type": "code", 3098 | "execution_count": 15, 3099 | "metadata": {}, 3100 | "outputs": [ 3101 | { 3102 | "name": "stdout", 3103 | "output_type": "stream", 3104 | "text": [ 3105 | "Back Translated Text: ['The speedy brown fox jumps over a lazy dog']\n" 3106 | ] 3107 | } 3108 | ], 3109 | "source": [ 3110 | "# translate original text to other language (german) and convert back to english language\n", 3111 | "back_trans_aug = naw.back_translation.BackTranslationAug()\n", 3112 | "back_trans_text = back_trans_aug.augment(text)\n", 3113 | "print('Back Translated Text:', back_trans_text)" 3114 | ] 3115 | }, 3116 | { 3117 | "cell_type": "code", 3118 | "execution_count": null, 3119 | "metadata": {}, 3120 | "outputs": [], 3121 | "source": [] 3122 | }, 3123 | { 3124 | "cell_type": "code", 3125 | "execution_count": null, 3126 | "metadata": {}, 3127 | "outputs": [], 3128 | "source": [] 3129 | } 3130 | ], 3131 | "metadata": { 3132 | "kernelspec": { 3133 | "display_name": "Python 3 (ipykernel)", 3134 | "language": "python", 3135 | "name": "python3" 3136 | }, 3137 | "language_info": { 3138 | "codemirror_mode": { 3139 | "name": "ipython", 3140 | "version": 3 3141 | }, 3142 | "file_extension": ".py", 3143 | "mimetype": "text/x-python", 3144 | "name": "python", 3145 | "nbconvert_exporter": "python", 3146 | "pygments_lexer": "ipython3", 3147 | "version": "3.11.5" 3148 | } 3149 | }, 3150 | "nbformat": 4, 3151 | "nbformat_minor": 4 3152 | } 3153 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Science Concepts 2 | 3 | The repository contains all the machine learning, deep learning and NLP concepts with examples in python 4 | 5 | 💻 Machine learning concepts playlist: http://bit.ly/mlconcepts 6 | 7 | ✍🏼 Natural Language Processing(NLP) concepts playlist: http://bit.ly/nlpconcepts 8 | 9 | 10 | ## Machine Learning 11 | 12 | 1. Normalize data using Max Absolute & Min Max Scaling - https://youtu.be/wSgWf-lUdDU 13 | 2. Standardize data using Z-Score/Standard Scalar - https://youtu.be/AmCkjGPmdvI 14 | 3. Detect and Remove Outliers in the Data - https://youtu.be/Cw2IvmWRcXs 15 | 4. Label Encoding for Categorical Attributes - https://youtu.be/YuzLkF7Ymf4 16 | 5. One Hot Encoding for Categorical Attributes - https://youtu.be/LqMHkc_F1WA 17 | 6. Target/Mean Encoding for Categorical Attributes - https://youtu.be/nd7vc4MZQz4 18 | 7. Frequency Encoding & Binary Encoding - https://youtu.be/2oCfBpnWQws 19 | 8. Extract Features from Datetime Attribute - https://youtu.be/PbyHFUVuqn8 20 | 9. How to Fill Missing Values in Dataset - https://youtu.be/FEQpdgoH_pM 21 | 10. Feature Selection using Correlation Matrix (Numerical) - https://youtu.be/1fFVt4tQjRE 22 | 11. Feature Selection using Chi Square (Category) - https://youtu.be/6N9H9KxdZdk 23 | 12. Feature Selection using Recursive Feature Elimination (RFE) - https://youtu.be/vxdVKbAv6as 24 | 13. Repeated Stratified KFold Cross Validation - https://youtu.be/cChWbibT-JI 25 | 14. How to handle Imbalanced Classes in Dataset - https://youtu.be/rVuUqpyPwEs 26 | 15. Ensemble Techniques to improve Model Performance - https://youtu.be/qPN-S5Ltbm4 27 | 16. Dimensionality Reduction using PCA vs LDA vs t-SNE vs UMAP - https://youtu.be/gk7ntPrxy-k 28 | 17. Handle Large Data using pandas - https://youtu.be/bd_1T2JCr4M 29 | 30 | ## Natural Language Processing 31 | 32 | 1. Tokenization - https://youtu.be/ivCcY8JCxeY 33 | 2. Stemming | Extract Root Words - https://youtu.be/O-SaH_dnb9A 34 | 3. Lemmatization - https://youtu.be/uvKKEkYZcdw 35 | 4. Part of Speech Tagging (POS) - https://youtu.be/n6j-T3_F9dI 36 | 5. Text Preprocessing in NLP - https://youtu.be/Br5dmsa49wo 37 | 6. Bag of Words (BOW) - https://youtu.be/dSce20oYIPY 38 | 7. Term Frequency - Inverse Document Frequency (TF-IDF) - https://youtu.be/O9aAwvk6SNI 39 | 8. Word2Vec - https://youtu.be/4DoJcQblpGQ 40 | 9. Word Embedding | GloVe - https://youtu.be/6uxVtUMtqtk 41 | 42 | ## Deep Learning 43 | 44 | 1. --------------------------------------------------------------------------------