├── .gitignore
├── Deep Learning
├── Deep Learning Concepts - Hackers Realm.ipynb
├── audio data
│ ├── OAF_back_fear.wav
│ ├── OAF_back_happy.wav
│ ├── OAF_back_ps.wav
│ └── OAF_back_sad.wav
└── image data
│ ├── 1.jpg
│ ├── 2.jpg
│ ├── 3.jpg
│ └── 4.jpg
├── Machine Learning
├── Machine Learning Concepts - Hackers Realm.ipynb
└── data
│ ├── 1000000 Sales Records.rar
│ ├── Loan Prediction Dataset.csv
│ ├── Traffic data.csv
│ ├── bike sharing dataset.csv
│ ├── creditcard.rar
│ └── winequality.csv
├── NLP
├── Natural Language Processing(NLP) Concepts - Hackers Realm.ipynb
└── data
│ └── Twitter Sentiments.csv
└── README.md
/.gitignore:
--------------------------------------------------------------------------------
1 |
2 | Machine Learning/data/1000000 Sales Records.csv
3 |
--------------------------------------------------------------------------------
/Deep Learning/audio data/OAF_back_fear.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/audio data/OAF_back_fear.wav
--------------------------------------------------------------------------------
/Deep Learning/audio data/OAF_back_happy.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/audio data/OAF_back_happy.wav
--------------------------------------------------------------------------------
/Deep Learning/audio data/OAF_back_ps.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/audio data/OAF_back_ps.wav
--------------------------------------------------------------------------------
/Deep Learning/audio data/OAF_back_sad.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/audio data/OAF_back_sad.wav
--------------------------------------------------------------------------------
/Deep Learning/image data/1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/image data/1.jpg
--------------------------------------------------------------------------------
/Deep Learning/image data/2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/image data/2.jpg
--------------------------------------------------------------------------------
/Deep Learning/image data/3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/image data/3.jpg
--------------------------------------------------------------------------------
/Deep Learning/image data/4.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Deep Learning/image data/4.jpg
--------------------------------------------------------------------------------
/Machine Learning/data/1000000 Sales Records.rar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Machine Learning/data/1000000 Sales Records.rar
--------------------------------------------------------------------------------
/Machine Learning/data/Loan Prediction Dataset.csv:
--------------------------------------------------------------------------------
1 | Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
2 | LP001002,Male,No,0,Graduate,No,5849,0,,360,1,Urban,Y
3 | LP001003,Male,Yes,1,Graduate,No,4583,1508,128,360,1,Rural,N
4 | LP001005,Male,Yes,0,Graduate,Yes,3000,0,66,360,1,Urban,Y
5 | LP001006,Male,Yes,0,Not Graduate,No,2583,2358,120,360,1,Urban,Y
6 | LP001008,Male,No,0,Graduate,No,6000,0,141,360,1,Urban,Y
7 | LP001011,Male,Yes,2,Graduate,Yes,5417,4196,267,360,1,Urban,Y
8 | LP001013,Male,Yes,0,Not Graduate,No,2333,1516,95,360,1,Urban,Y
9 | LP001014,Male,Yes,3+,Graduate,No,3036,2504,158,360,0,Semiurban,N
10 | LP001018,Male,Yes,2,Graduate,No,4006,1526,168,360,1,Urban,Y
11 | LP001020,Male,Yes,1,Graduate,No,12841,10968,349,360,1,Semiurban,N
12 | LP001024,Male,Yes,2,Graduate,No,3200,700,70,360,1,Urban,Y
13 | LP001027,Male,Yes,2,Graduate,,2500,1840,109,360,1,Urban,Y
14 | LP001028,Male,Yes,2,Graduate,No,3073,8106,200,360,1,Urban,Y
15 | LP001029,Male,No,0,Graduate,No,1853,2840,114,360,1,Rural,N
16 | LP001030,Male,Yes,2,Graduate,No,1299,1086,17,120,1,Urban,Y
17 | LP001032,Male,No,0,Graduate,No,4950,0,125,360,1,Urban,Y
18 | LP001034,Male,No,1,Not Graduate,No,3596,0,100,240,,Urban,Y
19 | LP001036,Female,No,0,Graduate,No,3510,0,76,360,0,Urban,N
20 | LP001038,Male,Yes,0,Not Graduate,No,4887,0,133,360,1,Rural,N
21 | LP001041,Male,Yes,0,Graduate,,2600,3500,115,,1,Urban,Y
22 | LP001043,Male,Yes,0,Not Graduate,No,7660,0,104,360,0,Urban,N
23 | LP001046,Male,Yes,1,Graduate,No,5955,5625,315,360,1,Urban,Y
24 | LP001047,Male,Yes,0,Not Graduate,No,2600,1911,116,360,0,Semiurban,N
25 | LP001050,,Yes,2,Not Graduate,No,3365,1917,112,360,0,Rural,N
26 | LP001052,Male,Yes,1,Graduate,,3717,2925,151,360,,Semiurban,N
27 | LP001066,Male,Yes,0,Graduate,Yes,9560,0,191,360,1,Semiurban,Y
28 | LP001068,Male,Yes,0,Graduate,No,2799,2253,122,360,1,Semiurban,Y
29 | LP001073,Male,Yes,2,Not Graduate,No,4226,1040,110,360,1,Urban,Y
30 | LP001086,Male,No,0,Not Graduate,No,1442,0,35,360,1,Urban,N
31 | LP001087,Female,No,2,Graduate,,3750,2083,120,360,1,Semiurban,Y
32 | LP001091,Male,Yes,1,Graduate,,4166,3369,201,360,,Urban,N
33 | LP001095,Male,No,0,Graduate,No,3167,0,74,360,1,Urban,N
34 | LP001097,Male,No,1,Graduate,Yes,4692,0,106,360,1,Rural,N
35 | LP001098,Male,Yes,0,Graduate,No,3500,1667,114,360,1,Semiurban,Y
36 | LP001100,Male,No,3+,Graduate,No,12500,3000,320,360,1,Rural,N
37 | LP001106,Male,Yes,0,Graduate,No,2275,2067,,360,1,Urban,Y
38 | LP001109,Male,Yes,0,Graduate,No,1828,1330,100,,0,Urban,N
39 | LP001112,Female,Yes,0,Graduate,No,3667,1459,144,360,1,Semiurban,Y
40 | LP001114,Male,No,0,Graduate,No,4166,7210,184,360,1,Urban,Y
41 | LP001116,Male,No,0,Not Graduate,No,3748,1668,110,360,1,Semiurban,Y
42 | LP001119,Male,No,0,Graduate,No,3600,0,80,360,1,Urban,N
43 | LP001120,Male,No,0,Graduate,No,1800,1213,47,360,1,Urban,Y
44 | LP001123,Male,Yes,0,Graduate,No,2400,0,75,360,,Urban,Y
45 | LP001131,Male,Yes,0,Graduate,No,3941,2336,134,360,1,Semiurban,Y
46 | LP001136,Male,Yes,0,Not Graduate,Yes,4695,0,96,,1,Urban,Y
47 | LP001137,Female,No,0,Graduate,No,3410,0,88,,1,Urban,Y
48 | LP001138,Male,Yes,1,Graduate,No,5649,0,44,360,1,Urban,Y
49 | LP001144,Male,Yes,0,Graduate,No,5821,0,144,360,1,Urban,Y
50 | LP001146,Female,Yes,0,Graduate,No,2645,3440,120,360,0,Urban,N
51 | LP001151,Female,No,0,Graduate,No,4000,2275,144,360,1,Semiurban,Y
52 | LP001155,Female,Yes,0,Not Graduate,No,1928,1644,100,360,1,Semiurban,Y
53 | LP001157,Female,No,0,Graduate,No,3086,0,120,360,1,Semiurban,Y
54 | LP001164,Female,No,0,Graduate,No,4230,0,112,360,1,Semiurban,N
55 | LP001179,Male,Yes,2,Graduate,No,4616,0,134,360,1,Urban,N
56 | LP001186,Female,Yes,1,Graduate,Yes,11500,0,286,360,0,Urban,N
57 | LP001194,Male,Yes,2,Graduate,No,2708,1167,97,360,1,Semiurban,Y
58 | LP001195,Male,Yes,0,Graduate,No,2132,1591,96,360,1,Semiurban,Y
59 | LP001197,Male,Yes,0,Graduate,No,3366,2200,135,360,1,Rural,N
60 | LP001198,Male,Yes,1,Graduate,No,8080,2250,180,360,1,Urban,Y
61 | LP001199,Male,Yes,2,Not Graduate,No,3357,2859,144,360,1,Urban,Y
62 | LP001205,Male,Yes,0,Graduate,No,2500,3796,120,360,1,Urban,Y
63 | LP001206,Male,Yes,3+,Graduate,No,3029,0,99,360,1,Urban,Y
64 | LP001207,Male,Yes,0,Not Graduate,Yes,2609,3449,165,180,0,Rural,N
65 | LP001213,Male,Yes,1,Graduate,No,4945,0,,360,0,Rural,N
66 | LP001222,Female,No,0,Graduate,No,4166,0,116,360,0,Semiurban,N
67 | LP001225,Male,Yes,0,Graduate,No,5726,4595,258,360,1,Semiurban,N
68 | LP001228,Male,No,0,Not Graduate,No,3200,2254,126,180,0,Urban,N
69 | LP001233,Male,Yes,1,Graduate,No,10750,0,312,360,1,Urban,Y
70 | LP001238,Male,Yes,3+,Not Graduate,Yes,7100,0,125,60,1,Urban,Y
71 | LP001241,Female,No,0,Graduate,No,4300,0,136,360,0,Semiurban,N
72 | LP001243,Male,Yes,0,Graduate,No,3208,3066,172,360,1,Urban,Y
73 | LP001245,Male,Yes,2,Not Graduate,Yes,1875,1875,97,360,1,Semiurban,Y
74 | LP001248,Male,No,0,Graduate,No,3500,0,81,300,1,Semiurban,Y
75 | LP001250,Male,Yes,3+,Not Graduate,No,4755,0,95,,0,Semiurban,N
76 | LP001253,Male,Yes,3+,Graduate,Yes,5266,1774,187,360,1,Semiurban,Y
77 | LP001255,Male,No,0,Graduate,No,3750,0,113,480,1,Urban,N
78 | LP001256,Male,No,0,Graduate,No,3750,4750,176,360,1,Urban,N
79 | LP001259,Male,Yes,1,Graduate,Yes,1000,3022,110,360,1,Urban,N
80 | LP001263,Male,Yes,3+,Graduate,No,3167,4000,180,300,0,Semiurban,N
81 | LP001264,Male,Yes,3+,Not Graduate,Yes,3333,2166,130,360,,Semiurban,Y
82 | LP001265,Female,No,0,Graduate,No,3846,0,111,360,1,Semiurban,Y
83 | LP001266,Male,Yes,1,Graduate,Yes,2395,0,,360,1,Semiurban,Y
84 | LP001267,Female,Yes,2,Graduate,No,1378,1881,167,360,1,Urban,N
85 | LP001273,Male,Yes,0,Graduate,No,6000,2250,265,360,,Semiurban,N
86 | LP001275,Male,Yes,1,Graduate,No,3988,0,50,240,1,Urban,Y
87 | LP001279,Male,No,0,Graduate,No,2366,2531,136,360,1,Semiurban,Y
88 | LP001280,Male,Yes,2,Not Graduate,No,3333,2000,99,360,,Semiurban,Y
89 | LP001282,Male,Yes,0,Graduate,No,2500,2118,104,360,1,Semiurban,Y
90 | LP001289,Male,No,0,Graduate,No,8566,0,210,360,1,Urban,Y
91 | LP001310,Male,Yes,0,Graduate,No,5695,4167,175,360,1,Semiurban,Y
92 | LP001316,Male,Yes,0,Graduate,No,2958,2900,131,360,1,Semiurban,Y
93 | LP001318,Male,Yes,2,Graduate,No,6250,5654,188,180,1,Semiurban,Y
94 | LP001319,Male,Yes,2,Not Graduate,No,3273,1820,81,360,1,Urban,Y
95 | LP001322,Male,No,0,Graduate,No,4133,0,122,360,1,Semiurban,Y
96 | LP001325,Male,No,0,Not Graduate,No,3620,0,25,120,1,Semiurban,Y
97 | LP001326,Male,No,0,Graduate,,6782,0,,360,,Urban,N
98 | LP001327,Female,Yes,0,Graduate,No,2484,2302,137,360,1,Semiurban,Y
99 | LP001333,Male,Yes,0,Graduate,No,1977,997,50,360,1,Semiurban,Y
100 | LP001334,Male,Yes,0,Not Graduate,No,4188,0,115,180,1,Semiurban,Y
101 | LP001343,Male,Yes,0,Graduate,No,1759,3541,131,360,1,Semiurban,Y
102 | LP001345,Male,Yes,2,Not Graduate,No,4288,3263,133,180,1,Urban,Y
103 | LP001349,Male,No,0,Graduate,No,4843,3806,151,360,1,Semiurban,Y
104 | LP001350,Male,Yes,,Graduate,No,13650,0,,360,1,Urban,Y
105 | LP001356,Male,Yes,0,Graduate,No,4652,3583,,360,1,Semiurban,Y
106 | LP001357,Male,,,Graduate,No,3816,754,160,360,1,Urban,Y
107 | LP001367,Male,Yes,1,Graduate,No,3052,1030,100,360,1,Urban,Y
108 | LP001369,Male,Yes,2,Graduate,No,11417,1126,225,360,1,Urban,Y
109 | LP001370,Male,No,0,Not Graduate,,7333,0,120,360,1,Rural,N
110 | LP001379,Male,Yes,2,Graduate,No,3800,3600,216,360,0,Urban,N
111 | LP001384,Male,Yes,3+,Not Graduate,No,2071,754,94,480,1,Semiurban,Y
112 | LP001385,Male,No,0,Graduate,No,5316,0,136,360,1,Urban,Y
113 | LP001387,Female,Yes,0,Graduate,,2929,2333,139,360,1,Semiurban,Y
114 | LP001391,Male,Yes,0,Not Graduate,No,3572,4114,152,,0,Rural,N
115 | LP001392,Female,No,1,Graduate,Yes,7451,0,,360,1,Semiurban,Y
116 | LP001398,Male,No,0,Graduate,,5050,0,118,360,1,Semiurban,Y
117 | LP001401,Male,Yes,1,Graduate,No,14583,0,185,180,1,Rural,Y
118 | LP001404,Female,Yes,0,Graduate,No,3167,2283,154,360,1,Semiurban,Y
119 | LP001405,Male,Yes,1,Graduate,No,2214,1398,85,360,,Urban,Y
120 | LP001421,Male,Yes,0,Graduate,No,5568,2142,175,360,1,Rural,N
121 | LP001422,Female,No,0,Graduate,No,10408,0,259,360,1,Urban,Y
122 | LP001426,Male,Yes,,Graduate,No,5667,2667,180,360,1,Rural,Y
123 | LP001430,Female,No,0,Graduate,No,4166,0,44,360,1,Semiurban,Y
124 | LP001431,Female,No,0,Graduate,No,2137,8980,137,360,0,Semiurban,Y
125 | LP001432,Male,Yes,2,Graduate,No,2957,0,81,360,1,Semiurban,Y
126 | LP001439,Male,Yes,0,Not Graduate,No,4300,2014,194,360,1,Rural,Y
127 | LP001443,Female,No,0,Graduate,No,3692,0,93,360,,Rural,Y
128 | LP001448,,Yes,3+,Graduate,No,23803,0,370,360,1,Rural,Y
129 | LP001449,Male,No,0,Graduate,No,3865,1640,,360,1,Rural,Y
130 | LP001451,Male,Yes,1,Graduate,Yes,10513,3850,160,180,0,Urban,N
131 | LP001465,Male,Yes,0,Graduate,No,6080,2569,182,360,,Rural,N
132 | LP001469,Male,No,0,Graduate,Yes,20166,0,650,480,,Urban,Y
133 | LP001473,Male,No,0,Graduate,No,2014,1929,74,360,1,Urban,Y
134 | LP001478,Male,No,0,Graduate,No,2718,0,70,360,1,Semiurban,Y
135 | LP001482,Male,Yes,0,Graduate,Yes,3459,0,25,120,1,Semiurban,Y
136 | LP001487,Male,No,0,Graduate,No,4895,0,102,360,1,Semiurban,Y
137 | LP001488,Male,Yes,3+,Graduate,No,4000,7750,290,360,1,Semiurban,N
138 | LP001489,Female,Yes,0,Graduate,No,4583,0,84,360,1,Rural,N
139 | LP001491,Male,Yes,2,Graduate,Yes,3316,3500,88,360,1,Urban,Y
140 | LP001492,Male,No,0,Graduate,No,14999,0,242,360,0,Semiurban,N
141 | LP001493,Male,Yes,2,Not Graduate,No,4200,1430,129,360,1,Rural,N
142 | LP001497,Male,Yes,2,Graduate,No,5042,2083,185,360,1,Rural,N
143 | LP001498,Male,No,0,Graduate,No,5417,0,168,360,1,Urban,Y
144 | LP001504,Male,No,0,Graduate,Yes,6950,0,175,180,1,Semiurban,Y
145 | LP001507,Male,Yes,0,Graduate,No,2698,2034,122,360,1,Semiurban,Y
146 | LP001508,Male,Yes,2,Graduate,No,11757,0,187,180,1,Urban,Y
147 | LP001514,Female,Yes,0,Graduate,No,2330,4486,100,360,1,Semiurban,Y
148 | LP001516,Female,Yes,2,Graduate,No,14866,0,70,360,1,Urban,Y
149 | LP001518,Male,Yes,1,Graduate,No,1538,1425,30,360,1,Urban,Y
150 | LP001519,Female,No,0,Graduate,No,10000,1666,225,360,1,Rural,N
151 | LP001520,Male,Yes,0,Graduate,No,4860,830,125,360,1,Semiurban,Y
152 | LP001528,Male,No,0,Graduate,No,6277,0,118,360,0,Rural,N
153 | LP001529,Male,Yes,0,Graduate,Yes,2577,3750,152,360,1,Rural,Y
154 | LP001531,Male,No,0,Graduate,No,9166,0,244,360,1,Urban,N
155 | LP001532,Male,Yes,2,Not Graduate,No,2281,0,113,360,1,Rural,N
156 | LP001535,Male,No,0,Graduate,No,3254,0,50,360,1,Urban,Y
157 | LP001536,Male,Yes,3+,Graduate,No,39999,0,600,180,0,Semiurban,Y
158 | LP001541,Male,Yes,1,Graduate,No,6000,0,160,360,,Rural,Y
159 | LP001543,Male,Yes,1,Graduate,No,9538,0,187,360,1,Urban,Y
160 | LP001546,Male,No,0,Graduate,,2980,2083,120,360,1,Rural,Y
161 | LP001552,Male,Yes,0,Graduate,No,4583,5625,255,360,1,Semiurban,Y
162 | LP001560,Male,Yes,0,Not Graduate,No,1863,1041,98,360,1,Semiurban,Y
163 | LP001562,Male,Yes,0,Graduate,No,7933,0,275,360,1,Urban,N
164 | LP001565,Male,Yes,1,Graduate,No,3089,1280,121,360,0,Semiurban,N
165 | LP001570,Male,Yes,2,Graduate,No,4167,1447,158,360,1,Rural,Y
166 | LP001572,Male,Yes,0,Graduate,No,9323,0,75,180,1,Urban,Y
167 | LP001574,Male,Yes,0,Graduate,No,3707,3166,182,,1,Rural,Y
168 | LP001577,Female,Yes,0,Graduate,No,4583,0,112,360,1,Rural,N
169 | LP001578,Male,Yes,0,Graduate,No,2439,3333,129,360,1,Rural,Y
170 | LP001579,Male,No,0,Graduate,No,2237,0,63,480,0,Semiurban,N
171 | LP001580,Male,Yes,2,Graduate,No,8000,0,200,360,1,Semiurban,Y
172 | LP001581,Male,Yes,0,Not Graduate,,1820,1769,95,360,1,Rural,Y
173 | LP001585,,Yes,3+,Graduate,No,51763,0,700,300,1,Urban,Y
174 | LP001586,Male,Yes,3+,Not Graduate,No,3522,0,81,180,1,Rural,N
175 | LP001594,Male,Yes,0,Graduate,No,5708,5625,187,360,1,Semiurban,Y
176 | LP001603,Male,Yes,0,Not Graduate,Yes,4344,736,87,360,1,Semiurban,N
177 | LP001606,Male,Yes,0,Graduate,No,3497,1964,116,360,1,Rural,Y
178 | LP001608,Male,Yes,2,Graduate,No,2045,1619,101,360,1,Rural,Y
179 | LP001610,Male,Yes,3+,Graduate,No,5516,11300,495,360,0,Semiurban,N
180 | LP001616,Male,Yes,1,Graduate,No,3750,0,116,360,1,Semiurban,Y
181 | LP001630,Male,No,0,Not Graduate,No,2333,1451,102,480,0,Urban,N
182 | LP001633,Male,Yes,1,Graduate,No,6400,7250,180,360,0,Urban,N
183 | LP001634,Male,No,0,Graduate,No,1916,5063,67,360,,Rural,N
184 | LP001636,Male,Yes,0,Graduate,No,4600,0,73,180,1,Semiurban,Y
185 | LP001637,Male,Yes,1,Graduate,No,33846,0,260,360,1,Semiurban,N
186 | LP001639,Female,Yes,0,Graduate,No,3625,0,108,360,1,Semiurban,Y
187 | LP001640,Male,Yes,0,Graduate,Yes,39147,4750,120,360,1,Semiurban,Y
188 | LP001641,Male,Yes,1,Graduate,Yes,2178,0,66,300,0,Rural,N
189 | LP001643,Male,Yes,0,Graduate,No,2383,2138,58,360,,Rural,Y
190 | LP001644,,Yes,0,Graduate,Yes,674,5296,168,360,1,Rural,Y
191 | LP001647,Male,Yes,0,Graduate,No,9328,0,188,180,1,Rural,Y
192 | LP001653,Male,No,0,Not Graduate,No,4885,0,48,360,1,Rural,Y
193 | LP001656,Male,No,0,Graduate,No,12000,0,164,360,1,Semiurban,N
194 | LP001657,Male,Yes,0,Not Graduate,No,6033,0,160,360,1,Urban,N
195 | LP001658,Male,No,0,Graduate,No,3858,0,76,360,1,Semiurban,Y
196 | LP001664,Male,No,0,Graduate,No,4191,0,120,360,1,Rural,Y
197 | LP001665,Male,Yes,1,Graduate,No,3125,2583,170,360,1,Semiurban,N
198 | LP001666,Male,No,0,Graduate,No,8333,3750,187,360,1,Rural,Y
199 | LP001669,Female,No,0,Not Graduate,No,1907,2365,120,,1,Urban,Y
200 | LP001671,Female,Yes,0,Graduate,No,3416,2816,113,360,,Semiurban,Y
201 | LP001673,Male,No,0,Graduate,Yes,11000,0,83,360,1,Urban,N
202 | LP001674,Male,Yes,1,Not Graduate,No,2600,2500,90,360,1,Semiurban,Y
203 | LP001677,Male,No,2,Graduate,No,4923,0,166,360,0,Semiurban,Y
204 | LP001682,Male,Yes,3+,Not Graduate,No,3992,0,,180,1,Urban,N
205 | LP001688,Male,Yes,1,Not Graduate,No,3500,1083,135,360,1,Urban,Y
206 | LP001691,Male,Yes,2,Not Graduate,No,3917,0,124,360,1,Semiurban,Y
207 | LP001692,Female,No,0,Not Graduate,No,4408,0,120,360,1,Semiurban,Y
208 | LP001693,Female,No,0,Graduate,No,3244,0,80,360,1,Urban,Y
209 | LP001698,Male,No,0,Not Graduate,No,3975,2531,55,360,1,Rural,Y
210 | LP001699,Male,No,0,Graduate,No,2479,0,59,360,1,Urban,Y
211 | LP001702,Male,No,0,Graduate,No,3418,0,127,360,1,Semiurban,N
212 | LP001708,Female,No,0,Graduate,No,10000,0,214,360,1,Semiurban,N
213 | LP001711,Male,Yes,3+,Graduate,No,3430,1250,128,360,0,Semiurban,N
214 | LP001713,Male,Yes,1,Graduate,Yes,7787,0,240,360,1,Urban,Y
215 | LP001715,Male,Yes,3+,Not Graduate,Yes,5703,0,130,360,1,Rural,Y
216 | LP001716,Male,Yes,0,Graduate,No,3173,3021,137,360,1,Urban,Y
217 | LP001720,Male,Yes,3+,Not Graduate,No,3850,983,100,360,1,Semiurban,Y
218 | LP001722,Male,Yes,0,Graduate,No,150,1800,135,360,1,Rural,N
219 | LP001726,Male,Yes,0,Graduate,No,3727,1775,131,360,1,Semiurban,Y
220 | LP001732,Male,Yes,2,Graduate,,5000,0,72,360,0,Semiurban,N
221 | LP001734,Female,Yes,2,Graduate,No,4283,2383,127,360,,Semiurban,Y
222 | LP001736,Male,Yes,0,Graduate,No,2221,0,60,360,0,Urban,N
223 | LP001743,Male,Yes,2,Graduate,No,4009,1717,116,360,1,Semiurban,Y
224 | LP001744,Male,No,0,Graduate,No,2971,2791,144,360,1,Semiurban,Y
225 | LP001749,Male,Yes,0,Graduate,No,7578,1010,175,,1,Semiurban,Y
226 | LP001750,Male,Yes,0,Graduate,No,6250,0,128,360,1,Semiurban,Y
227 | LP001751,Male,Yes,0,Graduate,No,3250,0,170,360,1,Rural,N
228 | LP001754,Male,Yes,,Not Graduate,Yes,4735,0,138,360,1,Urban,N
229 | LP001758,Male,Yes,2,Graduate,No,6250,1695,210,360,1,Semiurban,Y
230 | LP001760,Male,,,Graduate,No,4758,0,158,480,1,Semiurban,Y
231 | LP001761,Male,No,0,Graduate,Yes,6400,0,200,360,1,Rural,Y
232 | LP001765,Male,Yes,1,Graduate,No,2491,2054,104,360,1,Semiurban,Y
233 | LP001768,Male,Yes,0,Graduate,,3716,0,42,180,1,Rural,Y
234 | LP001770,Male,No,0,Not Graduate,No,3189,2598,120,,1,Rural,Y
235 | LP001776,Female,No,0,Graduate,No,8333,0,280,360,1,Semiurban,Y
236 | LP001778,Male,Yes,1,Graduate,No,3155,1779,140,360,1,Semiurban,Y
237 | LP001784,Male,Yes,1,Graduate,No,5500,1260,170,360,1,Rural,Y
238 | LP001786,Male,Yes,0,Graduate,,5746,0,255,360,,Urban,N
239 | LP001788,Female,No,0,Graduate,Yes,3463,0,122,360,,Urban,Y
240 | LP001790,Female,No,1,Graduate,No,3812,0,112,360,1,Rural,Y
241 | LP001792,Male,Yes,1,Graduate,No,3315,0,96,360,1,Semiurban,Y
242 | LP001798,Male,Yes,2,Graduate,No,5819,5000,120,360,1,Rural,Y
243 | LP001800,Male,Yes,1,Not Graduate,No,2510,1983,140,180,1,Urban,N
244 | LP001806,Male,No,0,Graduate,No,2965,5701,155,60,1,Urban,Y
245 | LP001807,Male,Yes,2,Graduate,Yes,6250,1300,108,360,1,Rural,Y
246 | LP001811,Male,Yes,0,Not Graduate,No,3406,4417,123,360,1,Semiurban,Y
247 | LP001813,Male,No,0,Graduate,Yes,6050,4333,120,180,1,Urban,N
248 | LP001814,Male,Yes,2,Graduate,No,9703,0,112,360,1,Urban,Y
249 | LP001819,Male,Yes,1,Not Graduate,No,6608,0,137,180,1,Urban,Y
250 | LP001824,Male,Yes,1,Graduate,No,2882,1843,123,480,1,Semiurban,Y
251 | LP001825,Male,Yes,0,Graduate,No,1809,1868,90,360,1,Urban,Y
252 | LP001835,Male,Yes,0,Not Graduate,No,1668,3890,201,360,0,Semiurban,N
253 | LP001836,Female,No,2,Graduate,No,3427,0,138,360,1,Urban,N
254 | LP001841,Male,No,0,Not Graduate,Yes,2583,2167,104,360,1,Rural,Y
255 | LP001843,Male,Yes,1,Not Graduate,No,2661,7101,279,180,1,Semiurban,Y
256 | LP001844,Male,No,0,Graduate,Yes,16250,0,192,360,0,Urban,N
257 | LP001846,Female,No,3+,Graduate,No,3083,0,255,360,1,Rural,Y
258 | LP001849,Male,No,0,Not Graduate,No,6045,0,115,360,0,Rural,N
259 | LP001854,Male,Yes,3+,Graduate,No,5250,0,94,360,1,Urban,N
260 | LP001859,Male,Yes,0,Graduate,No,14683,2100,304,360,1,Rural,N
261 | LP001864,Male,Yes,3+,Not Graduate,No,4931,0,128,360,,Semiurban,N
262 | LP001865,Male,Yes,1,Graduate,No,6083,4250,330,360,,Urban,Y
263 | LP001868,Male,No,0,Graduate,No,2060,2209,134,360,1,Semiurban,Y
264 | LP001870,Female,No,1,Graduate,No,3481,0,155,36,1,Semiurban,N
265 | LP001871,Female,No,0,Graduate,No,7200,0,120,360,1,Rural,Y
266 | LP001872,Male,No,0,Graduate,Yes,5166,0,128,360,1,Semiurban,Y
267 | LP001875,Male,No,0,Graduate,No,4095,3447,151,360,1,Rural,Y
268 | LP001877,Male,Yes,2,Graduate,No,4708,1387,150,360,1,Semiurban,Y
269 | LP001882,Male,Yes,3+,Graduate,No,4333,1811,160,360,0,Urban,Y
270 | LP001883,Female,No,0,Graduate,,3418,0,135,360,1,Rural,N
271 | LP001884,Female,No,1,Graduate,No,2876,1560,90,360,1,Urban,Y
272 | LP001888,Female,No,0,Graduate,No,3237,0,30,360,1,Urban,Y
273 | LP001891,Male,Yes,0,Graduate,No,11146,0,136,360,1,Urban,Y
274 | LP001892,Male,No,0,Graduate,No,2833,1857,126,360,1,Rural,Y
275 | LP001894,Male,Yes,0,Graduate,No,2620,2223,150,360,1,Semiurban,Y
276 | LP001896,Male,Yes,2,Graduate,No,3900,0,90,360,1,Semiurban,Y
277 | LP001900,Male,Yes,1,Graduate,No,2750,1842,115,360,1,Semiurban,Y
278 | LP001903,Male,Yes,0,Graduate,No,3993,3274,207,360,1,Semiurban,Y
279 | LP001904,Male,Yes,0,Graduate,No,3103,1300,80,360,1,Urban,Y
280 | LP001907,Male,Yes,0,Graduate,No,14583,0,436,360,1,Semiurban,Y
281 | LP001908,Female,Yes,0,Not Graduate,No,4100,0,124,360,,Rural,Y
282 | LP001910,Male,No,1,Not Graduate,Yes,4053,2426,158,360,0,Urban,N
283 | LP001914,Male,Yes,0,Graduate,No,3927,800,112,360,1,Semiurban,Y
284 | LP001915,Male,Yes,2,Graduate,No,2301,985.7999878,78,180,1,Urban,Y
285 | LP001917,Female,No,0,Graduate,No,1811,1666,54,360,1,Urban,Y
286 | LP001922,Male,Yes,0,Graduate,No,20667,0,,360,1,Rural,N
287 | LP001924,Male,No,0,Graduate,No,3158,3053,89,360,1,Rural,Y
288 | LP001925,Female,No,0,Graduate,Yes,2600,1717,99,300,1,Semiurban,N
289 | LP001926,Male,Yes,0,Graduate,No,3704,2000,120,360,1,Rural,Y
290 | LP001931,Female,No,0,Graduate,No,4124,0,115,360,1,Semiurban,Y
291 | LP001935,Male,No,0,Graduate,No,9508,0,187,360,1,Rural,Y
292 | LP001936,Male,Yes,0,Graduate,No,3075,2416,139,360,1,Rural,Y
293 | LP001938,Male,Yes,2,Graduate,No,4400,0,127,360,0,Semiurban,N
294 | LP001940,Male,Yes,2,Graduate,No,3153,1560,134,360,1,Urban,Y
295 | LP001945,Female,No,,Graduate,No,5417,0,143,480,0,Urban,N
296 | LP001947,Male,Yes,0,Graduate,No,2383,3334,172,360,1,Semiurban,Y
297 | LP001949,Male,Yes,3+,Graduate,,4416,1250,110,360,1,Urban,Y
298 | LP001953,Male,Yes,1,Graduate,No,6875,0,200,360,1,Semiurban,Y
299 | LP001954,Female,Yes,1,Graduate,No,4666,0,135,360,1,Urban,Y
300 | LP001955,Female,No,0,Graduate,No,5000,2541,151,480,1,Rural,N
301 | LP001963,Male,Yes,1,Graduate,No,2014,2925,113,360,1,Urban,N
302 | LP001964,Male,Yes,0,Not Graduate,No,1800,2934,93,360,0,Urban,N
303 | LP001972,Male,Yes,,Not Graduate,No,2875,1750,105,360,1,Semiurban,Y
304 | LP001974,Female,No,0,Graduate,No,5000,0,132,360,1,Rural,Y
305 | LP001977,Male,Yes,1,Graduate,No,1625,1803,96,360,1,Urban,Y
306 | LP001978,Male,No,0,Graduate,No,4000,2500,140,360,1,Rural,Y
307 | LP001990,Male,No,0,Not Graduate,No,2000,0,,360,1,Urban,N
308 | LP001993,Female,No,0,Graduate,No,3762,1666,135,360,1,Rural,Y
309 | LP001994,Female,No,0,Graduate,No,2400,1863,104,360,0,Urban,N
310 | LP001996,Male,No,0,Graduate,No,20233,0,480,360,1,Rural,N
311 | LP001998,Male,Yes,2,Not Graduate,No,7667,0,185,360,,Rural,Y
312 | LP002002,Female,No,0,Graduate,No,2917,0,84,360,1,Semiurban,Y
313 | LP002004,Male,No,0,Not Graduate,No,2927,2405,111,360,1,Semiurban,Y
314 | LP002006,Female,No,0,Graduate,No,2507,0,56,360,1,Rural,Y
315 | LP002008,Male,Yes,2,Graduate,Yes,5746,0,144,84,,Rural,Y
316 | LP002024,,Yes,0,Graduate,No,2473,1843,159,360,1,Rural,N
317 | LP002031,Male,Yes,1,Not Graduate,No,3399,1640,111,180,1,Urban,Y
318 | LP002035,Male,Yes,2,Graduate,No,3717,0,120,360,1,Semiurban,Y
319 | LP002036,Male,Yes,0,Graduate,No,2058,2134,88,360,,Urban,Y
320 | LP002043,Female,No,1,Graduate,No,3541,0,112,360,,Semiurban,Y
321 | LP002050,Male,Yes,1,Graduate,Yes,10000,0,155,360,1,Rural,N
322 | LP002051,Male,Yes,0,Graduate,No,2400,2167,115,360,1,Semiurban,Y
323 | LP002053,Male,Yes,3+,Graduate,No,4342,189,124,360,1,Semiurban,Y
324 | LP002054,Male,Yes,2,Not Graduate,No,3601,1590,,360,1,Rural,Y
325 | LP002055,Female,No,0,Graduate,No,3166,2985,132,360,,Rural,Y
326 | LP002065,Male,Yes,3+,Graduate,No,15000,0,300,360,1,Rural,Y
327 | LP002067,Male,Yes,1,Graduate,Yes,8666,4983,376,360,0,Rural,N
328 | LP002068,Male,No,0,Graduate,No,4917,0,130,360,0,Rural,Y
329 | LP002082,Male,Yes,0,Graduate,Yes,5818,2160,184,360,1,Semiurban,Y
330 | LP002086,Female,Yes,0,Graduate,No,4333,2451,110,360,1,Urban,N
331 | LP002087,Female,No,0,Graduate,No,2500,0,67,360,1,Urban,Y
332 | LP002097,Male,No,1,Graduate,No,4384,1793,117,360,1,Urban,Y
333 | LP002098,Male,No,0,Graduate,No,2935,0,98,360,1,Semiurban,Y
334 | LP002100,Male,No,,Graduate,No,2833,0,71,360,1,Urban,Y
335 | LP002101,Male,Yes,0,Graduate,,63337,0,490,180,1,Urban,Y
336 | LP002103,,Yes,1,Graduate,Yes,9833,1833,182,180,1,Urban,Y
337 | LP002106,Male,Yes,,Graduate,Yes,5503,4490,70,,1,Semiurban,Y
338 | LP002110,Male,Yes,1,Graduate,,5250,688,160,360,1,Rural,Y
339 | LP002112,Male,Yes,2,Graduate,Yes,2500,4600,176,360,1,Rural,Y
340 | LP002113,Female,No,3+,Not Graduate,No,1830,0,,360,0,Urban,N
341 | LP002114,Female,No,0,Graduate,No,4160,0,71,360,1,Semiurban,Y
342 | LP002115,Male,Yes,3+,Not Graduate,No,2647,1587,173,360,1,Rural,N
343 | LP002116,Female,No,0,Graduate,No,2378,0,46,360,1,Rural,N
344 | LP002119,Male,Yes,1,Not Graduate,No,4554,1229,158,360,1,Urban,Y
345 | LP002126,Male,Yes,3+,Not Graduate,No,3173,0,74,360,1,Semiurban,Y
346 | LP002128,Male,Yes,2,Graduate,,2583,2330,125,360,1,Rural,Y
347 | LP002129,Male,Yes,0,Graduate,No,2499,2458,160,360,1,Semiurban,Y
348 | LP002130,Male,Yes,,Not Graduate,No,3523,3230,152,360,0,Rural,N
349 | LP002131,Male,Yes,2,Not Graduate,No,3083,2168,126,360,1,Urban,Y
350 | LP002137,Male,Yes,0,Graduate,No,6333,4583,259,360,,Semiurban,Y
351 | LP002138,Male,Yes,0,Graduate,No,2625,6250,187,360,1,Rural,Y
352 | LP002139,Male,Yes,0,Graduate,No,9083,0,228,360,1,Semiurban,Y
353 | LP002140,Male,No,0,Graduate,No,8750,4167,308,360,1,Rural,N
354 | LP002141,Male,Yes,3+,Graduate,No,2666,2083,95,360,1,Rural,Y
355 | LP002142,Female,Yes,0,Graduate,Yes,5500,0,105,360,0,Rural,N
356 | LP002143,Female,Yes,0,Graduate,No,2423,505,130,360,1,Semiurban,Y
357 | LP002144,Female,No,,Graduate,No,3813,0,116,180,1,Urban,Y
358 | LP002149,Male,Yes,2,Graduate,No,8333,3167,165,360,1,Rural,Y
359 | LP002151,Male,Yes,1,Graduate,No,3875,0,67,360,1,Urban,N
360 | LP002158,Male,Yes,0,Not Graduate,No,3000,1666,100,480,0,Urban,N
361 | LP002160,Male,Yes,3+,Graduate,No,5167,3167,200,360,1,Semiurban,Y
362 | LP002161,Female,No,1,Graduate,No,4723,0,81,360,1,Semiurban,N
363 | LP002170,Male,Yes,2,Graduate,No,5000,3667,236,360,1,Semiurban,Y
364 | LP002175,Male,Yes,0,Graduate,No,4750,2333,130,360,1,Urban,Y
365 | LP002178,Male,Yes,0,Graduate,No,3013,3033,95,300,,Urban,Y
366 | LP002180,Male,No,0,Graduate,Yes,6822,0,141,360,1,Rural,Y
367 | LP002181,Male,No,0,Not Graduate,No,6216,0,133,360,1,Rural,N
368 | LP002187,Male,No,0,Graduate,No,2500,0,96,480,1,Semiurban,N
369 | LP002188,Male,No,0,Graduate,No,5124,0,124,,0,Rural,N
370 | LP002190,Male,Yes,1,Graduate,No,6325,0,175,360,1,Semiurban,Y
371 | LP002191,Male,Yes,0,Graduate,No,19730,5266,570,360,1,Rural,N
372 | LP002194,Female,No,0,Graduate,Yes,15759,0,55,360,1,Semiurban,Y
373 | LP002197,Male,Yes,2,Graduate,No,5185,0,155,360,1,Semiurban,Y
374 | LP002201,Male,Yes,2,Graduate,Yes,9323,7873,380,300,1,Rural,Y
375 | LP002205,Male,No,1,Graduate,No,3062,1987,111,180,0,Urban,N
376 | LP002209,Female,No,0,Graduate,,2764,1459,110,360,1,Urban,Y
377 | LP002211,Male,Yes,0,Graduate,No,4817,923,120,180,1,Urban,Y
378 | LP002219,Male,Yes,3+,Graduate,No,8750,4996,130,360,1,Rural,Y
379 | LP002223,Male,Yes,0,Graduate,No,4310,0,130,360,,Semiurban,Y
380 | LP002224,Male,No,0,Graduate,No,3069,0,71,480,1,Urban,N
381 | LP002225,Male,Yes,2,Graduate,No,5391,0,130,360,1,Urban,Y
382 | LP002226,Male,Yes,0,Graduate,,3333,2500,128,360,1,Semiurban,Y
383 | LP002229,Male,No,0,Graduate,No,5941,4232,296,360,1,Semiurban,Y
384 | LP002231,Female,No,0,Graduate,No,6000,0,156,360,1,Urban,Y
385 | LP002234,Male,No,0,Graduate,Yes,7167,0,128,360,1,Urban,Y
386 | LP002236,Male,Yes,2,Graduate,No,4566,0,100,360,1,Urban,N
387 | LP002237,Male,No,1,Graduate,,3667,0,113,180,1,Urban,Y
388 | LP002239,Male,No,0,Not Graduate,No,2346,1600,132,360,1,Semiurban,Y
389 | LP002243,Male,Yes,0,Not Graduate,No,3010,3136,,360,0,Urban,N
390 | LP002244,Male,Yes,0,Graduate,No,2333,2417,136,360,1,Urban,Y
391 | LP002250,Male,Yes,0,Graduate,No,5488,0,125,360,1,Rural,Y
392 | LP002255,Male,No,3+,Graduate,No,9167,0,185,360,1,Rural,Y
393 | LP002262,Male,Yes,3+,Graduate,No,9504,0,275,360,1,Rural,Y
394 | LP002263,Male,Yes,0,Graduate,No,2583,2115,120,360,,Urban,Y
395 | LP002265,Male,Yes,2,Not Graduate,No,1993,1625,113,180,1,Semiurban,Y
396 | LP002266,Male,Yes,2,Graduate,No,3100,1400,113,360,1,Urban,Y
397 | LP002272,Male,Yes,2,Graduate,No,3276,484,135,360,,Semiurban,Y
398 | LP002277,Female,No,0,Graduate,No,3180,0,71,360,0,Urban,N
399 | LP002281,Male,Yes,0,Graduate,No,3033,1459,95,360,1,Urban,Y
400 | LP002284,Male,No,0,Not Graduate,No,3902,1666,109,360,1,Rural,Y
401 | LP002287,Female,No,0,Graduate,No,1500,1800,103,360,0,Semiurban,N
402 | LP002288,Male,Yes,2,Not Graduate,No,2889,0,45,180,0,Urban,N
403 | LP002296,Male,No,0,Not Graduate,No,2755,0,65,300,1,Rural,N
404 | LP002297,Male,No,0,Graduate,No,2500,20000,103,360,1,Semiurban,Y
405 | LP002300,Female,No,0,Not Graduate,No,1963,0,53,360,1,Semiurban,Y
406 | LP002301,Female,No,0,Graduate,Yes,7441,0,194,360,1,Rural,N
407 | LP002305,Female,No,0,Graduate,No,4547,0,115,360,1,Semiurban,Y
408 | LP002308,Male,Yes,0,Not Graduate,No,2167,2400,115,360,1,Urban,Y
409 | LP002314,Female,No,0,Not Graduate,No,2213,0,66,360,1,Rural,Y
410 | LP002315,Male,Yes,1,Graduate,No,8300,0,152,300,0,Semiurban,N
411 | LP002317,Male,Yes,3+,Graduate,No,81000,0,360,360,0,Rural,N
412 | LP002318,Female,No,1,Not Graduate,Yes,3867,0,62,360,1,Semiurban,N
413 | LP002319,Male,Yes,0,Graduate,,6256,0,160,360,,Urban,Y
414 | LP002328,Male,Yes,0,Not Graduate,No,6096,0,218,360,0,Rural,N
415 | LP002332,Male,Yes,0,Not Graduate,No,2253,2033,110,360,1,Rural,Y
416 | LP002335,Female,Yes,0,Not Graduate,No,2149,3237,178,360,0,Semiurban,N
417 | LP002337,Female,No,0,Graduate,No,2995,0,60,360,1,Urban,Y
418 | LP002341,Female,No,1,Graduate,No,2600,0,160,360,1,Urban,N
419 | LP002342,Male,Yes,2,Graduate,Yes,1600,20000,239,360,1,Urban,N
420 | LP002345,Male,Yes,0,Graduate,No,1025,2773,112,360,1,Rural,Y
421 | LP002347,Male,Yes,0,Graduate,No,3246,1417,138,360,1,Semiurban,Y
422 | LP002348,Male,Yes,0,Graduate,No,5829,0,138,360,1,Rural,Y
423 | LP002357,Female,No,0,Not Graduate,No,2720,0,80,,0,Urban,N
424 | LP002361,Male,Yes,0,Graduate,No,1820,1719,100,360,1,Urban,Y
425 | LP002362,Male,Yes,1,Graduate,No,7250,1667,110,,0,Urban,N
426 | LP002364,Male,Yes,0,Graduate,No,14880,0,96,360,1,Semiurban,Y
427 | LP002366,Male,Yes,0,Graduate,No,2666,4300,121,360,1,Rural,Y
428 | LP002367,Female,No,1,Not Graduate,No,4606,0,81,360,1,Rural,N
429 | LP002368,Male,Yes,2,Graduate,No,5935,0,133,360,1,Semiurban,Y
430 | LP002369,Male,Yes,0,Graduate,No,2920,16.12000084,87,360,1,Rural,Y
431 | LP002370,Male,No,0,Not Graduate,No,2717,0,60,180,1,Urban,Y
432 | LP002377,Female,No,1,Graduate,Yes,8624,0,150,360,1,Semiurban,Y
433 | LP002379,Male,No,0,Graduate,No,6500,0,105,360,0,Rural,N
434 | LP002386,Male,No,0,Graduate,,12876,0,405,360,1,Semiurban,Y
435 | LP002387,Male,Yes,0,Graduate,No,2425,2340,143,360,1,Semiurban,Y
436 | LP002390,Male,No,0,Graduate,No,3750,0,100,360,1,Urban,Y
437 | LP002393,Female,,,Graduate,No,10047,0,,240,1,Semiurban,Y
438 | LP002398,Male,No,0,Graduate,No,1926,1851,50,360,1,Semiurban,Y
439 | LP002401,Male,Yes,0,Graduate,No,2213,1125,,360,1,Urban,Y
440 | LP002403,Male,No,0,Graduate,Yes,10416,0,187,360,0,Urban,N
441 | LP002407,Female,Yes,0,Not Graduate,Yes,7142,0,138,360,1,Rural,Y
442 | LP002408,Male,No,0,Graduate,No,3660,5064,187,360,1,Semiurban,Y
443 | LP002409,Male,Yes,0,Graduate,No,7901,1833,180,360,1,Rural,Y
444 | LP002418,Male,No,3+,Not Graduate,No,4707,1993,148,360,1,Semiurban,Y
445 | LP002422,Male,No,1,Graduate,No,37719,0,152,360,1,Semiurban,Y
446 | LP002424,Male,Yes,0,Graduate,No,7333,8333,175,300,,Rural,Y
447 | LP002429,Male,Yes,1,Graduate,Yes,3466,1210,130,360,1,Rural,Y
448 | LP002434,Male,Yes,2,Not Graduate,No,4652,0,110,360,1,Rural,Y
449 | LP002435,Male,Yes,0,Graduate,,3539,1376,55,360,1,Rural,N
450 | LP002443,Male,Yes,2,Graduate,No,3340,1710,150,360,0,Rural,N
451 | LP002444,Male,No,1,Not Graduate,Yes,2769,1542,190,360,,Semiurban,N
452 | LP002446,Male,Yes,2,Not Graduate,No,2309,1255,125,360,0,Rural,N
453 | LP002447,Male,Yes,2,Not Graduate,No,1958,1456,60,300,,Urban,Y
454 | LP002448,Male,Yes,0,Graduate,No,3948,1733,149,360,0,Rural,N
455 | LP002449,Male,Yes,0,Graduate,No,2483,2466,90,180,0,Rural,Y
456 | LP002453,Male,No,0,Graduate,Yes,7085,0,84,360,1,Semiurban,Y
457 | LP002455,Male,Yes,2,Graduate,No,3859,0,96,360,1,Semiurban,Y
458 | LP002459,Male,Yes,0,Graduate,No,4301,0,118,360,1,Urban,Y
459 | LP002467,Male,Yes,0,Graduate,No,3708,2569,173,360,1,Urban,N
460 | LP002472,Male,No,2,Graduate,No,4354,0,136,360,1,Rural,Y
461 | LP002473,Male,Yes,0,Graduate,No,8334,0,160,360,1,Semiurban,N
462 | LP002478,,Yes,0,Graduate,Yes,2083,4083,160,360,,Semiurban,Y
463 | LP002484,Male,Yes,3+,Graduate,No,7740,0,128,180,1,Urban,Y
464 | LP002487,Male,Yes,0,Graduate,No,3015,2188,153,360,1,Rural,Y
465 | LP002489,Female,No,1,Not Graduate,,5191,0,132,360,1,Semiurban,Y
466 | LP002493,Male,No,0,Graduate,No,4166,0,98,360,0,Semiurban,N
467 | LP002494,Male,No,0,Graduate,No,6000,0,140,360,1,Rural,Y
468 | LP002500,Male,Yes,3+,Not Graduate,No,2947,1664,70,180,0,Urban,N
469 | LP002501,,Yes,0,Graduate,No,16692,0,110,360,1,Semiurban,Y
470 | LP002502,Female,Yes,2,Not Graduate,,210,2917,98,360,1,Semiurban,Y
471 | LP002505,Male,Yes,0,Graduate,No,4333,2451,110,360,1,Urban,N
472 | LP002515,Male,Yes,1,Graduate,Yes,3450,2079,162,360,1,Semiurban,Y
473 | LP002517,Male,Yes,1,Not Graduate,No,2653,1500,113,180,0,Rural,N
474 | LP002519,Male,Yes,3+,Graduate,No,4691,0,100,360,1,Semiurban,Y
475 | LP002522,Female,No,0,Graduate,Yes,2500,0,93,360,,Urban,Y
476 | LP002524,Male,No,2,Graduate,No,5532,4648,162,360,1,Rural,Y
477 | LP002527,Male,Yes,2,Graduate,Yes,16525,1014,150,360,1,Rural,Y
478 | LP002529,Male,Yes,2,Graduate,No,6700,1750,230,300,1,Semiurban,Y
479 | LP002530,,Yes,2,Graduate,No,2873,1872,132,360,0,Semiurban,N
480 | LP002531,Male,Yes,1,Graduate,Yes,16667,2250,86,360,1,Semiurban,Y
481 | LP002533,Male,Yes,2,Graduate,No,2947,1603,,360,1,Urban,N
482 | LP002534,Female,No,0,Not Graduate,No,4350,0,154,360,1,Rural,Y
483 | LP002536,Male,Yes,3+,Not Graduate,No,3095,0,113,360,1,Rural,Y
484 | LP002537,Male,Yes,0,Graduate,No,2083,3150,128,360,1,Semiurban,Y
485 | LP002541,Male,Yes,0,Graduate,No,10833,0,234,360,1,Semiurban,Y
486 | LP002543,Male,Yes,2,Graduate,No,8333,0,246,360,1,Semiurban,Y
487 | LP002544,Male,Yes,1,Not Graduate,No,1958,2436,131,360,1,Rural,Y
488 | LP002545,Male,No,2,Graduate,No,3547,0,80,360,0,Rural,N
489 | LP002547,Male,Yes,1,Graduate,No,18333,0,500,360,1,Urban,N
490 | LP002555,Male,Yes,2,Graduate,Yes,4583,2083,160,360,1,Semiurban,Y
491 | LP002556,Male,No,0,Graduate,No,2435,0,75,360,1,Urban,N
492 | LP002560,Male,No,0,Not Graduate,No,2699,2785,96,360,,Semiurban,Y
493 | LP002562,Male,Yes,1,Not Graduate,No,5333,1131,186,360,,Urban,Y
494 | LP002571,Male,No,0,Not Graduate,No,3691,0,110,360,1,Rural,Y
495 | LP002582,Female,No,0,Not Graduate,Yes,17263,0,225,360,1,Semiurban,Y
496 | LP002585,Male,Yes,0,Graduate,No,3597,2157,119,360,0,Rural,N
497 | LP002586,Female,Yes,1,Graduate,No,3326,913,105,84,1,Semiurban,Y
498 | LP002587,Male,Yes,0,Not Graduate,No,2600,1700,107,360,1,Rural,Y
499 | LP002588,Male,Yes,0,Graduate,No,4625,2857,111,12,,Urban,Y
500 | LP002600,Male,Yes,1,Graduate,Yes,2895,0,95,360,1,Semiurban,Y
501 | LP002602,Male,No,0,Graduate,No,6283,4416,209,360,0,Rural,N
502 | LP002603,Female,No,0,Graduate,No,645,3683,113,480,1,Rural,Y
503 | LP002606,Female,No,0,Graduate,No,3159,0,100,360,1,Semiurban,Y
504 | LP002615,Male,Yes,2,Graduate,No,4865,5624,208,360,1,Semiurban,Y
505 | LP002618,Male,Yes,1,Not Graduate,No,4050,5302,138,360,,Rural,N
506 | LP002619,Male,Yes,0,Not Graduate,No,3814,1483,124,300,1,Semiurban,Y
507 | LP002622,Male,Yes,2,Graduate,No,3510,4416,243,360,1,Rural,Y
508 | LP002624,Male,Yes,0,Graduate,No,20833,6667,480,360,,Urban,Y
509 | LP002625,,No,0,Graduate,No,3583,0,96,360,1,Urban,N
510 | LP002626,Male,Yes,0,Graduate,Yes,2479,3013,188,360,1,Urban,Y
511 | LP002634,Female,No,1,Graduate,No,13262,0,40,360,1,Urban,Y
512 | LP002637,Male,No,0,Not Graduate,No,3598,1287,100,360,1,Rural,N
513 | LP002640,Male,Yes,1,Graduate,No,6065,2004,250,360,1,Semiurban,Y
514 | LP002643,Male,Yes,2,Graduate,No,3283,2035,148,360,1,Urban,Y
515 | LP002648,Male,Yes,0,Graduate,No,2130,6666,70,180,1,Semiurban,N
516 | LP002652,Male,No,0,Graduate,No,5815,3666,311,360,1,Rural,N
517 | LP002659,Male,Yes,3+,Graduate,No,3466,3428,150,360,1,Rural,Y
518 | LP002670,Female,Yes,2,Graduate,No,2031,1632,113,480,1,Semiurban,Y
519 | LP002682,Male,Yes,,Not Graduate,No,3074,1800,123,360,0,Semiurban,N
520 | LP002683,Male,No,0,Graduate,No,4683,1915,185,360,1,Semiurban,N
521 | LP002684,Female,No,0,Not Graduate,No,3400,0,95,360,1,Rural,N
522 | LP002689,Male,Yes,2,Not Graduate,No,2192,1742,45,360,1,Semiurban,Y
523 | LP002690,Male,No,0,Graduate,No,2500,0,55,360,1,Semiurban,Y
524 | LP002692,Male,Yes,3+,Graduate,Yes,5677,1424,100,360,1,Rural,Y
525 | LP002693,Male,Yes,2,Graduate,Yes,7948,7166,480,360,1,Rural,Y
526 | LP002697,Male,No,0,Graduate,No,4680,2087,,360,1,Semiurban,N
527 | LP002699,Male,Yes,2,Graduate,Yes,17500,0,400,360,1,Rural,Y
528 | LP002705,Male,Yes,0,Graduate,No,3775,0,110,360,1,Semiurban,Y
529 | LP002706,Male,Yes,1,Not Graduate,No,5285,1430,161,360,0,Semiurban,Y
530 | LP002714,Male,No,1,Not Graduate,No,2679,1302,94,360,1,Semiurban,Y
531 | LP002716,Male,No,0,Not Graduate,No,6783,0,130,360,1,Semiurban,Y
532 | LP002717,Male,Yes,0,Graduate,No,1025,5500,216,360,,Rural,Y
533 | LP002720,Male,Yes,3+,Graduate,No,4281,0,100,360,1,Urban,Y
534 | LP002723,Male,No,2,Graduate,No,3588,0,110,360,0,Rural,N
535 | LP002729,Male,No,1,Graduate,No,11250,0,196,360,,Semiurban,N
536 | LP002731,Female,No,0,Not Graduate,Yes,18165,0,125,360,1,Urban,Y
537 | LP002732,Male,No,0,Not Graduate,,2550,2042,126,360,1,Rural,Y
538 | LP002734,Male,Yes,0,Graduate,No,6133,3906,324,360,1,Urban,Y
539 | LP002738,Male,No,2,Graduate,No,3617,0,107,360,1,Semiurban,Y
540 | LP002739,Male,Yes,0,Not Graduate,No,2917,536,66,360,1,Rural,N
541 | LP002740,Male,Yes,3+,Graduate,No,6417,0,157,180,1,Rural,Y
542 | LP002741,Female,Yes,1,Graduate,No,4608,2845,140,180,1,Semiurban,Y
543 | LP002743,Female,No,0,Graduate,No,2138,0,99,360,0,Semiurban,N
544 | LP002753,Female,No,1,Graduate,,3652,0,95,360,1,Semiurban,Y
545 | LP002755,Male,Yes,1,Not Graduate,No,2239,2524,128,360,1,Urban,Y
546 | LP002757,Female,Yes,0,Not Graduate,No,3017,663,102,360,,Semiurban,Y
547 | LP002767,Male,Yes,0,Graduate,No,2768,1950,155,360,1,Rural,Y
548 | LP002768,Male,No,0,Not Graduate,No,3358,0,80,36,1,Semiurban,N
549 | LP002772,Male,No,0,Graduate,No,2526,1783,145,360,1,Rural,Y
550 | LP002776,Female,No,0,Graduate,No,5000,0,103,360,0,Semiurban,N
551 | LP002777,Male,Yes,0,Graduate,No,2785,2016,110,360,1,Rural,Y
552 | LP002778,Male,Yes,2,Graduate,Yes,6633,0,,360,0,Rural,N
553 | LP002784,Male,Yes,1,Not Graduate,No,2492,2375,,360,1,Rural,Y
554 | LP002785,Male,Yes,1,Graduate,No,3333,3250,158,360,1,Urban,Y
555 | LP002788,Male,Yes,0,Not Graduate,No,2454,2333,181,360,0,Urban,N
556 | LP002789,Male,Yes,0,Graduate,No,3593,4266,132,180,0,Rural,N
557 | LP002792,Male,Yes,1,Graduate,No,5468,1032,26,360,1,Semiurban,Y
558 | LP002794,Female,No,0,Graduate,No,2667,1625,84,360,,Urban,Y
559 | LP002795,Male,Yes,3+,Graduate,Yes,10139,0,260,360,1,Semiurban,Y
560 | LP002798,Male,Yes,0,Graduate,No,3887,2669,162,360,1,Semiurban,Y
561 | LP002804,Female,Yes,0,Graduate,No,4180,2306,182,360,1,Semiurban,Y
562 | LP002807,Male,Yes,2,Not Graduate,No,3675,242,108,360,1,Semiurban,Y
563 | LP002813,Female,Yes,1,Graduate,Yes,19484,0,600,360,1,Semiurban,Y
564 | LP002820,Male,Yes,0,Graduate,No,5923,2054,211,360,1,Rural,Y
565 | LP002821,Male,No,0,Not Graduate,Yes,5800,0,132,360,1,Semiurban,Y
566 | LP002832,Male,Yes,2,Graduate,No,8799,0,258,360,0,Urban,N
567 | LP002833,Male,Yes,0,Not Graduate,No,4467,0,120,360,,Rural,Y
568 | LP002836,Male,No,0,Graduate,No,3333,0,70,360,1,Urban,Y
569 | LP002837,Male,Yes,3+,Graduate,No,3400,2500,123,360,0,Rural,N
570 | LP002840,Female,No,0,Graduate,No,2378,0,9,360,1,Urban,N
571 | LP002841,Male,Yes,0,Graduate,No,3166,2064,104,360,0,Urban,N
572 | LP002842,Male,Yes,1,Graduate,No,3417,1750,186,360,1,Urban,Y
573 | LP002847,Male,Yes,,Graduate,No,5116,1451,165,360,0,Urban,N
574 | LP002855,Male,Yes,2,Graduate,No,16666,0,275,360,1,Urban,Y
575 | LP002862,Male,Yes,2,Not Graduate,No,6125,1625,187,480,1,Semiurban,N
576 | LP002863,Male,Yes,3+,Graduate,No,6406,0,150,360,1,Semiurban,N
577 | LP002868,Male,Yes,2,Graduate,No,3159,461,108,84,1,Urban,Y
578 | LP002872,,Yes,0,Graduate,No,3087,2210,136,360,0,Semiurban,N
579 | LP002874,Male,No,0,Graduate,No,3229,2739,110,360,1,Urban,Y
580 | LP002877,Male,Yes,1,Graduate,No,1782,2232,107,360,1,Rural,Y
581 | LP002888,Male,No,0,Graduate,,3182,2917,161,360,1,Urban,Y
582 | LP002892,Male,Yes,2,Graduate,No,6540,0,205,360,1,Semiurban,Y
583 | LP002893,Male,No,0,Graduate,No,1836,33837,90,360,1,Urban,N
584 | LP002894,Female,Yes,0,Graduate,No,3166,0,36,360,1,Semiurban,Y
585 | LP002898,Male,Yes,1,Graduate,No,1880,0,61,360,,Rural,N
586 | LP002911,Male,Yes,1,Graduate,No,2787,1917,146,360,0,Rural,N
587 | LP002912,Male,Yes,1,Graduate,No,4283,3000,172,84,1,Rural,N
588 | LP002916,Male,Yes,0,Graduate,No,2297,1522,104,360,1,Urban,Y
589 | LP002917,Female,No,0,Not Graduate,No,2165,0,70,360,1,Semiurban,Y
590 | LP002925,,No,0,Graduate,No,4750,0,94,360,1,Semiurban,Y
591 | LP002926,Male,Yes,2,Graduate,Yes,2726,0,106,360,0,Semiurban,N
592 | LP002928,Male,Yes,0,Graduate,No,3000,3416,56,180,1,Semiurban,Y
593 | LP002931,Male,Yes,2,Graduate,Yes,6000,0,205,240,1,Semiurban,N
594 | LP002933,,No,3+,Graduate,Yes,9357,0,292,360,1,Semiurban,Y
595 | LP002936,Male,Yes,0,Graduate,No,3859,3300,142,180,1,Rural,Y
596 | LP002938,Male,Yes,0,Graduate,Yes,16120,0,260,360,1,Urban,Y
597 | LP002940,Male,No,0,Not Graduate,No,3833,0,110,360,1,Rural,Y
598 | LP002941,Male,Yes,2,Not Graduate,Yes,6383,1000,187,360,1,Rural,N
599 | LP002943,Male,No,,Graduate,No,2987,0,88,360,0,Semiurban,N
600 | LP002945,Male,Yes,0,Graduate,Yes,9963,0,180,360,1,Rural,Y
601 | LP002948,Male,Yes,2,Graduate,No,5780,0,192,360,1,Urban,Y
602 | LP002949,Female,No,3+,Graduate,,416,41667,350,180,,Urban,N
603 | LP002950,Male,Yes,0,Not Graduate,,2894,2792,155,360,1,Rural,Y
604 | LP002953,Male,Yes,3+,Graduate,No,5703,0,128,360,1,Urban,Y
605 | LP002958,Male,No,0,Graduate,No,3676,4301,172,360,1,Rural,Y
606 | LP002959,Female,Yes,1,Graduate,No,12000,0,496,360,1,Semiurban,Y
607 | LP002960,Male,Yes,0,Not Graduate,No,2400,3800,,180,1,Urban,N
608 | LP002961,Male,Yes,1,Graduate,No,3400,2500,173,360,1,Semiurban,Y
609 | LP002964,Male,Yes,2,Not Graduate,No,3987,1411,157,360,1,Rural,Y
610 | LP002974,Male,Yes,0,Graduate,No,3232,1950,108,360,1,Rural,Y
611 | LP002978,Female,No,0,Graduate,No,2900,0,71,360,1,Rural,Y
612 | LP002979,Male,Yes,3+,Graduate,No,4106,0,40,180,1,Rural,Y
613 | LP002983,Male,Yes,1,Graduate,No,8072,240,253,360,1,Urban,Y
614 | LP002984,Male,Yes,2,Graduate,No,7583,0,187,360,1,Urban,Y
615 | LP002990,Female,No,0,Graduate,Yes,4583,0,133,360,0,Semiurban,N
616 |
--------------------------------------------------------------------------------
/Machine Learning/data/creditcard.rar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aswintechguy/Data-Science-Concepts/9025d8ee8526a15b73b9df56c79338eae7b498dc/Machine Learning/data/creditcard.rar
--------------------------------------------------------------------------------
/NLP/Natural Language Processing(NLP) Concepts - Hackers Realm.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Tokenization\n",
8 | "\n",
9 | "Tokenization is the process of breaking down the given text in natural language processing into the smallest unit in a sentence called a token. Punctuation marks, words, and numbers can be considered tokens."
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 6,
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "text = 'Hi Everyone! This is Hackers Realm. We are learning Natural Language Processing. We reached 1000000 views.'"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 7,
24 | "metadata": {},
25 | "outputs": [
26 | {
27 | "data": {
28 | "text/plain": [
29 | "['Hi',\n",
30 | " 'Everyone!',\n",
31 | " 'This',\n",
32 | " 'is',\n",
33 | " 'Hackers',\n",
34 | " 'Realm.',\n",
35 | " 'We',\n",
36 | " 'are',\n",
37 | " 'learning',\n",
38 | " 'Natural',\n",
39 | " 'Language',\n",
40 | " 'Processing.',\n",
41 | " 'We',\n",
42 | " 'reached',\n",
43 | " '1000000',\n",
44 | " 'views.']"
45 | ]
46 | },
47 | "execution_count": 7,
48 | "metadata": {},
49 | "output_type": "execute_result"
50 | }
51 | ],
52 | "source": [
53 | "text.split(' ')"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 8,
59 | "metadata": {},
60 | "outputs": [],
61 | "source": [
62 | "from nltk import sent_tokenize, word_tokenize"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": 9,
68 | "metadata": {},
69 | "outputs": [
70 | {
71 | "data": {
72 | "text/plain": [
73 | "['Hi Everyone!',\n",
74 | " 'This is Hackers Realm.',\n",
75 | " 'We are learning Natural Language Processing.',\n",
76 | " 'We reached 1000000 views.']"
77 | ]
78 | },
79 | "execution_count": 9,
80 | "metadata": {},
81 | "output_type": "execute_result"
82 | }
83 | ],
84 | "source": [
85 | "# split the text into sentences\n",
86 | "sent_tokens = sent_tokenize(text)\n",
87 | "sent_tokens"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": 10,
93 | "metadata": {},
94 | "outputs": [
95 | {
96 | "data": {
97 | "text/plain": [
98 | "['Hi',\n",
99 | " 'Everyone',\n",
100 | " '!',\n",
101 | " 'This',\n",
102 | " 'is',\n",
103 | " 'Hackers',\n",
104 | " 'Realm',\n",
105 | " '.',\n",
106 | " 'We',\n",
107 | " 'are',\n",
108 | " 'learning',\n",
109 | " 'Natural',\n",
110 | " 'Language',\n",
111 | " 'Processing',\n",
112 | " '.',\n",
113 | " 'We',\n",
114 | " 'reached',\n",
115 | " '1000000',\n",
116 | " 'views',\n",
117 | " '.']"
118 | ]
119 | },
120 | "execution_count": 10,
121 | "metadata": {},
122 | "output_type": "execute_result"
123 | }
124 | ],
125 | "source": [
126 | "# split the text into words\n",
127 | "word_tokens = word_tokenize(text)\n",
128 | "word_tokens"
129 | ]
130 | },
131 | {
132 | "cell_type": "code",
133 | "execution_count": null,
134 | "metadata": {},
135 | "outputs": [],
136 | "source": []
137 | },
138 | {
139 | "cell_type": "markdown",
140 | "metadata": {},
141 | "source": [
142 | "# Stemming\n",
143 | "\n",
144 | "Stemming is the process of finding the root of words. A word stem need not be the same root as a dictionary-based morphological root, it just is an equal to or smaller form of the word."
145 | ]
146 | },
147 | {
148 | "cell_type": "code",
149 | "execution_count": 13,
150 | "metadata": {},
151 | "outputs": [],
152 | "source": [
153 | "from nltk.stem import PorterStemmer, SnowballStemmer\n",
154 | "ps = PorterStemmer()"
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": 17,
160 | "metadata": {},
161 | "outputs": [
162 | {
163 | "data": {
164 | "text/plain": [
165 | "'eat'"
166 | ]
167 | },
168 | "execution_count": 17,
169 | "metadata": {},
170 | "output_type": "execute_result"
171 | }
172 | ],
173 | "source": [
174 | "word = ('eats')\n",
175 | "ps.stem(word)"
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": 16,
181 | "metadata": {},
182 | "outputs": [
183 | {
184 | "data": {
185 | "text/plain": [
186 | "'eat'"
187 | ]
188 | },
189 | "execution_count": 16,
190 | "metadata": {},
191 | "output_type": "execute_result"
192 | }
193 | ],
194 | "source": [
195 | "word = ('eating')\n",
196 | "ps.stem(word)"
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": 18,
202 | "metadata": {},
203 | "outputs": [
204 | {
205 | "data": {
206 | "text/plain": [
207 | "'eaten'"
208 | ]
209 | },
210 | "execution_count": 18,
211 | "metadata": {},
212 | "output_type": "execute_result"
213 | }
214 | ],
215 | "source": [
216 | "word = ('eaten')\n",
217 | "ps.stem(word)"
218 | ]
219 | },
220 | {
221 | "cell_type": "code",
222 | "execution_count": 19,
223 | "metadata": {},
224 | "outputs": [],
225 | "source": [
226 | "text = 'Hi Everyone! This is Hackers Realm. We are learning Natural Language Processing. We reached 1000000 views.'"
227 | ]
228 | },
229 | {
230 | "cell_type": "code",
231 | "execution_count": 20,
232 | "metadata": {},
233 | "outputs": [],
234 | "source": [
235 | "word_tokens = word_tokenize(text)"
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": 21,
241 | "metadata": {},
242 | "outputs": [
243 | {
244 | "data": {
245 | "text/plain": [
246 | "'Hi everyon ! thi is hacker realm . We are learn natur languag process . We reach 1000000 view .'"
247 | ]
248 | },
249 | "execution_count": 21,
250 | "metadata": {},
251 | "output_type": "execute_result"
252 | }
253 | ],
254 | "source": [
255 | "stemmed_sentence = \" \".join(ps.stem(word) for word in word_tokens)\n",
256 | "stemmed_sentence"
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": null,
262 | "metadata": {},
263 | "outputs": [],
264 | "source": []
265 | },
266 | {
267 | "cell_type": "markdown",
268 | "metadata": {},
269 | "source": [
270 | "# Lemmatization\n",
271 | "\n",
272 | "Lemmatization is the process of finding the form of the related word in the dictionary. It is different from Stemming. It involves longer processes to calculate than Stemming."
273 | ]
274 | },
275 | {
276 | "cell_type": "code",
277 | "execution_count": 22,
278 | "metadata": {},
279 | "outputs": [],
280 | "source": [
281 | "from nltk.stem import WordNetLemmatizer\n",
282 | "lemmatizer = WordNetLemmatizer()"
283 | ]
284 | },
285 | {
286 | "cell_type": "code",
287 | "execution_count": 30,
288 | "metadata": {},
289 | "outputs": [
290 | {
291 | "data": {
292 | "text/plain": [
293 | "'worker'"
294 | ]
295 | },
296 | "execution_count": 30,
297 | "metadata": {},
298 | "output_type": "execute_result"
299 | }
300 | ],
301 | "source": [
302 | "lemmatizer.lemmatize('workers')"
303 | ]
304 | },
305 | {
306 | "cell_type": "code",
307 | "execution_count": 31,
308 | "metadata": {},
309 | "outputs": [
310 | {
311 | "data": {
312 | "text/plain": [
313 | "'word'"
314 | ]
315 | },
316 | "execution_count": 31,
317 | "metadata": {},
318 | "output_type": "execute_result"
319 | }
320 | ],
321 | "source": [
322 | "lemmatizer.lemmatize('words')"
323 | ]
324 | },
325 | {
326 | "cell_type": "code",
327 | "execution_count": 37,
328 | "metadata": {},
329 | "outputs": [
330 | {
331 | "data": {
332 | "text/plain": [
333 | "'foot'"
334 | ]
335 | },
336 | "execution_count": 37,
337 | "metadata": {},
338 | "output_type": "execute_result"
339 | }
340 | ],
341 | "source": [
342 | "lemmatizer.lemmatize('feet')"
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "execution_count": 39,
348 | "metadata": {},
349 | "outputs": [
350 | {
351 | "data": {
352 | "text/plain": [
353 | "'strip'"
354 | ]
355 | },
356 | "execution_count": 39,
357 | "metadata": {},
358 | "output_type": "execute_result"
359 | }
360 | ],
361 | "source": [
362 | "lemmatizer.lemmatize('stripes', 'v')"
363 | ]
364 | },
365 | {
366 | "cell_type": "code",
367 | "execution_count": 40,
368 | "metadata": {},
369 | "outputs": [
370 | {
371 | "data": {
372 | "text/plain": [
373 | "'stripe'"
374 | ]
375 | },
376 | "execution_count": 40,
377 | "metadata": {},
378 | "output_type": "execute_result"
379 | }
380 | ],
381 | "source": [
382 | "lemmatizer.lemmatize('stripes', 'n')"
383 | ]
384 | },
385 | {
386 | "cell_type": "code",
387 | "execution_count": 41,
388 | "metadata": {},
389 | "outputs": [],
390 | "source": [
391 | "text = 'Hi Everyone! This is Hackers Realm. We are learning Natural Language Processing. We reached 1000000 views.'"
392 | ]
393 | },
394 | {
395 | "cell_type": "code",
396 | "execution_count": 42,
397 | "metadata": {},
398 | "outputs": [],
399 | "source": [
400 | "word_tokens = word_tokenize(text)"
401 | ]
402 | },
403 | {
404 | "cell_type": "code",
405 | "execution_count": 44,
406 | "metadata": {},
407 | "outputs": [
408 | {
409 | "data": {
410 | "text/plain": [
411 | "'hi everyone ! this is hacker realm . we are learning natural language processing . we reached 1000000 view .'"
412 | ]
413 | },
414 | "execution_count": 44,
415 | "metadata": {},
416 | "output_type": "execute_result"
417 | }
418 | ],
419 | "source": [
420 | "lemmatized_sentence = \" \".join(lemmatizer.lemmatize(word.lower()) for word in word_tokens)\n",
421 | "lemmatized_sentence"
422 | ]
423 | },
424 | {
425 | "cell_type": "code",
426 | "execution_count": null,
427 | "metadata": {},
428 | "outputs": [],
429 | "source": []
430 | },
431 | {
432 | "cell_type": "markdown",
433 | "metadata": {},
434 | "source": [
435 | "# Part of Speech Tagging (POS)\n",
436 | "\n",
437 | "Part of Speech Tagging is a process of converting a sentence to forms — list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on.\n",
438 | "\n",
439 | "https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html"
440 | ]
441 | },
442 | {
443 | "cell_type": "code",
444 | "execution_count": 45,
445 | "metadata": {},
446 | "outputs": [],
447 | "source": [
448 | "from nltk import pos_tag"
449 | ]
450 | },
451 | {
452 | "cell_type": "code",
453 | "execution_count": 51,
454 | "metadata": {},
455 | "outputs": [
456 | {
457 | "data": {
458 | "text/plain": [
459 | "[('fighting', 'VBG')]"
460 | ]
461 | },
462 | "execution_count": 51,
463 | "metadata": {},
464 | "output_type": "execute_result"
465 | }
466 | ],
467 | "source": [
468 | "pos_tag(['fighting'])"
469 | ]
470 | },
471 | {
472 | "cell_type": "code",
473 | "execution_count": 46,
474 | "metadata": {},
475 | "outputs": [],
476 | "source": [
477 | "text = 'Hi Everyone! This is Hackers Realm. We are learning Natural Language Processing. We reached 1000000 views.'"
478 | ]
479 | },
480 | {
481 | "cell_type": "code",
482 | "execution_count": 47,
483 | "metadata": {},
484 | "outputs": [],
485 | "source": [
486 | "word_tokens = word_tokenize(text)"
487 | ]
488 | },
489 | {
490 | "cell_type": "code",
491 | "execution_count": 52,
492 | "metadata": {},
493 | "outputs": [
494 | {
495 | "data": {
496 | "text/plain": [
497 | "[('Hi', 'NNP'),\n",
498 | " ('Everyone', 'NN'),\n",
499 | " ('!', '.'),\n",
500 | " ('This', 'DT'),\n",
501 | " ('is', 'VBZ'),\n",
502 | " ('Hackers', 'NNP'),\n",
503 | " ('Realm', 'NNP'),\n",
504 | " ('.', '.'),\n",
505 | " ('We', 'PRP'),\n",
506 | " ('are', 'VBP'),\n",
507 | " ('learning', 'VBG'),\n",
508 | " ('Natural', 'NNP'),\n",
509 | " ('Language', 'NNP'),\n",
510 | " ('Processing', 'NNP'),\n",
511 | " ('.', '.'),\n",
512 | " ('We', 'PRP'),\n",
513 | " ('reached', 'VBD'),\n",
514 | " ('1000000', 'CD'),\n",
515 | " ('views', 'NNS'),\n",
516 | " ('.', '.')]"
517 | ]
518 | },
519 | "execution_count": 52,
520 | "metadata": {},
521 | "output_type": "execute_result"
522 | }
523 | ],
524 | "source": [
525 | "pos_tag(word_tokens)"
526 | ]
527 | },
528 | {
529 | "cell_type": "code",
530 | "execution_count": null,
531 | "metadata": {},
532 | "outputs": [],
533 | "source": []
534 | },
535 | {
536 | "cell_type": "markdown",
537 | "metadata": {},
538 | "source": [
539 | "# Text Preprocessing (Clean Data)"
540 | ]
541 | },
542 | {
543 | "cell_type": "code",
544 | "execution_count": 9,
545 | "metadata": {},
546 | "outputs": [
547 | {
548 | "data": {
549 | "text/html": [
550 | "
\n",
551 | "\n",
564 | "
\n",
565 | " \n",
566 | " \n",
567 | " | \n",
568 | " tweet | \n",
569 | "
\n",
570 | " \n",
571 | " \n",
572 | " \n",
573 | " 0 | \n",
574 | " @user when a father is dysfunctional and is s... | \n",
575 | "
\n",
576 | " \n",
577 | " 1 | \n",
578 | " @user @user thanks for #lyft credit i can't us... | \n",
579 | "
\n",
580 | " \n",
581 | " 2 | \n",
582 | " bihday your majesty | \n",
583 | "
\n",
584 | " \n",
585 | " 3 | \n",
586 | " #model i love u take with u all the time in ... | \n",
587 | "
\n",
588 | " \n",
589 | " 4 | \n",
590 | " factsguide: society now #motivation | \n",
591 | "
\n",
592 | " \n",
593 | "
\n",
594 | "
"
595 | ],
596 | "text/plain": [
597 | " tweet\n",
598 | "0 @user when a father is dysfunctional and is s...\n",
599 | "1 @user @user thanks for #lyft credit i can't us...\n",
600 | "2 bihday your majesty\n",
601 | "3 #model i love u take with u all the time in ...\n",
602 | "4 factsguide: society now #motivation"
603 | ]
604 | },
605 | "execution_count": 9,
606 | "metadata": {},
607 | "output_type": "execute_result"
608 | }
609 | ],
610 | "source": [
611 | "import pandas as pd\n",
612 | "import string\n",
613 | "df = pd.read_csv('data/Twitter Sentiments.csv')\n",
614 | "# drop the columns\n",
615 | "df = df.drop(columns=['id', 'label'], axis=1)\n",
616 | "df.head()"
617 | ]
618 | },
619 | {
620 | "cell_type": "markdown",
621 | "metadata": {},
622 | "source": [
623 | "## Convert to lowercase"
624 | ]
625 | },
626 | {
627 | "cell_type": "code",
628 | "execution_count": 10,
629 | "metadata": {},
630 | "outputs": [
631 | {
632 | "data": {
633 | "text/html": [
634 | "\n",
635 | "\n",
648 | "
\n",
649 | " \n",
650 | " \n",
651 | " | \n",
652 | " tweet | \n",
653 | " clean_text | \n",
654 | "
\n",
655 | " \n",
656 | " \n",
657 | " \n",
658 | " 0 | \n",
659 | " @user when a father is dysfunctional and is s... | \n",
660 | " @user when a father is dysfunctional and is s... | \n",
661 | "
\n",
662 | " \n",
663 | " 1 | \n",
664 | " @user @user thanks for #lyft credit i can't us... | \n",
665 | " @user @user thanks for #lyft credit i can't us... | \n",
666 | "
\n",
667 | " \n",
668 | " 2 | \n",
669 | " bihday your majesty | \n",
670 | " bihday your majesty | \n",
671 | "
\n",
672 | " \n",
673 | " 3 | \n",
674 | " #model i love u take with u all the time in ... | \n",
675 | " #model i love u take with u all the time in ... | \n",
676 | "
\n",
677 | " \n",
678 | " 4 | \n",
679 | " factsguide: society now #motivation | \n",
680 | " factsguide: society now #motivation | \n",
681 | "
\n",
682 | " \n",
683 | "
\n",
684 | "
"
685 | ],
686 | "text/plain": [
687 | " tweet \\\n",
688 | "0 @user when a father is dysfunctional and is s... \n",
689 | "1 @user @user thanks for #lyft credit i can't us... \n",
690 | "2 bihday your majesty \n",
691 | "3 #model i love u take with u all the time in ... \n",
692 | "4 factsguide: society now #motivation \n",
693 | "\n",
694 | " clean_text \n",
695 | "0 @user when a father is dysfunctional and is s... \n",
696 | "1 @user @user thanks for #lyft credit i can't us... \n",
697 | "2 bihday your majesty \n",
698 | "3 #model i love u take with u all the time in ... \n",
699 | "4 factsguide: society now #motivation "
700 | ]
701 | },
702 | "execution_count": 10,
703 | "metadata": {},
704 | "output_type": "execute_result"
705 | }
706 | ],
707 | "source": [
708 | "df['clean_text'] = df['tweet'].str.lower()\n",
709 | "df.head()"
710 | ]
711 | },
712 | {
713 | "cell_type": "markdown",
714 | "metadata": {},
715 | "source": [
716 | "## Removal of Punctuations"
717 | ]
718 | },
719 | {
720 | "cell_type": "code",
721 | "execution_count": 12,
722 | "metadata": {},
723 | "outputs": [
724 | {
725 | "data": {
726 | "text/plain": [
727 | "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'"
728 | ]
729 | },
730 | "execution_count": 12,
731 | "metadata": {},
732 | "output_type": "execute_result"
733 | }
734 | ],
735 | "source": [
736 | "string.punctuation"
737 | ]
738 | },
739 | {
740 | "cell_type": "code",
741 | "execution_count": 13,
742 | "metadata": {},
743 | "outputs": [],
744 | "source": [
745 | "def remove_punctuations(text):\n",
746 | " punctuations = string.punctuation\n",
747 | " return text.translate(str.maketrans('', '', punctuations))"
748 | ]
749 | },
750 | {
751 | "cell_type": "code",
752 | "execution_count": 14,
753 | "metadata": {},
754 | "outputs": [
755 | {
756 | "data": {
757 | "text/html": [
758 | "\n",
759 | "\n",
772 | "
\n",
773 | " \n",
774 | " \n",
775 | " | \n",
776 | " tweet | \n",
777 | " clean_text | \n",
778 | "
\n",
779 | " \n",
780 | " \n",
781 | " \n",
782 | " 0 | \n",
783 | " @user when a father is dysfunctional and is s... | \n",
784 | " user when a father is dysfunctional and is so... | \n",
785 | "
\n",
786 | " \n",
787 | " 1 | \n",
788 | " @user @user thanks for #lyft credit i can't us... | \n",
789 | " user user thanks for lyft credit i cant use ca... | \n",
790 | "
\n",
791 | " \n",
792 | " 2 | \n",
793 | " bihday your majesty | \n",
794 | " bihday your majesty | \n",
795 | "
\n",
796 | " \n",
797 | " 3 | \n",
798 | " #model i love u take with u all the time in ... | \n",
799 | " model i love u take with u all the time in u... | \n",
800 | "
\n",
801 | " \n",
802 | " 4 | \n",
803 | " factsguide: society now #motivation | \n",
804 | " factsguide society now motivation | \n",
805 | "
\n",
806 | " \n",
807 | "
\n",
808 | "
"
809 | ],
810 | "text/plain": [
811 | " tweet \\\n",
812 | "0 @user when a father is dysfunctional and is s... \n",
813 | "1 @user @user thanks for #lyft credit i can't us... \n",
814 | "2 bihday your majesty \n",
815 | "3 #model i love u take with u all the time in ... \n",
816 | "4 factsguide: society now #motivation \n",
817 | "\n",
818 | " clean_text \n",
819 | "0 user when a father is dysfunctional and is so... \n",
820 | "1 user user thanks for lyft credit i cant use ca... \n",
821 | "2 bihday your majesty \n",
822 | "3 model i love u take with u all the time in u... \n",
823 | "4 factsguide society now motivation "
824 | ]
825 | },
826 | "execution_count": 14,
827 | "metadata": {},
828 | "output_type": "execute_result"
829 | }
830 | ],
831 | "source": [
832 | "df['clean_text'] = df['clean_text'].apply(lambda x: remove_punctuations(x))\n",
833 | "df.head()"
834 | ]
835 | },
836 | {
837 | "cell_type": "markdown",
838 | "metadata": {},
839 | "source": [
840 | "## Removal of Stopwords"
841 | ]
842 | },
843 | {
844 | "cell_type": "code",
845 | "execution_count": 17,
846 | "metadata": {},
847 | "outputs": [
848 | {
849 | "data": {
850 | "text/plain": [
851 | "\"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mustn't, needn, needn't, shan, shan't, shouldn, shouldn't, wasn, wasn't, weren, weren't, won, won't, wouldn, wouldn't\""
852 | ]
853 | },
854 | "execution_count": 17,
855 | "metadata": {},
856 | "output_type": "execute_result"
857 | }
858 | ],
859 | "source": [
860 | "from nltk.corpus import stopwords\n",
861 | "\", \".join(stopwords.words('english'))"
862 | ]
863 | },
864 | {
865 | "cell_type": "code",
866 | "execution_count": 18,
867 | "metadata": {},
868 | "outputs": [],
869 | "source": [
870 | "STOPWORDS = set(stopwords.words('english'))\n",
871 | "def remove_stopwords(text):\n",
872 | " return \" \".join([word for word in text.split() if word not in STOPWORDS])"
873 | ]
874 | },
875 | {
876 | "cell_type": "code",
877 | "execution_count": 19,
878 | "metadata": {},
879 | "outputs": [
880 | {
881 | "data": {
882 | "text/html": [
883 | "\n",
884 | "\n",
897 | "
\n",
898 | " \n",
899 | " \n",
900 | " | \n",
901 | " tweet | \n",
902 | " clean_text | \n",
903 | "
\n",
904 | " \n",
905 | " \n",
906 | " \n",
907 | " 0 | \n",
908 | " @user when a father is dysfunctional and is s... | \n",
909 | " user father dysfunctional selfish drags kids d... | \n",
910 | "
\n",
911 | " \n",
912 | " 1 | \n",
913 | " @user @user thanks for #lyft credit i can't us... | \n",
914 | " user user thanks lyft credit cant use cause do... | \n",
915 | "
\n",
916 | " \n",
917 | " 2 | \n",
918 | " bihday your majesty | \n",
919 | " bihday majesty | \n",
920 | "
\n",
921 | " \n",
922 | " 3 | \n",
923 | " #model i love u take with u all the time in ... | \n",
924 | " model love u take u time urð± ðððð... | \n",
925 | "
\n",
926 | " \n",
927 | " 4 | \n",
928 | " factsguide: society now #motivation | \n",
929 | " factsguide society motivation | \n",
930 | "
\n",
931 | " \n",
932 | "
\n",
933 | "
"
934 | ],
935 | "text/plain": [
936 | " tweet \\\n",
937 | "0 @user when a father is dysfunctional and is s... \n",
938 | "1 @user @user thanks for #lyft credit i can't us... \n",
939 | "2 bihday your majesty \n",
940 | "3 #model i love u take with u all the time in ... \n",
941 | "4 factsguide: society now #motivation \n",
942 | "\n",
943 | " clean_text \n",
944 | "0 user father dysfunctional selfish drags kids d... \n",
945 | "1 user user thanks lyft credit cant use cause do... \n",
946 | "2 bihday majesty \n",
947 | "3 model love u take u time urð± ðððð... \n",
948 | "4 factsguide society motivation "
949 | ]
950 | },
951 | "execution_count": 19,
952 | "metadata": {},
953 | "output_type": "execute_result"
954 | }
955 | ],
956 | "source": [
957 | "df['clean_text'] = df['clean_text'].apply(lambda x: remove_stopwords(x))\n",
958 | "df.head()"
959 | ]
960 | },
961 | {
962 | "cell_type": "markdown",
963 | "metadata": {},
964 | "source": [
965 | "## Removal of Frequent Words"
966 | ]
967 | },
968 | {
969 | "cell_type": "code",
970 | "execution_count": 23,
971 | "metadata": {},
972 | "outputs": [
973 | {
974 | "data": {
975 | "text/plain": [
976 | "[('user', 17473),\n",
977 | " ('love', 2647),\n",
978 | " ('day', 2198),\n",
979 | " ('happy', 1663),\n",
980 | " ('amp', 1582),\n",
981 | " ('im', 1139),\n",
982 | " ('u', 1136),\n",
983 | " ('time', 1110),\n",
984 | " ('life', 1086),\n",
985 | " ('like', 1042)]"
986 | ]
987 | },
988 | "execution_count": 23,
989 | "metadata": {},
990 | "output_type": "execute_result"
991 | }
992 | ],
993 | "source": [
994 | "from collections import Counter\n",
995 | "word_count = Counter()\n",
996 | "for text in df['clean_text']:\n",
997 | " for word in text.split():\n",
998 | " word_count[word] += 1\n",
999 | " \n",
1000 | "word_count.most_common(10)"
1001 | ]
1002 | },
1003 | {
1004 | "cell_type": "code",
1005 | "execution_count": 24,
1006 | "metadata": {},
1007 | "outputs": [],
1008 | "source": [
1009 | "FREQUENT_WORDS = set(word for (word, wc) in word_count.most_common(3))\n",
1010 | "def remove_freq_words(text):\n",
1011 | " return \" \".join([word for word in text.split() if word not in FREQUENT_WORDS])"
1012 | ]
1013 | },
1014 | {
1015 | "cell_type": "code",
1016 | "execution_count": 25,
1017 | "metadata": {},
1018 | "outputs": [
1019 | {
1020 | "data": {
1021 | "text/html": [
1022 | "\n",
1023 | "\n",
1036 | "
\n",
1037 | " \n",
1038 | " \n",
1039 | " | \n",
1040 | " tweet | \n",
1041 | " clean_text | \n",
1042 | "
\n",
1043 | " \n",
1044 | " \n",
1045 | " \n",
1046 | " 0 | \n",
1047 | " @user when a father is dysfunctional and is s... | \n",
1048 | " father dysfunctional selfish drags kids dysfun... | \n",
1049 | "
\n",
1050 | " \n",
1051 | " 1 | \n",
1052 | " @user @user thanks for #lyft credit i can't us... | \n",
1053 | " thanks lyft credit cant use cause dont offer w... | \n",
1054 | "
\n",
1055 | " \n",
1056 | " 2 | \n",
1057 | " bihday your majesty | \n",
1058 | " bihday majesty | \n",
1059 | "
\n",
1060 | " \n",
1061 | " 3 | \n",
1062 | " #model i love u take with u all the time in ... | \n",
1063 | " model u take u time urð± ðððð ð... | \n",
1064 | "
\n",
1065 | " \n",
1066 | " 4 | \n",
1067 | " factsguide: society now #motivation | \n",
1068 | " factsguide society motivation | \n",
1069 | "
\n",
1070 | " \n",
1071 | "
\n",
1072 | "
"
1073 | ],
1074 | "text/plain": [
1075 | " tweet \\\n",
1076 | "0 @user when a father is dysfunctional and is s... \n",
1077 | "1 @user @user thanks for #lyft credit i can't us... \n",
1078 | "2 bihday your majesty \n",
1079 | "3 #model i love u take with u all the time in ... \n",
1080 | "4 factsguide: society now #motivation \n",
1081 | "\n",
1082 | " clean_text \n",
1083 | "0 father dysfunctional selfish drags kids dysfun... \n",
1084 | "1 thanks lyft credit cant use cause dont offer w... \n",
1085 | "2 bihday majesty \n",
1086 | "3 model u take u time urð± ðððð ð... \n",
1087 | "4 factsguide society motivation "
1088 | ]
1089 | },
1090 | "execution_count": 25,
1091 | "metadata": {},
1092 | "output_type": "execute_result"
1093 | }
1094 | ],
1095 | "source": [
1096 | "df['clean_text'] = df['clean_text'].apply(lambda x: remove_freq_words(x))\n",
1097 | "df.head()"
1098 | ]
1099 | },
1100 | {
1101 | "cell_type": "markdown",
1102 | "metadata": {},
1103 | "source": [
1104 | "## Removal of Rare Words"
1105 | ]
1106 | },
1107 | {
1108 | "cell_type": "code",
1109 | "execution_count": 30,
1110 | "metadata": {},
1111 | "outputs": [
1112 | {
1113 | "data": {
1114 | "text/plain": [
1115 | "{'airwaves',\n",
1116 | " 'carnt',\n",
1117 | " 'chisolm',\n",
1118 | " 'ibizabringitonmallorcaholidayssummer',\n",
1119 | " 'isz',\n",
1120 | " 'mantle',\n",
1121 | " 'shirley',\n",
1122 | " 'youuuð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dð\\x9f\\x98\\x8dâ\\x9d¤ï¸\\x8f',\n",
1123 | " 'ð\\x9f\\x99\\x8fð\\x9f\\x8f¼ð\\x9f\\x8d¹ð\\x9f\\x98\\x8eð\\x9f\\x8eµ'}"
1124 | ]
1125 | },
1126 | "execution_count": 30,
1127 | "metadata": {},
1128 | "output_type": "execute_result"
1129 | }
1130 | ],
1131 | "source": [
1132 | "RARE_WORDS = set(word for (word, wc) in word_count.most_common()[:-10:-1])\n",
1133 | "RARE_WORDS"
1134 | ]
1135 | },
1136 | {
1137 | "cell_type": "code",
1138 | "execution_count": 31,
1139 | "metadata": {},
1140 | "outputs": [],
1141 | "source": [
1142 | "def remove_rare_words(text):\n",
1143 | " return \" \".join([word for word in text.split() if word not in RARE_WORDS])"
1144 | ]
1145 | },
1146 | {
1147 | "cell_type": "code",
1148 | "execution_count": 32,
1149 | "metadata": {},
1150 | "outputs": [
1151 | {
1152 | "data": {
1153 | "text/html": [
1154 | "\n",
1155 | "\n",
1168 | "
\n",
1169 | " \n",
1170 | " \n",
1171 | " | \n",
1172 | " tweet | \n",
1173 | " clean_text | \n",
1174 | "
\n",
1175 | " \n",
1176 | " \n",
1177 | " \n",
1178 | " 0 | \n",
1179 | " @user when a father is dysfunctional and is s... | \n",
1180 | " father dysfunctional selfish drags kids dysfun... | \n",
1181 | "
\n",
1182 | " \n",
1183 | " 1 | \n",
1184 | " @user @user thanks for #lyft credit i can't us... | \n",
1185 | " thanks lyft credit cant use cause dont offer w... | \n",
1186 | "
\n",
1187 | " \n",
1188 | " 2 | \n",
1189 | " bihday your majesty | \n",
1190 | " bihday majesty | \n",
1191 | "
\n",
1192 | " \n",
1193 | " 3 | \n",
1194 | " #model i love u take with u all the time in ... | \n",
1195 | " model u take u time urð± ðððð ð... | \n",
1196 | "
\n",
1197 | " \n",
1198 | " 4 | \n",
1199 | " factsguide: society now #motivation | \n",
1200 | " factsguide society motivation | \n",
1201 | "
\n",
1202 | " \n",
1203 | "
\n",
1204 | "
"
1205 | ],
1206 | "text/plain": [
1207 | " tweet \\\n",
1208 | "0 @user when a father is dysfunctional and is s... \n",
1209 | "1 @user @user thanks for #lyft credit i can't us... \n",
1210 | "2 bihday your majesty \n",
1211 | "3 #model i love u take with u all the time in ... \n",
1212 | "4 factsguide: society now #motivation \n",
1213 | "\n",
1214 | " clean_text \n",
1215 | "0 father dysfunctional selfish drags kids dysfun... \n",
1216 | "1 thanks lyft credit cant use cause dont offer w... \n",
1217 | "2 bihday majesty \n",
1218 | "3 model u take u time urð± ðððð ð... \n",
1219 | "4 factsguide society motivation "
1220 | ]
1221 | },
1222 | "execution_count": 32,
1223 | "metadata": {},
1224 | "output_type": "execute_result"
1225 | }
1226 | ],
1227 | "source": [
1228 | "df['clean_text'] = df['clean_text'].apply(lambda x: remove_rare_words(x))\n",
1229 | "df.head()"
1230 | ]
1231 | },
1232 | {
1233 | "cell_type": "markdown",
1234 | "metadata": {},
1235 | "source": [
1236 | "## Removal of Special characters"
1237 | ]
1238 | },
1239 | {
1240 | "cell_type": "code",
1241 | "execution_count": 33,
1242 | "metadata": {},
1243 | "outputs": [],
1244 | "source": [
1245 | "import re\n",
1246 | "def remove_spl_chars(text):\n",
1247 | " text = re.sub('[^a-zA-Z0-9]', ' ', text)\n",
1248 | " text = re.sub('\\s+', ' ', text)\n",
1249 | " return text"
1250 | ]
1251 | },
1252 | {
1253 | "cell_type": "code",
1254 | "execution_count": 34,
1255 | "metadata": {},
1256 | "outputs": [
1257 | {
1258 | "data": {
1259 | "text/html": [
1260 | "\n",
1261 | "\n",
1274 | "
\n",
1275 | " \n",
1276 | " \n",
1277 | " | \n",
1278 | " tweet | \n",
1279 | " clean_text | \n",
1280 | "
\n",
1281 | " \n",
1282 | " \n",
1283 | " \n",
1284 | " 0 | \n",
1285 | " @user when a father is dysfunctional and is s... | \n",
1286 | " father dysfunctional selfish drags kids dysfun... | \n",
1287 | "
\n",
1288 | " \n",
1289 | " 1 | \n",
1290 | " @user @user thanks for #lyft credit i can't us... | \n",
1291 | " thanks lyft credit cant use cause dont offer w... | \n",
1292 | "
\n",
1293 | " \n",
1294 | " 2 | \n",
1295 | " bihday your majesty | \n",
1296 | " bihday majesty | \n",
1297 | "
\n",
1298 | " \n",
1299 | " 3 | \n",
1300 | " #model i love u take with u all the time in ... | \n",
1301 | " model u take u time ur | \n",
1302 | "
\n",
1303 | " \n",
1304 | " 4 | \n",
1305 | " factsguide: society now #motivation | \n",
1306 | " factsguide society motivation | \n",
1307 | "
\n",
1308 | " \n",
1309 | "
\n",
1310 | "
"
1311 | ],
1312 | "text/plain": [
1313 | " tweet \\\n",
1314 | "0 @user when a father is dysfunctional and is s... \n",
1315 | "1 @user @user thanks for #lyft credit i can't us... \n",
1316 | "2 bihday your majesty \n",
1317 | "3 #model i love u take with u all the time in ... \n",
1318 | "4 factsguide: society now #motivation \n",
1319 | "\n",
1320 | " clean_text \n",
1321 | "0 father dysfunctional selfish drags kids dysfun... \n",
1322 | "1 thanks lyft credit cant use cause dont offer w... \n",
1323 | "2 bihday majesty \n",
1324 | "3 model u take u time ur \n",
1325 | "4 factsguide society motivation "
1326 | ]
1327 | },
1328 | "execution_count": 34,
1329 | "metadata": {},
1330 | "output_type": "execute_result"
1331 | }
1332 | ],
1333 | "source": [
1334 | "df['clean_text'] = df['clean_text'].apply(lambda x: remove_spl_chars(x))\n",
1335 | "df.head()"
1336 | ]
1337 | },
1338 | {
1339 | "cell_type": "markdown",
1340 | "metadata": {},
1341 | "source": [
1342 | "## Stemming"
1343 | ]
1344 | },
1345 | {
1346 | "cell_type": "code",
1347 | "execution_count": 35,
1348 | "metadata": {},
1349 | "outputs": [],
1350 | "source": [
1351 | "from nltk.stem.porter import PorterStemmer\n",
1352 | "ps = PorterStemmer()\n",
1353 | "def stem_words(text):\n",
1354 | " return \" \".join([ps.stem(word) for word in text.split()])"
1355 | ]
1356 | },
1357 | {
1358 | "cell_type": "code",
1359 | "execution_count": 36,
1360 | "metadata": {},
1361 | "outputs": [
1362 | {
1363 | "data": {
1364 | "text/html": [
1365 | "\n",
1366 | "\n",
1379 | "
\n",
1380 | " \n",
1381 | " \n",
1382 | " | \n",
1383 | " tweet | \n",
1384 | " clean_text | \n",
1385 | " stemmed_text | \n",
1386 | "
\n",
1387 | " \n",
1388 | " \n",
1389 | " \n",
1390 | " 0 | \n",
1391 | " @user when a father is dysfunctional and is s... | \n",
1392 | " father dysfunctional selfish drags kids dysfun... | \n",
1393 | " father dysfunct selfish drag kid dysfunct run | \n",
1394 | "
\n",
1395 | " \n",
1396 | " 1 | \n",
1397 | " @user @user thanks for #lyft credit i can't us... | \n",
1398 | " thanks lyft credit cant use cause dont offer w... | \n",
1399 | " thank lyft credit cant use caus dont offer whe... | \n",
1400 | "
\n",
1401 | " \n",
1402 | " 2 | \n",
1403 | " bihday your majesty | \n",
1404 | " bihday majesty | \n",
1405 | " bihday majesti | \n",
1406 | "
\n",
1407 | " \n",
1408 | " 3 | \n",
1409 | " #model i love u take with u all the time in ... | \n",
1410 | " model u take u time ur | \n",
1411 | " model u take u time ur | \n",
1412 | "
\n",
1413 | " \n",
1414 | " 4 | \n",
1415 | " factsguide: society now #motivation | \n",
1416 | " factsguide society motivation | \n",
1417 | " factsguid societi motiv | \n",
1418 | "
\n",
1419 | " \n",
1420 | "
\n",
1421 | "
"
1422 | ],
1423 | "text/plain": [
1424 | " tweet \\\n",
1425 | "0 @user when a father is dysfunctional and is s... \n",
1426 | "1 @user @user thanks for #lyft credit i can't us... \n",
1427 | "2 bihday your majesty \n",
1428 | "3 #model i love u take with u all the time in ... \n",
1429 | "4 factsguide: society now #motivation \n",
1430 | "\n",
1431 | " clean_text \\\n",
1432 | "0 father dysfunctional selfish drags kids dysfun... \n",
1433 | "1 thanks lyft credit cant use cause dont offer w... \n",
1434 | "2 bihday majesty \n",
1435 | "3 model u take u time ur \n",
1436 | "4 factsguide society motivation \n",
1437 | "\n",
1438 | " stemmed_text \n",
1439 | "0 father dysfunct selfish drag kid dysfunct run \n",
1440 | "1 thank lyft credit cant use caus dont offer whe... \n",
1441 | "2 bihday majesti \n",
1442 | "3 model u take u time ur \n",
1443 | "4 factsguid societi motiv "
1444 | ]
1445 | },
1446 | "execution_count": 36,
1447 | "metadata": {},
1448 | "output_type": "execute_result"
1449 | }
1450 | ],
1451 | "source": [
1452 | "df['stemmed_text'] = df['clean_text'].apply(lambda x: stem_words(x))\n",
1453 | "df.head()"
1454 | ]
1455 | },
1456 | {
1457 | "cell_type": "markdown",
1458 | "metadata": {},
1459 | "source": [
1460 | "## Lemmatization & POS Tagging"
1461 | ]
1462 | },
1463 | {
1464 | "cell_type": "code",
1465 | "execution_count": 41,
1466 | "metadata": {},
1467 | "outputs": [],
1468 | "source": [
1469 | "from nltk import pos_tag\n",
1470 | "from nltk.corpus import wordnet\n",
1471 | "from nltk.stem import WordNetLemmatizer\n",
1472 | "\n",
1473 | "lemmatizer = WordNetLemmatizer()\n",
1474 | "wordnet_map = {\"N\":wordnet.NOUN, \"V\": wordnet.VERB, \"J\": wordnet.ADJ, \"R\": wordnet.ADV}\n",
1475 | "\n",
1476 | "def lemmatize_words(text):\n",
1477 | " # find pos tags\n",
1478 | " pos_text = pos_tag(text.split())\n",
1479 | " return \" \".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_text])"
1480 | ]
1481 | },
1482 | {
1483 | "cell_type": "code",
1484 | "execution_count": 42,
1485 | "metadata": {},
1486 | "outputs": [
1487 | {
1488 | "data": {
1489 | "text/plain": [
1490 | "'n'"
1491 | ]
1492 | },
1493 | "execution_count": 42,
1494 | "metadata": {},
1495 | "output_type": "execute_result"
1496 | }
1497 | ],
1498 | "source": [
1499 | "wordnet.NOUN"
1500 | ]
1501 | },
1502 | {
1503 | "cell_type": "code",
1504 | "execution_count": 43,
1505 | "metadata": {},
1506 | "outputs": [
1507 | {
1508 | "data": {
1509 | "text/html": [
1510 | "\n",
1511 | "\n",
1524 | "
\n",
1525 | " \n",
1526 | " \n",
1527 | " | \n",
1528 | " tweet | \n",
1529 | " clean_text | \n",
1530 | " stemmed_text | \n",
1531 | " lemmatized_text | \n",
1532 | "
\n",
1533 | " \n",
1534 | " \n",
1535 | " \n",
1536 | " 0 | \n",
1537 | " @user when a father is dysfunctional and is s... | \n",
1538 | " father dysfunctional selfish drags kids dysfun... | \n",
1539 | " father dysfunct selfish drag kid dysfunct run | \n",
1540 | " father dysfunctional selfish drag kid dysfunct... | \n",
1541 | "
\n",
1542 | " \n",
1543 | " 1 | \n",
1544 | " @user @user thanks for #lyft credit i can't us... | \n",
1545 | " thanks lyft credit cant use cause dont offer w... | \n",
1546 | " thank lyft credit cant use caus dont offer whe... | \n",
1547 | " thanks lyft credit cant use cause dont offer w... | \n",
1548 | "
\n",
1549 | " \n",
1550 | " 2 | \n",
1551 | " bihday your majesty | \n",
1552 | " bihday majesty | \n",
1553 | " bihday majesti | \n",
1554 | " bihday majesty | \n",
1555 | "
\n",
1556 | " \n",
1557 | " 3 | \n",
1558 | " #model i love u take with u all the time in ... | \n",
1559 | " model u take u time ur | \n",
1560 | " model u take u time ur | \n",
1561 | " model u take u time ur | \n",
1562 | "
\n",
1563 | " \n",
1564 | " 4 | \n",
1565 | " factsguide: society now #motivation | \n",
1566 | " factsguide society motivation | \n",
1567 | " factsguid societi motiv | \n",
1568 | " factsguide society motivation | \n",
1569 | "
\n",
1570 | " \n",
1571 | "
\n",
1572 | "
"
1573 | ],
1574 | "text/plain": [
1575 | " tweet \\\n",
1576 | "0 @user when a father is dysfunctional and is s... \n",
1577 | "1 @user @user thanks for #lyft credit i can't us... \n",
1578 | "2 bihday your majesty \n",
1579 | "3 #model i love u take with u all the time in ... \n",
1580 | "4 factsguide: society now #motivation \n",
1581 | "\n",
1582 | " clean_text \\\n",
1583 | "0 father dysfunctional selfish drags kids dysfun... \n",
1584 | "1 thanks lyft credit cant use cause dont offer w... \n",
1585 | "2 bihday majesty \n",
1586 | "3 model u take u time ur \n",
1587 | "4 factsguide society motivation \n",
1588 | "\n",
1589 | " stemmed_text \\\n",
1590 | "0 father dysfunct selfish drag kid dysfunct run \n",
1591 | "1 thank lyft credit cant use caus dont offer whe... \n",
1592 | "2 bihday majesti \n",
1593 | "3 model u take u time ur \n",
1594 | "4 factsguid societi motiv \n",
1595 | "\n",
1596 | " lemmatized_text \n",
1597 | "0 father dysfunctional selfish drag kid dysfunct... \n",
1598 | "1 thanks lyft credit cant use cause dont offer w... \n",
1599 | "2 bihday majesty \n",
1600 | "3 model u take u time ur \n",
1601 | "4 factsguide society motivation "
1602 | ]
1603 | },
1604 | "execution_count": 43,
1605 | "metadata": {},
1606 | "output_type": "execute_result"
1607 | }
1608 | ],
1609 | "source": [
1610 | "df['lemmatized_text'] = df['clean_text'].apply(lambda x: lemmatize_words(x))\n",
1611 | "df.head()"
1612 | ]
1613 | },
1614 | {
1615 | "cell_type": "code",
1616 | "execution_count": 44,
1617 | "metadata": {},
1618 | "outputs": [
1619 | {
1620 | "data": {
1621 | "text/html": [
1622 | "\n",
1623 | "\n",
1636 | "
\n",
1637 | " \n",
1638 | " \n",
1639 | " | \n",
1640 | " tweet | \n",
1641 | " clean_text | \n",
1642 | " stemmed_text | \n",
1643 | " lemmatized_text | \n",
1644 | "
\n",
1645 | " \n",
1646 | " \n",
1647 | " \n",
1648 | " 21468 | \n",
1649 | " @user for real now: we will be playing @user ... | \n",
1650 | " real playing czech republic china championship... | \n",
1651 | " real play czech republ china championship wugc... | \n",
1652 | " real play czech republic china championship wu... | \n",
1653 | "
\n",
1654 | " \n",
1655 | " 9568 | \n",
1656 | " dear america. please don't let this influence ... | \n",
1657 | " dear america please dont let influence vote tr... | \n",
1658 | " dear america pleas dont let influenc vote trum... | \n",
1659 | " dear america please dont let influence vote tr... | \n",
1660 | "
\n",
1661 | " \n",
1662 | " 19804 | \n",
1663 | " finally... now on to other suppos~ #leagueof... | \n",
1664 | " finally suppos leagueoflegends | \n",
1665 | " final suppo leagueoflegend | \n",
1666 | " finally suppos leagueoflegends | \n",
1667 | "
\n",
1668 | " \n",
1669 | " 22323 | \n",
1670 | " @user @user @user @user feeling #worried. | \n",
1671 | " feeling worried | \n",
1672 | " feel worri | \n",
1673 | " feel worry | \n",
1674 | "
\n",
1675 | " \n",
1676 | " 20171 | \n",
1677 | " i am valued. #i_am #positive #affirmation | \n",
1678 | " valued iam positive affirmation | \n",
1679 | " valu iam posit affirm | \n",
1680 | " value iam positive affirmation | \n",
1681 | "
\n",
1682 | " \n",
1683 | " 29669 | \n",
1684 | " fathers day selfie â¤ï¸ #grandad #selfie #... | \n",
1685 | " fathers selfie grandad selfie fathersday bless... | \n",
1686 | " father selfi grandad selfi fathersday bless su... | \n",
1687 | " father selfie grandad selfie fathersday bless ... | \n",
1688 | "
\n",
1689 | " \n",
1690 | " 4360 | \n",
1691 | " when 8th #graders say they're for high #school | \n",
1692 | " 8th graders say theyre high school | \n",
1693 | " 8th grader say theyr high school | \n",
1694 | " 8th grader say theyre high school | \n",
1695 | "
\n",
1696 | " \n",
1697 | " 15915 | \n",
1698 | " current mood ðð¦ #alone #anxiety #rain ... | \n",
1699 | " current mood alone anxiety rain thistooshallpass | \n",
1700 | " current mood alon anxieti rain thistooshallpass | \n",
1701 | " current mood alone anxiety rain thistooshallpass | \n",
1702 | "
\n",
1703 | " \n",
1704 | " 92 | \n",
1705 | " yes! received my acceptance letter for my mast... | \n",
1706 | " yes received acceptance letter masters back oc... | \n",
1707 | " ye receiv accept letter master back octob good... | \n",
1708 | " yes receive acceptance letter master back octo... | \n",
1709 | "
\n",
1710 | " \n",
1711 | " 18745 | \n",
1712 | " @user @user this so made me smile | \n",
1713 | " made smile | \n",
1714 | " made smile | \n",
1715 | " make smile | \n",
1716 | "
\n",
1717 | " \n",
1718 | "
\n",
1719 | "
"
1720 | ],
1721 | "text/plain": [
1722 | " tweet \\\n",
1723 | "21468 @user for real now: we will be playing @user ... \n",
1724 | "9568 dear america. please don't let this influence ... \n",
1725 | "19804 finally... now on to other suppos~ #leagueof... \n",
1726 | "22323 @user @user @user @user feeling #worried. \n",
1727 | "20171 i am valued. #i_am #positive #affirmation \n",
1728 | "29669 fathers day selfie â¤ï¸ #grandad #selfie #... \n",
1729 | "4360 when 8th #graders say they're for high #school \n",
1730 | "15915 current mood ðð¦ #alone #anxiety #rain ... \n",
1731 | "92 yes! received my acceptance letter for my mast... \n",
1732 | "18745 @user @user this so made me smile \n",
1733 | "\n",
1734 | " clean_text \\\n",
1735 | "21468 real playing czech republic china championship... \n",
1736 | "9568 dear america please dont let influence vote tr... \n",
1737 | "19804 finally suppos leagueoflegends \n",
1738 | "22323 feeling worried \n",
1739 | "20171 valued iam positive affirmation \n",
1740 | "29669 fathers selfie grandad selfie fathersday bless... \n",
1741 | "4360 8th graders say theyre high school \n",
1742 | "15915 current mood alone anxiety rain thistooshallpass \n",
1743 | "92 yes received acceptance letter masters back oc... \n",
1744 | "18745 made smile \n",
1745 | "\n",
1746 | " stemmed_text \\\n",
1747 | "21468 real play czech republ china championship wugc... \n",
1748 | "9568 dear america pleas dont let influenc vote trum... \n",
1749 | "19804 final suppo leagueoflegend \n",
1750 | "22323 feel worri \n",
1751 | "20171 valu iam posit affirm \n",
1752 | "29669 father selfi grandad selfi fathersday bless su... \n",
1753 | "4360 8th grader say theyr high school \n",
1754 | "15915 current mood alon anxieti rain thistooshallpass \n",
1755 | "92 ye receiv accept letter master back octob good... \n",
1756 | "18745 made smile \n",
1757 | "\n",
1758 | " lemmatized_text \n",
1759 | "21468 real play czech republic china championship wu... \n",
1760 | "9568 dear america please dont let influence vote tr... \n",
1761 | "19804 finally suppos leagueoflegends \n",
1762 | "22323 feel worry \n",
1763 | "20171 value iam positive affirmation \n",
1764 | "29669 father selfie grandad selfie fathersday bless ... \n",
1765 | "4360 8th grader say theyre high school \n",
1766 | "15915 current mood alone anxiety rain thistooshallpass \n",
1767 | "92 yes receive acceptance letter master back octo... \n",
1768 | "18745 make smile "
1769 | ]
1770 | },
1771 | "execution_count": 44,
1772 | "metadata": {},
1773 | "output_type": "execute_result"
1774 | }
1775 | ],
1776 | "source": [
1777 | "df.sample(frac=1).head(10)"
1778 | ]
1779 | },
1780 | {
1781 | "cell_type": "markdown",
1782 | "metadata": {},
1783 | "source": [
1784 | "## Removal of URLs"
1785 | ]
1786 | },
1787 | {
1788 | "cell_type": "code",
1789 | "execution_count": 53,
1790 | "metadata": {},
1791 | "outputs": [],
1792 | "source": [
1793 | "text = \"https://www.hackersrealm.net is the URL of the channel Hackers Realm\""
1794 | ]
1795 | },
1796 | {
1797 | "cell_type": "code",
1798 | "execution_count": 54,
1799 | "metadata": {},
1800 | "outputs": [],
1801 | "source": [
1802 | "def remove_url(text):\n",
1803 | " return re.sub(r'https?://\\S+|www\\.\\S+', '', text)"
1804 | ]
1805 | },
1806 | {
1807 | "cell_type": "code",
1808 | "execution_count": 55,
1809 | "metadata": {},
1810 | "outputs": [
1811 | {
1812 | "data": {
1813 | "text/plain": [
1814 | "' is the URL of the channel Hackers Realm'"
1815 | ]
1816 | },
1817 | "execution_count": 55,
1818 | "metadata": {},
1819 | "output_type": "execute_result"
1820 | }
1821 | ],
1822 | "source": [
1823 | "remove_url(text)"
1824 | ]
1825 | },
1826 | {
1827 | "cell_type": "markdown",
1828 | "metadata": {},
1829 | "source": [
1830 | "## Removal of HTML Tags"
1831 | ]
1832 | },
1833 | {
1834 | "cell_type": "code",
1835 | "execution_count": 56,
1836 | "metadata": {},
1837 | "outputs": [],
1838 | "source": [
1839 | "text = \" Hackers Realm
This is NLP text preprocessing tutorial
\""
1840 | ]
1841 | },
1842 | {
1843 | "cell_type": "code",
1844 | "execution_count": 57,
1845 | "metadata": {},
1846 | "outputs": [],
1847 | "source": [
1848 | "def remove_html_tags(text):\n",
1849 | " return re.sub(r'<.*?>', '', text)"
1850 | ]
1851 | },
1852 | {
1853 | "cell_type": "code",
1854 | "execution_count": 58,
1855 | "metadata": {},
1856 | "outputs": [
1857 | {
1858 | "data": {
1859 | "text/plain": [
1860 | "' Hackers Realm This is NLP text preprocessing tutorial '"
1861 | ]
1862 | },
1863 | "execution_count": 58,
1864 | "metadata": {},
1865 | "output_type": "execute_result"
1866 | }
1867 | ],
1868 | "source": [
1869 | "remove_html_tags(text)"
1870 | ]
1871 | },
1872 | {
1873 | "cell_type": "markdown",
1874 | "metadata": {},
1875 | "source": [
1876 | "## Spelling Correction"
1877 | ]
1878 | },
1879 | {
1880 | "cell_type": "code",
1881 | "execution_count": 64,
1882 | "metadata": {},
1883 | "outputs": [],
1884 | "source": [
1885 | "!pip install pyspellchecker"
1886 | ]
1887 | },
1888 | {
1889 | "cell_type": "code",
1890 | "execution_count": 7,
1891 | "metadata": {},
1892 | "outputs": [],
1893 | "source": [
1894 | "text = 'natur is a beuty'"
1895 | ]
1896 | },
1897 | {
1898 | "cell_type": "code",
1899 | "execution_count": 8,
1900 | "metadata": {},
1901 | "outputs": [],
1902 | "source": [
1903 | "from spellchecker import SpellChecker\n",
1904 | "spell = SpellChecker()\n",
1905 | "\n",
1906 | "def correct_spellings(text):\n",
1907 | " corrected_text = []\n",
1908 | " misspelled_text = spell.unknown(text.split())\n",
1909 | " # print(misspelled_text)\n",
1910 | " for word in text.split():\n",
1911 | " if word in misspelled_text:\n",
1912 | " corrected_text.append(spell.correction(word))\n",
1913 | " else:\n",
1914 | " corrected_text.append(word)\n",
1915 | " \n",
1916 | " return \" \".join(corrected_text)"
1917 | ]
1918 | },
1919 | {
1920 | "cell_type": "code",
1921 | "execution_count": 9,
1922 | "metadata": {},
1923 | "outputs": [
1924 | {
1925 | "data": {
1926 | "text/plain": [
1927 | "'nature is a beauty'"
1928 | ]
1929 | },
1930 | "execution_count": 9,
1931 | "metadata": {},
1932 | "output_type": "execute_result"
1933 | }
1934 | ],
1935 | "source": [
1936 | "correct_spellings(text)"
1937 | ]
1938 | },
1939 | {
1940 | "cell_type": "code",
1941 | "execution_count": null,
1942 | "metadata": {},
1943 | "outputs": [],
1944 | "source": []
1945 | },
1946 | {
1947 | "cell_type": "markdown",
1948 | "metadata": {},
1949 | "source": [
1950 | "# Feature Extraction from Text Data"
1951 | ]
1952 | },
1953 | {
1954 | "cell_type": "markdown",
1955 | "metadata": {},
1956 | "source": [
1957 | "## Bag of Words\n",
1958 | "\n",
1959 | "A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: A vocabulary of known words. A measure of the presence of known words."
1960 | ]
1961 | },
1962 | {
1963 | "cell_type": "code",
1964 | "execution_count": 5,
1965 | "metadata": {},
1966 | "outputs": [],
1967 | "source": [
1968 | "text_data = ['I am interested in NLP', 'This is a good tutorial with good topic', 'Feature extraction is very important topic']"
1969 | ]
1970 | },
1971 | {
1972 | "cell_type": "code",
1973 | "execution_count": 6,
1974 | "metadata": {},
1975 | "outputs": [],
1976 | "source": [
1977 | "from sklearn.feature_extraction.text import CountVectorizer\n",
1978 | "bow = CountVectorizer(stop_words='english')"
1979 | ]
1980 | },
1981 | {
1982 | "cell_type": "code",
1983 | "execution_count": 7,
1984 | "metadata": {},
1985 | "outputs": [
1986 | {
1987 | "data": {
1988 | "text/plain": [
1989 | "CountVectorizer(stop_words='english')"
1990 | ]
1991 | },
1992 | "execution_count": 7,
1993 | "metadata": {},
1994 | "output_type": "execute_result"
1995 | }
1996 | ],
1997 | "source": [
1998 | "# fit the data\n",
1999 | "bow.fit(text_data)"
2000 | ]
2001 | },
2002 | {
2003 | "cell_type": "code",
2004 | "execution_count": 8,
2005 | "metadata": {},
2006 | "outputs": [
2007 | {
2008 | "data": {
2009 | "text/plain": [
2010 | "['extraction',\n",
2011 | " 'feature',\n",
2012 | " 'good',\n",
2013 | " 'important',\n",
2014 | " 'interested',\n",
2015 | " 'nlp',\n",
2016 | " 'topic',\n",
2017 | " 'tutorial']"
2018 | ]
2019 | },
2020 | "execution_count": 8,
2021 | "metadata": {},
2022 | "output_type": "execute_result"
2023 | }
2024 | ],
2025 | "source": [
2026 | "# get the vocabulary list\n",
2027 | "bow.get_feature_names()"
2028 | ]
2029 | },
2030 | {
2031 | "cell_type": "code",
2032 | "execution_count": 9,
2033 | "metadata": {},
2034 | "outputs": [
2035 | {
2036 | "data": {
2037 | "text/plain": [
2038 | "<3x8 sparse matrix of type ''\n",
2039 | "\twith 9 stored elements in Compressed Sparse Row format>"
2040 | ]
2041 | },
2042 | "execution_count": 9,
2043 | "metadata": {},
2044 | "output_type": "execute_result"
2045 | }
2046 | ],
2047 | "source": [
2048 | "bow_features = bow.transform(text_data)\n",
2049 | "bow_features"
2050 | ]
2051 | },
2052 | {
2053 | "cell_type": "code",
2054 | "execution_count": 10,
2055 | "metadata": {},
2056 | "outputs": [
2057 | {
2058 | "data": {
2059 | "text/plain": [
2060 | "array([[0, 0, 0, 0, 1, 1, 0, 0],\n",
2061 | " [0, 0, 2, 0, 0, 0, 1, 1],\n",
2062 | " [1, 1, 0, 1, 0, 0, 1, 0]], dtype=int64)"
2063 | ]
2064 | },
2065 | "execution_count": 10,
2066 | "metadata": {},
2067 | "output_type": "execute_result"
2068 | }
2069 | ],
2070 | "source": [
2071 | "bow_feature_array = bow_features.toarray()\n",
2072 | "bow_feature_array"
2073 | ]
2074 | },
2075 | {
2076 | "cell_type": "code",
2077 | "execution_count": 11,
2078 | "metadata": {},
2079 | "outputs": [
2080 | {
2081 | "name": "stdout",
2082 | "output_type": "stream",
2083 | "text": [
2084 | "['extraction', 'feature', 'good', 'important', 'interested', 'nlp', 'topic', 'tutorial']\n",
2085 | "I am interested in NLP\n",
2086 | "[0 0 0 0 1 1 0 0]\n",
2087 | "This is a good tutorial with good topic\n",
2088 | "[0 0 2 0 0 0 1 1]\n",
2089 | "Feature extraction is very important topic\n",
2090 | "[1 1 0 1 0 0 1 0]\n"
2091 | ]
2092 | }
2093 | ],
2094 | "source": [
2095 | "print(bow.get_feature_names())\n",
2096 | "for sentence, feature in zip(text_data, bow_feature_array):\n",
2097 | " print(sentence)\n",
2098 | " print(feature)"
2099 | ]
2100 | },
2101 | {
2102 | "cell_type": "code",
2103 | "execution_count": null,
2104 | "metadata": {},
2105 | "outputs": [],
2106 | "source": []
2107 | },
2108 | {
2109 | "cell_type": "markdown",
2110 | "metadata": {},
2111 | "source": [
2112 | "## TF-IDF (Term Frequency/Inverse Document Frequency)\n",
2113 | "\n",
2114 | "TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc) in a document amongst a collection of documents"
2115 | ]
2116 | },
2117 | {
2118 | "cell_type": "code",
2119 | "execution_count": 12,
2120 | "metadata": {},
2121 | "outputs": [],
2122 | "source": [
2123 | "text_data = ['I am interested in NLP', 'This is a good tutorial with good topic', 'Feature extraction is very important topic']"
2124 | ]
2125 | },
2126 | {
2127 | "cell_type": "code",
2128 | "execution_count": 13,
2129 | "metadata": {},
2130 | "outputs": [],
2131 | "source": [
2132 | "from sklearn.feature_extraction.text import TfidfVectorizer\n",
2133 | "tfidf = TfidfVectorizer(stop_words='english')"
2134 | ]
2135 | },
2136 | {
2137 | "cell_type": "code",
2138 | "execution_count": 14,
2139 | "metadata": {},
2140 | "outputs": [
2141 | {
2142 | "data": {
2143 | "text/plain": [
2144 | "TfidfVectorizer(stop_words='english')"
2145 | ]
2146 | },
2147 | "execution_count": 14,
2148 | "metadata": {},
2149 | "output_type": "execute_result"
2150 | }
2151 | ],
2152 | "source": [
2153 | "# fit the data\n",
2154 | "tfidf.fit(text_data)"
2155 | ]
2156 | },
2157 | {
2158 | "cell_type": "code",
2159 | "execution_count": 15,
2160 | "metadata": {},
2161 | "outputs": [
2162 | {
2163 | "data": {
2164 | "text/plain": [
2165 | "{'interested': 4,\n",
2166 | " 'nlp': 5,\n",
2167 | " 'good': 2,\n",
2168 | " 'tutorial': 7,\n",
2169 | " 'topic': 6,\n",
2170 | " 'feature': 1,\n",
2171 | " 'extraction': 0,\n",
2172 | " 'important': 3}"
2173 | ]
2174 | },
2175 | "execution_count": 15,
2176 | "metadata": {},
2177 | "output_type": "execute_result"
2178 | }
2179 | ],
2180 | "source": [
2181 | "# get the vocabulary list\n",
2182 | "tfidf.vocabulary_"
2183 | ]
2184 | },
2185 | {
2186 | "cell_type": "code",
2187 | "execution_count": 16,
2188 | "metadata": {},
2189 | "outputs": [
2190 | {
2191 | "data": {
2192 | "text/plain": [
2193 | "<3x8 sparse matrix of type ''\n",
2194 | "\twith 9 stored elements in Compressed Sparse Row format>"
2195 | ]
2196 | },
2197 | "execution_count": 16,
2198 | "metadata": {},
2199 | "output_type": "execute_result"
2200 | }
2201 | ],
2202 | "source": [
2203 | "tfidf_features = tfidf.transform(text_data)\n",
2204 | "tfidf_features"
2205 | ]
2206 | },
2207 | {
2208 | "cell_type": "code",
2209 | "execution_count": 17,
2210 | "metadata": {},
2211 | "outputs": [
2212 | {
2213 | "data": {
2214 | "text/plain": [
2215 | "array([[0. , 0. , 0. , 0. , 0.70710678,\n",
2216 | " 0.70710678, 0. , 0. ],\n",
2217 | " [0. , 0. , 0.84678897, 0. , 0. ,\n",
2218 | " 0. , 0.32200242, 0.42339448],\n",
2219 | " [0.52863461, 0.52863461, 0. , 0.52863461, 0. ,\n",
2220 | " 0. , 0.40204024, 0. ]])"
2221 | ]
2222 | },
2223 | "execution_count": 17,
2224 | "metadata": {},
2225 | "output_type": "execute_result"
2226 | }
2227 | ],
2228 | "source": [
2229 | "tfidf_feature_array = tfidf_features.toarray()\n",
2230 | "tfidf_feature_array"
2231 | ]
2232 | },
2233 | {
2234 | "cell_type": "code",
2235 | "execution_count": 19,
2236 | "metadata": {},
2237 | "outputs": [
2238 | {
2239 | "name": "stdout",
2240 | "output_type": "stream",
2241 | "text": [
2242 | "I am interested in NLP\n",
2243 | " (0, 5)\t0.7071067811865476\n",
2244 | " (0, 4)\t0.7071067811865476\n",
2245 | "This is a good tutorial with good topic\n",
2246 | " (0, 7)\t0.42339448341195934\n",
2247 | " (0, 6)\t0.3220024178194947\n",
2248 | " (0, 2)\t0.8467889668239187\n",
2249 | "Feature extraction is very important topic\n",
2250 | " (0, 6)\t0.4020402441612698\n",
2251 | " (0, 3)\t0.5286346066596935\n",
2252 | " (0, 1)\t0.5286346066596935\n",
2253 | " (0, 0)\t0.5286346066596935\n"
2254 | ]
2255 | }
2256 | ],
2257 | "source": [
2258 | "for sentence, feature in zip(text_data, tfidf_features):\n",
2259 | " print(sentence)\n",
2260 | " print(feature)"
2261 | ]
2262 | },
2263 | {
2264 | "cell_type": "code",
2265 | "execution_count": null,
2266 | "metadata": {},
2267 | "outputs": [],
2268 | "source": []
2269 | },
2270 | {
2271 | "cell_type": "markdown",
2272 | "metadata": {},
2273 | "source": [
2274 | "## Word2vec\n",
2275 | "\n",
2276 | "The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence."
2277 | ]
2278 | },
2279 | {
2280 | "cell_type": "code",
2281 | "execution_count": 20,
2282 | "metadata": {},
2283 | "outputs": [],
2284 | "source": [
2285 | "from gensim.test.utils import common_texts\n",
2286 | "from gensim.models import Word2Vec"
2287 | ]
2288 | },
2289 | {
2290 | "cell_type": "code",
2291 | "execution_count": 21,
2292 | "metadata": {},
2293 | "outputs": [
2294 | {
2295 | "data": {
2296 | "text/plain": [
2297 | "[['human', 'interface', 'computer'],\n",
2298 | " ['survey', 'user', 'computer', 'system', 'response', 'time'],\n",
2299 | " ['eps', 'user', 'interface', 'system'],\n",
2300 | " ['system', 'human', 'system', 'eps'],\n",
2301 | " ['user', 'response', 'time'],\n",
2302 | " ['trees'],\n",
2303 | " ['graph', 'trees'],\n",
2304 | " ['graph', 'minors', 'trees'],\n",
2305 | " ['graph', 'minors', 'survey']]"
2306 | ]
2307 | },
2308 | "execution_count": 21,
2309 | "metadata": {},
2310 | "output_type": "execute_result"
2311 | }
2312 | ],
2313 | "source": [
2314 | "# text data\n",
2315 | "common_texts"
2316 | ]
2317 | },
2318 | {
2319 | "cell_type": "code",
2320 | "execution_count": 23,
2321 | "metadata": {},
2322 | "outputs": [],
2323 | "source": [
2324 | "# initialize and fit the data\n",
2325 | "model = Word2Vec(common_texts, size=100, min_count=1)"
2326 | ]
2327 | },
2328 | {
2329 | "cell_type": "code",
2330 | "execution_count": 25,
2331 | "metadata": {},
2332 | "outputs": [
2333 | {
2334 | "data": {
2335 | "text/plain": [
2336 | "array([-0.00042112, 0.00126945, -0.00348724, 0.00373327, 0.00387501,\n",
2337 | " -0.00306736, -0.00138952, -0.00139083, 0.00334137, 0.00413064,\n",
2338 | " 0.00045129, -0.00390373, -0.00159695, -0.00369461, -0.00036086,\n",
2339 | " 0.00444261, -0.00391653, 0.00447466, -0.00032617, 0.00056412,\n",
2340 | " -0.00017338, -0.00464378, 0.00039338, -0.00353649, 0.0040346 ,\n",
2341 | " 0.00179682, -0.00186994, -0.00121431, -0.00370716, 0.00039535,\n",
2342 | " -0.00117291, 0.00498948, -0.00243317, 0.00480749, -0.00128626,\n",
2343 | " -0.0018426 , -0.00086148, -0.00347201, -0.0025697 , -0.00409948,\n",
2344 | " 0.00433477, -0.00424404, 0.00389087, 0.0024296 , 0.0009781 ,\n",
2345 | " -0.00267652, -0.00039598, 0.00188174, -0.00141169, 0.00143257,\n",
2346 | " 0.00363962, -0.00445332, 0.00499313, -0.00013036, 0.00411159,\n",
2347 | " 0.00307077, -0.00048517, 0.00491026, -0.00315512, -0.00091287,\n",
2348 | " 0.00465486, 0.00034458, 0.00097905, 0.00187424, -0.00452135,\n",
2349 | " -0.00365111, 0.00260027, 0.00464861, -0.00243504, -0.00425601,\n",
2350 | " -0.00265299, -0.00108813, 0.00284521, -0.00437486, -0.0015496 ,\n",
2351 | " -0.00054869, 0.00228153, 0.00360572, 0.00255484, -0.00357945,\n",
2352 | " -0.00235164, 0.00220505, -0.0016885 , 0.00294839, -0.00337972,\n",
2353 | " 0.00291201, 0.00250298, 0.00447992, -0.00129002, 0.0025 ,\n",
2354 | " -0.00430755, -0.00419162, -0.00029911, 0.00166961, 0.00417119,\n",
2355 | " -0.00209666, 0.00452041, 0.00010931, -0.00115822, -0.00154263],\n",
2356 | " dtype=float32)"
2357 | ]
2358 | },
2359 | "execution_count": 25,
2360 | "metadata": {},
2361 | "output_type": "execute_result"
2362 | }
2363 | ],
2364 | "source": [
2365 | "model.wv['graph']"
2366 | ]
2367 | },
2368 | {
2369 | "cell_type": "code",
2370 | "execution_count": 26,
2371 | "metadata": {},
2372 | "outputs": [
2373 | {
2374 | "data": {
2375 | "text/plain": [
2376 | "[('interface', 0.1710839718580246),\n",
2377 | " ('user', 0.08987751603126526),\n",
2378 | " ('trees', 0.07364125549793243),\n",
2379 | " ('minors', 0.045832667499780655),\n",
2380 | " ('computer', 0.025292515754699707),\n",
2381 | " ('system', 0.012846874073147774),\n",
2382 | " ('human', -0.03873271495103836),\n",
2383 | " ('survey', -0.06853737682104111),\n",
2384 | " ('time', -0.07515352964401245),\n",
2385 | " ('eps', -0.07798048853874207)]"
2386 | ]
2387 | },
2388 | "execution_count": 26,
2389 | "metadata": {},
2390 | "output_type": "execute_result"
2391 | }
2392 | ],
2393 | "source": [
2394 | "model.wv.most_similar('graph')"
2395 | ]
2396 | },
2397 | {
2398 | "cell_type": "code",
2399 | "execution_count": null,
2400 | "metadata": {},
2401 | "outputs": [],
2402 | "source": []
2403 | },
2404 | {
2405 | "cell_type": "markdown",
2406 | "metadata": {},
2407 | "source": [
2408 | "## Word Embedding using Glove\n",
2409 | "\n",
2410 | "GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space\n",
2411 | "\n",
2412 | "Download link: https://www.kaggle.com/datasets/danielwillgeorge/glove6b100dtxt"
2413 | ]
2414 | },
2415 | {
2416 | "cell_type": "code",
2417 | "execution_count": 28,
2418 | "metadata": {},
2419 | "outputs": [
2420 | {
2421 | "data": {
2422 | "text/html": [
2423 | "\n",
2424 | "\n",
2437 | "
\n",
2438 | " \n",
2439 | " \n",
2440 | " | \n",
2441 | " tweet | \n",
2442 | " clean_text | \n",
2443 | "
\n",
2444 | " \n",
2445 | " \n",
2446 | " \n",
2447 | " 0 | \n",
2448 | " @user when a father is dysfunctional and is s... | \n",
2449 | " user father dysfunctional selfish drags kids ... | \n",
2450 | "
\n",
2451 | " \n",
2452 | " 1 | \n",
2453 | " @user @user thanks for #lyft credit i can't us... | \n",
2454 | " user user thanks lyft credit can t use cause ... | \n",
2455 | "
\n",
2456 | " \n",
2457 | " 2 | \n",
2458 | " bihday your majesty | \n",
2459 | " bihday majesty | \n",
2460 | "
\n",
2461 | " \n",
2462 | " 3 | \n",
2463 | " #model i love u take with u all the time in ... | \n",
2464 | " model love u take u time ur | \n",
2465 | "
\n",
2466 | " \n",
2467 | " 4 | \n",
2468 | " factsguide: society now #motivation | \n",
2469 | " factsguide society motivation | \n",
2470 | "
\n",
2471 | " \n",
2472 | "
\n",
2473 | "
"
2474 | ],
2475 | "text/plain": [
2476 | " tweet \\\n",
2477 | "0 @user when a father is dysfunctional and is s... \n",
2478 | "1 @user @user thanks for #lyft credit i can't us... \n",
2479 | "2 bihday your majesty \n",
2480 | "3 #model i love u take with u all the time in ... \n",
2481 | "4 factsguide: society now #motivation \n",
2482 | "\n",
2483 | " clean_text \n",
2484 | "0 user father dysfunctional selfish drags kids ... \n",
2485 | "1 user user thanks lyft credit can t use cause ... \n",
2486 | "2 bihday majesty \n",
2487 | "3 model love u take u time ur \n",
2488 | "4 factsguide society motivation "
2489 | ]
2490 | },
2491 | "execution_count": 28,
2492 | "metadata": {},
2493 | "output_type": "execute_result"
2494 | }
2495 | ],
2496 | "source": [
2497 | "import pandas as pd\n",
2498 | "import string\n",
2499 | "from nltk.corpus import stopwords\n",
2500 | "df = pd.read_csv('data/Twitter Sentiments.csv')\n",
2501 | "# drop the columns\n",
2502 | "df = df.drop(columns=['id', 'label'], axis=1)\n",
2503 | "\n",
2504 | "df['clean_text'] = df['tweet'].str.lower()\n",
2505 | "\n",
2506 | "STOPWORDS = set(stopwords.words('english'))\n",
2507 | "def remove_stopwords(text):\n",
2508 | " return \" \".join([word for word in text.split() if word not in STOPWORDS])\n",
2509 | "df['clean_text'] = df['clean_text'].apply(lambda x: remove_stopwords(x))\n",
2510 | "\n",
2511 | "import re\n",
2512 | "def remove_spl_chars(text):\n",
2513 | " text = re.sub('[^a-zA-Z0-9]', ' ', text)\n",
2514 | " text = re.sub('\\s+', ' ', text)\n",
2515 | " return text\n",
2516 | "df['clean_text'] = df['clean_text'].apply(lambda x: remove_spl_chars(x))\n",
2517 | "\n",
2518 | "df.head()"
2519 | ]
2520 | },
2521 | {
2522 | "cell_type": "code",
2523 | "execution_count": 34,
2524 | "metadata": {},
2525 | "outputs": [],
2526 | "source": [
2527 | "from keras.preprocessing.text import Tokenizer\n",
2528 | "from keras.preprocessing.sequence import pad_sequences\n",
2529 | "import numpy as np"
2530 | ]
2531 | },
2532 | {
2533 | "cell_type": "code",
2534 | "execution_count": 30,
2535 | "metadata": {},
2536 | "outputs": [
2537 | {
2538 | "data": {
2539 | "text/plain": [
2540 | "39085"
2541 | ]
2542 | },
2543 | "execution_count": 30,
2544 | "metadata": {},
2545 | "output_type": "execute_result"
2546 | }
2547 | ],
2548 | "source": [
2549 | "# tokenize text\n",
2550 | "tokenizer = Tokenizer()\n",
2551 | "tokenizer.fit_on_texts(df['clean_text'])\n",
2552 | "\n",
2553 | "word_index = tokenizer.word_index\n",
2554 | "vocab_size = len(word_index)\n",
2555 | "vocab_size"
2556 | ]
2557 | },
2558 | {
2559 | "cell_type": "code",
2560 | "execution_count": 40,
2561 | "metadata": {},
2562 | "outputs": [],
2563 | "source": [
2564 | "# word_index"
2565 | ]
2566 | },
2567 | {
2568 | "cell_type": "code",
2569 | "execution_count": 31,
2570 | "metadata": {},
2571 | "outputs": [
2572 | {
2573 | "data": {
2574 | "text/plain": [
2575 | "131"
2576 | ]
2577 | },
2578 | "execution_count": 31,
2579 | "metadata": {},
2580 | "output_type": "execute_result"
2581 | }
2582 | ],
2583 | "source": [
2584 | "max(len(data) for data in df['clean_text'])"
2585 | ]
2586 | },
2587 | {
2588 | "cell_type": "code",
2589 | "execution_count": 32,
2590 | "metadata": {},
2591 | "outputs": [],
2592 | "source": [
2593 | "# padding text data\n",
2594 | "sequences = tokenizer.texts_to_sequences(df['clean_text'])\n",
2595 | "padded_seq = pad_sequences(sequences, maxlen=131, padding='post', truncating='post')"
2596 | ]
2597 | },
2598 | {
2599 | "cell_type": "code",
2600 | "execution_count": 33,
2601 | "metadata": {},
2602 | "outputs": [
2603 | {
2604 | "data": {
2605 | "text/plain": [
2606 | "array([ 1, 28, 15330, 2630, 6365, 184, 7786, 385, 0,\n",
2607 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2608 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2609 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2610 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2611 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2612 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2613 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2614 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2615 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2616 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2617 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2618 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2619 | " 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
2620 | " 0, 0, 0, 0, 0])"
2621 | ]
2622 | },
2623 | "execution_count": 33,
2624 | "metadata": {},
2625 | "output_type": "execute_result"
2626 | }
2627 | ],
2628 | "source": [
2629 | "padded_seq[0]"
2630 | ]
2631 | },
2632 | {
2633 | "cell_type": "code",
2634 | "execution_count": 35,
2635 | "metadata": {},
2636 | "outputs": [],
2637 | "source": [
2638 | "# create embedding index\n",
2639 | "embedding_index = {}\n",
2640 | "with open('glove.6B.100d.txt', encoding='utf-8') as f:\n",
2641 | " for line in f:\n",
2642 | " values = line.split()\n",
2643 | " word = values[0]\n",
2644 | " coefs = np.asarray(values[1:], dtype='float32')\n",
2645 | " embedding_index[word] = coefs"
2646 | ]
2647 | },
2648 | {
2649 | "cell_type": "code",
2650 | "execution_count": 36,
2651 | "metadata": {},
2652 | "outputs": [
2653 | {
2654 | "data": {
2655 | "text/plain": [
2656 | "array([-0.030769 , 0.11993 , 0.53909 , -0.43696 , -0.73937 ,\n",
2657 | " -0.15345 , 0.081126 , -0.38559 , -0.68797 , -0.41632 ,\n",
2658 | " -0.13183 , -0.24922 , 0.441 , 0.085919 , 0.20871 ,\n",
2659 | " -0.063582 , 0.062228 , -0.051234 , -0.13398 , 1.1418 ,\n",
2660 | " 0.036526 , 0.49029 , -0.24567 , -0.412 , 0.12349 ,\n",
2661 | " 0.41336 , -0.48397 , -0.54243 , -0.27787 , -0.26015 ,\n",
2662 | " -0.38485 , 0.78656 , 0.1023 , -0.20712 , 0.40751 ,\n",
2663 | " 0.32026 , -0.51052 , 0.48362 , -0.0099498, -0.38685 ,\n",
2664 | " 0.034975 , -0.167 , 0.4237 , -0.54164 , -0.30323 ,\n",
2665 | " -0.36983 , 0.082836 , -0.52538 , -0.064531 , -1.398 ,\n",
2666 | " -0.14873 , -0.35327 , -0.1118 , 1.0912 , 0.095864 ,\n",
2667 | " -2.8129 , 0.45238 , 0.46213 , 1.6012 , -0.20837 ,\n",
2668 | " -0.27377 , 0.71197 , -1.0754 , -0.046974 , 0.67479 ,\n",
2669 | " -0.065839 , 0.75824 , 0.39405 , 0.15507 , -0.64719 ,\n",
2670 | " 0.32796 , -0.031748 , 0.52899 , -0.43886 , 0.67405 ,\n",
2671 | " 0.42136 , -0.11981 , -0.21777 , -0.29756 , -0.1351 ,\n",
2672 | " 0.59898 , 0.46529 , -0.58258 , -0.02323 , -1.5442 ,\n",
2673 | " 0.01901 , -0.015877 , 0.024499 , -0.58017 , -0.67659 ,\n",
2674 | " -0.040379 , -0.44043 , 0.083292 , 0.20035 , -0.75499 ,\n",
2675 | " 0.16918 , -0.26573 , -0.52878 , 0.17584 , 1.065 ],\n",
2676 | " dtype=float32)"
2677 | ]
2678 | },
2679 | "execution_count": 36,
2680 | "metadata": {},
2681 | "output_type": "execute_result"
2682 | }
2683 | ],
2684 | "source": [
2685 | "embedding_index['good']"
2686 | ]
2687 | },
2688 | {
2689 | "cell_type": "code",
2690 | "execution_count": 41,
2691 | "metadata": {},
2692 | "outputs": [],
2693 | "source": [
2694 | "# create embedding matrix\n",
2695 | "embedding_matrix = np.zeros((vocab_size+1, 100))\n",
2696 | "for word, i in word_index.items():\n",
2697 | " embedding_vector = embedding_index.get(word)\n",
2698 | " if embedding_vector is not None:\n",
2699 | " embedding_matrix[i] = embedding_vector"
2700 | ]
2701 | },
2702 | {
2703 | "cell_type": "code",
2704 | "execution_count": 42,
2705 | "metadata": {},
2706 | "outputs": [
2707 | {
2708 | "data": {
2709 | "text/plain": [
2710 | "(39086, 100)"
2711 | ]
2712 | },
2713 | "execution_count": 42,
2714 | "metadata": {},
2715 | "output_type": "execute_result"
2716 | }
2717 | ],
2718 | "source": [
2719 | "embedding_matrix.shape"
2720 | ]
2721 | },
2722 | {
2723 | "cell_type": "code",
2724 | "execution_count": null,
2725 | "metadata": {},
2726 | "outputs": [],
2727 | "source": []
2728 | },
2729 | {
2730 | "cell_type": "markdown",
2731 | "metadata": {},
2732 | "source": [
2733 | "# Named Entity Recognition"
2734 | ]
2735 | },
2736 | {
2737 | "cell_type": "code",
2738 | "execution_count": null,
2739 | "metadata": {
2740 | "id": "PCnOhgdCcked"
2741 | },
2742 | "outputs": [],
2743 | "source": [
2744 | "# !pip install -U pip setuptools wheel\n",
2745 | "# !pip install -U spacy\n",
2746 | "# !python -m spacy download en_core_web_sm"
2747 | ]
2748 | },
2749 | {
2750 | "cell_type": "code",
2751 | "execution_count": null,
2752 | "metadata": {
2753 | "id": "9LfSlz4Ye9SD"
2754 | },
2755 | "outputs": [],
2756 | "source": [
2757 | "import spacy\n",
2758 | "from spacy import displacy"
2759 | ]
2760 | },
2761 | {
2762 | "cell_type": "code",
2763 | "execution_count": null,
2764 | "metadata": {
2765 | "id": "nD0z2jmtfGfk"
2766 | },
2767 | "outputs": [],
2768 | "source": [
2769 | "NER = spacy.load('en_core_web_sm')"
2770 | ]
2771 | },
2772 | {
2773 | "cell_type": "code",
2774 | "execution_count": null,
2775 | "metadata": {
2776 | "id": "Q5OJPYA2fyiK"
2777 | },
2778 | "outputs": [],
2779 | "source": [
2780 | "text = 'Mark Zuckerberg is one of the founders of Facebook, a company from the United States'"
2781 | ]
2782 | },
2783 | {
2784 | "cell_type": "code",
2785 | "execution_count": null,
2786 | "metadata": {
2787 | "id": "pYE8UjWsgf6S"
2788 | },
2789 | "outputs": [],
2790 | "source": [
2791 | "ner_text = NER(text)"
2792 | ]
2793 | },
2794 | {
2795 | "cell_type": "code",
2796 | "execution_count": null,
2797 | "metadata": {
2798 | "colab": {
2799 | "base_uri": "https://localhost:8080/"
2800 | },
2801 | "id": "kpWcdBlWgf38",
2802 | "outputId": "b8cac9aa-02e2-42a3-dc01-fc3970b7c7e5"
2803 | },
2804 | "outputs": [
2805 | {
2806 | "name": "stdout",
2807 | "output_type": "stream",
2808 | "text": [
2809 | "Mark Zuckerberg PERSON\n",
2810 | "one CARDINAL\n",
2811 | "Facebook ORG\n",
2812 | "the United States GPE\n"
2813 | ]
2814 | }
2815 | ],
2816 | "source": [
2817 | "for word in ner_text.ents:\n",
2818 | " print(word.text, word.label_)"
2819 | ]
2820 | },
2821 | {
2822 | "cell_type": "code",
2823 | "execution_count": null,
2824 | "metadata": {
2825 | "colab": {
2826 | "base_uri": "https://localhost:8080/",
2827 | "height": 35
2828 | },
2829 | "id": "iFBCYIvDgrL6",
2830 | "outputId": "5e0291e3-8c6d-4082-f3c0-473ad0bdac43"
2831 | },
2832 | "outputs": [
2833 | {
2834 | "data": {
2835 | "application/vnd.google.colaboratory.intrinsic+json": {
2836 | "type": "string"
2837 | },
2838 | "text/plain": [
2839 | "'Countries, cities, states'"
2840 | ]
2841 | },
2842 | "execution_count": 13,
2843 | "metadata": {},
2844 | "output_type": "execute_result"
2845 | }
2846 | ],
2847 | "source": [
2848 | "spacy.explain('GPE')"
2849 | ]
2850 | },
2851 | {
2852 | "cell_type": "code",
2853 | "execution_count": null,
2854 | "metadata": {
2855 | "colab": {
2856 | "base_uri": "https://localhost:8080/",
2857 | "height": 35
2858 | },
2859 | "id": "vkzMb7Bwg1Fi",
2860 | "outputId": "4b6c9ed6-1270-4b9f-a35a-28e24122d7d3"
2861 | },
2862 | "outputs": [
2863 | {
2864 | "data": {
2865 | "application/vnd.google.colaboratory.intrinsic+json": {
2866 | "type": "string"
2867 | },
2868 | "text/plain": [
2869 | "'Numerals that do not fall under another type'"
2870 | ]
2871 | },
2872 | "execution_count": 14,
2873 | "metadata": {},
2874 | "output_type": "execute_result"
2875 | }
2876 | ],
2877 | "source": [
2878 | "spacy.explain('CARDINAL')"
2879 | ]
2880 | },
2881 | {
2882 | "cell_type": "code",
2883 | "execution_count": null,
2884 | "metadata": {
2885 | "colab": {
2886 | "base_uri": "https://localhost:8080/",
2887 | "height": 52
2888 | },
2889 | "id": "LBPSsLT5g9nS",
2890 | "outputId": "69a16d27-bf86-4b7f-e5cd-b3e56adeb149"
2891 | },
2892 | "outputs": [
2893 | {
2894 | "data": {
2895 | "text/html": [
2896 | "\n",
2897 | "\n",
2898 | " Mark Zuckerberg\n",
2899 | " PERSON\n",
2900 | "\n",
2901 | " is \n",
2902 | "\n",
2903 | " one\n",
2904 | " CARDINAL\n",
2905 | "\n",
2906 | " of the founders of \n",
2907 | "\n",
2908 | " Facebook\n",
2909 | " ORG\n",
2910 | "\n",
2911 | ", a company from \n",
2912 | "\n",
2913 | " the United States\n",
2914 | " GPE\n",
2915 | "\n",
2916 | "
"
2917 | ],
2918 | "text/plain": [
2919 | ""
2920 | ]
2921 | },
2922 | "metadata": {},
2923 | "output_type": "display_data"
2924 | }
2925 | ],
2926 | "source": [
2927 | "displacy.render(ner_text, style='ent', jupyter=True)"
2928 | ]
2929 | },
2930 | {
2931 | "cell_type": "code",
2932 | "execution_count": null,
2933 | "metadata": {},
2934 | "outputs": [],
2935 | "source": []
2936 | },
2937 | {
2938 | "cell_type": "markdown",
2939 | "metadata": {},
2940 | "source": [
2941 | "# Data Augmentation for Text"
2942 | ]
2943 | },
2944 | {
2945 | "cell_type": "code",
2946 | "execution_count": null,
2947 | "metadata": {},
2948 | "outputs": [],
2949 | "source": [
2950 | "# uses\n",
2951 | "# 1. increase the dataset size by creating more samples\n",
2952 | "# 2. reduce overfitting\n",
2953 | "# 3. improve model generalization\n",
2954 | "# 4. handling imbalance dataset"
2955 | ]
2956 | },
2957 | {
2958 | "cell_type": "code",
2959 | "execution_count": null,
2960 | "metadata": {},
2961 | "outputs": [],
2962 | "source": [
2963 | "!pip install nlpaug\n",
2964 | "!pip install sacremoses"
2965 | ]
2966 | },
2967 | {
2968 | "cell_type": "code",
2969 | "execution_count": 2,
2970 | "metadata": {},
2971 | "outputs": [],
2972 | "source": [
2973 | "import nlpaug.augmenter.word as naw"
2974 | ]
2975 | },
2976 | {
2977 | "cell_type": "code",
2978 | "execution_count": 3,
2979 | "metadata": {},
2980 | "outputs": [],
2981 | "source": [
2982 | "text = 'The quick brown fox jumps over a lazy dog'"
2983 | ]
2984 | },
2985 | {
2986 | "cell_type": "markdown",
2987 | "metadata": {},
2988 | "source": [
2989 | "### Synonym Replacement"
2990 | ]
2991 | },
2992 | {
2993 | "cell_type": "code",
2994 | "execution_count": 10,
2995 | "metadata": {},
2996 | "outputs": [
2997 | {
2998 | "name": "stdout",
2999 | "output_type": "stream",
3000 | "text": [
3001 | "Synonym Text: ['The flying brownness fox jumps over a lazy andiron']\n"
3002 | ]
3003 | }
3004 | ],
3005 | "source": [
3006 | "syn_aug = naw.synonym.SynonymAug(aug_src='wordnet')\n",
3007 | "synonym_text = syn_aug.augment(text)\n",
3008 | "print('Synonym Text:', synonym_text)"
3009 | ]
3010 | },
3011 | {
3012 | "cell_type": "markdown",
3013 | "metadata": {},
3014 | "source": [
3015 | "### Random Substitution"
3016 | ]
3017 | },
3018 | {
3019 | "cell_type": "code",
3020 | "execution_count": 11,
3021 | "metadata": {},
3022 | "outputs": [
3023 | {
3024 | "name": "stdout",
3025 | "output_type": "stream",
3026 | "text": [
3027 | "Substituted Text: ['_ _ brown fox jumps _ a lazy dog']\n"
3028 | ]
3029 | }
3030 | ],
3031 | "source": [
3032 | "sub_aug = naw.random.RandomWordAug(action='substitute')\n",
3033 | "substituted_text = sub_aug.augment(text)\n",
3034 | "print('Substituted Text:', substituted_text)"
3035 | ]
3036 | },
3037 | {
3038 | "cell_type": "markdown",
3039 | "metadata": {},
3040 | "source": [
3041 | "### Random Deletion"
3042 | ]
3043 | },
3044 | {
3045 | "cell_type": "code",
3046 | "execution_count": 12,
3047 | "metadata": {},
3048 | "outputs": [
3049 | {
3050 | "name": "stdout",
3051 | "output_type": "stream",
3052 | "text": [
3053 | "Deletion Text: ['Quick brown jumps over a lazy dog']\n"
3054 | ]
3055 | }
3056 | ],
3057 | "source": [
3058 | "del_aug = naw.random.RandomWordAug(action='delete')\n",
3059 | "deletion_text = del_aug.augment(text)\n",
3060 | "print('Deletion Text:', deletion_text)"
3061 | ]
3062 | },
3063 | {
3064 | "cell_type": "markdown",
3065 | "metadata": {},
3066 | "source": [
3067 | "### Random Swap"
3068 | ]
3069 | },
3070 | {
3071 | "cell_type": "code",
3072 | "execution_count": 13,
3073 | "metadata": {},
3074 | "outputs": [
3075 | {
3076 | "name": "stdout",
3077 | "output_type": "stream",
3078 | "text": [
3079 | "Swap Text: ['The quick brown jumps fox a lazy over dog']\n"
3080 | ]
3081 | }
3082 | ],
3083 | "source": [
3084 | "swap_aug = naw.random.RandomWordAug(action='swap')\n",
3085 | "swap_text = swap_aug.augment(text)\n",
3086 | "print('Swap Text:', swap_text)"
3087 | ]
3088 | },
3089 | {
3090 | "cell_type": "markdown",
3091 | "metadata": {},
3092 | "source": [
3093 | "### Back Translation"
3094 | ]
3095 | },
3096 | {
3097 | "cell_type": "code",
3098 | "execution_count": 15,
3099 | "metadata": {},
3100 | "outputs": [
3101 | {
3102 | "name": "stdout",
3103 | "output_type": "stream",
3104 | "text": [
3105 | "Back Translated Text: ['The speedy brown fox jumps over a lazy dog']\n"
3106 | ]
3107 | }
3108 | ],
3109 | "source": [
3110 | "# translate original text to other language (german) and convert back to english language\n",
3111 | "back_trans_aug = naw.back_translation.BackTranslationAug()\n",
3112 | "back_trans_text = back_trans_aug.augment(text)\n",
3113 | "print('Back Translated Text:', back_trans_text)"
3114 | ]
3115 | },
3116 | {
3117 | "cell_type": "code",
3118 | "execution_count": null,
3119 | "metadata": {},
3120 | "outputs": [],
3121 | "source": []
3122 | },
3123 | {
3124 | "cell_type": "code",
3125 | "execution_count": null,
3126 | "metadata": {},
3127 | "outputs": [],
3128 | "source": []
3129 | }
3130 | ],
3131 | "metadata": {
3132 | "kernelspec": {
3133 | "display_name": "Python 3 (ipykernel)",
3134 | "language": "python",
3135 | "name": "python3"
3136 | },
3137 | "language_info": {
3138 | "codemirror_mode": {
3139 | "name": "ipython",
3140 | "version": 3
3141 | },
3142 | "file_extension": ".py",
3143 | "mimetype": "text/x-python",
3144 | "name": "python",
3145 | "nbconvert_exporter": "python",
3146 | "pygments_lexer": "ipython3",
3147 | "version": "3.11.5"
3148 | }
3149 | },
3150 | "nbformat": 4,
3151 | "nbformat_minor": 4
3152 | }
3153 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Data Science Concepts
2 |
3 | The repository contains all the machine learning, deep learning and NLP concepts with examples in python
4 |
5 | 💻 Machine learning concepts playlist: http://bit.ly/mlconcepts
6 |
7 | ✍🏼 Natural Language Processing(NLP) concepts playlist: http://bit.ly/nlpconcepts
8 |
9 |
10 | ## Machine Learning
11 |
12 | 1. Normalize data using Max Absolute & Min Max Scaling - https://youtu.be/wSgWf-lUdDU
13 | 2. Standardize data using Z-Score/Standard Scalar - https://youtu.be/AmCkjGPmdvI
14 | 3. Detect and Remove Outliers in the Data - https://youtu.be/Cw2IvmWRcXs
15 | 4. Label Encoding for Categorical Attributes - https://youtu.be/YuzLkF7Ymf4
16 | 5. One Hot Encoding for Categorical Attributes - https://youtu.be/LqMHkc_F1WA
17 | 6. Target/Mean Encoding for Categorical Attributes - https://youtu.be/nd7vc4MZQz4
18 | 7. Frequency Encoding & Binary Encoding - https://youtu.be/2oCfBpnWQws
19 | 8. Extract Features from Datetime Attribute - https://youtu.be/PbyHFUVuqn8
20 | 9. How to Fill Missing Values in Dataset - https://youtu.be/FEQpdgoH_pM
21 | 10. Feature Selection using Correlation Matrix (Numerical) - https://youtu.be/1fFVt4tQjRE
22 | 11. Feature Selection using Chi Square (Category) - https://youtu.be/6N9H9KxdZdk
23 | 12. Feature Selection using Recursive Feature Elimination (RFE) - https://youtu.be/vxdVKbAv6as
24 | 13. Repeated Stratified KFold Cross Validation - https://youtu.be/cChWbibT-JI
25 | 14. How to handle Imbalanced Classes in Dataset - https://youtu.be/rVuUqpyPwEs
26 | 15. Ensemble Techniques to improve Model Performance - https://youtu.be/qPN-S5Ltbm4
27 | 16. Dimensionality Reduction using PCA vs LDA vs t-SNE vs UMAP - https://youtu.be/gk7ntPrxy-k
28 | 17. Handle Large Data using pandas - https://youtu.be/bd_1T2JCr4M
29 |
30 | ## Natural Language Processing
31 |
32 | 1. Tokenization - https://youtu.be/ivCcY8JCxeY
33 | 2. Stemming | Extract Root Words - https://youtu.be/O-SaH_dnb9A
34 | 3. Lemmatization - https://youtu.be/uvKKEkYZcdw
35 | 4. Part of Speech Tagging (POS) - https://youtu.be/n6j-T3_F9dI
36 | 5. Text Preprocessing in NLP - https://youtu.be/Br5dmsa49wo
37 | 6. Bag of Words (BOW) - https://youtu.be/dSce20oYIPY
38 | 7. Term Frequency - Inverse Document Frequency (TF-IDF) - https://youtu.be/O9aAwvk6SNI
39 | 8. Word2Vec - https://youtu.be/4DoJcQblpGQ
40 | 9. Word Embedding | GloVe - https://youtu.be/6uxVtUMtqtk
41 |
42 | ## Deep Learning
43 |
44 | 1.
--------------------------------------------------------------------------------