├── 1. NLP Basics
    ├── 1.1. what is NLP.ipynb
    ├── 1.2. reading in text data & why do we need cleaning.ipynb
    ├── 1.3. How to explore a dataset.ipynb
    ├── 1.4. learning how to use regular expressions.ipynb
    ├── 1.5. implementing a pipeline to clean text.ipynb
    ├── SMSSpamCollection.tsv
    └── SMSSpamCollection_cleaned.tsv
├── 2. Data Cleaning
    ├── 2.1. stemming.ipynb
    ├── 2.2. lemmatizing.ipynb
    └── SMSSpamCollection.tsv
├── 3. Vectorizing Raw Data
    ├── 3.1. count vectoriztion.ipynb
    ├── 3.2. N_grams.ipynb
    ├── 3.3. TF-IDF.ipynb
    └── SMSSpamCollection.tsv
├── 4. Feature Engineering
    ├── 4.1. Feature Creation.ipynb
    ├── 4.2. Transformation.ipynb
    └── SMSSpamCollection.tsv
├── 5. Building Machine Learning Classifiers
    ├── 5.1. Building a basic Random Forest Model.ipynb
    ├── 5.2. Random Forest on a holdout test set.ipynb
    ├── 5.3. Explore Random Forest Model with Grid-Search.ipynb
    ├── 5.4. Evaluate Random Forest with GridSearchCV.ipynb
    ├── 5.5. Explore Gradient Boosting model with Grid-Search.ipynb
    ├── 5.6. Evaluate Gradient Boosting with GridSearchCV.ipynb
    ├── 5.7. Model Selection.ipynb
    ├── SMSSpamCollection.tsv
    └── empty
├── LICENSE
├── README.md
├── page.html
└── test output
    ├── empty
    ├── giphy.gif
    ├── output_1.png
    └── output_2.png


/1. NLP Basics/1.1. what is NLP.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# NLP Basics: What is Natural Language Processing & the Natural Language Toolkit?"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "### How to install NLTK on your local machine\n",
 15 |     "\n",
 16 |     "Both sets of instructions below assume you already have Python installed. These instructions are taken directly from [http://www.nltk.org/install.html](http://www.nltk.org/install.html).\n",
 17 |     "\n",
 18 |     "**Mac/Unix**\n",
 19 |     "\n",
 20 |     "From the terminal:\n",
 21 |     "1. Install NLTK: run `pip install -U nltk`\n",
 22 |     "2. Test installation: run `python` then type `import nltk`\n",
 23 |     "\n",
 24 |     "**Windows**\n",
 25 |     "\n",
 26 |     "1. Install NLTK: [http://pypi.python.org/pypi/nltk](http://pypi.python.org/pypi/nltk)\n",
 27 |     "2. Test installation: `Start>Python35`, then type `import nltk`"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "markdown",
 32 |    "metadata": {},
 33 |    "source": [
 34 |     "### Download NLTK data"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": 9,
 40 |    "metadata": {},
 41 |    "outputs": [
 42 |     {
 43 |      "name": "stdout",
 44 |      "output_type": "stream",
 45 |      "text": [
 46 |       "showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml\n"
 47 |      ]
 48 |     },
 49 |     {
 50 |      "data": {
 51 |       "text/plain": [
 52 |        "True"
 53 |       ]
 54 |      },
 55 |      "execution_count": 9,
 56 |      "metadata": {},
 57 |      "output_type": "execute_result"
 58 |     }
 59 |    ],
 60 |    "source": [
 61 |     "import nltk\n",
 62 |     "nltk.download()"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "code",
 67 |    "execution_count": 10,
 68 |    "metadata": {},
 69 |    "outputs": [
 70 |     {
 71 |      "data": {
 72 |       "text/plain": [
 73 |        "['AbstractLazySequence',\n",
 74 |        " 'AffixTagger',\n",
 75 |        " 'AlignedSent',\n",
 76 |        " 'Alignment',\n",
 77 |        " 'AnnotationTask',\n",
 78 |        " 'ApplicationExpression',\n",
 79 |        " 'Assignment',\n",
 80 |        " 'BigramAssocMeasures',\n",
 81 |        " 'BigramCollocationFinder',\n",
 82 |        " 'BigramTagger',\n",
 83 |        " 'BinaryMaxentFeatureEncoding',\n",
 84 |        " 'BlanklineTokenizer',\n",
 85 |        " 'BllipParser',\n",
 86 |        " 'BottomUpChartParser',\n",
 87 |        " 'BottomUpLeftCornerChartParser',\n",
 88 |        " 'BottomUpProbabilisticChartParser',\n",
 89 |        " 'Boxer',\n",
 90 |        " 'BrillTagger',\n",
 91 |        " 'BrillTaggerTrainer',\n",
 92 |        " 'CFG',\n",
 93 |        " 'CRFTagger',\n",
 94 |        " 'CfgReadingCommand',\n",
 95 |        " 'ChartParser',\n",
 96 |        " 'ChunkParserI',\n",
 97 |        " 'ChunkScore',\n",
 98 |        " 'ClassifierBasedPOSTagger',\n",
 99 |        " 'ClassifierBasedTagger',\n",
100 |        " 'ClassifierI',\n",
101 |        " 'ConcordanceIndex',\n",
102 |        " 'ConditionalExponentialClassifier',\n",
103 |        " 'ConditionalFreqDist',\n",
104 |        " 'ConditionalProbDist',\n",
105 |        " 'ConditionalProbDistI',\n",
106 |        " 'ConfusionMatrix',\n",
107 |        " 'ContextIndex',\n",
108 |        " 'ContextTagger',\n",
109 |        " 'ContingencyMeasures',\n",
110 |        " 'CoreNLPDependencyParser',\n",
111 |        " 'CoreNLPParser',\n",
112 |        " 'Counter',\n",
113 |        " 'CrossValidationProbDist',\n",
114 |        " 'DRS',\n",
115 |        " 'DecisionTreeClassifier',\n",
116 |        " 'DefaultTagger',\n",
117 |        " 'DependencyEvaluator',\n",
118 |        " 'DependencyGrammar',\n",
119 |        " 'DependencyGraph',\n",
120 |        " 'DependencyProduction',\n",
121 |        " 'DictionaryConditionalProbDist',\n",
122 |        " 'DictionaryProbDist',\n",
123 |        " 'DiscourseTester',\n",
124 |        " 'DrtExpression',\n",
125 |        " 'DrtGlueReadingCommand',\n",
126 |        " 'ELEProbDist',\n",
127 |        " 'EarleyChartParser',\n",
128 |        " 'Expression',\n",
129 |        " 'FStructure',\n",
130 |        " 'FeatDict',\n",
131 |        " 'FeatList',\n",
132 |        " 'FeatStruct',\n",
133 |        " 'FeatStructReader',\n",
134 |        " 'Feature',\n",
135 |        " 'FeatureBottomUpChartParser',\n",
136 |        " 'FeatureBottomUpLeftCornerChartParser',\n",
137 |        " 'FeatureChartParser',\n",
138 |        " 'FeatureEarleyChartParser',\n",
139 |        " 'FeatureIncrementalBottomUpChartParser',\n",
140 |        " 'FeatureIncrementalBottomUpLeftCornerChartParser',\n",
141 |        " 'FeatureIncrementalChartParser',\n",
142 |        " 'FeatureIncrementalTopDownChartParser',\n",
143 |        " 'FeatureTopDownChartParser',\n",
144 |        " 'FreqDist',\n",
145 |        " 'HTTPPasswordMgrWithDefaultRealm',\n",
146 |        " 'HeldoutProbDist',\n",
147 |        " 'HiddenMarkovModelTagger',\n",
148 |        " 'HiddenMarkovModelTrainer',\n",
149 |        " 'HunposTagger',\n",
150 |        " 'IBMModel',\n",
151 |        " 'IBMModel1',\n",
152 |        " 'IBMModel2',\n",
153 |        " 'IBMModel3',\n",
154 |        " 'IBMModel4',\n",
155 |        " 'IBMModel5',\n",
156 |        " 'ISRIStemmer',\n",
157 |        " 'ImmutableMultiParentedTree',\n",
158 |        " 'ImmutableParentedTree',\n",
159 |        " 'ImmutableProbabilisticMixIn',\n",
160 |        " 'ImmutableProbabilisticTree',\n",
161 |        " 'ImmutableTree',\n",
162 |        " 'IncrementalBottomUpChartParser',\n",
163 |        " 'IncrementalBottomUpLeftCornerChartParser',\n",
164 |        " 'IncrementalChartParser',\n",
165 |        " 'IncrementalLeftCornerChartParser',\n",
166 |        " 'IncrementalTopDownChartParser',\n",
167 |        " 'Index',\n",
168 |        " 'InsideChartParser',\n",
169 |        " 'JSONTaggedDecoder',\n",
170 |        " 'JSONTaggedEncoder',\n",
171 |        " 'KneserNeyProbDist',\n",
172 |        " 'LancasterStemmer',\n",
173 |        " 'LaplaceProbDist',\n",
174 |        " 'LazyConcatenation',\n",
175 |        " 'LazyEnumerate',\n",
176 |        " 'LazyIteratorList',\n",
177 |        " 'LazyMap',\n",
178 |        " 'LazySubsequence',\n",
179 |        " 'LazyZip',\n",
180 |        " 'LeftCornerChartParser',\n",
181 |        " 'LidstoneProbDist',\n",
182 |        " 'LineTokenizer',\n",
183 |        " 'LogicalExpressionException',\n",
184 |        " 'LongestChartParser',\n",
185 |        " 'MLEProbDist',\n",
186 |        " 'MWETokenizer',\n",
187 |        " 'Mace',\n",
188 |        " 'MaceCommand',\n",
189 |        " 'MaltParser',\n",
190 |        " 'MaxentClassifier',\n",
191 |        " 'Model',\n",
192 |        " 'MultiClassifierI',\n",
193 |        " 'MultiParentedTree',\n",
194 |        " 'MutableProbDist',\n",
195 |        " 'NaiveBayesClassifier',\n",
196 |        " 'NaiveBayesDependencyScorer',\n",
197 |        " 'NgramAssocMeasures',\n",
198 |        " 'NgramTagger',\n",
199 |        " 'NonprojectiveDependencyParser',\n",
200 |        " 'Nonterminal',\n",
201 |        " 'OrderedDict',\n",
202 |        " 'PCFG',\n",
203 |        " 'Paice',\n",
204 |        " 'ParallelProverBuilder',\n",
205 |        " 'ParallelProverBuilderCommand',\n",
206 |        " 'ParentedTree',\n",
207 |        " 'ParserI',\n",
208 |        " 'PerceptronTagger',\n",
209 |        " 'PhraseTable',\n",
210 |        " 'PorterStemmer',\n",
211 |        " 'PositiveNaiveBayesClassifier',\n",
212 |        " 'ProbDistI',\n",
213 |        " 'ProbabilisticDependencyGrammar',\n",
214 |        " 'ProbabilisticMixIn',\n",
215 |        " 'ProbabilisticNonprojectiveParser',\n",
216 |        " 'ProbabilisticProduction',\n",
217 |        " 'ProbabilisticProjectiveDependencyParser',\n",
218 |        " 'ProbabilisticTree',\n",
219 |        " 'Production',\n",
220 |        " 'ProjectiveDependencyParser',\n",
221 |        " 'Prover9',\n",
222 |        " 'Prover9Command',\n",
223 |        " 'ProxyBasicAuthHandler',\n",
224 |        " 'ProxyDigestAuthHandler',\n",
225 |        " 'ProxyHandler',\n",
226 |        " 'PunktSentenceTokenizer',\n",
227 |        " 'QuadgramCollocationFinder',\n",
228 |        " 'RSLPStemmer',\n",
229 |        " 'RTEFeatureExtractor',\n",
230 |        " 'RUS_PICKLE',\n",
231 |        " 'RandomChartParser',\n",
232 |        " 'RangeFeature',\n",
233 |        " 'ReadingCommand',\n",
234 |        " 'RecursiveDescentParser',\n",
235 |        " 'RegexpChunkParser',\n",
236 |        " 'RegexpParser',\n",
237 |        " 'RegexpStemmer',\n",
238 |        " 'RegexpTagger',\n",
239 |        " 'RegexpTokenizer',\n",
240 |        " 'ReppTokenizer',\n",
241 |        " 'ResolutionProver',\n",
242 |        " 'ResolutionProverCommand',\n",
243 |        " 'SExprTokenizer',\n",
244 |        " 'SLASH',\n",
245 |        " 'Senna',\n",
246 |        " 'SennaChunkTagger',\n",
247 |        " 'SennaNERTagger',\n",
248 |        " 'SennaTagger',\n",
249 |        " 'SequentialBackoffTagger',\n",
250 |        " 'ShiftReduceParser',\n",
251 |        " 'SimpleGoodTuringProbDist',\n",
252 |        " 'SklearnClassifier',\n",
253 |        " 'SlashFeature',\n",
254 |        " 'SnowballStemmer',\n",
255 |        " 'SpaceTokenizer',\n",
256 |        " 'StackDecoder',\n",
257 |        " 'StanfordNERTagger',\n",
258 |        " 'StanfordPOSTagger',\n",
259 |        " 'StanfordSegmenter',\n",
260 |        " 'StanfordTagger',\n",
261 |        " 'StemmerI',\n",
262 |        " 'SteppingChartParser',\n",
263 |        " 'SteppingRecursiveDescentParser',\n",
264 |        " 'SteppingShiftReduceParser',\n",
265 |        " 'TYPE',\n",
266 |        " 'TabTokenizer',\n",
267 |        " 'TableauProver',\n",
268 |        " 'TableauProverCommand',\n",
269 |        " 'TaggerI',\n",
270 |        " 'TestGrammar',\n",
271 |        " 'Text',\n",
272 |        " 'TextCat',\n",
273 |        " 'TextCollection',\n",
274 |        " 'TextTilingTokenizer',\n",
275 |        " 'TnT',\n",
276 |        " 'TokenSearcher',\n",
277 |        " 'ToktokTokenizer',\n",
278 |        " 'TopDownChartParser',\n",
279 |        " 'TransitionParser',\n",
280 |        " 'Tree',\n",
281 |        " 'TreebankWordTokenizer',\n",
282 |        " 'Trie',\n",
283 |        " 'TrigramAssocMeasures',\n",
284 |        " 'TrigramCollocationFinder',\n",
285 |        " 'TrigramTagger',\n",
286 |        " 'TweetTokenizer',\n",
287 |        " 'TypedMaxentFeatureEncoding',\n",
288 |        " 'Undefined',\n",
289 |        " 'UniformProbDist',\n",
290 |        " 'UnigramTagger',\n",
291 |        " 'UnsortedChartParser',\n",
292 |        " 'Valuation',\n",
293 |        " 'Variable',\n",
294 |        " 'ViterbiParser',\n",
295 |        " 'WekaClassifier',\n",
296 |        " 'WhitespaceTokenizer',\n",
297 |        " 'WittenBellProbDist',\n",
298 |        " 'WordNetLemmatizer',\n",
299 |        " 'WordPunctTokenizer',\n",
300 |        " '__author__',\n",
301 |        " '__author_email__',\n",
302 |        " '__builtins__',\n",
303 |        " '__cached__',\n",
304 |        " '__classifiers__',\n",
305 |        " '__copyright__',\n",
306 |        " '__doc__',\n",
307 |        " '__file__',\n",
308 |        " '__keywords__',\n",
309 |        " '__license__',\n",
310 |        " '__loader__',\n",
311 |        " '__longdescr__',\n",
312 |        " '__maintainer__',\n",
313 |        " '__maintainer_email__',\n",
314 |        " '__name__',\n",
315 |        " '__package__',\n",
316 |        " '__path__',\n",
317 |        " '__spec__',\n",
318 |        " '__url__',\n",
319 |        " '__version__',\n",
320 |        " 'absolute_import',\n",
321 |        " 'accuracy',\n",
322 |        " 'add_logs',\n",
323 |        " 'agreement',\n",
324 |        " 'align',\n",
325 |        " 'alignment_error_rate',\n",
326 |        " 'aline',\n",
327 |        " 'api',\n",
328 |        " 'app',\n",
329 |        " 'apply_features',\n",
330 |        " 'approxrand',\n",
331 |        " 'arity',\n",
332 |        " 'association',\n",
333 |        " 'bigrams',\n",
334 |        " 'binary_distance',\n",
335 |        " 'binary_search_file',\n",
336 |        " 'binding_ops',\n",
337 |        " 'bisect',\n",
338 |        " 'blankline_tokenize',\n",
339 |        " 'bleu',\n",
340 |        " 'bleu_score',\n",
341 |        " 'bllip',\n",
342 |        " 'boolean_ops',\n",
343 |        " 'boxer',\n",
344 |        " 'bracket_parse',\n",
345 |        " 'breadth_first',\n",
346 |        " 'brill',\n",
347 |        " 'brill_trainer',\n",
348 |        " 'build_opener',\n",
349 |        " 'call_megam',\n",
350 |        " 'casual',\n",
351 |        " 'casual_tokenize',\n",
352 |        " 'ccg',\n",
353 |        " 'chain',\n",
354 |        " 'chart',\n",
355 |        " 'chat',\n",
356 |        " 'choose',\n",
357 |        " 'chunk',\n",
358 |        " 'class_types',\n",
359 |        " 'classify',\n",
360 |        " 'clause',\n",
361 |        " 'clean_html',\n",
362 |        " 'clean_url',\n",
363 |        " 'cluster',\n",
364 |        " 'collections',\n",
365 |        " 'collocations',\n",
366 |        " 'combinations',\n",
367 |        " 'compat',\n",
368 |        " 'config_java',\n",
369 |        " 'config_megam',\n",
370 |        " 'config_weka',\n",
371 |        " 'conflicts',\n",
372 |        " 'confusionmatrix',\n",
373 |        " 'conllstr2tree',\n",
374 |        " 'conlltags2tree',\n",
375 |        " 'corenlp',\n",
376 |        " 'corpus',\n",
377 |        " 'crf',\n",
378 |        " 'custom_distance',\n",
379 |        " 'data',\n",
380 |        " 'decisiontree',\n",
381 |        " 'decorator',\n",
382 |        " 'decorators',\n",
383 |        " 'defaultdict',\n",
384 |        " 'demo',\n",
385 |        " 'dependencygraph',\n",
386 |        " 'deque',\n",
387 |        " 'discourse',\n",
388 |        " 'distance',\n",
389 |        " 'download',\n",
390 |        " 'download_gui',\n",
391 |        " 'download_shell',\n",
392 |        " 'downloader',\n",
393 |        " 'draw',\n",
394 |        " 'drt',\n",
395 |        " 'earleychart',\n",
396 |        " 'edit_distance',\n",
397 |        " 'elementtree_indent',\n",
398 |        " 'entropy',\n",
399 |        " 'equality_preds',\n",
400 |        " 'evaluate',\n",
401 |        " 'evaluate_sents',\n",
402 |        " 'everygrams',\n",
403 |        " 'extract_rels',\n",
404 |        " 'extract_test_sentences',\n",
405 |        " 'f_measure',\n",
406 |        " 'featstruct',\n",
407 |        " 'featurechart',\n",
408 |        " 'filestring',\n",
409 |        " 'find',\n",
410 |        " 'flatten',\n",
411 |        " 'fractional_presence',\n",
412 |        " 'getproxies',\n",
413 |        " 'ghd',\n",
414 |        " 'glue',\n",
415 |        " 'grammar',\n",
416 |        " 'guess_encoding',\n",
417 |        " 'help',\n",
418 |        " 'hmm',\n",
419 |        " 'hunpos',\n",
420 |        " 'ibm1',\n",
421 |        " 'ibm2',\n",
422 |        " 'ibm3',\n",
423 |        " 'ibm4',\n",
424 |        " 'ibm5',\n",
425 |        " 'ibm_model',\n",
426 |        " 'ieerstr2tree',\n",
427 |        " 'improved_close_quote_regex',\n",
428 |        " 'improved_open_quote_regex',\n",
429 |        " 'improved_punct_regex',\n",
430 |        " 'in_idle',\n",
431 |        " 'induce_pcfg',\n",
432 |        " 'inference',\n",
433 |        " 'infile',\n",
434 |        " 'inspect',\n",
435 |        " 'install_opener',\n",
436 |        " 'internals',\n",
437 |        " 'interpret_sents',\n",
438 |        " 'interval_distance',\n",
439 |        " 'invert_dict',\n",
440 |        " 'invert_graph',\n",
441 |        " 'is_rel',\n",
442 |        " 'islice',\n",
443 |        " 'isri',\n",
444 |        " 'jaccard_distance',\n",
445 |        " 'json_tags',\n",
446 |        " 'jsontags',\n",
447 |        " 'lancaster',\n",
448 |        " 'lazyimport',\n",
449 |        " 'lfg',\n",
450 |        " 'line_tokenize',\n",
451 |        " 'linearlogic',\n",
452 |        " 'load',\n",
453 |        " 'load_parser',\n",
454 |        " 'locale',\n",
455 |        " 'log_likelihood',\n",
456 |        " 'logic',\n",
457 |        " 'mace',\n",
458 |        " 'malt',\n",
459 |        " 'map_tag',\n",
460 |        " 'mapping',\n",
461 |        " 'masi_distance',\n",
462 |        " 'maxent',\n",
463 |        " 'megam',\n",
464 |        " 'memoize',\n",
465 |        " 'metrics',\n",
466 |        " 'misc',\n",
467 |        " 'mwe',\n",
468 |        " 'naivebayes',\n",
469 |        " 'ne_chunk',\n",
470 |        " 'ne_chunk_sents',\n",
471 |        " 'ngrams',\n",
472 |        " 'nonprojectivedependencyparser',\n",
473 |        " 'nonterminals',\n",
474 |        " 'numpy',\n",
475 |        " 'os',\n",
476 |        " 'pad_sequence',\n",
477 |        " 'paice',\n",
478 |        " 'parse',\n",
479 |        " 'parse_sents',\n",
480 |        " 'pchart',\n",
481 |        " 'perceptron',\n",
482 |        " 'pk',\n",
483 |        " 'porter',\n",
484 |        " 'pos_tag',\n",
485 |        " 'pos_tag_sents',\n",
486 |        " 'positivenaivebayes',\n",
487 |        " 'pprint',\n",
488 |        " 'pr',\n",
489 |        " 'precision',\n",
490 |        " 'presence',\n",
491 |        " 'print_function',\n",
492 |        " 'print_string',\n",
493 |        " 'probability',\n",
494 |        " 'projectivedependencyparser',\n",
495 |        " 'prover9',\n",
496 |        " 'punkt',\n",
497 |        " 'py25',\n",
498 |        " 'py26',\n",
499 |        " 'py27',\n",
500 |        " 'pydoc',\n",
501 |        " 'python_2_unicode_compatible',\n",
502 |        " 'raise_unorderable_types',\n",
503 |        " 'ranks_from_scores',\n",
504 |        " 'ranks_from_sequence',\n",
505 |        " 're',\n",
506 |        " 're_show',\n",
507 |        " 'read_grammar',\n",
508 |        " 'read_logic',\n",
509 |        " 'read_valuation',\n",
510 |        " 'recall',\n",
511 |        " 'recursivedescent',\n",
512 |        " 'regexp',\n",
513 |        " 'regexp_span_tokenize',\n",
514 |        " 'regexp_tokenize',\n",
515 |        " 'register_tag',\n",
516 |        " 'relextract',\n",
517 |        " 'repp',\n",
518 |        " 'resolution',\n",
519 |        " 'ribes',\n",
520 |        " 'ribes_score',\n",
521 |        " 'root_semrep',\n",
522 |        " 'rslp',\n",
523 |        " 'rte_classifier',\n",
524 |        " 'rte_classify',\n",
525 |        " 'rte_features',\n",
526 |        " 'rtuple',\n",
527 |        " 'scikitlearn',\n",
528 |        " 'scores',\n",
529 |        " 'segmentation',\n",
530 |        " 'sem',\n",
531 |        " 'senna',\n",
532 |        " 'sent_tokenize',\n",
533 |        " 'sequential',\n",
534 |        " 'set2rel',\n",
535 |        " 'set_proxy',\n",
536 |        " 'sexpr',\n",
537 |        " 'sexpr_tokenize',\n",
538 |        " 'shiftreduce',\n",
539 |        " 'simple',\n",
540 |        " 'sinica_parse',\n",
541 |        " 'skipgrams',\n",
542 |        " 'skolemize',\n",
543 |        " 'slice_bounds',\n",
544 |        " 'snowball',\n",
545 |        " 'spearman',\n",
546 |        " 'spearman_correlation',\n",
547 |        " 'stack_decoder',\n",
548 |        " 'stanford',\n",
549 |        " 'stanford_segmenter',\n",
550 |        " 'stem',\n",
551 |        " 'str2tuple',\n",
552 |        " 'string_span_tokenize',\n",
553 |        " 'string_types',\n",
554 |        " 'subprocess',\n",
555 |        " 'subsumes',\n",
556 |        " 'sum_logs',\n",
557 |        " 'sys',\n",
558 |        " 'tableau',\n",
559 |        " 'tadm',\n",
560 |        " 'tag',\n",
561 |        " 'tagset_mapping',\n",
562 |        " 'tagstr2tree',\n",
563 |        " 'tbl',\n",
564 |        " 'text',\n",
565 |        " 'text_type',\n",
566 |        " 'textcat',\n",
567 |        " 'texttiling',\n",
568 |        " 'textwrap',\n",
569 |        " 'tkinter',\n",
570 |        " 'tnt',\n",
571 |        " 'tokenize',\n",
572 |        " 'tokenwrap',\n",
573 |        " 'toktok',\n",
574 |        " 'toolbox',\n",
575 |        " 'total_ordering',\n",
576 |        " 'transitionparser',\n",
577 |        " 'transitive_closure',\n",
578 |        " 'translate',\n",
579 |        " 'tree',\n",
580 |        " 'tree2conllstr',\n",
581 |        " 'tree2conlltags',\n",
582 |        " 'treebank',\n",
583 |        " 'treetransforms',\n",
584 |        " 'trigrams',\n",
585 |        " 'tuple2str',\n",
586 |        " 'types',\n",
587 |        " 'unify',\n",
588 |        " 'unique_list',\n",
589 |        " 'untag',\n",
590 |        " 'usage',\n",
591 |        " 'util',\n",
592 |        " 'version_file',\n",
593 |        " 'version_info',\n",
594 |        " 'viterbi',\n",
595 |        " 'weka',\n",
596 |        " 'windowdiff',\n",
597 |        " 'word_tokenize',\n",
598 |        " 'wordnet',\n",
599 |        " 'wordpunct_tokenize',\n",
600 |        " 'wsd']"
601 |       ]
602 |      },
603 |      "execution_count": 10,
604 |      "metadata": {},
605 |      "output_type": "execute_result"
606 |     }
607 |    ],
608 |    "source": [
609 |     "dir(nltk)"
610 |    ]
611 |   },
612 |   {
613 |    "cell_type": "markdown",
614 |    "metadata": {},
615 |    "source": [
616 |     "### What can you do with NLTK?"
617 |    ]
618 |   },
619 |   {
620 |    "cell_type": "code",
621 |    "execution_count": 12,
622 |    "metadata": {},
623 |    "outputs": [
624 |     {
625 |      "data": {
626 |       "text/plain": [
627 |        "['i', 'herself', 'been', 'with', 'here', 'very', 'doesn', 'won']"
628 |       ]
629 |      },
630 |      "execution_count": 12,
631 |      "metadata": {},
632 |      "output_type": "execute_result"
633 |     }
634 |    ],
635 |    "source": [
636 |     "from nltk.corpus import stopwords\n",
637 |     "\n",
638 |     "stopwords.words('english')[0:500:25]"
639 |    ]
640 |   },
641 |   {
642 |    "cell_type": "code",
643 |    "execution_count": null,
644 |    "metadata": {
645 |     "collapsed": true,
646 |     "jupyter": {
647 |      "outputs_hidden": true
648 |     }
649 |    },
650 |    "outputs": [],
651 |    "source": []
652 |   }
653 |  ],
654 |  "metadata": {
655 |   "kernelspec": {
656 |    "display_name": "Python 3 (ipykernel)",
657 |    "language": "python",
658 |    "name": "python3"
659 |   },
660 |   "language_info": {
661 |    "codemirror_mode": {
662 |     "name": "ipython",
663 |     "version": 3
664 |    },
665 |    "file_extension": ".py",
666 |    "mimetype": "text/x-python",
667 |    "name": "python",
668 |    "nbconvert_exporter": "python",
669 |    "pygments_lexer": "ipython3",
670 |    "version": "3.11.0"
671 |   }
672 |  },
673 |  "nbformat": 4,
674 |  "nbformat_minor": 4
675 | }
676 | 


--------------------------------------------------------------------------------
/1. NLP Basics/1.2. reading in text data & why do we need cleaning.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# NLP Basics: Reading in text data & why do we need to clean the text?"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "### Read in semi-structured text data"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 2,
 20 |    "metadata": {},
 21 |    "outputs": [
 22 |     {
 23 |      "data": {
 24 |       "text/plain": [
 25 |        "\"ham\\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\\nspam\\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\\nham\\tNah I don't think he goes to usf, he lives around here though\\nham\\tEven my brother is not like to speak with me. They treat me like aid\""
 26 |       ]
 27 |      },
 28 |      "execution_count": 2,
 29 |      "metadata": {},
 30 |      "output_type": "execute_result"
 31 |     }
 32 |    ],
 33 |    "source": [
 34 |     "# Read in the raw text\n",
 35 |     "rawData = open(\"SMSSpamCollection.tsv\").read()\n",
 36 |     "\n",
 37 |     "# Print the raw data\n",
 38 |     "rawData[0:500]"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "code",
 43 |    "execution_count": 3,
 44 |    "metadata": {
 45 |     "tags": []
 46 |    },
 47 |    "outputs": [],
 48 |    "source": [
 49 |     "parsedData = rawData.replace('\\t', '\\n').split('\\n')"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": 4,
 55 |    "metadata": {},
 56 |    "outputs": [
 57 |     {
 58 |      "data": {
 59 |       "text/plain": [
 60 |        "['ham',\n",
 61 |        " \"I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\",\n",
 62 |        " 'spam',\n",
 63 |        " \"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\",\n",
 64 |        " 'ham']"
 65 |       ]
 66 |      },
 67 |      "execution_count": 4,
 68 |      "metadata": {},
 69 |      "output_type": "execute_result"
 70 |     }
 71 |    ],
 72 |    "source": [
 73 |     "parsedData[0:5]"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": 5,
 79 |    "metadata": {
 80 |     "tags": []
 81 |    },
 82 |    "outputs": [],
 83 |    "source": [
 84 |     "labelList = parsedData[0::2]\n",
 85 |     "textList = parsedData[1::2]"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "code",
 90 |    "execution_count": 6,
 91 |    "metadata": {},
 92 |    "outputs": [
 93 |     {
 94 |      "name": "stdout",
 95 |      "output_type": "stream",
 96 |      "text": [
 97 |       "['ham', 'spam', 'ham', 'ham', 'ham']\n",
 98 |       "[\"I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\", \"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\", \"Nah I don't think he goes to usf, he lives around here though\", 'Even my brother is not like to speak with me. They treat me like aids patent.', 'I HAVE A DATE ON SUNDAY WITH WILL!!']\n"
 99 |      ]
100 |     }
101 |    ],
102 |    "source": [
103 |     "print(labelList[0:5])\n",
104 |     "print(textList[0:5])"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "execution_count": 12,
110 |    "metadata": {},
111 |    "outputs": [
112 |     {
113 |      "ename": "ValueError",
114 |      "evalue": "All arrays must be of the same length",
115 |      "output_type": "error",
116 |      "traceback": [
117 |       "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
118 |       "\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
119 |       "Cell \u001b[1;32mIn[12], line 3\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mpandas\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mpd\u001b[39;00m\n\u001b[1;32m----> 3\u001b[0m fullCorpus \u001b[38;5;241m=\u001b[39m \u001b[43mpd\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mDataFrame\u001b[49m\u001b[43m(\u001b[49m\u001b[43m{\u001b[49m\n\u001b[0;32m      4\u001b[0m \u001b[43m    \u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mlabel\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m:\u001b[49m\u001b[43m \u001b[49m\u001b[43mlabelList\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m      5\u001b[0m \u001b[43m    \u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mbody_list\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m:\u001b[49m\u001b[43m \u001b[49m\u001b[43mtextList\u001b[49m\n\u001b[0;32m      6\u001b[0m \u001b[43m}\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m      8\u001b[0m fullCorpus\u001b[38;5;241m.\u001b[39mhead()\n",
120 |       "File \u001b[1;32m~\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\core\\frame.py:663\u001b[0m, in \u001b[0;36mDataFrame.__init__\u001b[1;34m(self, data, index, columns, dtype, copy)\u001b[0m\n\u001b[0;32m    657\u001b[0m     mgr \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_init_mgr(\n\u001b[0;32m    658\u001b[0m         data, axes\u001b[38;5;241m=\u001b[39m{\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mindex\u001b[39m\u001b[38;5;124m\"\u001b[39m: index, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcolumns\u001b[39m\u001b[38;5;124m\"\u001b[39m: columns}, dtype\u001b[38;5;241m=\u001b[39mdtype, copy\u001b[38;5;241m=\u001b[39mcopy\n\u001b[0;32m    659\u001b[0m     )\n\u001b[0;32m    661\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data, \u001b[38;5;28mdict\u001b[39m):\n\u001b[0;32m    662\u001b[0m     \u001b[38;5;66;03m# GH#38939 de facto copy defaults to False only in non-dict cases\u001b[39;00m\n\u001b[1;32m--> 663\u001b[0m     mgr \u001b[38;5;241m=\u001b[39m \u001b[43mdict_to_mgr\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcopy\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcopy\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtyp\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmanager\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m    664\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data, ma\u001b[38;5;241m.\u001b[39mMaskedArray):\n\u001b[0;32m    665\u001b[0m     \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mnumpy\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mma\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmrecords\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mmrecords\u001b[39;00m\n",
121 |       "File \u001b[1;32m~\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\core\\internals\\construction.py:493\u001b[0m, in \u001b[0;36mdict_to_mgr\u001b[1;34m(data, index, columns, dtype, typ, copy)\u001b[0m\n\u001b[0;32m    489\u001b[0m     \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m    490\u001b[0m         \u001b[38;5;66;03m# dtype check to exclude e.g. range objects, scalars\u001b[39;00m\n\u001b[0;32m    491\u001b[0m         arrays \u001b[38;5;241m=\u001b[39m [x\u001b[38;5;241m.\u001b[39mcopy() \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(x, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdtype\u001b[39m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;28;01melse\u001b[39;00m x \u001b[38;5;28;01mfor\u001b[39;00m x \u001b[38;5;129;01min\u001b[39;00m arrays]\n\u001b[1;32m--> 493\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43marrays_to_mgr\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrays\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtyp\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtyp\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mconsolidate\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcopy\u001b[49m\u001b[43m)\u001b[49m\n",
122 |       "File \u001b[1;32m~\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\core\\internals\\construction.py:118\u001b[0m, in \u001b[0;36marrays_to_mgr\u001b[1;34m(arrays, columns, index, dtype, verify_integrity, typ, consolidate)\u001b[0m\n\u001b[0;32m    115\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m verify_integrity:\n\u001b[0;32m    116\u001b[0m     \u001b[38;5;66;03m# figure out the index, if necessary\u001b[39;00m\n\u001b[0;32m    117\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m index \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m--> 118\u001b[0m         index \u001b[38;5;241m=\u001b[39m \u001b[43m_extract_index\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrays\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m    119\u001b[0m     \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m    120\u001b[0m         index \u001b[38;5;241m=\u001b[39m ensure_index(index)\n",
123 |       "File \u001b[1;32m~\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\core\\internals\\construction.py:666\u001b[0m, in \u001b[0;36m_extract_index\u001b[1;34m(data)\u001b[0m\n\u001b[0;32m    664\u001b[0m lengths \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mlist\u001b[39m(\u001b[38;5;28mset\u001b[39m(raw_lengths))\n\u001b[0;32m    665\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(lengths) \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[1;32m--> 666\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mAll arrays must be of the same length\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m    668\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m have_dicts:\n\u001b[0;32m    669\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[0;32m    670\u001b[0m         \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMixing dicts with non-Series may lead to ambiguous ordering.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m    671\u001b[0m     )\n",
124 |       "\u001b[1;31mValueError\u001b[0m: All arrays must be of the same length"
125 |      ]
126 |     }
127 |    ],
128 |    "source": [
129 |     "import pandas as pd\n",
130 |     "\n",
131 |     "fullCorpus = pd.DataFrame({\n",
132 |     "    'label': labelList,\n",
133 |     "    'body_list': textList\n",
134 |     "})\n",
135 |     "\n",
136 |     "fullCorpus.head()"
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "code",
141 |    "execution_count": 8,
142 |    "metadata": {},
143 |    "outputs": [
144 |     {
145 |      "name": "stdout",
146 |      "output_type": "stream",
147 |      "text": [
148 |       "5571\n",
149 |       "5570\n"
150 |      ]
151 |     }
152 |    ],
153 |    "source": [
154 |     "print(len(labelList))\n",
155 |     "print(len(textList))"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": 9,
161 |    "metadata": {},
162 |    "outputs": [
163 |     {
164 |      "name": "stdout",
165 |      "output_type": "stream",
166 |      "text": [
167 |       "['ham', 'ham', 'ham', 'ham', '']\n"
168 |      ]
169 |     }
170 |    ],
171 |    "source": [
172 |     "print(labelList[-5:])"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": 10,
178 |    "metadata": {},
179 |    "outputs": [
180 |     {
181 |      "data": {
182 |       "text/html": [
183 |        "<div>\n",
184 |        "<style scoped>\n",
185 |        "    .dataframe tbody tr th:only-of-type {\n",
186 |        "        vertical-align: middle;\n",
187 |        "    }\n",
188 |        "\n",
189 |        "    .dataframe tbody tr th {\n",
190 |        "        vertical-align: top;\n",
191 |        "    }\n",
192 |        "\n",
193 |        "    .dataframe thead th {\n",
194 |        "        text-align: right;\n",
195 |        "    }\n",
196 |        "</style>\n",
197 |        "<table border=\"1\" class=\"dataframe\">\n",
198 |        "  <thead>\n",
199 |        "    <tr style=\"text-align: right;\">\n",
200 |        "      <th></th>\n",
201 |        "      <th>label</th>\n",
202 |        "      <th>body_list</th>\n",
203 |        "    </tr>\n",
204 |        "  </thead>\n",
205 |        "  <tbody>\n",
206 |        "    <tr>\n",
207 |        "      <th>0</th>\n",
208 |        "      <td>ham</td>\n",
209 |        "      <td>I've been searching for the right words to tha...</td>\n",
210 |        "    </tr>\n",
211 |        "    <tr>\n",
212 |        "      <th>1</th>\n",
213 |        "      <td>spam</td>\n",
214 |        "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
215 |        "    </tr>\n",
216 |        "    <tr>\n",
217 |        "      <th>2</th>\n",
218 |        "      <td>ham</td>\n",
219 |        "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
220 |        "    </tr>\n",
221 |        "    <tr>\n",
222 |        "      <th>3</th>\n",
223 |        "      <td>ham</td>\n",
224 |        "      <td>Even my brother is not like to speak with me. ...</td>\n",
225 |        "    </tr>\n",
226 |        "    <tr>\n",
227 |        "      <th>4</th>\n",
228 |        "      <td>ham</td>\n",
229 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL!!</td>\n",
230 |        "    </tr>\n",
231 |        "  </tbody>\n",
232 |        "</table>\n",
233 |        "</div>"
234 |       ],
235 |       "text/plain": [
236 |        "  label                                          body_list\n",
237 |        "0   ham  I've been searching for the right words to tha...\n",
238 |        "1  spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
239 |        "2   ham  Nah I don't think he goes to usf, he lives aro...\n",
240 |        "3   ham  Even my brother is not like to speak with me. ...\n",
241 |        "4   ham                I HAVE A DATE ON SUNDAY WITH WILL!!"
242 |       ]
243 |      },
244 |      "execution_count": 10,
245 |      "metadata": {},
246 |      "output_type": "execute_result"
247 |     }
248 |    ],
249 |    "source": [
250 |     "fullCorpus = pd.DataFrame({\n",
251 |     "    'label': labelList[:-1],\n",
252 |     "    'body_list': textList\n",
253 |     "})\n",
254 |     "\n",
255 |     "fullCorpus.head()"
256 |    ]
257 |   },
258 |   {
259 |    "cell_type": "code",
260 |    "execution_count": 11,
261 |    "metadata": {},
262 |    "outputs": [
263 |     {
264 |      "data": {
265 |       "text/html": [
266 |        "<div>\n",
267 |        "<style scoped>\n",
268 |        "    .dataframe tbody tr th:only-of-type {\n",
269 |        "        vertical-align: middle;\n",
270 |        "    }\n",
271 |        "\n",
272 |        "    .dataframe tbody tr th {\n",
273 |        "        vertical-align: top;\n",
274 |        "    }\n",
275 |        "\n",
276 |        "    .dataframe thead th {\n",
277 |        "        text-align: right;\n",
278 |        "    }\n",
279 |        "</style>\n",
280 |        "<table border=\"1\" class=\"dataframe\">\n",
281 |        "  <thead>\n",
282 |        "    <tr style=\"text-align: right;\">\n",
283 |        "      <th></th>\n",
284 |        "      <th>0</th>\n",
285 |        "      <th>1</th>\n",
286 |        "    </tr>\n",
287 |        "  </thead>\n",
288 |        "  <tbody>\n",
289 |        "    <tr>\n",
290 |        "      <th>0</th>\n",
291 |        "      <td>ham</td>\n",
292 |        "      <td>I've been searching for the right words to tha...</td>\n",
293 |        "    </tr>\n",
294 |        "    <tr>\n",
295 |        "      <th>1</th>\n",
296 |        "      <td>spam</td>\n",
297 |        "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
298 |        "    </tr>\n",
299 |        "    <tr>\n",
300 |        "      <th>2</th>\n",
301 |        "      <td>ham</td>\n",
302 |        "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
303 |        "    </tr>\n",
304 |        "    <tr>\n",
305 |        "      <th>3</th>\n",
306 |        "      <td>ham</td>\n",
307 |        "      <td>Even my brother is not like to speak with me. ...</td>\n",
308 |        "    </tr>\n",
309 |        "    <tr>\n",
310 |        "      <th>4</th>\n",
311 |        "      <td>ham</td>\n",
312 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL!!</td>\n",
313 |        "    </tr>\n",
314 |        "  </tbody>\n",
315 |        "</table>\n",
316 |        "</div>"
317 |       ],
318 |       "text/plain": [
319 |        "      0                                                  1\n",
320 |        "0   ham  I've been searching for the right words to tha...\n",
321 |        "1  spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
322 |        "2   ham  Nah I don't think he goes to usf, he lives aro...\n",
323 |        "3   ham  Even my brother is not like to speak with me. ...\n",
324 |        "4   ham                I HAVE A DATE ON SUNDAY WITH WILL!!"
325 |       ]
326 |      },
327 |      "execution_count": 11,
328 |      "metadata": {},
329 |      "output_type": "execute_result"
330 |     }
331 |    ],
332 |    "source": [
333 |     "dataset = pd.read_csv(\"SMSSpamCollection.tsv\", sep=\"\\t\", header=None)\n",
334 |     "dataset.head("
335 |    ]
336 |   },
337 |   {
338 |    "cell_type": "code",
339 |    "execution_count": null,
340 |    "metadata": {
341 |     "collapsed": true,
342 |     "jupyter": {
343 |      "outputs_hidden": true
344 |     }
345 |    },
346 |    "outputs": [],
347 |    "source": []
348 |   }
349 |  ],
350 |  "metadata": {
351 |   "kernelspec": {
352 |    "display_name": "Python 3 (ipykernel)",
353 |    "language": "python",
354 |    "name": "python3"
355 |   },
356 |   "language_info": {
357 |    "codemirror_mode": {
358 |     "name": "ipython",
359 |     "version": 3
360 |    },
361 |    "file_extension": ".py",
362 |    "mimetype": "text/x-python",
363 |    "name": "python",
364 |    "nbconvert_exporter": "python",
365 |    "pygments_lexer": "ipython3",
366 |    "version": "3.11.0"
367 |   }
368 |  },
369 |  "nbformat": 4,
370 |  "nbformat_minor": 4
371 | }
372 | 


--------------------------------------------------------------------------------
/1. NLP Basics/1.3. How to explore a dataset.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# NLP Basics: Exploring the dataset"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "### Read in text data"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 1,
 20 |    "metadata": {},
 21 |    "outputs": [
 22 |     {
 23 |      "data": {
 24 |       "text/html": [
 25 |        "<div>\n",
 26 |        "<style>\n",
 27 |        "    .dataframe thead tr:only-child th {\n",
 28 |        "        text-align: right;\n",
 29 |        "    }\n",
 30 |        "\n",
 31 |        "    .dataframe thead th {\n",
 32 |        "        text-align: left;\n",
 33 |        "    }\n",
 34 |        "\n",
 35 |        "    .dataframe tbody tr th {\n",
 36 |        "        vertical-align: top;\n",
 37 |        "    }\n",
 38 |        "</style>\n",
 39 |        "<table border=\"1\" class=\"dataframe\">\n",
 40 |        "  <thead>\n",
 41 |        "    <tr style=\"text-align: right;\">\n",
 42 |        "      <th></th>\n",
 43 |        "      <th>label</th>\n",
 44 |        "      <th>body_text</th>\n",
 45 |        "    </tr>\n",
 46 |        "  </thead>\n",
 47 |        "  <tbody>\n",
 48 |        "    <tr>\n",
 49 |        "      <th>0</th>\n",
 50 |        "      <td>ham</td>\n",
 51 |        "      <td>I've been searching for the right words to tha...</td>\n",
 52 |        "    </tr>\n",
 53 |        "    <tr>\n",
 54 |        "      <th>1</th>\n",
 55 |        "      <td>spam</td>\n",
 56 |        "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
 57 |        "    </tr>\n",
 58 |        "    <tr>\n",
 59 |        "      <th>2</th>\n",
 60 |        "      <td>ham</td>\n",
 61 |        "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
 62 |        "    </tr>\n",
 63 |        "    <tr>\n",
 64 |        "      <th>3</th>\n",
 65 |        "      <td>ham</td>\n",
 66 |        "      <td>Even my brother is not like to speak with me. ...</td>\n",
 67 |        "    </tr>\n",
 68 |        "    <tr>\n",
 69 |        "      <th>4</th>\n",
 70 |        "      <td>ham</td>\n",
 71 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL!!</td>\n",
 72 |        "    </tr>\n",
 73 |        "  </tbody>\n",
 74 |        "</table>\n",
 75 |        "</div>"
 76 |       ],
 77 |       "text/plain": [
 78 |        "  label                                          body_text\n",
 79 |        "0   ham  I've been searching for the right words to tha...\n",
 80 |        "1  spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
 81 |        "2   ham  Nah I don't think he goes to usf, he lives aro...\n",
 82 |        "3   ham  Even my brother is not like to speak with me. ...\n",
 83 |        "4   ham                I HAVE A DATE ON SUNDAY WITH WILL!!"
 84 |       ]
 85 |      },
 86 |      "execution_count": 1,
 87 |      "metadata": {},
 88 |      "output_type": "execute_result"
 89 |     }
 90 |    ],
 91 |    "source": [
 92 |     "import pandas as pd\n",
 93 |     "\n",
 94 |     "fullCorpus = pd.read_csv('SMSSpamCollection.tsv', sep='\\t', header=None)\n",
 95 |     "fullCorpus.columns = ['label', 'body_text']\n",
 96 |     "\n",
 97 |     "fullCorpus.head()"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "markdown",
102 |    "metadata": {},
103 |    "source": [
104 |     "### Explore the dataset"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "execution_count": 2,
110 |    "metadata": {},
111 |    "outputs": [
112 |     {
113 |      "name": "stdout",
114 |      "output_type": "stream",
115 |      "text": [
116 |       "Input data has 5568 rows and 2 columns\n"
117 |      ]
118 |     }
119 |    ],
120 |    "source": [
121 |     "# What is the shape of the dataset?\n",
122 |     "\n",
123 |     "print(\"Input data has {} rows and {} columns\".format(len(fullCorpus), len(fullCorpus.columns)))"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "code",
128 |    "execution_count": 5,
129 |    "metadata": {},
130 |    "outputs": [
131 |     {
132 |      "name": "stdout",
133 |      "output_type": "stream",
134 |      "text": [
135 |       "Out of 5568 rows, 746 are spam, 4822 are ham\n"
136 |      ]
137 |     }
138 |    ],
139 |    "source": [
140 |     "# How many spam/ham are there?\n",
141 |     "\n",
142 |     "print(\"Out of {} rows, {} are spam, {} are ham\".format(len(fullCorpus),\n",
143 |     "                                                       len(fullCorpus[fullCorpus['label']=='spam']),\n",
144 |     "                                                       len(fullCorpus[fullCorpus['label']=='ham'])))"
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": 6,
150 |    "metadata": {},
151 |    "outputs": [
152 |     {
153 |      "name": "stdout",
154 |      "output_type": "stream",
155 |      "text": [
156 |       "Number of null in label: 0\n",
157 |       "Number of null in text: 0\n"
158 |      ]
159 |     }
160 |    ],
161 |    "source": [
162 |     "# How much missing data is there?\n",
163 |     "\n",
164 |     "print(\"Number of null in label: {}\".format(fullCorpus['label'].isnull().sum()))\n",
165 |     "print(\"Number of null in text: {}\".format(fullCorpus['body_text'].isnull().sum()))"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "code",
170 |    "execution_count": null,
171 |    "metadata": {
172 |     "collapsed": true,
173 |     "jupyter": {
174 |      "outputs_hidden": true
175 |     }
176 |    },
177 |    "outputs": [],
178 |    "source": []
179 |   }
180 |  ],
181 |  "metadata": {
182 |   "kernelspec": {
183 |    "display_name": "Python 3 (ipykernel)",
184 |    "language": "python",
185 |    "name": "python3"
186 |   },
187 |   "language_info": {
188 |    "codemirror_mode": {
189 |     "name": "ipython",
190 |     "version": 3
191 |    },
192 |    "file_extension": ".py",
193 |    "mimetype": "text/x-python",
194 |    "name": "python",
195 |    "nbconvert_exporter": "python",
196 |    "pygments_lexer": "ipython3",
197 |    "version": "3.11.0"
198 |   }
199 |  },
200 |  "nbformat": 4,
201 |  "nbformat_minor": 4
202 | }
203 | 


--------------------------------------------------------------------------------
/1. NLP Basics/1.4. learning how to use regular expressions.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# NLP Basics: Learning how to use regular expressions"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "### Using regular expressions in Python\n",
 15 |     "\n",
 16 |     "Python's `re` package is the most commonly used regex resource. More details can be found [here](https://docs.python.org/3/library/re.html)."
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "code",
 21 |    "execution_count": 1,
 22 |    "metadata": {
 23 |     "collapsed": true,
 24 |     "jupyter": {
 25 |      "outputs_hidden": true
 26 |     }
 27 |    },
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "import re\n",
 31 |     "\n",
 32 |     "re_test = 'This is a made up string to test 2 different regex methods'\n",
 33 |     "re_test_messy = 'This      is a made up     string to test 2    different regex methods'\n",
 34 |     "re_test_messy1 = 'This-is-a-made/up.string*to>>>>test----2\"\"\"\"\"\"different~regex-methods'"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "markdown",
 39 |    "metadata": {},
 40 |    "source": [
 41 |     "### Splitting a sentence into a list of words"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": 3,
 47 |    "metadata": {},
 48 |    "outputs": [
 49 |     {
 50 |      "data": {
 51 |       "text/plain": [
 52 |        "['This',\n",
 53 |        " 'is',\n",
 54 |        " 'a',\n",
 55 |        " 'made',\n",
 56 |        " 'up',\n",
 57 |        " 'string',\n",
 58 |        " 'to',\n",
 59 |        " 'test',\n",
 60 |        " '2',\n",
 61 |        " 'different',\n",
 62 |        " 'regex',\n",
 63 |        " 'methods']"
 64 |       ]
 65 |      },
 66 |      "execution_count": 3,
 67 |      "metadata": {},
 68 |      "output_type": "execute_result"
 69 |     }
 70 |    ],
 71 |    "source": [
 72 |     "re.split('\\s', re_test)"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": 4,
 78 |    "metadata": {},
 79 |    "outputs": [
 80 |     {
 81 |      "data": {
 82 |       "text/plain": [
 83 |        "['This',\n",
 84 |        " '',\n",
 85 |        " '',\n",
 86 |        " '',\n",
 87 |        " '',\n",
 88 |        " '',\n",
 89 |        " 'is',\n",
 90 |        " 'a',\n",
 91 |        " 'made',\n",
 92 |        " 'up',\n",
 93 |        " '',\n",
 94 |        " '',\n",
 95 |        " '',\n",
 96 |        " '',\n",
 97 |        " 'string',\n",
 98 |        " 'to',\n",
 99 |        " 'test',\n",
100 |        " '2',\n",
101 |        " '',\n",
102 |        " '',\n",
103 |        " '',\n",
104 |        " 'different',\n",
105 |        " 'regex',\n",
106 |        " 'methods']"
107 |       ]
108 |      },
109 |      "execution_count": 4,
110 |      "metadata": {},
111 |      "output_type": "execute_result"
112 |     }
113 |    ],
114 |    "source": [
115 |     "re.split('\\s', re_test_messy)"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": 5,
121 |    "metadata": {},
122 |    "outputs": [
123 |     {
124 |      "data": {
125 |       "text/plain": [
126 |        "['This',\n",
127 |        " 'is',\n",
128 |        " 'a',\n",
129 |        " 'made',\n",
130 |        " 'up',\n",
131 |        " 'string',\n",
132 |        " 'to',\n",
133 |        " 'test',\n",
134 |        " '2',\n",
135 |        " 'different',\n",
136 |        " 'regex',\n",
137 |        " 'methods']"
138 |       ]
139 |      },
140 |      "execution_count": 5,
141 |      "metadata": {},
142 |      "output_type": "execute_result"
143 |     }
144 |    ],
145 |    "source": [
146 |     "re.split('\\s+', re_test_messy)"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "execution_count": 6,
152 |    "metadata": {},
153 |    "outputs": [
154 |     {
155 |      "data": {
156 |       "text/plain": [
157 |        "['This-is-a-made/up.string*to>>>>test----2\"\"\"\"\"\"different~regex-methods']"
158 |       ]
159 |      },
160 |      "execution_count": 6,
161 |      "metadata": {},
162 |      "output_type": "execute_result"
163 |     }
164 |    ],
165 |    "source": [
166 |     "re.split('\\s+', re_test_messy1)"
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "code",
171 |    "execution_count": 7,
172 |    "metadata": {},
173 |    "outputs": [
174 |     {
175 |      "data": {
176 |       "text/plain": [
177 |        "['This',\n",
178 |        " 'is',\n",
179 |        " 'a',\n",
180 |        " 'made',\n",
181 |        " 'up',\n",
182 |        " 'string',\n",
183 |        " 'to',\n",
184 |        " 'test',\n",
185 |        " '2',\n",
186 |        " 'different',\n",
187 |        " 'regex',\n",
188 |        " 'methods']"
189 |       ]
190 |      },
191 |      "execution_count": 7,
192 |      "metadata": {},
193 |      "output_type": "execute_result"
194 |     }
195 |    ],
196 |    "source": [
197 |     "re.split('\\W+', re_test_messy1)"
198 |    ]
199 |   },
200 |   {
201 |    "cell_type": "code",
202 |    "execution_count": 10,
203 |    "metadata": {},
204 |    "outputs": [
205 |     {
206 |      "data": {
207 |       "text/plain": [
208 |        "['This',\n",
209 |        " 'is',\n",
210 |        " 'a',\n",
211 |        " 'made',\n",
212 |        " 'up',\n",
213 |        " 'string',\n",
214 |        " 'to',\n",
215 |        " 'test',\n",
216 |        " '2',\n",
217 |        " 'different',\n",
218 |        " 'regex',\n",
219 |        " 'methods']"
220 |       ]
221 |      },
222 |      "execution_count": 10,
223 |      "metadata": {},
224 |      "output_type": "execute_result"
225 |     }
226 |    ],
227 |    "source": [
228 |     "re.findall('\\S+', re_test)"
229 |    ]
230 |   },
231 |   {
232 |    "cell_type": "code",
233 |    "execution_count": 11,
234 |    "metadata": {},
235 |    "outputs": [
236 |     {
237 |      "data": {
238 |       "text/plain": [
239 |        "['This',\n",
240 |        " 'is',\n",
241 |        " 'a',\n",
242 |        " 'made',\n",
243 |        " 'up',\n",
244 |        " 'string',\n",
245 |        " 'to',\n",
246 |        " 'test',\n",
247 |        " '2',\n",
248 |        " 'different',\n",
249 |        " 'regex',\n",
250 |        " 'methods']"
251 |       ]
252 |      },
253 |      "execution_count": 11,
254 |      "metadata": {},
255 |      "output_type": "execute_result"
256 |     }
257 |    ],
258 |    "source": [
259 |     "re.findall('\\S+', re_test_messy)"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": 12,
265 |    "metadata": {},
266 |    "outputs": [
267 |     {
268 |      "data": {
269 |       "text/plain": [
270 |        "['This-is-a-made/up.string*to>>>>test----2\"\"\"\"\"\"different~regex-methods']"
271 |       ]
272 |      },
273 |      "execution_count": 12,
274 |      "metadata": {},
275 |      "output_type": "execute_result"
276 |     }
277 |    ],
278 |    "source": [
279 |     "re.findall('\\S+', re_test_messy1)"
280 |    ]
281 |   },
282 |   {
283 |    "cell_type": "code",
284 |    "execution_count": 13,
285 |    "metadata": {},
286 |    "outputs": [
287 |     {
288 |      "data": {
289 |       "text/plain": [
290 |        "['This',\n",
291 |        " 'is',\n",
292 |        " 'a',\n",
293 |        " 'made',\n",
294 |        " 'up',\n",
295 |        " 'string',\n",
296 |        " 'to',\n",
297 |        " 'test',\n",
298 |        " '2',\n",
299 |        " 'different',\n",
300 |        " 'regex',\n",
301 |        " 'methods']"
302 |       ]
303 |      },
304 |      "execution_count": 13,
305 |      "metadata": {},
306 |      "output_type": "execute_result"
307 |     }
308 |    ],
309 |    "source": [
310 |     "re.findall('\\w+', re_test_messy1)"
311 |    ]
312 |   },
313 |   {
314 |    "cell_type": "markdown",
315 |    "metadata": {},
316 |    "source": [
317 |     "### Replacing a specific string"
318 |    ]
319 |   },
320 |   {
321 |    "cell_type": "code",
322 |    "execution_count": 3,
323 |    "metadata": {
324 |     "collapsed": true,
325 |     "jupyter": {
326 |      "outputs_hidden": true
327 |     }
328 |    },
329 |    "outputs": [],
330 |    "source": [
331 |     "pep8_test = 'I try to follow PEP8 guidelines'\n",
332 |     "pep7_test = 'I try to follow PEP7 guidelines'\n",
333 |     "peep8_test = 'I try to follow PEEP8 guidelines'"
334 |    ]
335 |   },
336 |   {
337 |    "cell_type": "code",
338 |    "execution_count": 4,
339 |    "metadata": {},
340 |    "outputs": [
341 |     {
342 |      "data": {
343 |       "text/plain": [
344 |        "['try', 'to', 'follow', 'guidelines']"
345 |       ]
346 |      },
347 |      "execution_count": 4,
348 |      "metadata": {},
349 |      "output_type": "execute_result"
350 |     }
351 |    ],
352 |    "source": [
353 |     "import re\n",
354 |     "\n",
355 |     "re.findall('[a-z]+', pep8_test)"
356 |    ]
357 |   },
358 |   {
359 |    "cell_type": "code",
360 |    "execution_count": 5,
361 |    "metadata": {},
362 |    "outputs": [
363 |     {
364 |      "data": {
365 |       "text/plain": [
366 |        "['I', 'PEP']"
367 |       ]
368 |      },
369 |      "execution_count": 5,
370 |      "metadata": {},
371 |      "output_type": "execute_result"
372 |     }
373 |    ],
374 |    "source": [
375 |     "re.findall('[A-Z]+', pep8_test)"
376 |    ]
377 |   },
378 |   {
379 |    "cell_type": "code",
380 |    "execution_count": 8,
381 |    "metadata": {},
382 |    "outputs": [
383 |     {
384 |      "data": {
385 |       "text/plain": [
386 |        "['PEEP8']"
387 |       ]
388 |      },
389 |      "execution_count": 8,
390 |      "metadata": {},
391 |      "output_type": "execute_result"
392 |     }
393 |    ],
394 |    "source": [
395 |     "re.findall('[A-Z]+[0-9]+', peep8_test)"
396 |    ]
397 |   },
398 |   {
399 |    "cell_type": "code",
400 |    "execution_count": 11,
401 |    "metadata": {},
402 |    "outputs": [
403 |     {
404 |      "data": {
405 |       "text/plain": [
406 |        "'I try to follow PEP8 Python Styleguide guidelines'"
407 |       ]
408 |      },
409 |      "execution_count": 11,
410 |      "metadata": {},
411 |      "output_type": "execute_result"
412 |     }
413 |    ],
414 |    "source": [
415 |     "re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styleguide', peep8_test)"
416 |    ]
417 |   },
418 |   {
419 |    "cell_type": "markdown",
420 |    "metadata": {},
421 |    "source": [
422 |     "### Other examples of regex methods\n",
423 |     "\n",
424 |     "- re.search()\n",
425 |     "- re.match()\n",
426 |     "- re.fullmatch()\n",
427 |     "- re.finditer()\n",
428 |     "- re.escape()"
429 |    ]
430 |   }
431 |  ],
432 |  "metadata": {
433 |   "kernelspec": {
434 |    "display_name": "Python 3 (ipykernel)",
435 |    "language": "python",
436 |    "name": "python3"
437 |   },
438 |   "language_info": {
439 |    "codemirror_mode": {
440 |     "name": "ipython",
441 |     "version": 3
442 |    },
443 |    "file_extension": ".py",
444 |    "mimetype": "text/x-python",
445 |    "name": "python",
446 |    "nbconvert_exporter": "python",
447 |    "pygments_lexer": "ipython3",
448 |    "version": "3.11.0"
449 |   }
450 |  },
451 |  "nbformat": 4,
452 |  "nbformat_minor": 4
453 | }
454 | 


--------------------------------------------------------------------------------
/1. NLP Basics/1.5. implementing a pipeline to clean text.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# NLP Basics: Implementing a pipeline to clean text"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "### Pre-processing text data\n",
 15 |     "\n",
 16 |     "Cleaning up the text data is necessary to highlight attributes that you're going to want your machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of a number of steps:\n",
 17 |     "1. **Remove punctuation**\n",
 18 |     "2. **Tokenization**\n",
 19 |     "3. **Remove stopwords**\n",
 20 |     "4. Lemmatize/Stem\n",
 21 |     "\n",
 22 |     "The first three steps are covered in this chapter as they're implemented in pretty much any text cleaning pipeline. Lemmatizing and stemming are covered in the next chapter as they're helpful but not critical."
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": 1,
 28 |    "metadata": {},
 29 |    "outputs": [
 30 |     {
 31 |      "data": {
 32 |       "text/html": [
 33 |        "<div>\n",
 34 |        "<style>\n",
 35 |        "    .dataframe thead tr:only-child th {\n",
 36 |        "        text-align: right;\n",
 37 |        "    }\n",
 38 |        "\n",
 39 |        "    .dataframe thead th {\n",
 40 |        "        text-align: left;\n",
 41 |        "    }\n",
 42 |        "\n",
 43 |        "    .dataframe tbody tr th {\n",
 44 |        "        vertical-align: top;\n",
 45 |        "    }\n",
 46 |        "</style>\n",
 47 |        "<table border=\"1\" class=\"dataframe\">\n",
 48 |        "  <thead>\n",
 49 |        "    <tr style=\"text-align: right;\">\n",
 50 |        "      <th></th>\n",
 51 |        "      <th>label</th>\n",
 52 |        "      <th>body_text</th>\n",
 53 |        "    </tr>\n",
 54 |        "  </thead>\n",
 55 |        "  <tbody>\n",
 56 |        "    <tr>\n",
 57 |        "      <th>0</th>\n",
 58 |        "      <td>ham</td>\n",
 59 |        "      <td>I've been searching for the right words to thank you for this breather. I promise i wont take yo...</td>\n",
 60 |        "    </tr>\n",
 61 |        "    <tr>\n",
 62 |        "      <th>1</th>\n",
 63 |        "      <td>spam</td>\n",
 64 |        "      <td>Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...</td>\n",
 65 |        "    </tr>\n",
 66 |        "    <tr>\n",
 67 |        "      <th>2</th>\n",
 68 |        "      <td>ham</td>\n",
 69 |        "      <td>Nah I don't think he goes to usf, he lives around here though</td>\n",
 70 |        "    </tr>\n",
 71 |        "    <tr>\n",
 72 |        "      <th>3</th>\n",
 73 |        "      <td>ham</td>\n",
 74 |        "      <td>Even my brother is not like to speak with me. They treat me like aids patent.</td>\n",
 75 |        "    </tr>\n",
 76 |        "    <tr>\n",
 77 |        "      <th>4</th>\n",
 78 |        "      <td>ham</td>\n",
 79 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL!!</td>\n",
 80 |        "    </tr>\n",
 81 |        "  </tbody>\n",
 82 |        "</table>\n",
 83 |        "</div>"
 84 |       ],
 85 |       "text/plain": [
 86 |        "  label  \\\n",
 87 |        "0   ham   \n",
 88 |        "1  spam   \n",
 89 |        "2   ham   \n",
 90 |        "3   ham   \n",
 91 |        "4   ham   \n",
 92 |        "\n",
 93 |        "                                                                                             body_text  \n",
 94 |        "0  I've been searching for the right words to thank you for this breather. I promise i wont take yo...  \n",
 95 |        "1  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...  \n",
 96 |        "2                                        Nah I don't think he goes to usf, he lives around here though  \n",
 97 |        "3                        Even my brother is not like to speak with me. They treat me like aids patent.  \n",
 98 |        "4                                                                  I HAVE A DATE ON SUNDAY WITH WILL!!  "
 99 |       ]
100 |      },
101 |      "execution_count": 1,
102 |      "metadata": {},
103 |      "output_type": "execute_result"
104 |     }
105 |    ],
106 |    "source": [
107 |     "import pandas as pd\n",
108 |     "pd.set_option('display.max_colwidth', 100)\n",
109 |     "\n",
110 |     "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t', header=None)\n",
111 |     "data.columns = ['label', 'body_text']\n",
112 |     "\n",
113 |     "data.head()"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": 2,
119 |    "metadata": {},
120 |    "outputs": [
121 |     {
122 |      "data": {
123 |       "text/html": [
124 |        "<div>\n",
125 |        "<style>\n",
126 |        "    .dataframe thead tr:only-child th {\n",
127 |        "        text-align: right;\n",
128 |        "    }\n",
129 |        "\n",
130 |        "    .dataframe thead th {\n",
131 |        "        text-align: left;\n",
132 |        "    }\n",
133 |        "\n",
134 |        "    .dataframe tbody tr th {\n",
135 |        "        vertical-align: top;\n",
136 |        "    }\n",
137 |        "</style>\n",
138 |        "<table border=\"1\" class=\"dataframe\">\n",
139 |        "  <thead>\n",
140 |        "    <tr style=\"text-align: right;\">\n",
141 |        "      <th></th>\n",
142 |        "      <th>label</th>\n",
143 |        "      <th>body_text</th>\n",
144 |        "      <th>body_text_nostop</th>\n",
145 |        "    </tr>\n",
146 |        "  </thead>\n",
147 |        "  <tbody>\n",
148 |        "    <tr>\n",
149 |        "      <th>0</th>\n",
150 |        "      <td>ham</td>\n",
151 |        "      <td>I've been searching for the right words to thank you for this breather. I promise i wont take yo...</td>\n",
152 |        "      <td>['ive', 'searching', 'right', 'words', 'thank', 'breather', 'promise', 'wont', 'take', 'help', '...</td>\n",
153 |        "    </tr>\n",
154 |        "    <tr>\n",
155 |        "      <th>1</th>\n",
156 |        "      <td>spam</td>\n",
157 |        "      <td>Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...</td>\n",
158 |        "      <td>['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005...</td>\n",
159 |        "    </tr>\n",
160 |        "    <tr>\n",
161 |        "      <th>2</th>\n",
162 |        "      <td>ham</td>\n",
163 |        "      <td>Nah I don't think he goes to usf, he lives around here though</td>\n",
164 |        "      <td>['nah', 'dont', 'think', 'goes', 'usf', 'lives', 'around', 'though']</td>\n",
165 |        "    </tr>\n",
166 |        "    <tr>\n",
167 |        "      <th>3</th>\n",
168 |        "      <td>ham</td>\n",
169 |        "      <td>Even my brother is not like to speak with me. They treat me like aids patent.</td>\n",
170 |        "      <td>['even', 'brother', 'like', 'speak', 'treat', 'like', 'aids', 'patent']</td>\n",
171 |        "    </tr>\n",
172 |        "    <tr>\n",
173 |        "      <th>4</th>\n",
174 |        "      <td>ham</td>\n",
175 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL!!</td>\n",
176 |        "      <td>['date', 'sunday']</td>\n",
177 |        "    </tr>\n",
178 |        "  </tbody>\n",
179 |        "</table>\n",
180 |        "</div>"
181 |       ],
182 |       "text/plain": [
183 |        "  label  \\\n",
184 |        "0   ham   \n",
185 |        "1  spam   \n",
186 |        "2   ham   \n",
187 |        "3   ham   \n",
188 |        "4   ham   \n",
189 |        "\n",
190 |        "                                                                                             body_text  \\\n",
191 |        "0  I've been searching for the right words to thank you for this breather. I promise i wont take yo...   \n",
192 |        "1  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...   \n",
193 |        "2                                        Nah I don't think he goes to usf, he lives around here though   \n",
194 |        "3                        Even my brother is not like to speak with me. They treat me like aids patent.   \n",
195 |        "4                                                                  I HAVE A DATE ON SUNDAY WITH WILL!!   \n",
196 |        "\n",
197 |        "                                                                                      body_text_nostop  \n",
198 |        "0  ['ive', 'searching', 'right', 'words', 'thank', 'breather', 'promise', 'wont', 'take', 'help', '...  \n",
199 |        "1  ['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005...  \n",
200 |        "2                                 ['nah', 'dont', 'think', 'goes', 'usf', 'lives', 'around', 'though']  \n",
201 |        "3                              ['even', 'brother', 'like', 'speak', 'treat', 'like', 'aids', 'patent']  \n",
202 |        "4                                                                                   ['date', 'sunday']  "
203 |       ]
204 |      },
205 |      "execution_count": 2,
206 |      "metadata": {},
207 |      "output_type": "execute_result"
208 |     }
209 |    ],
210 |    "source": [
211 |     "# What does the cleaned version look like?\n",
212 |     "data_cleaned = pd.read_csv(\"SMSSpamCollection_cleaned.tsv\", sep='\\t')\n",
213 |     "data_cleaned.head()"
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "markdown",
218 |    "metadata": {},
219 |    "source": [
220 |     "### Remove punctuation"
221 |    ]
222 |   },
223 |   {
224 |    "cell_type": "code",
225 |    "execution_count": 3,
226 |    "metadata": {},
227 |    "outputs": [
228 |     {
229 |      "data": {
230 |       "text/plain": [
231 |        "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'"
232 |       ]
233 |      },
234 |      "execution_count": 3,
235 |      "metadata": {},
236 |      "output_type": "execute_result"
237 |     }
238 |    ],
239 |    "source": [
240 |     "import string\n",
241 |     "string.punctuation"
242 |    ]
243 |   },
244 |   {
245 |    "cell_type": "code",
246 |    "execution_count": 4,
247 |    "metadata": {},
248 |    "outputs": [
249 |     {
250 |      "data": {
251 |       "text/plain": [
252 |        "False"
253 |       ]
254 |      },
255 |      "execution_count": 4,
256 |      "metadata": {},
257 |      "output_type": "execute_result"
258 |     }
259 |    ],
260 |    "source": [
261 |     "\"I like NLP.\" == \"I like NLP\""
262 |    ]
263 |   },
264 |   {
265 |    "cell_type": "code",
266 |    "execution_count": 5,
267 |    "metadata": {},
268 |    "outputs": [
269 |     {
270 |      "data": {
271 |       "text/html": [
272 |        "<div>\n",
273 |        "<style>\n",
274 |        "    .dataframe thead tr:only-child th {\n",
275 |        "        text-align: right;\n",
276 |        "    }\n",
277 |        "\n",
278 |        "    .dataframe thead th {\n",
279 |        "        text-align: left;\n",
280 |        "    }\n",
281 |        "\n",
282 |        "    .dataframe tbody tr th {\n",
283 |        "        vertical-align: top;\n",
284 |        "    }\n",
285 |        "</style>\n",
286 |        "<table border=\"1\" class=\"dataframe\">\n",
287 |        "  <thead>\n",
288 |        "    <tr style=\"text-align: right;\">\n",
289 |        "      <th></th>\n",
290 |        "      <th>label</th>\n",
291 |        "      <th>body_text</th>\n",
292 |        "      <th>body_text_clean</th>\n",
293 |        "    </tr>\n",
294 |        "  </thead>\n",
295 |        "  <tbody>\n",
296 |        "    <tr>\n",
297 |        "      <th>0</th>\n",
298 |        "      <td>ham</td>\n",
299 |        "      <td>I've been searching for the right words to thank you for this breather. I promise i wont take yo...</td>\n",
300 |        "      <td>Ive been searching for the right words to thank you for this breather I promise i wont take your...</td>\n",
301 |        "    </tr>\n",
302 |        "    <tr>\n",
303 |        "      <th>1</th>\n",
304 |        "      <td>spam</td>\n",
305 |        "      <td>Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...</td>\n",
306 |        "      <td>Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...</td>\n",
307 |        "    </tr>\n",
308 |        "    <tr>\n",
309 |        "      <th>2</th>\n",
310 |        "      <td>ham</td>\n",
311 |        "      <td>Nah I don't think he goes to usf, he lives around here though</td>\n",
312 |        "      <td>Nah I dont think he goes to usf he lives around here though</td>\n",
313 |        "    </tr>\n",
314 |        "    <tr>\n",
315 |        "      <th>3</th>\n",
316 |        "      <td>ham</td>\n",
317 |        "      <td>Even my brother is not like to speak with me. They treat me like aids patent.</td>\n",
318 |        "      <td>Even my brother is not like to speak with me They treat me like aids patent</td>\n",
319 |        "    </tr>\n",
320 |        "    <tr>\n",
321 |        "      <th>4</th>\n",
322 |        "      <td>ham</td>\n",
323 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL!!</td>\n",
324 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL</td>\n",
325 |        "    </tr>\n",
326 |        "  </tbody>\n",
327 |        "</table>\n",
328 |        "</div>"
329 |       ],
330 |       "text/plain": [
331 |        "  label  \\\n",
332 |        "0   ham   \n",
333 |        "1  spam   \n",
334 |        "2   ham   \n",
335 |        "3   ham   \n",
336 |        "4   ham   \n",
337 |        "\n",
338 |        "                                                                                             body_text  \\\n",
339 |        "0  I've been searching for the right words to thank you for this breather. I promise i wont take yo...   \n",
340 |        "1  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...   \n",
341 |        "2                                        Nah I don't think he goes to usf, he lives around here though   \n",
342 |        "3                        Even my brother is not like to speak with me. They treat me like aids patent.   \n",
343 |        "4                                                                  I HAVE A DATE ON SUNDAY WITH WILL!!   \n",
344 |        "\n",
345 |        "                                                                                       body_text_clean  \n",
346 |        "0  Ive been searching for the right words to thank you for this breather I promise i wont take your...  \n",
347 |        "1  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...  \n",
348 |        "2                                          Nah I dont think he goes to usf he lives around here though  \n",
349 |        "3                          Even my brother is not like to speak with me They treat me like aids patent  \n",
350 |        "4                                                                    I HAVE A DATE ON SUNDAY WITH WILL  "
351 |       ]
352 |      },
353 |      "execution_count": 5,
354 |      "metadata": {},
355 |      "output_type": "execute_result"
356 |     }
357 |    ],
358 |    "source": [
359 |     "def remove_punct(text):\n",
360 |     "    text_nopunct = \"\".join([char for char in text if char not in string.punctuation])\n",
361 |     "    return text_nopunct\n",
362 |     "\n",
363 |     "data['body_text_clean'] = data['body_text'].apply(lambda x: remove_punct(x))\n",
364 |     "\n",
365 |     "data.head()"
366 |    ]
367 |   },
368 |   {
369 |    "cell_type": "markdown",
370 |    "metadata": {},
371 |    "source": [
372 |     "### Tokenization"
373 |    ]
374 |   },
375 |   {
376 |    "cell_type": "code",
377 |    "execution_count": 6,
378 |    "metadata": {},
379 |    "outputs": [
380 |     {
381 |      "data": {
382 |       "text/html": [
383 |        "<div>\n",
384 |        "<style>\n",
385 |        "    .dataframe thead tr:only-child th {\n",
386 |        "        text-align: right;\n",
387 |        "    }\n",
388 |        "\n",
389 |        "    .dataframe thead th {\n",
390 |        "        text-align: left;\n",
391 |        "    }\n",
392 |        "\n",
393 |        "    .dataframe tbody tr th {\n",
394 |        "        vertical-align: top;\n",
395 |        "    }\n",
396 |        "</style>\n",
397 |        "<table border=\"1\" class=\"dataframe\">\n",
398 |        "  <thead>\n",
399 |        "    <tr style=\"text-align: right;\">\n",
400 |        "      <th></th>\n",
401 |        "      <th>label</th>\n",
402 |        "      <th>body_text</th>\n",
403 |        "      <th>body_text_clean</th>\n",
404 |        "      <th>body_text_tokenized</th>\n",
405 |        "    </tr>\n",
406 |        "  </thead>\n",
407 |        "  <tbody>\n",
408 |        "    <tr>\n",
409 |        "      <th>0</th>\n",
410 |        "      <td>ham</td>\n",
411 |        "      <td>I've been searching for the right words to thank you for this breather. I promise i wont take yo...</td>\n",
412 |        "      <td>Ive been searching for the right words to thank you for this breather I promise i wont take your...</td>\n",
413 |        "      <td>[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...</td>\n",
414 |        "    </tr>\n",
415 |        "    <tr>\n",
416 |        "      <th>1</th>\n",
417 |        "      <td>spam</td>\n",
418 |        "      <td>Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...</td>\n",
419 |        "      <td>Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...</td>\n",
420 |        "      <td>[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...</td>\n",
421 |        "    </tr>\n",
422 |        "    <tr>\n",
423 |        "      <th>2</th>\n",
424 |        "      <td>ham</td>\n",
425 |        "      <td>Nah I don't think he goes to usf, he lives around here though</td>\n",
426 |        "      <td>Nah I dont think he goes to usf he lives around here though</td>\n",
427 |        "      <td>[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]</td>\n",
428 |        "    </tr>\n",
429 |        "    <tr>\n",
430 |        "      <th>3</th>\n",
431 |        "      <td>ham</td>\n",
432 |        "      <td>Even my brother is not like to speak with me. They treat me like aids patent.</td>\n",
433 |        "      <td>Even my brother is not like to speak with me They treat me like aids patent</td>\n",
434 |        "      <td>[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]</td>\n",
435 |        "    </tr>\n",
436 |        "    <tr>\n",
437 |        "      <th>4</th>\n",
438 |        "      <td>ham</td>\n",
439 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL!!</td>\n",
440 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL</td>\n",
441 |        "      <td>[i, have, a, date, on, sunday, with, will]</td>\n",
442 |        "    </tr>\n",
443 |        "  </tbody>\n",
444 |        "</table>\n",
445 |        "</div>"
446 |       ],
447 |       "text/plain": [
448 |        "  label  \\\n",
449 |        "0   ham   \n",
450 |        "1  spam   \n",
451 |        "2   ham   \n",
452 |        "3   ham   \n",
453 |        "4   ham   \n",
454 |        "\n",
455 |        "                                                                                             body_text  \\\n",
456 |        "0  I've been searching for the right words to thank you for this breather. I promise i wont take yo...   \n",
457 |        "1  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...   \n",
458 |        "2                                        Nah I don't think he goes to usf, he lives around here though   \n",
459 |        "3                        Even my brother is not like to speak with me. They treat me like aids patent.   \n",
460 |        "4                                                                  I HAVE A DATE ON SUNDAY WITH WILL!!   \n",
461 |        "\n",
462 |        "                                                                                       body_text_clean  \\\n",
463 |        "0  Ive been searching for the right words to thank you for this breather I promise i wont take your...   \n",
464 |        "1  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...   \n",
465 |        "2                                          Nah I dont think he goes to usf he lives around here though   \n",
466 |        "3                          Even my brother is not like to speak with me They treat me like aids patent   \n",
467 |        "4                                                                    I HAVE A DATE ON SUNDAY WITH WILL   \n",
468 |        "\n",
469 |        "                                                                                   body_text_tokenized  \n",
470 |        "0  [ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...  \n",
471 |        "1  [free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...  \n",
472 |        "2                            [nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]  \n",
473 |        "3         [even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]  \n",
474 |        "4                                                           [i, have, a, date, on, sunday, with, will]  "
475 |       ]
476 |      },
477 |      "execution_count": 6,
478 |      "metadata": {},
479 |      "output_type": "execute_result"
480 |     }
481 |    ],
482 |    "source": [
483 |     "import re\n",
484 |     "\n",
485 |     "def tokenize(text):\n",
486 |     "    tokens = re.split('\\W+', text)\n",
487 |     "    return tokens\n",
488 |     "\n",
489 |     "data['body_text_tokenized'] = data['body_text_clean'].apply(lambda x: tokenize(x.lower()))\n",
490 |     "\n",
491 |     "data.head()"
492 |    ]
493 |   },
494 |   {
495 |    "cell_type": "code",
496 |    "execution_count": 7,
497 |    "metadata": {},
498 |    "outputs": [
499 |     {
500 |      "data": {
501 |       "text/plain": [
502 |        "False"
503 |       ]
504 |      },
505 |      "execution_count": 7,
506 |      "metadata": {},
507 |      "output_type": "execute_result"
508 |     }
509 |    ],
510 |    "source": [
511 |     "'NLP' == 'nlp'"
512 |    ]
513 |   },
514 |   {
515 |    "cell_type": "markdown",
516 |    "metadata": {},
517 |    "source": [
518 |     "### Remove stopwords"
519 |    ]
520 |   },
521 |   {
522 |    "cell_type": "code",
523 |    "execution_count": 8,
524 |    "metadata": {
525 |     "collapsed": true,
526 |     "jupyter": {
527 |      "outputs_hidden": true
528 |     }
529 |    },
530 |    "outputs": [],
531 |    "source": [
532 |     "import nltk\n",
533 |     "\n",
534 |     "stopword = nltk.corpus.stopwords.words('english')"
535 |    ]
536 |   },
537 |   {
538 |    "cell_type": "code",
539 |    "execution_count": 9,
540 |    "metadata": {},
541 |    "outputs": [
542 |     {
543 |      "data": {
544 |       "text/html": [
545 |        "<div>\n",
546 |        "<style>\n",
547 |        "    .dataframe thead tr:only-child th {\n",
548 |        "        text-align: right;\n",
549 |        "    }\n",
550 |        "\n",
551 |        "    .dataframe thead th {\n",
552 |        "        text-align: left;\n",
553 |        "    }\n",
554 |        "\n",
555 |        "    .dataframe tbody tr th {\n",
556 |        "        vertical-align: top;\n",
557 |        "    }\n",
558 |        "</style>\n",
559 |        "<table border=\"1\" class=\"dataframe\">\n",
560 |        "  <thead>\n",
561 |        "    <tr style=\"text-align: right;\">\n",
562 |        "      <th></th>\n",
563 |        "      <th>label</th>\n",
564 |        "      <th>body_text</th>\n",
565 |        "      <th>body_text_clean</th>\n",
566 |        "      <th>body_text_tokenized</th>\n",
567 |        "      <th>body_text_nostop</th>\n",
568 |        "    </tr>\n",
569 |        "  </thead>\n",
570 |        "  <tbody>\n",
571 |        "    <tr>\n",
572 |        "      <th>0</th>\n",
573 |        "      <td>ham</td>\n",
574 |        "      <td>I've been searching for the right words to thank you for this breather. I promise i wont take yo...</td>\n",
575 |        "      <td>Ive been searching for the right words to thank you for this breather I promise i wont take your...</td>\n",
576 |        "      <td>[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...</td>\n",
577 |        "      <td>[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...</td>\n",
578 |        "    </tr>\n",
579 |        "    <tr>\n",
580 |        "      <th>1</th>\n",
581 |        "      <td>spam</td>\n",
582 |        "      <td>Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...</td>\n",
583 |        "      <td>Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...</td>\n",
584 |        "      <td>[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...</td>\n",
585 |        "      <td>[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...</td>\n",
586 |        "    </tr>\n",
587 |        "    <tr>\n",
588 |        "      <th>2</th>\n",
589 |        "      <td>ham</td>\n",
590 |        "      <td>Nah I don't think he goes to usf, he lives around here though</td>\n",
591 |        "      <td>Nah I dont think he goes to usf he lives around here though</td>\n",
592 |        "      <td>[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]</td>\n",
593 |        "      <td>[nah, dont, think, goes, usf, lives, around, though]</td>\n",
594 |        "    </tr>\n",
595 |        "    <tr>\n",
596 |        "      <th>3</th>\n",
597 |        "      <td>ham</td>\n",
598 |        "      <td>Even my brother is not like to speak with me. They treat me like aids patent.</td>\n",
599 |        "      <td>Even my brother is not like to speak with me They treat me like aids patent</td>\n",
600 |        "      <td>[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]</td>\n",
601 |        "      <td>[even, brother, like, speak, treat, like, aids, patent]</td>\n",
602 |        "    </tr>\n",
603 |        "    <tr>\n",
604 |        "      <th>4</th>\n",
605 |        "      <td>ham</td>\n",
606 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL!!</td>\n",
607 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL</td>\n",
608 |        "      <td>[i, have, a, date, on, sunday, with, will]</td>\n",
609 |        "      <td>[date, sunday]</td>\n",
610 |        "    </tr>\n",
611 |        "  </tbody>\n",
612 |        "</table>\n",
613 |        "</div>"
614 |       ],
615 |       "text/plain": [
616 |        "  label  \\\n",
617 |        "0   ham   \n",
618 |        "1  spam   \n",
619 |        "2   ham   \n",
620 |        "3   ham   \n",
621 |        "4   ham   \n",
622 |        "\n",
623 |        "                                                                                             body_text  \\\n",
624 |        "0  I've been searching for the right words to thank you for this breather. I promise i wont take yo...   \n",
625 |        "1  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...   \n",
626 |        "2                                        Nah I don't think he goes to usf, he lives around here though   \n",
627 |        "3                        Even my brother is not like to speak with me. They treat me like aids patent.   \n",
628 |        "4                                                                  I HAVE A DATE ON SUNDAY WITH WILL!!   \n",
629 |        "\n",
630 |        "                                                                                       body_text_clean  \\\n",
631 |        "0  Ive been searching for the right words to thank you for this breather I promise i wont take your...   \n",
632 |        "1  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...   \n",
633 |        "2                                          Nah I dont think he goes to usf he lives around here though   \n",
634 |        "3                          Even my brother is not like to speak with me They treat me like aids patent   \n",
635 |        "4                                                                    I HAVE A DATE ON SUNDAY WITH WILL   \n",
636 |        "\n",
637 |        "                                                                                   body_text_tokenized  \\\n",
638 |        "0  [ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...   \n",
639 |        "1  [free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...   \n",
640 |        "2                            [nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]   \n",
641 |        "3         [even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]   \n",
642 |        "4                                                           [i, have, a, date, on, sunday, with, will]   \n",
643 |        "\n",
644 |        "                                                                                      body_text_nostop  \n",
645 |        "0  [ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...  \n",
646 |        "1  [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...  \n",
647 |        "2                                                 [nah, dont, think, goes, usf, lives, around, though]  \n",
648 |        "3                                              [even, brother, like, speak, treat, like, aids, patent]  \n",
649 |        "4                                                                                       [date, sunday]  "
650 |       ]
651 |      },
652 |      "execution_count": 9,
653 |      "metadata": {},
654 |      "output_type": "execute_result"
655 |     }
656 |    ],
657 |    "source": [
658 |     "def remove_stopwords(tokenized_list):\n",
659 |     "    text = [word for word in tokenized_list if word not in stopword]\n",
660 |     "    return text\n",
661 |     "\n",
662 |     "data['body_text_nostop'] = data['body_text_tokenized'].apply(lambda x: remove_stopwords(x))\n",
663 |     "\n",
664 |     "data.head()"
665 |    ]
666 |   },
667 |   {
668 |    "cell_type": "code",
669 |    "execution_count": null,
670 |    "metadata": {
671 |     "collapsed": true,
672 |     "jupyter": {
673 |      "outputs_hidden": true
674 |     }
675 |    },
676 |    "outputs": [],
677 |    "source": []
678 |   }
679 |  ],
680 |  "metadata": {
681 |   "kernelspec": {
682 |    "display_name": "Python 3 (ipykernel)",
683 |    "language": "python",
684 |    "name": "python3"
685 |   },
686 |   "language_info": {
687 |    "codemirror_mode": {
688 |     "name": "ipython",
689 |     "version": 3
690 |    },
691 |    "file_extension": ".py",
692 |    "mimetype": "text/x-python",
693 |    "name": "python",
694 |    "nbconvert_exporter": "python",
695 |    "pygments_lexer": "ipython3",
696 |    "version": "3.11.0"
697 |   }
698 |  },
699 |  "nbformat": 4,
700 |  "nbformat_minor": 4
701 | }
702 | 


--------------------------------------------------------------------------------
/2. Data Cleaning/2.1. stemming.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Supplemental Data Cleaning: Using Stemming"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "### Test out Porter stemmer"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 10,
 20 |    "metadata": {
 21 |     "collapsed": true,
 22 |     "jupyter": {
 23 |      "outputs_hidden": true
 24 |     }
 25 |    },
 26 |    "outputs": [],
 27 |    "source": [
 28 |     "import nltk\n",
 29 |     "\n",
 30 |     "ps = nltk.PorterStemmer()"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": 12,
 36 |    "metadata": {},
 37 |    "outputs": [
 38 |     {
 39 |      "data": {
 40 |       "text/plain": [
 41 |        "['MARTIN_EXTENSIONS',\n",
 42 |        " 'NLTK_EXTENSIONS',\n",
 43 |        " 'ORIGINAL_ALGORITHM',\n",
 44 |        " '__abstractmethods__',\n",
 45 |        " '__class__',\n",
 46 |        " '__delattr__',\n",
 47 |        " '__dict__',\n",
 48 |        " '__dir__',\n",
 49 |        " '__doc__',\n",
 50 |        " '__eq__',\n",
 51 |        " '__format__',\n",
 52 |        " '__ge__',\n",
 53 |        " '__getattribute__',\n",
 54 |        " '__gt__',\n",
 55 |        " '__hash__',\n",
 56 |        " '__init__',\n",
 57 |        " '__init_subclass__',\n",
 58 |        " '__le__',\n",
 59 |        " '__lt__',\n",
 60 |        " '__module__',\n",
 61 |        " '__ne__',\n",
 62 |        " '__new__',\n",
 63 |        " '__reduce__',\n",
 64 |        " '__reduce_ex__',\n",
 65 |        " '__repr__',\n",
 66 |        " '__setattr__',\n",
 67 |        " '__sizeof__',\n",
 68 |        " '__str__',\n",
 69 |        " '__subclasshook__',\n",
 70 |        " '__unicode__',\n",
 71 |        " '__weakref__',\n",
 72 |        " '_abc_cache',\n",
 73 |        " '_abc_negative_cache',\n",
 74 |        " '_abc_negative_cache_version',\n",
 75 |        " '_abc_registry',\n",
 76 |        " '_apply_rule_list',\n",
 77 |        " '_contains_vowel',\n",
 78 |        " '_ends_cvc',\n",
 79 |        " '_ends_double_consonant',\n",
 80 |        " '_has_positive_measure',\n",
 81 |        " '_is_consonant',\n",
 82 |        " '_measure',\n",
 83 |        " '_replace_suffix',\n",
 84 |        " '_step1a',\n",
 85 |        " '_step1b',\n",
 86 |        " '_step1c',\n",
 87 |        " '_step2',\n",
 88 |        " '_step3',\n",
 89 |        " '_step4',\n",
 90 |        " '_step5a',\n",
 91 |        " '_step5b',\n",
 92 |        " 'mode',\n",
 93 |        " 'pool',\n",
 94 |        " 'stem',\n",
 95 |        " 'unicode_repr',\n",
 96 |        " 'vowels']"
 97 |       ]
 98 |      },
 99 |      "execution_count": 12,
100 |      "metadata": {},
101 |      "output_type": "execute_result"
102 |     }
103 |    ],
104 |    "source": [
105 |     "dir(ps)"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "code",
110 |    "execution_count": 13,
111 |    "metadata": {},
112 |    "outputs": [
113 |     {
114 |      "name": "stdout",
115 |      "output_type": "stream",
116 |      "text": [
117 |       "grow\n",
118 |       "grow\n",
119 |       "grow\n"
120 |      ]
121 |     }
122 |    ],
123 |    "source": [
124 |     "print(ps.stem('grows'))\n",
125 |     "print(ps.stem('growing'))\n",
126 |     "print(ps.stem('grow'))"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": 14,
132 |    "metadata": {},
133 |    "outputs": [
134 |     {
135 |      "name": "stdout",
136 |      "output_type": "stream",
137 |      "text": [
138 |       "run\n",
139 |       "run\n",
140 |       "runner\n"
141 |      ]
142 |     }
143 |    ],
144 |    "source": [
145 |     "print(ps.stem('run'))\n",
146 |     "print(ps.stem('running'))\n",
147 |     "print(ps.stem('runner'))"
148 |    ]
149 |   },
150 |   {
151 |    "cell_type": "markdown",
152 |    "metadata": {},
153 |    "source": [
154 |     "### Read in raw text"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": 15,
160 |    "metadata": {},
161 |    "outputs": [
162 |     {
163 |      "data": {
164 |       "text/html": [
165 |        "<div>\n",
166 |        "<style>\n",
167 |        "    .dataframe thead tr:only-child th {\n",
168 |        "        text-align: right;\n",
169 |        "    }\n",
170 |        "\n",
171 |        "    .dataframe thead th {\n",
172 |        "        text-align: left;\n",
173 |        "    }\n",
174 |        "\n",
175 |        "    .dataframe tbody tr th {\n",
176 |        "        vertical-align: top;\n",
177 |        "    }\n",
178 |        "</style>\n",
179 |        "<table border=\"1\" class=\"dataframe\">\n",
180 |        "  <thead>\n",
181 |        "    <tr style=\"text-align: right;\">\n",
182 |        "      <th></th>\n",
183 |        "      <th>label</th>\n",
184 |        "      <th>body_text</th>\n",
185 |        "    </tr>\n",
186 |        "  </thead>\n",
187 |        "  <tbody>\n",
188 |        "    <tr>\n",
189 |        "      <th>0</th>\n",
190 |        "      <td>spam</td>\n",
191 |        "      <td>Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...</td>\n",
192 |        "    </tr>\n",
193 |        "    <tr>\n",
194 |        "      <th>1</th>\n",
195 |        "      <td>ham</td>\n",
196 |        "      <td>Nah I don't think he goes to usf, he lives around here though</td>\n",
197 |        "    </tr>\n",
198 |        "    <tr>\n",
199 |        "      <th>2</th>\n",
200 |        "      <td>ham</td>\n",
201 |        "      <td>Even my brother is not like to speak with me. They treat me like aids patent.</td>\n",
202 |        "    </tr>\n",
203 |        "    <tr>\n",
204 |        "      <th>3</th>\n",
205 |        "      <td>ham</td>\n",
206 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL!!</td>\n",
207 |        "    </tr>\n",
208 |        "    <tr>\n",
209 |        "      <th>4</th>\n",
210 |        "      <td>ham</td>\n",
211 |        "      <td>As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...</td>\n",
212 |        "    </tr>\n",
213 |        "  </tbody>\n",
214 |        "</table>\n",
215 |        "</div>"
216 |       ],
217 |       "text/plain": [
218 |        "  label  \\\n",
219 |        "0  spam   \n",
220 |        "1   ham   \n",
221 |        "2   ham   \n",
222 |        "3   ham   \n",
223 |        "4   ham   \n",
224 |        "\n",
225 |        "                                                                                             body_text  \n",
226 |        "0  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...  \n",
227 |        "1                                        Nah I don't think he goes to usf, he lives around here though  \n",
228 |        "2                        Even my brother is not like to speak with me. They treat me like aids patent.  \n",
229 |        "3                                                                  I HAVE A DATE ON SUNDAY WITH WILL!!  \n",
230 |        "4  As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...  "
231 |       ]
232 |      },
233 |      "execution_count": 15,
234 |      "metadata": {},
235 |      "output_type": "execute_result"
236 |     }
237 |    ],
238 |    "source": [
239 |     "import pandas as pd\n",
240 |     "import re\n",
241 |     "import string\n",
242 |     "pd.set_option('display.max_colwidth', 100)\n",
243 |     "\n",
244 |     "stopwords = nltk.corpus.stopwords.words('english')\n",
245 |     "\n",
246 |     "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t')\n",
247 |     "data.columns = ['label', 'body_text']\n",
248 |     "\n",
249 |     "data.head()"
250 |    ]
251 |   },
252 |   {
253 |    "cell_type": "markdown",
254 |    "metadata": {},
255 |    "source": [
256 |     "### Clean up text"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "code",
261 |    "execution_count": 16,
262 |    "metadata": {},
263 |    "outputs": [
264 |     {
265 |      "data": {
266 |       "text/html": [
267 |        "<div>\n",
268 |        "<style>\n",
269 |        "    .dataframe thead tr:only-child th {\n",
270 |        "        text-align: right;\n",
271 |        "    }\n",
272 |        "\n",
273 |        "    .dataframe thead th {\n",
274 |        "        text-align: left;\n",
275 |        "    }\n",
276 |        "\n",
277 |        "    .dataframe tbody tr th {\n",
278 |        "        vertical-align: top;\n",
279 |        "    }\n",
280 |        "</style>\n",
281 |        "<table border=\"1\" class=\"dataframe\">\n",
282 |        "  <thead>\n",
283 |        "    <tr style=\"text-align: right;\">\n",
284 |        "      <th></th>\n",
285 |        "      <th>label</th>\n",
286 |        "      <th>body_text</th>\n",
287 |        "      <th>body_text_nostop</th>\n",
288 |        "    </tr>\n",
289 |        "  </thead>\n",
290 |        "  <tbody>\n",
291 |        "    <tr>\n",
292 |        "      <th>0</th>\n",
293 |        "      <td>spam</td>\n",
294 |        "      <td>Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...</td>\n",
295 |        "      <td>[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...</td>\n",
296 |        "    </tr>\n",
297 |        "    <tr>\n",
298 |        "      <th>1</th>\n",
299 |        "      <td>ham</td>\n",
300 |        "      <td>Nah I don't think he goes to usf, he lives around here though</td>\n",
301 |        "      <td>[nah, dont, think, goes, usf, lives, around, though]</td>\n",
302 |        "    </tr>\n",
303 |        "    <tr>\n",
304 |        "      <th>2</th>\n",
305 |        "      <td>ham</td>\n",
306 |        "      <td>Even my brother is not like to speak with me. They treat me like aids patent.</td>\n",
307 |        "      <td>[even, brother, like, speak, treat, like, aids, patent]</td>\n",
308 |        "    </tr>\n",
309 |        "    <tr>\n",
310 |        "      <th>3</th>\n",
311 |        "      <td>ham</td>\n",
312 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL!!</td>\n",
313 |        "      <td>[date, sunday]</td>\n",
314 |        "    </tr>\n",
315 |        "    <tr>\n",
316 |        "      <th>4</th>\n",
317 |        "      <td>ham</td>\n",
318 |        "      <td>As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...</td>\n",
319 |        "      <td>[per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr...</td>\n",
320 |        "    </tr>\n",
321 |        "  </tbody>\n",
322 |        "</table>\n",
323 |        "</div>"
324 |       ],
325 |       "text/plain": [
326 |        "  label  \\\n",
327 |        "0  spam   \n",
328 |        "1   ham   \n",
329 |        "2   ham   \n",
330 |        "3   ham   \n",
331 |        "4   ham   \n",
332 |        "\n",
333 |        "                                                                                             body_text  \\\n",
334 |        "0  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...   \n",
335 |        "1                                        Nah I don't think he goes to usf, he lives around here though   \n",
336 |        "2                        Even my brother is not like to speak with me. They treat me like aids patent.   \n",
337 |        "3                                                                  I HAVE A DATE ON SUNDAY WITH WILL!!   \n",
338 |        "4  As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...   \n",
339 |        "\n",
340 |        "                                                                                      body_text_nostop  \n",
341 |        "0  [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...  \n",
342 |        "1                                                 [nah, dont, think, goes, usf, lives, around, though]  \n",
343 |        "2                                              [even, brother, like, speak, treat, like, aids, patent]  \n",
344 |        "3                                                                                       [date, sunday]  \n",
345 |        "4  [per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr...  "
346 |       ]
347 |      },
348 |      "execution_count": 16,
349 |      "metadata": {},
350 |      "output_type": "execute_result"
351 |     }
352 |    ],
353 |    "source": [
354 |     "def clean_text(text):\n",
355 |     "    text = \"\".join([word for word in text if word not in string.punctuation])\n",
356 |     "    tokens = re.split('\\W+', text)\n",
357 |     "    text = [word for word in tokens if word not in stopwords]\n",
358 |     "    return text\n",
359 |     "\n",
360 |     "data['body_text_nostop'] = data['body_text'].apply(lambda x: clean_text(x.lower()))\n",
361 |     "\n",
362 |     "data.head()"
363 |    ]
364 |   },
365 |   {
366 |    "cell_type": "markdown",
367 |    "metadata": {},
368 |    "source": [
369 |     "### Stem text"
370 |    ]
371 |   },
372 |   {
373 |    "cell_type": "code",
374 |    "execution_count": 17,
375 |    "metadata": {},
376 |    "outputs": [
377 |     {
378 |      "data": {
379 |       "text/html": [
380 |        "<div>\n",
381 |        "<style>\n",
382 |        "    .dataframe thead tr:only-child th {\n",
383 |        "        text-align: right;\n",
384 |        "    }\n",
385 |        "\n",
386 |        "    .dataframe thead th {\n",
387 |        "        text-align: left;\n",
388 |        "    }\n",
389 |        "\n",
390 |        "    .dataframe tbody tr th {\n",
391 |        "        vertical-align: top;\n",
392 |        "    }\n",
393 |        "</style>\n",
394 |        "<table border=\"1\" class=\"dataframe\">\n",
395 |        "  <thead>\n",
396 |        "    <tr style=\"text-align: right;\">\n",
397 |        "      <th></th>\n",
398 |        "      <th>label</th>\n",
399 |        "      <th>body_text</th>\n",
400 |        "      <th>body_text_nostop</th>\n",
401 |        "      <th>body_text_stemmed</th>\n",
402 |        "    </tr>\n",
403 |        "  </thead>\n",
404 |        "  <tbody>\n",
405 |        "    <tr>\n",
406 |        "      <th>0</th>\n",
407 |        "      <td>spam</td>\n",
408 |        "      <td>Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...</td>\n",
409 |        "      <td>[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...</td>\n",
410 |        "      <td>[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,...</td>\n",
411 |        "    </tr>\n",
412 |        "    <tr>\n",
413 |        "      <th>1</th>\n",
414 |        "      <td>ham</td>\n",
415 |        "      <td>Nah I don't think he goes to usf, he lives around here though</td>\n",
416 |        "      <td>[nah, dont, think, goes, usf, lives, around, though]</td>\n",
417 |        "      <td>[nah, dont, think, goe, usf, live, around, though]</td>\n",
418 |        "    </tr>\n",
419 |        "    <tr>\n",
420 |        "      <th>2</th>\n",
421 |        "      <td>ham</td>\n",
422 |        "      <td>Even my brother is not like to speak with me. They treat me like aids patent.</td>\n",
423 |        "      <td>[even, brother, like, speak, treat, like, aids, patent]</td>\n",
424 |        "      <td>[even, brother, like, speak, treat, like, aid, patent]</td>\n",
425 |        "    </tr>\n",
426 |        "    <tr>\n",
427 |        "      <th>3</th>\n",
428 |        "      <td>ham</td>\n",
429 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL!!</td>\n",
430 |        "      <td>[date, sunday]</td>\n",
431 |        "      <td>[date, sunday]</td>\n",
432 |        "    </tr>\n",
433 |        "    <tr>\n",
434 |        "      <th>4</th>\n",
435 |        "      <td>ham</td>\n",
436 |        "      <td>As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...</td>\n",
437 |        "      <td>[per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr...</td>\n",
438 |        "      <td>[per, request, mell, mell, oru, minnaminungint, nurungu, vettam, set, callertun, caller, press, ...</td>\n",
439 |        "    </tr>\n",
440 |        "  </tbody>\n",
441 |        "</table>\n",
442 |        "</div>"
443 |       ],
444 |       "text/plain": [
445 |        "  label  \\\n",
446 |        "0  spam   \n",
447 |        "1   ham   \n",
448 |        "2   ham   \n",
449 |        "3   ham   \n",
450 |        "4   ham   \n",
451 |        "\n",
452 |        "                                                                                             body_text  \\\n",
453 |        "0  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...   \n",
454 |        "1                                        Nah I don't think he goes to usf, he lives around here though   \n",
455 |        "2                        Even my brother is not like to speak with me. They treat me like aids patent.   \n",
456 |        "3                                                                  I HAVE A DATE ON SUNDAY WITH WILL!!   \n",
457 |        "4  As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...   \n",
458 |        "\n",
459 |        "                                                                                      body_text_nostop  \\\n",
460 |        "0  [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...   \n",
461 |        "1                                                 [nah, dont, think, goes, usf, lives, around, though]   \n",
462 |        "2                                              [even, brother, like, speak, treat, like, aids, patent]   \n",
463 |        "3                                                                                       [date, sunday]   \n",
464 |        "4  [per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr...   \n",
465 |        "\n",
466 |        "                                                                                     body_text_stemmed  \n",
467 |        "0  [free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,...  \n",
468 |        "1                                                   [nah, dont, think, goe, usf, live, around, though]  \n",
469 |        "2                                               [even, brother, like, speak, treat, like, aid, patent]  \n",
470 |        "3                                                                                       [date, sunday]  \n",
471 |        "4  [per, request, mell, mell, oru, minnaminungint, nurungu, vettam, set, callertun, caller, press, ...  "
472 |       ]
473 |      },
474 |      "execution_count": 17,
475 |      "metadata": {},
476 |      "output_type": "execute_result"
477 |     }
478 |    ],
479 |    "source": [
480 |     "def stemming(tokenized_text):\n",
481 |     "    text = [ps.stem(word) for word in tokenized_text]\n",
482 |     "    return text\n",
483 |     "\n",
484 |     "data['body_text_stemmed'] = data['body_text_nostop'].apply(lambda x: stemming(x))\n",
485 |     "\n",
486 |     "data.head()"
487 |    ]
488 |   },
489 |   {
490 |    "cell_type": "code",
491 |    "execution_count": null,
492 |    "metadata": {
493 |     "collapsed": true,
494 |     "jupyter": {
495 |      "outputs_hidden": true
496 |     }
497 |    },
498 |    "outputs": [],
499 |    "source": []
500 |   }
501 |  ],
502 |  "metadata": {
503 |   "kernelspec": {
504 |    "display_name": "Python 3 (ipykernel)",
505 |    "language": "python",
506 |    "name": "python3"
507 |   },
508 |   "language_info": {
509 |    "codemirror_mode": {
510 |     "name": "ipython",
511 |     "version": 3
512 |    },
513 |    "file_extension": ".py",
514 |    "mimetype": "text/x-python",
515 |    "name": "python",
516 |    "nbconvert_exporter": "python",
517 |    "pygments_lexer": "ipython3",
518 |    "version": "3.11.0"
519 |   }
520 |  },
521 |  "nbformat": 4,
522 |  "nbformat_minor": 4
523 | }
524 | 


--------------------------------------------------------------------------------
/2. Data Cleaning/2.2. lemmatizing.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Supplemental Data Cleaning: Using a Lemmatizer"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {
 13 |     "tags": []
 14 |    },
 15 |    "source": [
 16 |     "### Test out WordNet lemmatizer (read more about WordNet [here](https://wordnet.princeton.edu/))"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "code",
 21 |    "execution_count": 2,
 22 |    "metadata": {
 23 |     "collapsed": true,
 24 |     "jupyter": {
 25 |      "outputs_hidden": true
 26 |     }
 27 |    },
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "import nltk\n",
 31 |     "\n",
 32 |     "wn = nltk.WordNetLemmatizer()\n",
 33 |     "ps = nltk.PorterStemmer()"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": 3,
 39 |    "metadata": {},
 40 |    "outputs": [
 41 |     {
 42 |      "data": {
 43 |       "text/plain": [
 44 |        "['__class__',\n",
 45 |        " '__delattr__',\n",
 46 |        " '__dict__',\n",
 47 |        " '__dir__',\n",
 48 |        " '__doc__',\n",
 49 |        " '__eq__',\n",
 50 |        " '__format__',\n",
 51 |        " '__ge__',\n",
 52 |        " '__getattribute__',\n",
 53 |        " '__gt__',\n",
 54 |        " '__hash__',\n",
 55 |        " '__init__',\n",
 56 |        " '__init_subclass__',\n",
 57 |        " '__le__',\n",
 58 |        " '__lt__',\n",
 59 |        " '__module__',\n",
 60 |        " '__ne__',\n",
 61 |        " '__new__',\n",
 62 |        " '__reduce__',\n",
 63 |        " '__reduce_ex__',\n",
 64 |        " '__repr__',\n",
 65 |        " '__setattr__',\n",
 66 |        " '__sizeof__',\n",
 67 |        " '__str__',\n",
 68 |        " '__subclasshook__',\n",
 69 |        " '__unicode__',\n",
 70 |        " '__weakref__',\n",
 71 |        " 'lemmatize',\n",
 72 |        " 'unicode_repr']"
 73 |       ]
 74 |      },
 75 |      "execution_count": 3,
 76 |      "metadata": {},
 77 |      "output_type": "execute_result"
 78 |     }
 79 |    ],
 80 |    "source": [
 81 |     "dir(wn)"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": 4,
 87 |    "metadata": {},
 88 |    "outputs": [
 89 |     {
 90 |      "name": "stdout",
 91 |      "output_type": "stream",
 92 |      "text": [
 93 |       "mean\n",
 94 |       "mean\n"
 95 |      ]
 96 |     }
 97 |    ],
 98 |    "source": [
 99 |     "print(ps.stem('meanness'))\n",
100 |     "print(ps.stem('meaning'))"
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "code",
105 |    "execution_count": 5,
106 |    "metadata": {},
107 |    "outputs": [
108 |     {
109 |      "name": "stdout",
110 |      "output_type": "stream",
111 |      "text": [
112 |       "meanness\n",
113 |       "meaning\n"
114 |      ]
115 |     }
116 |    ],
117 |    "source": [
118 |     "print(wn.lemmatize('meanness'))\n",
119 |     "print(wn.lemmatize('meaning'))"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": 6,
125 |    "metadata": {},
126 |    "outputs": [
127 |     {
128 |      "name": "stdout",
129 |      "output_type": "stream",
130 |      "text": [
131 |       "goos\n",
132 |       "gees\n"
133 |      ]
134 |     }
135 |    ],
136 |    "source": [
137 |     "print(ps.stem('goose'))\n",
138 |     "print(ps.stem('geese'))"
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "code",
143 |    "execution_count": 8,
144 |    "metadata": {},
145 |    "outputs": [
146 |     {
147 |      "name": "stdout",
148 |      "output_type": "stream",
149 |      "text": [
150 |       "goose\n",
151 |       "goose\n"
152 |      ]
153 |     }
154 |    ],
155 |    "source": [
156 |     "print(wn.lemmatize('goose'))\n",
157 |     "print(wn.lemmatize('geese'))"
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "markdown",
162 |    "metadata": {},
163 |    "source": [
164 |     "### Read in raw text"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "code",
169 |    "execution_count": 9,
170 |    "metadata": {},
171 |    "outputs": [
172 |     {
173 |      "data": {
174 |       "text/html": [
175 |        "<div>\n",
176 |        "<style>\n",
177 |        "    .dataframe thead tr:only-child th {\n",
178 |        "        text-align: right;\n",
179 |        "    }\n",
180 |        "\n",
181 |        "    .dataframe thead th {\n",
182 |        "        text-align: left;\n",
183 |        "    }\n",
184 |        "\n",
185 |        "    .dataframe tbody tr th {\n",
186 |        "        vertical-align: top;\n",
187 |        "    }\n",
188 |        "</style>\n",
189 |        "<table border=\"1\" class=\"dataframe\">\n",
190 |        "  <thead>\n",
191 |        "    <tr style=\"text-align: right;\">\n",
192 |        "      <th></th>\n",
193 |        "      <th>label</th>\n",
194 |        "      <th>body_text</th>\n",
195 |        "    </tr>\n",
196 |        "  </thead>\n",
197 |        "  <tbody>\n",
198 |        "    <tr>\n",
199 |        "      <th>0</th>\n",
200 |        "      <td>spam</td>\n",
201 |        "      <td>Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...</td>\n",
202 |        "    </tr>\n",
203 |        "    <tr>\n",
204 |        "      <th>1</th>\n",
205 |        "      <td>ham</td>\n",
206 |        "      <td>Nah I don't think he goes to usf, he lives around here though</td>\n",
207 |        "    </tr>\n",
208 |        "    <tr>\n",
209 |        "      <th>2</th>\n",
210 |        "      <td>ham</td>\n",
211 |        "      <td>Even my brother is not like to speak with me. They treat me like aids patent.</td>\n",
212 |        "    </tr>\n",
213 |        "    <tr>\n",
214 |        "      <th>3</th>\n",
215 |        "      <td>ham</td>\n",
216 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL!!</td>\n",
217 |        "    </tr>\n",
218 |        "    <tr>\n",
219 |        "      <th>4</th>\n",
220 |        "      <td>ham</td>\n",
221 |        "      <td>As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...</td>\n",
222 |        "    </tr>\n",
223 |        "  </tbody>\n",
224 |        "</table>\n",
225 |        "</div>"
226 |       ],
227 |       "text/plain": [
228 |        "  label  \\\n",
229 |        "0  spam   \n",
230 |        "1   ham   \n",
231 |        "2   ham   \n",
232 |        "3   ham   \n",
233 |        "4   ham   \n",
234 |        "\n",
235 |        "                                                                                             body_text  \n",
236 |        "0  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...  \n",
237 |        "1                                        Nah I don't think he goes to usf, he lives around here though  \n",
238 |        "2                        Even my brother is not like to speak with me. They treat me like aids patent.  \n",
239 |        "3                                                                  I HAVE A DATE ON SUNDAY WITH WILL!!  \n",
240 |        "4  As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...  "
241 |       ]
242 |      },
243 |      "execution_count": 9,
244 |      "metadata": {},
245 |      "output_type": "execute_result"
246 |     }
247 |    ],
248 |    "source": [
249 |     "import pandas as pd\n",
250 |     "import re\n",
251 |     "import string\n",
252 |     "pd.set_option('display.max_colwidth', 100)\n",
253 |     "\n",
254 |     "stopwords = nltk.corpus.stopwords.words('english')\n",
255 |     "\n",
256 |     "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t')\n",
257 |     "data.columns = ['label', 'body_text']\n",
258 |     "\n",
259 |     "data.head()"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "markdown",
264 |    "metadata": {},
265 |    "source": [
266 |     "### Clean up text"
267 |    ]
268 |   },
269 |   {
270 |    "cell_type": "code",
271 |    "execution_count": 10,
272 |    "metadata": {},
273 |    "outputs": [
274 |     {
275 |      "data": {
276 |       "text/html": [
277 |        "<div>\n",
278 |        "<style>\n",
279 |        "    .dataframe thead tr:only-child th {\n",
280 |        "        text-align: right;\n",
281 |        "    }\n",
282 |        "\n",
283 |        "    .dataframe thead th {\n",
284 |        "        text-align: left;\n",
285 |        "    }\n",
286 |        "\n",
287 |        "    .dataframe tbody tr th {\n",
288 |        "        vertical-align: top;\n",
289 |        "    }\n",
290 |        "</style>\n",
291 |        "<table border=\"1\" class=\"dataframe\">\n",
292 |        "  <thead>\n",
293 |        "    <tr style=\"text-align: right;\">\n",
294 |        "      <th></th>\n",
295 |        "      <th>label</th>\n",
296 |        "      <th>body_text</th>\n",
297 |        "      <th>body_text_nostop</th>\n",
298 |        "    </tr>\n",
299 |        "  </thead>\n",
300 |        "  <tbody>\n",
301 |        "    <tr>\n",
302 |        "      <th>0</th>\n",
303 |        "      <td>spam</td>\n",
304 |        "      <td>Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...</td>\n",
305 |        "      <td>[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...</td>\n",
306 |        "    </tr>\n",
307 |        "    <tr>\n",
308 |        "      <th>1</th>\n",
309 |        "      <td>ham</td>\n",
310 |        "      <td>Nah I don't think he goes to usf, he lives around here though</td>\n",
311 |        "      <td>[nah, dont, think, goes, usf, lives, around, though]</td>\n",
312 |        "    </tr>\n",
313 |        "    <tr>\n",
314 |        "      <th>2</th>\n",
315 |        "      <td>ham</td>\n",
316 |        "      <td>Even my brother is not like to speak with me. They treat me like aids patent.</td>\n",
317 |        "      <td>[even, brother, like, speak, treat, like, aids, patent]</td>\n",
318 |        "    </tr>\n",
319 |        "    <tr>\n",
320 |        "      <th>3</th>\n",
321 |        "      <td>ham</td>\n",
322 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL!!</td>\n",
323 |        "      <td>[date, sunday]</td>\n",
324 |        "    </tr>\n",
325 |        "    <tr>\n",
326 |        "      <th>4</th>\n",
327 |        "      <td>ham</td>\n",
328 |        "      <td>As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...</td>\n",
329 |        "      <td>[per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr...</td>\n",
330 |        "    </tr>\n",
331 |        "  </tbody>\n",
332 |        "</table>\n",
333 |        "</div>"
334 |       ],
335 |       "text/plain": [
336 |        "  label  \\\n",
337 |        "0  spam   \n",
338 |        "1   ham   \n",
339 |        "2   ham   \n",
340 |        "3   ham   \n",
341 |        "4   ham   \n",
342 |        "\n",
343 |        "                                                                                             body_text  \\\n",
344 |        "0  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...   \n",
345 |        "1                                        Nah I don't think he goes to usf, he lives around here though   \n",
346 |        "2                        Even my brother is not like to speak with me. They treat me like aids patent.   \n",
347 |        "3                                                                  I HAVE A DATE ON SUNDAY WITH WILL!!   \n",
348 |        "4  As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...   \n",
349 |        "\n",
350 |        "                                                                                      body_text_nostop  \n",
351 |        "0  [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...  \n",
352 |        "1                                                 [nah, dont, think, goes, usf, lives, around, though]  \n",
353 |        "2                                              [even, brother, like, speak, treat, like, aids, patent]  \n",
354 |        "3                                                                                       [date, sunday]  \n",
355 |        "4  [per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr...  "
356 |       ]
357 |      },
358 |      "execution_count": 10,
359 |      "metadata": {},
360 |      "output_type": "execute_result"
361 |     }
362 |    ],
363 |    "source": [
364 |     "def clean_text(text):\n",
365 |     "    text = \"\".join([word for word in text if word not in string.punctuation])\n",
366 |     "    tokens = re.split('\\W+', text)\n",
367 |     "    text = [word for word in tokens if word not in stopwords]\n",
368 |     "    return text\n",
369 |     "\n",
370 |     "data['body_text_nostop'] = data['body_text'].apply(lambda x: clean_text(x.lower()))\n",
371 |     "\n",
372 |     "data.head()"
373 |    ]
374 |   },
375 |   {
376 |    "cell_type": "markdown",
377 |    "metadata": {},
378 |    "source": [
379 |     "### Lemmatize text"
380 |    ]
381 |   },
382 |   {
383 |    "cell_type": "code",
384 |    "execution_count": 11,
385 |    "metadata": {},
386 |    "outputs": [
387 |     {
388 |      "data": {
389 |       "text/html": [
390 |        "<div>\n",
391 |        "<style>\n",
392 |        "    .dataframe thead tr:only-child th {\n",
393 |        "        text-align: right;\n",
394 |        "    }\n",
395 |        "\n",
396 |        "    .dataframe thead th {\n",
397 |        "        text-align: left;\n",
398 |        "    }\n",
399 |        "\n",
400 |        "    .dataframe tbody tr th {\n",
401 |        "        vertical-align: top;\n",
402 |        "    }\n",
403 |        "</style>\n",
404 |        "<table border=\"1\" class=\"dataframe\">\n",
405 |        "  <thead>\n",
406 |        "    <tr style=\"text-align: right;\">\n",
407 |        "      <th></th>\n",
408 |        "      <th>label</th>\n",
409 |        "      <th>body_text</th>\n",
410 |        "      <th>body_text_nostop</th>\n",
411 |        "      <th>body_text_lemmatized</th>\n",
412 |        "    </tr>\n",
413 |        "  </thead>\n",
414 |        "  <tbody>\n",
415 |        "    <tr>\n",
416 |        "      <th>0</th>\n",
417 |        "      <td>spam</td>\n",
418 |        "      <td>Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...</td>\n",
419 |        "      <td>[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...</td>\n",
420 |        "      <td>[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...</td>\n",
421 |        "    </tr>\n",
422 |        "    <tr>\n",
423 |        "      <th>1</th>\n",
424 |        "      <td>ham</td>\n",
425 |        "      <td>Nah I don't think he goes to usf, he lives around here though</td>\n",
426 |        "      <td>[nah, dont, think, goes, usf, lives, around, though]</td>\n",
427 |        "      <td>[nah, dont, think, go, usf, life, around, though]</td>\n",
428 |        "    </tr>\n",
429 |        "    <tr>\n",
430 |        "      <th>2</th>\n",
431 |        "      <td>ham</td>\n",
432 |        "      <td>Even my brother is not like to speak with me. They treat me like aids patent.</td>\n",
433 |        "      <td>[even, brother, like, speak, treat, like, aids, patent]</td>\n",
434 |        "      <td>[even, brother, like, speak, treat, like, aid, patent]</td>\n",
435 |        "    </tr>\n",
436 |        "    <tr>\n",
437 |        "      <th>3</th>\n",
438 |        "      <td>ham</td>\n",
439 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL!!</td>\n",
440 |        "      <td>[date, sunday]</td>\n",
441 |        "      <td>[date, sunday]</td>\n",
442 |        "    </tr>\n",
443 |        "    <tr>\n",
444 |        "      <th>4</th>\n",
445 |        "      <td>ham</td>\n",
446 |        "      <td>As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...</td>\n",
447 |        "      <td>[per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr...</td>\n",
448 |        "      <td>[per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, caller, pre...</td>\n",
449 |        "    </tr>\n",
450 |        "    <tr>\n",
451 |        "      <th>5</th>\n",
452 |        "      <td>spam</td>\n",
453 |        "      <td>WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To c...</td>\n",
454 |        "      <td>[winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 0906170...</td>\n",
455 |        "      <td>[winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 0906170...</td>\n",
456 |        "    </tr>\n",
457 |        "    <tr>\n",
458 |        "      <th>6</th>\n",
459 |        "      <td>spam</td>\n",
460 |        "      <td>Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with came...</td>\n",
461 |        "      <td>[mobile, 11, months, u, r, entitled, update, latest, colour, mobiles, camera, free, call, mobile...</td>\n",
462 |        "      <td>[mobile, 11, month, u, r, entitled, update, latest, colour, mobile, camera, free, call, mobile, ...</td>\n",
463 |        "    </tr>\n",
464 |        "    <tr>\n",
465 |        "      <th>7</th>\n",
466 |        "      <td>ham</td>\n",
467 |        "      <td>I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried ...</td>\n",
468 |        "      <td>[im, gonna, home, soon, dont, want, talk, stuff, anymore, tonight, k, ive, cried, enough, today]</td>\n",
469 |        "      <td>[im, gonna, home, soon, dont, want, talk, stuff, anymore, tonight, k, ive, cried, enough, today]</td>\n",
470 |        "    </tr>\n",
471 |        "    <tr>\n",
472 |        "      <th>8</th>\n",
473 |        "      <td>spam</td>\n",
474 |        "      <td>SIX chances to win CASH! From 100 to 20,000 pounds txt&gt; CSH11 and send to 87575. Cost 150p/day, ...</td>\n",
475 |        "      <td>[six, chances, win, cash, 100, 20000, pounds, txt, csh11, send, 87575, cost, 150pday, 6days, 16,...</td>\n",
476 |        "      <td>[six, chance, win, cash, 100, 20000, pound, txt, csh11, send, 87575, cost, 150pday, 6days, 16, t...</td>\n",
477 |        "    </tr>\n",
478 |        "    <tr>\n",
479 |        "      <th>9</th>\n",
480 |        "      <td>spam</td>\n",
481 |        "      <td>URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM...</td>\n",
482 |        "      <td>[urgent, 1, week, free, membership, 100000, prize, jackpot, txt, word, claim, 81010, tc, wwwdbuk...</td>\n",
483 |        "      <td>[urgent, 1, week, free, membership, 100000, prize, jackpot, txt, word, claim, 81010, tc, wwwdbuk...</td>\n",
484 |        "    </tr>\n",
485 |        "  </tbody>\n",
486 |        "</table>\n",
487 |        "</div>"
488 |       ],
489 |       "text/plain": [
490 |        "  label  \\\n",
491 |        "0  spam   \n",
492 |        "1   ham   \n",
493 |        "2   ham   \n",
494 |        "3   ham   \n",
495 |        "4   ham   \n",
496 |        "5  spam   \n",
497 |        "6  spam   \n",
498 |        "7   ham   \n",
499 |        "8  spam   \n",
500 |        "9  spam   \n",
501 |        "\n",
502 |        "                                                                                             body_text  \\\n",
503 |        "0  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...   \n",
504 |        "1                                        Nah I don't think he goes to usf, he lives around here though   \n",
505 |        "2                        Even my brother is not like to speak with me. They treat me like aids patent.   \n",
506 |        "3                                                                  I HAVE A DATE ON SUNDAY WITH WILL!!   \n",
507 |        "4  As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...   \n",
508 |        "5  WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To c...   \n",
509 |        "6  Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with came...   \n",
510 |        "7  I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried ...   \n",
511 |        "8  SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, ...   \n",
512 |        "9  URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM...   \n",
513 |        "\n",
514 |        "                                                                                      body_text_nostop  \\\n",
515 |        "0  [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...   \n",
516 |        "1                                                 [nah, dont, think, goes, usf, lives, around, though]   \n",
517 |        "2                                              [even, brother, like, speak, treat, like, aids, patent]   \n",
518 |        "3                                                                                       [date, sunday]   \n",
519 |        "4  [per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr...   \n",
520 |        "5  [winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 0906170...   \n",
521 |        "6  [mobile, 11, months, u, r, entitled, update, latest, colour, mobiles, camera, free, call, mobile...   \n",
522 |        "7     [im, gonna, home, soon, dont, want, talk, stuff, anymore, tonight, k, ive, cried, enough, today]   \n",
523 |        "8  [six, chances, win, cash, 100, 20000, pounds, txt, csh11, send, 87575, cost, 150pday, 6days, 16,...   \n",
524 |        "9  [urgent, 1, week, free, membership, 100000, prize, jackpot, txt, word, claim, 81010, tc, wwwdbuk...   \n",
525 |        "\n",
526 |        "                                                                                  body_text_lemmatized  \n",
527 |        "0  [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...  \n",
528 |        "1                                                    [nah, dont, think, go, usf, life, around, though]  \n",
529 |        "2                                               [even, brother, like, speak, treat, like, aid, patent]  \n",
530 |        "3                                                                                       [date, sunday]  \n",
531 |        "4  [per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, caller, pre...  \n",
532 |        "5  [winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 0906170...  \n",
533 |        "6  [mobile, 11, month, u, r, entitled, update, latest, colour, mobile, camera, free, call, mobile, ...  \n",
534 |        "7     [im, gonna, home, soon, dont, want, talk, stuff, anymore, tonight, k, ive, cried, enough, today]  \n",
535 |        "8  [six, chance, win, cash, 100, 20000, pound, txt, csh11, send, 87575, cost, 150pday, 6days, 16, t...  \n",
536 |        "9  [urgent, 1, week, free, membership, 100000, prize, jackpot, txt, word, claim, 81010, tc, wwwdbuk...  "
537 |       ]
538 |      },
539 |      "execution_count": 11,
540 |      "metadata": {},
541 |      "output_type": "execute_result"
542 |     }
543 |    ],
544 |    "source": [
545 |     "def lemmatizing(tokenized_text):\n",
546 |     "    text = [wn.lemmatize(word) for word in tokenized_text]\n",
547 |     "    return text\n",
548 |     "\n",
549 |     "data['body_text_lemmatized'] = data['body_text_nostop'].apply(lambda x: lemmatizing(x))\n",
550 |     "\n",
551 |     "data.head(10)"
552 |    ]
553 |   },
554 |   {
555 |    "cell_type": "code",
556 |    "execution_count": null,
557 |    "metadata": {
558 |     "collapsed": true,
559 |     "jupyter": {
560 |      "outputs_hidden": true
561 |     }
562 |    },
563 |    "outputs": [],
564 |    "source": []
565 |   }
566 |  ],
567 |  "metadata": {
568 |   "kernelspec": {
569 |    "display_name": "Python 3 (ipykernel)",
570 |    "language": "python",
571 |    "name": "python3"
572 |   },
573 |   "language_info": {
574 |    "codemirror_mode": {
575 |     "name": "ipython",
576 |     "version": 3
577 |    },
578 |    "file_extension": ".py",
579 |    "mimetype": "text/x-python",
580 |    "name": "python",
581 |    "nbconvert_exporter": "python",
582 |    "pygments_lexer": "ipython3",
583 |    "version": "3.11.0"
584 |   }
585 |  },
586 |  "nbformat": 4,
587 |  "nbformat_minor": 4
588 | }
589 | 


--------------------------------------------------------------------------------
/4. Feature Engineering/4.1. Feature Creation.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Feature Engineering: Feature Creation"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "### Read in text"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 1,
 20 |    "metadata": {
 21 |     "collapsed": true,
 22 |     "jupyter": {
 23 |      "outputs_hidden": true
 24 |     }
 25 |    },
 26 |    "outputs": [],
 27 |    "source": [
 28 |     "import pandas as pd\n",
 29 |     "\n",
 30 |     "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t')\n",
 31 |     "data.columns = ['label', 'body_text']"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "markdown",
 36 |    "metadata": {},
 37 |    "source": [
 38 |     "### Create feature for text message length"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "code",
 43 |    "execution_count": 2,
 44 |    "metadata": {},
 45 |    "outputs": [
 46 |     {
 47 |      "data": {
 48 |       "text/html": [
 49 |        "<div>\n",
 50 |        "<style>\n",
 51 |        "    .dataframe thead tr:only-child th {\n",
 52 |        "        text-align: right;\n",
 53 |        "    }\n",
 54 |        "\n",
 55 |        "    .dataframe thead th {\n",
 56 |        "        text-align: left;\n",
 57 |        "    }\n",
 58 |        "\n",
 59 |        "    .dataframe tbody tr th {\n",
 60 |        "        vertical-align: top;\n",
 61 |        "    }\n",
 62 |        "</style>\n",
 63 |        "<table border=\"1\" class=\"dataframe\">\n",
 64 |        "  <thead>\n",
 65 |        "    <tr style=\"text-align: right;\">\n",
 66 |        "      <th></th>\n",
 67 |        "      <th>label</th>\n",
 68 |        "      <th>body_text</th>\n",
 69 |        "      <th>body_len</th>\n",
 70 |        "    </tr>\n",
 71 |        "  </thead>\n",
 72 |        "  <tbody>\n",
 73 |        "    <tr>\n",
 74 |        "      <th>0</th>\n",
 75 |        "      <td>spam</td>\n",
 76 |        "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
 77 |        "      <td>128</td>\n",
 78 |        "    </tr>\n",
 79 |        "    <tr>\n",
 80 |        "      <th>1</th>\n",
 81 |        "      <td>ham</td>\n",
 82 |        "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
 83 |        "      <td>49</td>\n",
 84 |        "    </tr>\n",
 85 |        "    <tr>\n",
 86 |        "      <th>2</th>\n",
 87 |        "      <td>ham</td>\n",
 88 |        "      <td>Even my brother is not like to speak with me. ...</td>\n",
 89 |        "      <td>62</td>\n",
 90 |        "    </tr>\n",
 91 |        "    <tr>\n",
 92 |        "      <th>3</th>\n",
 93 |        "      <td>ham</td>\n",
 94 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL!!</td>\n",
 95 |        "      <td>28</td>\n",
 96 |        "    </tr>\n",
 97 |        "    <tr>\n",
 98 |        "      <th>4</th>\n",
 99 |        "      <td>ham</td>\n",
100 |        "      <td>As per your request 'Melle Melle (Oru Minnamin...</td>\n",
101 |        "      <td>135</td>\n",
102 |        "    </tr>\n",
103 |        "  </tbody>\n",
104 |        "</table>\n",
105 |        "</div>"
106 |       ],
107 |       "text/plain": [
108 |        "  label                                          body_text  body_len\n",
109 |        "0  spam  Free entry in 2 a wkly comp to win FA Cup fina...       128\n",
110 |        "1   ham  Nah I don't think he goes to usf, he lives aro...        49\n",
111 |        "2   ham  Even my brother is not like to speak with me. ...        62\n",
112 |        "3   ham                I HAVE A DATE ON SUNDAY WITH WILL!!        28\n",
113 |        "4   ham  As per your request 'Melle Melle (Oru Minnamin...       135"
114 |       ]
115 |      },
116 |      "execution_count": 2,
117 |      "metadata": {},
118 |      "output_type": "execute_result"
119 |     }
120 |    ],
121 |    "source": [
122 |     "data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(\" \"))\n",
123 |     "\n",
124 |     "data.head()"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "markdown",
129 |    "metadata": {},
130 |    "source": [
131 |     "### Create feature for % of text that is punctuation"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": 3,
137 |    "metadata": {},
138 |    "outputs": [
139 |     {
140 |      "data": {
141 |       "text/html": [
142 |        "<div>\n",
143 |        "<style>\n",
144 |        "    .dataframe thead tr:only-child th {\n",
145 |        "        text-align: right;\n",
146 |        "    }\n",
147 |        "\n",
148 |        "    .dataframe thead th {\n",
149 |        "        text-align: left;\n",
150 |        "    }\n",
151 |        "\n",
152 |        "    .dataframe tbody tr th {\n",
153 |        "        vertical-align: top;\n",
154 |        "    }\n",
155 |        "</style>\n",
156 |        "<table border=\"1\" class=\"dataframe\">\n",
157 |        "  <thead>\n",
158 |        "    <tr style=\"text-align: right;\">\n",
159 |        "      <th></th>\n",
160 |        "      <th>label</th>\n",
161 |        "      <th>body_text</th>\n",
162 |        "      <th>body_len</th>\n",
163 |        "      <th>punct%</th>\n",
164 |        "    </tr>\n",
165 |        "  </thead>\n",
166 |        "  <tbody>\n",
167 |        "    <tr>\n",
168 |        "      <th>0</th>\n",
169 |        "      <td>spam</td>\n",
170 |        "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
171 |        "      <td>128</td>\n",
172 |        "      <td>4.7</td>\n",
173 |        "    </tr>\n",
174 |        "    <tr>\n",
175 |        "      <th>1</th>\n",
176 |        "      <td>ham</td>\n",
177 |        "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
178 |        "      <td>49</td>\n",
179 |        "      <td>4.1</td>\n",
180 |        "    </tr>\n",
181 |        "    <tr>\n",
182 |        "      <th>2</th>\n",
183 |        "      <td>ham</td>\n",
184 |        "      <td>Even my brother is not like to speak with me. ...</td>\n",
185 |        "      <td>62</td>\n",
186 |        "      <td>3.2</td>\n",
187 |        "    </tr>\n",
188 |        "    <tr>\n",
189 |        "      <th>3</th>\n",
190 |        "      <td>ham</td>\n",
191 |        "      <td>I HAVE A DATE ON SUNDAY WITH WILL!!</td>\n",
192 |        "      <td>28</td>\n",
193 |        "      <td>7.1</td>\n",
194 |        "    </tr>\n",
195 |        "    <tr>\n",
196 |        "      <th>4</th>\n",
197 |        "      <td>ham</td>\n",
198 |        "      <td>As per your request 'Melle Melle (Oru Minnamin...</td>\n",
199 |        "      <td>135</td>\n",
200 |        "      <td>4.4</td>\n",
201 |        "    </tr>\n",
202 |        "  </tbody>\n",
203 |        "</table>\n",
204 |        "</div>"
205 |       ],
206 |       "text/plain": [
207 |        "  label                                          body_text  body_len  punct%\n",
208 |        "0  spam  Free entry in 2 a wkly comp to win FA Cup fina...       128     4.7\n",
209 |        "1   ham  Nah I don't think he goes to usf, he lives aro...        49     4.1\n",
210 |        "2   ham  Even my brother is not like to speak with me. ...        62     3.2\n",
211 |        "3   ham                I HAVE A DATE ON SUNDAY WITH WILL!!        28     7.1\n",
212 |        "4   ham  As per your request 'Melle Melle (Oru Minnamin...       135     4.4"
213 |       ]
214 |      },
215 |      "execution_count": 3,
216 |      "metadata": {},
217 |      "output_type": "execute_result"
218 |     }
219 |    ],
220 |    "source": [
221 |     "import string\n",
222 |     "\n",
223 |     "def count_punct(text):\n",
224 |     "    count = sum([1 for char in text if char in string.punctuation])\n",
225 |     "    return round(count/(len(text) - text.count(\" \")), 3)*100\n",
226 |     "\n",
227 |     "data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))\n",
228 |     "\n",
229 |     "data.head()"
230 |    ]
231 |   },
232 |   {
233 |    "cell_type": "markdown",
234 |    "metadata": {},
235 |    "source": [
236 |     "### Evaluate created features"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "code",
241 |    "execution_count": 4,
242 |    "metadata": {
243 |     "collapsed": true,
244 |     "jupyter": {
245 |      "outputs_hidden": true
246 |     }
247 |    },
248 |    "outputs": [],
249 |    "source": [
250 |     "from matplotlib import pyplot\n",
251 |     "import numpy as np\n",
252 |     "%matplotlib inline"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "code",
257 |    "execution_count": 5,
258 |    "metadata": {},
259 |    "outputs": [
260 |     {
261 |      "data": {
262 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD8CAYAAACb4nSYAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAFSdJREFUeJzt3X+M3PWd3/Hn2z+wkxZMz7gRsYE1\nBU62szIExyYqnGQlOHYS4lyAxrTobAUFXYrTwokEfFEQJXe9QNq6VwXlQs4oBNHgK/nlCF84UpM0\nrYDYBnz2hgMW8JU9U+IY5COAwTbv/jHftcbD7s6sdz2zu5/nQ7L2O5/5fHfe853xaz/zmc98JzIT\nSVIZJnW6AElS+xj6klQQQ1+SCmLoS1JBDH1JKoihL0kFMfQlqSCGviQVxNCXpIJM6XQBjU455ZTs\n6urqdBmSNK5s3779N5k5q1m/MRf6XV1dbNu2rdNlSNK4EhF/30o/p3ckqSCGviQVxNCXpIKMuTn9\ngRw8eJC+vj4OHDjQ6VLabvr06cyZM4epU6d2uhRJE8C4CP2+vj5OPPFEurq6iIhOl9M2mcm+ffvo\n6+tj7ty5nS5H0gQwLqZ3Dhw4wMyZM4sKfICIYObMmUW+wpF0fIyL0AeKC/x+pd5vScfHuAl9SdLI\njYs5/UbrH3x6VH/fdRefM6q/T5LGqnEZ+pKaG2pw5ECnXE7vtOi1117jYx/7GAsXLuR973sfGzdu\npKurixtuuIHFixezePFient7Afjxj3/MkiVLOO+88/jwhz/MSy+9BMDNN9/M6tWrWbZsGV1dXXz/\n+9/ni1/8It3d3SxfvpyDBw928i5KKoCh36Kf/OQnvPe972XHjh3s2rWL5cuXA3DSSSfxy1/+krVr\n13LttdcCcOGFF/LII4/w+OOPs2rVKm677bYjv+fZZ5/l/vvv50c/+hFXXnklS5cuZefOnbzrXe/i\n/vvv78h9k1QOQ79F3d3d/PSnP+WGG27gF7/4BTNmzADgiiuuOPLz4YcfBmqfK/jIRz5Cd3c3X/va\n1+jp6Tnye1asWMHUqVPp7u7m8OHDR/54dHd3s3v37vbeKUnFMfRbdM4557B9+3a6u7tZt24dt9xy\nC3D0ksr+7c9//vOsXbuWnTt38s1vfvOodfbTpk0DYNKkSUydOvXIPpMmTeLQoUPtujuSCmXot2jP\nnj28+93v5sorr+T666/nscceA2Djxo1Hfn7wgx8EYP/+/cyePRuAu+66qzMFS9IAxuXqnU6sPNi5\ncydf+MIXjozQv/GNb3DZZZfx5ptvsmTJEt5++22++93vArU3bC+//HJmz57NBRdcwPPPP9/2eiVp\nIJGZna7hKIsWLcrGL1F58sknmTdvXocqGlz/F76ccsopx/V2xur919jmks2yRMT2zFzUrJ/TO5JU\nkHE5vTNWuNpG0njjSF+SCtJS6EfE8oh4KiJ6I+LGAa6fFhEbq+sfjYiuhutPj4jfRsT1o1O2JOlY\nNA39iJgM3A6sAOYDV0TE/IZuVwGvZOZZwHrg1obr1wN/PfJyJUkj0cpIfzHQm5nPZeZbwL3AyoY+\nK4H+Ben3AR+K6lNHEfFJ4DmgB0lSR7XyRu5s4IW6y33AksH6ZOahiNgPzIyIN4AbgIuB0ZvaeejP\nRu1XAbB0XdMuu3fv5uMf/zi7du0a3duWpDZqZaQ/0Fc3NS7uH6zPfwDWZ+Zvh7yBiKsjYltEbNu7\nd28LJUmSjkUrod8HnFZ3eQ6wZ7A+ETEFmAG8TO0VwW0RsRu4FvjjiFjbeAOZeUdmLsrMRbNmzRr2\nnWiXw4cP89nPfpYFCxawbNky3njjDb71rW/xgQ98gIULF3LppZfy+uuvA7BmzRo+97nPsXTpUs48\n80x+/vOf85nPfIZ58+axZs2azt4RScVqJfS3AmdHxNyIOAFYBWxq6LMJWF1tXwZsyZqLMrMrM7uA\n/wr8x8z8+ijV3nbPPPMM11xzDT09PZx88sl873vf41Of+hRbt25lx44dzJs3jw0bNhzp/8orr7Bl\nyxbWr1/PJZdcwnXXXUdPTw87d+7kiSee6OA9kVSqpqGfmYeAtcADwJPAX2VmT0TcEhGfqLptoDaH\n3wv8EfCOZZ0Twdy5czn33HMBOP/889m9eze7du3ioosuoru7m3vuueeo0yhfcsklRATd3d285z3v\nobu7m0mTJrFgwQI/2CWpI1r6RG5mbgY2N7TdVLd9ALi8ye+4+RjqG1P6T4sMMHnyZN544w3WrFnD\nD3/4QxYuXMi3v/1tfvazn72j/6RJk47a19MoS+oUP5E7Qq+++iqnnnoqBw8e5J577ul0OZI0pPF5\n7p0Wlli2y1e+8hWWLFnCGWecQXd3N6+++mqnS5KkQXlq5XGg9PuvY+OplcviqZUlSe9g6EtSQcZN\n6I+1aah2KfV+Szo+xkXoT58+nX379hUXgJnJvn37mD59eqdLkTRBjIvVO3PmzKGvr48Sz8szffp0\n5syZ0+kyJE0Q4yL0p06dyty5cztdhiSNe+NiekeSNDoMfUkqiKEvSQUx9CWpIIa+JBXE0Jekghj6\nklQQQ1+SCmLoS1JBDH1JKoihL0kFMfQlqSCGviQVxNCXpIIY+pJUEENfkgpi6EtSQQx9SSqIoS9J\nBTH0Jakghr4kFcTQl6SCGPqSVBBDX5IKYuhLUkEMfUkqiKEvSQUx9CWpIIa+JBXE0JekgrQU+hGx\nPCKeiojeiLhxgOunRcTG6vpHI6Kral8cEU9U/3ZExO+PbvmSpOFoGvoRMRm4HVgBzAeuiIj5Dd2u\nAl7JzLOA9cCtVfsuYFFmngssB74ZEVNGq3hJ0vC0EsCLgd7MfA4gIu4FVgK/quuzEri52r4P+HpE\nRGa+XtdnOpAjrlgSAOsffLrTJWgcamV6ZzbwQt3lvqptwD6ZeQjYD8wEiIglEdED7AT+sLpektQB\nrYR+DNDWOGIftE9mPpqZC4APAOsiYvo7biDi6ojYFhHb9u7d20JJkqRj0Uro9wGn1V2eA+wZrE81\nZz8DeLm+Q2Y+CbwGvK/xBjLzjsxclJmLZs2a1Xr1kqRhaSX0twJnR8TciDgBWAVsauizCVhdbV8G\nbMnMrPaZAhARZwC/C+welcolScPW9I3czDwUEWuBB4DJwJ2Z2RMRtwDbMnMTsAG4OyJ6qY3wV1W7\nXwjcGBEHgbeBf5uZvzked0SS1FxLyyczczOwuaHtprrtA8DlA+x3N3D3CGuUJI0SP5ErSQUx9CWp\nIIa+JBXE0Jekghj6klQQQ1+SCmLoS1JBDH1JKoihL0kFMfQlqSCGviQVxNCXpIIY+pJUEENfkgpi\n6EtSQQx9SSqIoS9JBTH0Jakghr4kFcTQl6SCGPqSVBBDX5IKYuhLUkEMfUkqiKEvSQUx9CWpIIa+\nJBXE0Jekghj6klQQQ1+SCmLoS1JBpnS6AEkDW//g050uQROQI31JKoihL0kFMfQlqSCGviQVxNCX\npIIY+pJUEENfkgrS0jr9iFgO/DkwGfjLzPxqw/XTgO8A5wP7gE9n5u6IuBj4KnAC8BbwhczcMor1\njy0P/dnQ1y9d1546JGkQTUf6ETEZuB1YAcwHroiI+Q3drgJeycyzgPXArVX7b4BLMrMbWA3cPVqF\nS5KGr5XpncVAb2Y+l5lvAfcCKxv6rATuqrbvAz4UEZGZj2fmnqq9B5hevSqQJHVAK6E/G3ih7nJf\n1TZgn8w8BOwHZjb0uRR4PDPfPLZSJUkj1cqcfgzQlsPpExELqE35LBvwBiKuBq4GOP3001soSZJ0\nLFoZ6fcBp9VdngPsGaxPREwBZgAvV5fnAD8A/iAznx3oBjLzjsxclJmLZs2aNbx7IElqWSsj/a3A\n2RExF/gHYBXwrxv6bKL2Ru3DwGXAlszMiDgZuB9Yl5n/Z/TKHqeGWt3jyh5JbdA09DPzUESsBR6g\ntmTzzszsiYhbgG2ZuQnYANwdEb3URvirqt3XAmcBX46IL1dtyzLz16N9RyS1rtlpm6+7+Jw2VaJ2\na2mdfmZuBjY3tN1Ut30AuHyA/f4E+JMR1ihJGiV+IleSCmLoS1JBDH1JKoihL0kFMfQlqSCGviQV\nxNCXpIK0tE5fbeC5+CW1gaE/XvhHQdIocHpHkgpi6EtSQQx9SSqIoS9JBTH0Jakghr4kFcTQl6SC\nuE5/OJqtlZekMc6RviQVxJG+1CHNvqdWOh4c6UtSQQx9SSqIoS9JBTH0Jakghr4kFcTQl6SCGPqS\nVBBDX5IKYuhLUkEMfUkqiKdhkI4TT7OgsciRviQVxNCXpIIY+pJUEENfkgpi6EtSQQx9SSqIoS9J\nBXGd/kTR7Evbl65rTx2SxjRH+pJUkJZCPyKWR8RTEdEbETcOcP20iNhYXf9oRHRV7TMj4qGI+G1E\nfH10S5ckDVfT0I+IycDtwApgPnBFRMxv6HYV8EpmngWsB26t2g8AXwauH7WKJUnHrJWR/mKgNzOf\ny8y3gHuBlQ19VgJ3Vdv3AR+KiMjM1zLzf1MLf0lSh7US+rOBF+ou91VtA/bJzEPAfmDmaBQoSRo9\nrYR+DNCWx9Bn8BuIuDoitkXEtr1797a6myRpmFoJ/T7gtLrLc4A9g/WJiCnADODlVovIzDsyc1Fm\nLpo1a1aru0mShqmV0N8KnB0RcyPiBGAVsKmhzyZgdbV9GbAlM1se6UuS2qPph7My81BErAUeACYD\nd2ZmT0TcAmzLzE3ABuDuiOilNsJf1b9/ROwGTgJOiIhPAssy81ejf1ckSc209InczNwMbG5ou6lu\n+wBw+SD7do2gPknSKPI0DKXwNA2S8DQMklQUQ1+SCmLoS1JBnNNXTbM5/6H4foA0bjjSl6SCGPqS\nVBBDX5IKYuhLUkEMfUkqiKEvSQVxyaZGzlM8SOOGI31JKogjfWkI6x98esjrr7v4nDZVIo0OQ18a\ngWZ/FKSxxukdSSqII30dfx18o9fpGelojvQlqSCGviQVxOkddZ7r/KW2MfRVNFffqDSGvjRGXfB/\n7xjy+kdOv7pNlWgiMfQ19g01/ePUjzQshr40Th3PVwIudZ24XL0jSQVxpK+mHn5u35DXf/DMmW2q\n5J2ajUidF5eOZuhrXGsW6iPdv9kfhaH2H8m+0vFi6EtDGEkwG+oai5zTl6SCGPqSVBCnd+o1Ox2A\nJI1zjvQlqSCO9KUJaiQrizRxGfrjxFheK3+8NbvvklpXVug7Zz8mGepS+5QV+jouDO3yeG6e8cvQ\nlwrk6SnKNfFCv9ApnJHO+Ttal8ow8UJ/DBsqWCfyG7GaeJqfYuI/taUODV9LoR8Ry4E/ByYDf5mZ\nX224fhrwHeB8YB/w6czcXV23DrgKOAz8u8x8YNSqn0AcaWsicc5/7Goa+hExGbgduBjoA7ZGxKbM\n/FVdt6uAVzLzrIhYBdwKfDoi5gOrgAXAe4GfRsQ5mXl4tO9IO5S8bFJl8WRxE1crI/3FQG9mPgcQ\nEfcCK4H60F8J3Fxt3wd8PSKiar83M98Eno+I3ur3PTw65Usai5r+0XhoBAOksfwVmc3eUxwDtbcS\n+rOBF+ou9wFLBuuTmYciYj8ws2p/pGHf2cdc7XE2kadYJvJ90/gzove3mgRrJ1+RN71tOv9HoZXQ\njwHassU+rexLRFwN9K8R+21EPNVCXYM5BfjNCPY/XqxreKxreKxreMZoXX88krrOaKVTK6HfB5xW\nd3kOsGeQPn0RMQWYAbzc4r5k5h3AqEwiRsS2zFw0Gr9rNFnX8FjX8FjX8JRcVytn2dwKnB0RcyPi\nBGpvzG5q6LMJWF1tXwZsycys2ldFxLSImAucDfxydEqXJA1X05F+NUe/FniA2pLNOzOzJyJuAbZl\n5iZgA3B39Ubty9T+MFD1+ytqb/oeAq4Zryt3JGkiaGmdfmZuBjY3tN1Ut30AuHyQff8U+NMR1Dhc\nY3WtmXUNj3UNj3UNT7F1RW0WRpJUAr85S5IKMmFCPyKWR8RTEdEbETd2sI7TIuKhiHgyInoi4t9X\n7TdHxD9ExBPVv492oLbdEbGzuv1tVdvvRMSDEfFM9fOftbmm3607Jk9ExD9GxLWdOF4RcWdE/Doi\ndtW1DXh8oua/Vc+3v42I97e5rq9FxN9Vt/2DiDi5au+KiDfqjttftLmuQR+3iFhXHa+nIuIjba5r\nY11NuyPiiaq9ncdrsGxo73MsM8f9P2pvMD8LnAmcAOwA5neollOB91fbJwJPA/OpfWL5+g4fp93A\nKQ1ttwE3Vts3Ard2+HH8f9TWG7f9eAG/B7wf2NXs+AAfBf6a2mdRLgAebXNdy4Ap1fatdXV11ffr\nwPEa8HGr/g/sAKYBc6v/r5PbVVfD9f8ZuKkDx2uwbGjrc2yijPSPnCoiM98C+k8V0XaZ+WJmPlZt\nvwo8yRj+FDK143RXtX0X8MkO1vIh4NnM/PtO3Hhm/i9qq8/qDXZ8VgLfyZpHgJMj4tR21ZWZf5OZ\nh6qLj1D7DExbDXK8BnPklCyZ+TzQf0qWttYVEQH8K+C7x+O2hzJENrT1OTZRQn+gU0V0PGgjogs4\nD3i0alpbvUy7s93TKJUE/iYitkftU9AA78nMF6H2pAT+eQfq6reKo/8zdvp4weDHZyw95z5DbUTY\nb25EPB4RP4+IizpQz0CP21g5XhcBL2XmM3VtbT9eDdnQ1ufYRAn9lk730E4R8U+B7wHXZuY/At8A\n/gVwLvAitZeY7fYvM/P9wArgmoj4vQ7UMKCoffDvE8D/qJrGwvEayph4zkXEl6h9BuaequlF4PTM\nPA/4I+C/R8RJbSxpsMdtTBwv4AqOHli0/XgNkA2Ddh2gbcTHbKKEfkune2iXiJhK7UG9JzO/D5CZ\nL2Xm4cx8G/gWx+ml7VAyc0/189fAD6oaXup/yVj9/HW766qsAB7LzJeqGjt+vCqDHZ+OP+ciYjXw\nceDfZDUJXE2f7Ku2t1ObO2/byeuHeNzGwvGaAnwK2Njf1u7jNVA20Obn2EQJ/VZOFdEW1ZzhBuDJ\nzPwvde31c3G/D+xq3Pc41/VPIuLE/m1qbwTu4uhTaKwGftTOuuocNQLr9PGqM9jx2QT8QbXC4gJg\nf/9L9HaI2hcb3QB8IjNfr2ufFbXvwCAizqR26pPn2ljXYI/bWDgly4eBv8vMvv6Gdh6vwbKBdj/H\n2vGudTv+UXun+2lqf6m/1ME6LqT2EuxvgSeqfx8F7gZ2Vu2bgFPbXNeZ1FZP7AB6+o8RtVNg/0/g\nmern73TgmL2b2jeuzahra/vxovZH50XgILVR1lWDHR9qL71vr55vO4FFba6rl9p8b/9z7C+qvpdW\nj+8O4DHgkjbXNejjBnypOl5PASvaWVfV/m3gDxv6tvN4DZYNbX2O+YlcSSrIRJnekSS1wNCXpIIY\n+pJUEENfkgpi6EtSQQx9SSqIoS9JBTH0Jakg/x8m8I8uRmS2CQAAAABJRU5ErkJggg==\n",
263 |       "text/plain": [
264 |        "<matplotlib.figure.Figure at 0x10c1e4be0>"
265 |       ]
266 |      },
267 |      "metadata": {},
268 |      "output_type": "display_data"
269 |     }
270 |    ],
271 |    "source": [
272 |     "bins = np.linspace(0, 200, 40)\n",
273 |     "\n",
274 |     "pyplot.hist(data[data['label']=='spam']['body_len'], bins, alpha=0.5, normed=True, label='spam')\n",
275 |     "pyplot.hist(data[data['label']=='ham']['body_len'], bins, alpha=0.5, normed=True, label='ham')\n",
276 |     "pyplot.legend(loc='upper left')\n",
277 |     "pyplot.show()"
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "code",
282 |    "execution_count": 6,
283 |    "metadata": {},
284 |    "outputs": [
285 |     {
286 |      "data": {
287 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD8CAYAAACb4nSYAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAGI9JREFUeJzt3X+Q1PWd5/Hnix+CF6NGnFjKQGYs\nsQrIRLOOg9aqF0xChovKVoQLZK2FixXuspLbuBsVUndocFOJyd6yW6WVkkRPYjTgGbMh51yIiuel\ntlAH/DWMrHEkHHRIKUHiagzCwPv+6C9c0xno78z0TDP9eT2qKPr7+X6+335/yvbVXz797U8rIjAz\nszSMqnUBZmY2fBz6ZmYJceibmSXEoW9mlhCHvplZQhz6ZmYJceibmSXEoW9mlhCHvplZQsbUuoBy\nZ555ZjQ1NdW6DDOzEWXz5s2/jYiGSv1OuNBvampi06ZNtS7DzGxEkfR/8/Tz9I6ZWUIc+mZmCXHo\nm5kl5ISb0zczy+PAgQMUCgX27dtX61KG1fjx42lsbGTs2LEDOt6hb2YjUqFQ4P3vfz9NTU1IqnU5\nwyIi2LNnD4VCgebm5gGdw9M7ZjYi7du3jwkTJiQT+ACSmDBhwqD+dZMr9CW1S3pFUo+kpX3sv0LS\nc5J6Jc0t2zdZ0s8lbZX0sqSmAVdrZlYipcA/bLBjrhj6kkYDdwGzgWnAAknTyrrtABYBD/Zxiu8D\n346IqUAb8MZgCjYzs4HLM6ffBvRExDYASWuAOcDLhztExPZs36HSA7M3hzER8VjW753qlG1mdrSV\nj/2yque78ZPnV/V8J4o8oT8R2FmyXQBm5Dz/+cDvJD0CNAOPA0sj4mC/qhwmlV409foiMLN05JnT\n72sCKXKefwxwOfAV4GLgXIrTQEc/gbRY0iZJm3bv3p3z1GZmtfX73/+eT3/601xwwQV8+MMfZu3a\ntTQ1NXHLLbfQ1tZGW1sbPT09APz0pz9lxowZfPSjH+UTn/gEr7/+OgC33XYbCxcuZNasWTQ1NfHI\nI49w880309LSQnt7OwcOHKhqzXlCvwBMKtluBHblPH8BeD4itkVEL/BPwJ+Ud4qIVRHRGhGtDQ0V\n1wsyMzsh/OxnP+Occ87hxRdfZMuWLbS3twNw6qmn8uyzz7JkyRK+/OUvA3DZZZfx9NNP8/zzzzN/\n/ny+9a1vHTnPa6+9xqOPPspPfvITrrvuOmbOnElXVxcnn3wyjz76aFVrzhP6ncAUSc2STgLmA+ty\nnr8T+ICkw0l+JSWfBZiZjWQtLS08/vjj3HLLLfziF7/gtNNOA2DBggVH/t64cSNQ/F7Bpz71KVpa\nWvj2t79Nd3f3kfPMnj2bsWPH0tLSwsGDB4+8ebS0tLB9+/aq1lwx9LMr9CXAemAr8FBEdEtaIeka\nAEkXSyoA84C7JXVnxx6kOLXzhKQuilNF363qCMzMauT8889n8+bNtLS0sGzZMlasWAEcfVvl4cdf\n+tKXWLJkCV1dXdx9991H3Ws/btw4AEaNGsXYsWOPHDNq1Ch6e3urWnOub+RGRAfQUda2vORxJ8Vp\nn76OfQz4yCBqNDM7Ie3atYszzjiD6667jlNOOYX77rsPgLVr17J06VLWrl3LpZdeCsBbb73FxIkT\nAVi9enWtSvYyDGZWH2pxd11XVxc33XTTkSv073znO8ydO5f33nuPGTNmcOjQIX74wx8CxQ9s582b\nx8SJE7nkkkv41a9+Nez1Aigi7404w6O1tTVq9SMqvmXTbOTYunUrU6dOrXUZf+TwD0GdeeaZQ/Yc\nfY1d0uaIaK10rNfeMTNLiKd3zMyqqNp321Sbr/TNzBLi0DczS4hD38wsIQ59M7OE+INcM6sPT36j\nuuebuaxil+3bt3PVVVexZcuW6j73EPKVvplZQhz6ZmaDcPDgQb7whS8wffp0Zs2axR/+8Ae++93v\ncvHFF3PBBRdw7bXX8u677wKwaNEivvjFLzJz5kzOPfdcnnrqKT7/+c8zdepUFi1aNCz1OvTNzAbh\n1Vdf5YYbbqC7u5vTTz+dH/3oR3zmM5+hs7OTF198kalTp3LPPfcc6b937142bNjAypUrufrqq7nx\nxhvp7u6mq6uLF154YcjrdeibmQ1Cc3MzF154IQAXXXQR27dvZ8uWLVx++eW0tLTwwAMPHLWM8tVX\nX40kWlpaOOuss2hpaWHUqFFMnz59WL7Y5dA3MxuEw8siA4wePZre3l4WLVrEnXfeSVdXF7feeusx\nl1EuPXYollHui0PfzKzK3n77bc4++2wOHDjAAw88UOtyjuJbNs2sPuS4xXK43H777cyYMYMPfehD\ntLS08Pbbb9e6pCO8tHIJL61sNnKcqEsrD4chX1pZUrukVyT1SFrax/4rJD0nqVfS3D72nyrp15Lu\nzPN8ZmY2NCqGvqTRwF3AbGAasEDStLJuO4BFwIPHOM3twFMDL9PMzKohz5V+G9ATEdsiYj+wBphT\n2iEitkfES8Ch8oMlXQScBfy8CvWamR1xok1PD4fBjjlP6E8EdpZsF7K2iiSNAv4bcFP/SzMzO7bx\n48ezZ8+epII/ItizZw/jx48f8Dny3L2jvp475/n/EuiIiJ1SX6fJnkBaDCwGmDx5cs5Tm1nKGhsb\nKRQK7N69u9alDKvx48fT2Ng44OPzhH4BmFSy3Qjsynn+S4HLJf0lcApwkqR3IuKoD4MjYhWwCop3\n7+Q8t5klbOzYsTQ3N9e6jBEnT+h3AlMkNQO/BuYDn8tz8oj488OPJS0CWssD38zMhk/FOf2I6AWW\nAOuBrcBDEdEtaYWkawAkXSypAMwD7pbUfewzmplZreT6Rm5EdAAdZW3LSx53Upz2Od457gPu63eF\nZmZWNV57x8wsIQ59M7OEOPTNzBLiVTaryAu2mdmJzlf6ZmYJceibmSXEoW9mlhCHvplZQhz6ZmYJ\nceibmSXEoW9mlhCHvplZQvzlrH6o9OUrM7MTna/0zcwS4tA3M0tIUtM7np4xs9T5St/MLCG5Ql9S\nu6RXJPVI+qPfuJV0haTnJPVKmlvSfqGkjZK6Jb0k6bPVLN7MzPqnYuhLGg3cBcwGpgELJE0r67YD\nWAQ8WNb+LvAXETEdaAf+QdLpgy3azMwGJs+cfhvQExHbACStAeYALx/uEBHbs32HSg+MiF+WPN4l\n6Q2gAfjdoCs3M7N+yzO9MxHYWbJdyNr6RVIbcBLwWh/7FkvaJGnT7t27+3tqMzPLKU/oq4+26M+T\nSDobuB/4DxFxqHx/RKyKiNaIaG1oaOjPqc3MrB/yhH4BmFSy3QjsyvsEkk4FHgX+S0Q83b/yzMys\nmvKEficwRVKzpJOA+cC6PCfP+v8Y+H5E/I+Bl2lmZtVQMfQjohdYAqwHtgIPRUS3pBWSrgGQdLGk\nAjAPuFtSd3b4vweuABZJeiH7c+GQjMTMzCrK9Y3ciOgAOsralpc87qQ47VN+3A+AHwyyRjMzqxJ/\nI9fMLCEOfTOzhDj0zcwS4tA3M0uIQ9/MLCFJrac/WJfsWHXc/U9PXjxMlZiZDYyv9M3MEuLQNzNL\niEPfzCwhDn0zs4Q49M3MEuLQNzNLiEPfzCwhvk+/RKX78M3MRjpf6ZuZJcShb2aWEIe+mVlCcoW+\npHZJr0jqkbS0j/1XSHpOUq+kuWX7Fkp6NfuzsFqFm5lZ/1UMfUmjgbuA2cA0YIGkaWXddgCLgAfL\njj0DuBWYAbQBt0r6wODLNjOzgchzpd8G9ETEtojYD6wB5pR2iIjtEfEScKjs2E8Bj0XEmxGxF3gM\naK9C3WZmNgB5Qn8isLNku5C15ZHrWEmLJW2StGn37t05T21mZv2VJ/TVR1vkPH+uYyNiVUS0RkRr\nQ0NDzlObmVl/5Qn9AjCpZLsR2JXz/IM51szMqixP6HcCUyQ1SzoJmA+sy3n+9cAsSR/IPsCdlbWZ\nmVkNVAz9iOgFllAM663AQxHRLWmFpGsAJF0sqQDMA+6W1J0d+yZwO8U3jk5gRdZmZmY1kGvtnYjo\nADrK2paXPO6kOHXT17H3AvcOokYzM6sSfyPXzCwhDn0zs4Q49M3MEuLQNzNLiEPfzCwhDn0zs4Q4\n9M3MEuLQNzNLiEPfzCwhDn0zs4TkWobBqmPlY7885r4bP3n+MFZiZqnylb6ZWUIc+mZmCfH0ThVd\nsmPVcfc/PXnxMFViZtY3X+mbmSXEoW9mlpBcoS+pXdIrknokLe1j/zhJa7P9z0hqytrHSlotqUvS\nVknLqlu+mZn1R8XQlzQauAuYDUwDFkiaVtbtemBvRJwHrATuyNrnAeMiogW4CPiPh98QzMxs+OW5\n0m8DeiJiW0TsB9YAc8r6zAFWZ48fBj4uSUAA75M0BjgZ2A/8a1UqNzOzfssT+hOBnSXbhaytzz7Z\nD6m/BUyg+Abwe+A3wA7g7/zD6GZmtZMn9NVHW+Ts0wYcBM4BmoG/kXTuHz2BtFjSJkmbdu/enaMk\nMzMbiDz36ReASSXbjcCuY/QpZFM5pwFvAp8DfhYRB4A3JP0z0ApsKz04IlYBqwBaW1vL31D658lv\nHGfntYM6tZnZSJfnSr8TmCKpWdJJwHxgXVmfdcDC7PFcYENEBMUpnStV9D7gEuBfqlO6mZn1V8XQ\nz+bolwDrga3AQxHRLWmFpGuybvcAEyT1AH8NHL6t8y7gFGALxTeP/x4RL1V5DGZmllOuZRgiogPo\nKGtbXvJ4H8XbM8uPe6evdjMzqw1/I9fMLCEOfTOzhDj0zcwS4tA3M0uIQ9/MLCEOfTOzhDj0zcwS\n4tA3M0uIQ9/MLCEOfTOzhORahsGq45Idq46z9++GrQ4zS5ev9M3MEuLQNzNLiEPfzCwhDn0zs4Q4\n9M3MEuLQNzNLSK7Ql9Qu6RVJPZKW9rF/nKS12f5nJDWV7PuIpI2SuiV1SRpfvfLNzKw/Koa+pNEU\nf+t2NjANWCBpWlm364G9EXEesBK4Izt2DPAD4D9FxHTgY8CBqlVvZmb9kudKvw3oiYhtEbEfWAPM\nKeszB1idPX4Y+LgkAbOAlyLiRYCI2BMRB6tTupmZ9Vee0J8I7CzZLmRtffaJiF7gLWACcD4QktZL\nek7SzYMv2czMBirPMgzqoy1y9hkDXAZcDLwLPCFpc0Q8cdTB0mJgMcDkyZNzlGRmZgOR50q/AEwq\n2W4Edh2rTzaPfxrwZtb+VET8NiLeBTqAPyl/gohYFRGtEdHa0NDQ/1GYmVkueUK/E5giqVnSScB8\nYF1Zn3XAwuzxXGBDRASwHviIpH+TvRn8W+Dl6pRuZmb9VXF6JyJ6JS2hGOCjgXsjolvSCmBTRKwD\n7gHul9RD8Qp/fnbsXkl/T/GNI4COiHh0iMYysj35jePvn7lseOows7qWa2nliOigODVT2ra85PE+\nYN4xjv0Bxds2zcysxvyNXDOzhDj0zcwS4tA3M0uIQ9/MLCFJ/Ubu8X+j1sys/vlK38wsIQ59M7OE\nOPTNzBLi0DczS4hD38wsIQ59M7OEOPTNzBLi0DczS4hD38wsIUl9I3dE83r7ZlYFDv0TxMZte467\n/9JzJxz/BH5TMLMcPL1jZpaQXKEvqV3SK5J6JC3tY/84SWuz/c9IairbP1nSO5K+Up2yzcxsICqG\nvqTRwF3AbGAasEDStLJu1wN7I+I8YCVwR9n+lcD/Gny5ZmY2GHmu9NuAnojYFhH7gTXAnLI+c4DV\n2eOHgY9LEoCkPwO2Ad3VKdnMzAYqT+hPBHaWbBeytj77REQv8BYwQdL7gFuArx3vCSQtlrRJ0qbd\nu3fnrd3MzPopT+irj7bI2edrwMqIeOd4TxARqyKiNSJaGxoacpRkZmYDkeeWzQIwqWS7Edh1jD4F\nSWOA04A3gRnAXEnfAk4HDknaFxF3DrryY6h066OZWcryhH4nMEVSM/BrYD7wubI+64CFwEZgLrAh\nIgK4/HAHSbcB7wxl4JuZ2fFVDP2I6JW0BFgPjAbujYhuSSuATRGxDrgHuF9SD8Ur/PlDWbSZmQ1M\nrm/kRkQH0FHWtrzk8T5gXoVz3DaA+iwz6G/smpnhb+SamSXFoW9mlhCHvplZQhz6ZmYJceibmSXE\noW9mlhCHvplZQhz6ZmYJceibmSXEoW9mlhCHvplZQhz6ZmYJceibmSXEoW9mlhCHvplZQhz6ZmYJ\nyfUjKpLagX+k+MtZ34uIb5btHwd8H7gI2AN8NiK2S/ok8E3gJGA/cFNEbKhi/ZbXk984/v6Zy4an\nDjOrqYpX+pJGA3cBs4FpwAJJ08q6XQ/sjYjzgJXAHVn7b4GrI6KF4m/o3l+tws3MrP/yTO+0AT0R\nsS0i9gNrgDllfeYAq7PHDwMfl6SIeD4idmXt3cD47F8FZmZWA3lCfyKws2S7kLX12ScieoG3gPIf\nbb0WeD4i3htYqWZmNlh55vTVR1v0p4+k6RSnfGb1+QTSYmAxwOTJk3OUZGZmA5HnSr8ATCrZbgR2\nHauPpDHAacCb2XYj8GPgLyLitb6eICJWRURrRLQ2NDT0bwRmZpZbniv9TmCKpGbg18B84HNlfdZR\n/KB2IzAX2BARIel04FFgWUT8c/XKtqrz3T1mSah4pZ/N0S8B1gNbgYciolvSCknXZN3uASZI6gH+\nGliatS8BzgP+q6QXsj8frPoozMwsl1z36UdEB9BR1ra85PE+YF4fx/0t8LeDrNHMzKokV+ibDYqn\njsxOGA59y+d4we3QNhsxHPo2eJWu5M3shOEF18zMEuLQNzNLiEPfzCwhntOvExu37Tnu/kvPLV8K\nycxS5Ct9M7OEOPTNzBLi0DczS4hD38wsIf4g12rPyzSYDRtf6ZuZJcShb2aWEE/vJKLSffyV1PQ+\n/8FM/3jqyOwoDn0b+bzgm1lunt4xM0tIrit9Se3APwKjge9FxDfL9o8Dvg9cBOwBPhsR27N9y4Dr\ngYPAf46I9VWr3mywBvuvBE8P2QhTMfQljQbuAj4JFIBOSesi4uWSbtcDeyPiPEnzgTuAz0qaRvGH\n1KcD5wCPSzo/Ig5WeyBmNTGYN41Kbxj+PMKGQJ4r/TagJyK2AUhaA8wBSkN/DnBb9vhh4E5JytrX\nRMR7wK+yH05vAzZWp3wbLoP9IPh4Kn1IPJjnruuF5k7kf6X4DeuElSf0JwI7S7YLwIxj9YmIXklv\nAROy9qfLjp044GrN6smJ/AF0rUN7MD/PWevaj+cEqC1P6KuPtsjZJ8+xSFoMLM4235H0So66juVM\n4LeDOH4kSm3MqY0XhmTMX63RsbmPP8aYh+W5a+Srg/nv/KE8nfKEfgGYVLLdCOw6Rp+CpDHAacCb\nOY8lIlYBq/IUXImkTRHRWo1zjRSpjTm18YLHnIrhGHOeWzY7gSmSmiWdRPGD2XVlfdYBC7PHc4EN\nERFZ+3xJ4yQ1A1OAZ6tTupmZ9VfFK/1sjn4JsJ7iLZv3RkS3pBXApohYB9wD3J99UPsmxTcGsn4P\nUfzQtxe4wXfumJnVTq779COiA+goa1te8ngfMO8Yx34d+PogauyvqkwTjTCpjTm18YLHnIohH7OK\nszBmZpYCL8NgZpaQugl9Se2SXpHUI2lpresZCpLulfSGpC0lbWdIekzSq9nfH6hljdUmaZKkJyVt\nldQt6a+y9rodt6Txkp6V9GI25q9l7c2SnsnGvDa7saJuSBot6XlJ/zPbruvxAkjaLqlL0guSNmVt\nQ/rarovQL1kqYjYwDViQLQFRb+4D2svalgJPRMQU4Ilsu570An8TEVOBS4Absv+29Tzu94ArI+IC\n4EKgXdIlFJc3WZmNeS/F5U/qyV8BW0u26328h82MiAtLbtUc0td2XYQ+JUtFRMR+4PBSEXUlIv4P\nxbujSs0BVmePVwN/NqxFDbGI+E1EPJc9fptiKEykjscdRe9km2OzPwFcSXGZE6izMUtqBD4NfC/b\nFnU83gqG9LVdL6Hf11IRqSz3cFZE/AaKAQl8sMb1DBlJTcBHgWeo83FnUx0vAG8AjwGvAb+LiN6s\nS729xv8BuBk4lG1PoL7He1gAP5e0OVuZAIb4tV0vP6KSa7kHG7kknQL8CPhyRPxr8UKwfmXfZ7lQ\n0unAj4GpfXUb3qqGhqSrgDciYrOkjx1u7qNrXYy3zJ9GxC5JHwQek/QvQ/2E9XKln2u5hzr1uqSz\nAbK/36hxPVUnaSzFwH8gIh7Jmut+3AAR8Tvgf1P8POP0bJkTqK/X+J8C10jaTnFq9kqKV/71Ot4j\nImJX9vcbFN/c2xji13a9hH6epSLqVekSGAuBn9SwlqrL5nbvAbZGxN+X7KrbcUtqyK7wkXQy8AmK\nn2U8SXGZE6ijMUfEsohojIgmiv/vboiIP6dOx3uYpPdJev/hx8AsYAtD/Nqumy9nSfp3FK8ODi8V\nMZzfAh4Wkn4IfIzi6oOvA7cC/wQ8BEwGdgDzIqL8w94RS9JlwC+ALv7/fO9XKc7r1+W4JX2E4gd4\noylemD0UESsknUvxSvgM4Hnguuy3KupGNr3zlYi4qt7Hm43vx9nmGODBiPi6pAkM4Wu7bkLfzMwq\nq5fpHTMzy8Ghb2aWEIe+mVlCHPpmZglx6JuZJcShb2aWEIe+mVlCHPpmZgn5fyjZgnDU1A4AAAAA\nAElFTkSuQmCC\n",
288 |       "text/plain": [
289 |        "<matplotlib.figure.Figure at 0x117567ef0>"
290 |       ]
291 |      },
292 |      "metadata": {},
293 |      "output_type": "display_data"
294 |     }
295 |    ],
296 |    "source": [
297 |     "bins = np.linspace(0, 50, 40)\n",
298 |     "\n",
299 |     "pyplot.hist(data[data['label']=='spam']['punct%'], bins, alpha=0.5, normed=True, label='spam')\n",
300 |     "pyplot.hist(data[data['label']=='ham']['punct%'], bins, alpha=0.5, normed=True, label='ham')\n",
301 |     "pyplot.legend(loc='upper right')\n",
302 |     "pyplot.show()"
303 |    ]
304 |   },
305 |   {
306 |    "cell_type": "code",
307 |    "execution_count": null,
308 |    "metadata": {
309 |     "collapsed": true,
310 |     "jupyter": {
311 |      "outputs_hidden": true
312 |     }
313 |    },
314 |    "outputs": [],
315 |    "source": []
316 |   }
317 |  ],
318 |  "metadata": {
319 |   "kernelspec": {
320 |    "display_name": "Python 3 (ipykernel)",
321 |    "language": "python",
322 |    "name": "python3"
323 |   },
324 |   "language_info": {
325 |    "codemirror_mode": {
326 |     "name": "ipython",
327 |     "version": 3
328 |    },
329 |    "file_extension": ".py",
330 |    "mimetype": "text/x-python",
331 |    "name": "python",
332 |    "nbconvert_exporter": "python",
333 |    "pygments_lexer": "ipython3",
334 |    "version": "3.11.0"
335 |   }
336 |  },
337 |  "nbformat": 4,
338 |  "nbformat_minor": 4
339 | }
340 | 


--------------------------------------------------------------------------------
/5. Building Machine Learning Classifiers/5.1. Building a basic Random Forest Model.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Building Machine Learning Classifiers: Building a basic Random Forest model"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "### Read in & clean text"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 2,
 20 |    "metadata": {},
 21 |    "outputs": [
 22 |     {
 23 |      "data": {
 24 |       "text/html": [
 25 |        "<div>\n",
 26 |        "<style>\n",
 27 |        "    .dataframe thead tr:only-child th {\n",
 28 |        "        text-align: right;\n",
 29 |        "    }\n",
 30 |        "\n",
 31 |        "    .dataframe thead th {\n",
 32 |        "        text-align: left;\n",
 33 |        "    }\n",
 34 |        "\n",
 35 |        "    .dataframe tbody tr th {\n",
 36 |        "        vertical-align: top;\n",
 37 |        "    }\n",
 38 |        "</style>\n",
 39 |        "<table border=\"1\" class=\"dataframe\">\n",
 40 |        "  <thead>\n",
 41 |        "    <tr style=\"text-align: right;\">\n",
 42 |        "      <th></th>\n",
 43 |        "      <th>body_len</th>\n",
 44 |        "      <th>punct%</th>\n",
 45 |        "      <th>0</th>\n",
 46 |        "      <th>1</th>\n",
 47 |        "      <th>2</th>\n",
 48 |        "      <th>3</th>\n",
 49 |        "      <th>4</th>\n",
 50 |        "      <th>5</th>\n",
 51 |        "      <th>6</th>\n",
 52 |        "      <th>7</th>\n",
 53 |        "      <th>...</th>\n",
 54 |        "      <th>8094</th>\n",
 55 |        "      <th>8095</th>\n",
 56 |        "      <th>8096</th>\n",
 57 |        "      <th>8097</th>\n",
 58 |        "      <th>8098</th>\n",
 59 |        "      <th>8099</th>\n",
 60 |        "      <th>8100</th>\n",
 61 |        "      <th>8101</th>\n",
 62 |        "      <th>8102</th>\n",
 63 |        "      <th>8103</th>\n",
 64 |        "    </tr>\n",
 65 |        "  </thead>\n",
 66 |        "  <tbody>\n",
 67 |        "    <tr>\n",
 68 |        "      <th>0</th>\n",
 69 |        "      <td>128</td>\n",
 70 |        "      <td>4.7</td>\n",
 71 |        "      <td>0.0</td>\n",
 72 |        "      <td>0.0</td>\n",
 73 |        "      <td>0.0</td>\n",
 74 |        "      <td>0.0</td>\n",
 75 |        "      <td>0.0</td>\n",
 76 |        "      <td>0.0</td>\n",
 77 |        "      <td>0.0</td>\n",
 78 |        "      <td>0.0</td>\n",
 79 |        "      <td>...</td>\n",
 80 |        "      <td>0.0</td>\n",
 81 |        "      <td>0.0</td>\n",
 82 |        "      <td>0.0</td>\n",
 83 |        "      <td>0.0</td>\n",
 84 |        "      <td>0.0</td>\n",
 85 |        "      <td>0.0</td>\n",
 86 |        "      <td>0.0</td>\n",
 87 |        "      <td>0.0</td>\n",
 88 |        "      <td>0.0</td>\n",
 89 |        "      <td>0.0</td>\n",
 90 |        "    </tr>\n",
 91 |        "    <tr>\n",
 92 |        "      <th>1</th>\n",
 93 |        "      <td>49</td>\n",
 94 |        "      <td>4.1</td>\n",
 95 |        "      <td>0.0</td>\n",
 96 |        "      <td>0.0</td>\n",
 97 |        "      <td>0.0</td>\n",
 98 |        "      <td>0.0</td>\n",
 99 |        "      <td>0.0</td>\n",
100 |        "      <td>0.0</td>\n",
101 |        "      <td>0.0</td>\n",
102 |        "      <td>0.0</td>\n",
103 |        "      <td>...</td>\n",
104 |        "      <td>0.0</td>\n",
105 |        "      <td>0.0</td>\n",
106 |        "      <td>0.0</td>\n",
107 |        "      <td>0.0</td>\n",
108 |        "      <td>0.0</td>\n",
109 |        "      <td>0.0</td>\n",
110 |        "      <td>0.0</td>\n",
111 |        "      <td>0.0</td>\n",
112 |        "      <td>0.0</td>\n",
113 |        "      <td>0.0</td>\n",
114 |        "    </tr>\n",
115 |        "    <tr>\n",
116 |        "      <th>2</th>\n",
117 |        "      <td>62</td>\n",
118 |        "      <td>3.2</td>\n",
119 |        "      <td>0.0</td>\n",
120 |        "      <td>0.0</td>\n",
121 |        "      <td>0.0</td>\n",
122 |        "      <td>0.0</td>\n",
123 |        "      <td>0.0</td>\n",
124 |        "      <td>0.0</td>\n",
125 |        "      <td>0.0</td>\n",
126 |        "      <td>0.0</td>\n",
127 |        "      <td>...</td>\n",
128 |        "      <td>0.0</td>\n",
129 |        "      <td>0.0</td>\n",
130 |        "      <td>0.0</td>\n",
131 |        "      <td>0.0</td>\n",
132 |        "      <td>0.0</td>\n",
133 |        "      <td>0.0</td>\n",
134 |        "      <td>0.0</td>\n",
135 |        "      <td>0.0</td>\n",
136 |        "      <td>0.0</td>\n",
137 |        "      <td>0.0</td>\n",
138 |        "    </tr>\n",
139 |        "    <tr>\n",
140 |        "      <th>3</th>\n",
141 |        "      <td>28</td>\n",
142 |        "      <td>7.1</td>\n",
143 |        "      <td>0.0</td>\n",
144 |        "      <td>0.0</td>\n",
145 |        "      <td>0.0</td>\n",
146 |        "      <td>0.0</td>\n",
147 |        "      <td>0.0</td>\n",
148 |        "      <td>0.0</td>\n",
149 |        "      <td>0.0</td>\n",
150 |        "      <td>0.0</td>\n",
151 |        "      <td>...</td>\n",
152 |        "      <td>0.0</td>\n",
153 |        "      <td>0.0</td>\n",
154 |        "      <td>0.0</td>\n",
155 |        "      <td>0.0</td>\n",
156 |        "      <td>0.0</td>\n",
157 |        "      <td>0.0</td>\n",
158 |        "      <td>0.0</td>\n",
159 |        "      <td>0.0</td>\n",
160 |        "      <td>0.0</td>\n",
161 |        "      <td>0.0</td>\n",
162 |        "    </tr>\n",
163 |        "    <tr>\n",
164 |        "      <th>4</th>\n",
165 |        "      <td>135</td>\n",
166 |        "      <td>4.4</td>\n",
167 |        "      <td>0.0</td>\n",
168 |        "      <td>0.0</td>\n",
169 |        "      <td>0.0</td>\n",
170 |        "      <td>0.0</td>\n",
171 |        "      <td>0.0</td>\n",
172 |        "      <td>0.0</td>\n",
173 |        "      <td>0.0</td>\n",
174 |        "      <td>0.0</td>\n",
175 |        "      <td>...</td>\n",
176 |        "      <td>0.0</td>\n",
177 |        "      <td>0.0</td>\n",
178 |        "      <td>0.0</td>\n",
179 |        "      <td>0.0</td>\n",
180 |        "      <td>0.0</td>\n",
181 |        "      <td>0.0</td>\n",
182 |        "      <td>0.0</td>\n",
183 |        "      <td>0.0</td>\n",
184 |        "      <td>0.0</td>\n",
185 |        "      <td>0.0</td>\n",
186 |        "    </tr>\n",
187 |        "  </tbody>\n",
188 |        "</table>\n",
189 |        "<p>5 rows × 8106 columns</p>\n",
190 |        "</div>"
191 |       ],
192 |       "text/plain": [
193 |        "   body_len  punct%    0    1    2    3    4    5    6    7  ...   8094  8095  \\\n",
194 |        "0       128     4.7  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
195 |        "1        49     4.1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
196 |        "2        62     3.2  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
197 |        "3        28     7.1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
198 |        "4       135     4.4  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
199 |        "\n",
200 |        "   8096  8097  8098  8099  8100  8101  8102  8103  \n",
201 |        "0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
202 |        "1   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
203 |        "2   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
204 |        "3   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
205 |        "4   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
206 |        "\n",
207 |        "[5 rows x 8106 columns]"
208 |       ]
209 |      },
210 |      "execution_count": 2,
211 |      "metadata": {},
212 |      "output_type": "execute_result"
213 |     }
214 |    ],
215 |    "source": [
216 |     "import nltk\n",
217 |     "import pandas as pd\n",
218 |     "import re\n",
219 |     "from sklearn.feature_extraction.text import TfidfVectorizer\n",
220 |     "import string\n",
221 |     "\n",
222 |     "stopwords = nltk.corpus.stopwords.words('english')\n",
223 |     "ps = nltk.PorterStemmer()\n",
224 |     "\n",
225 |     "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t')\n",
226 |     "data.columns = ['label', 'body_text']\n",
227 |     "\n",
228 |     "def count_punct(text):\n",
229 |     "    count = sum([1 for char in text if char in string.punctuation])\n",
230 |     "    return round(count/(len(text) - text.count(\" \")), 3)*100\n",
231 |     "\n",
232 |     "data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(\" \"))\n",
233 |     "data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))\n",
234 |     "\n",
235 |     "def clean_text(text):\n",
236 |     "    text = \"\".join([word.lower() for word in text if word not in string.punctuation])\n",
237 |     "    tokens = re.split('\\W+', text)\n",
238 |     "    text = [ps.stem(word) for word in tokens if word not in stopwords]\n",
239 |     "    return text\n",
240 |     "\n",
241 |     "tfidf_vect = TfidfVectorizer(analyzer=clean_text)\n",
242 |     "X_tfidf = tfidf_vect.fit_transform(data['body_text'])\n",
243 |     "\n",
244 |     "X_features = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)\n",
245 |     "X_features.head()"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "markdown",
250 |    "metadata": {},
251 |    "source": [
252 |     "### Explore RandomForestClassifier Attributes & Hyperparameters"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "code",
257 |    "execution_count": 5,
258 |    "metadata": {
259 |     "collapsed": true,
260 |     "jupyter": {
261 |      "outputs_hidden": true
262 |     }
263 |    },
264 |    "outputs": [],
265 |    "source": [
266 |     "from sklearn.ensemble import RandomForestClassifier"
267 |    ]
268 |   },
269 |   {
270 |    "cell_type": "code",
271 |    "execution_count": 6,
272 |    "metadata": {},
273 |    "outputs": [
274 |     {
275 |      "name": "stdout",
276 |      "output_type": "stream",
277 |      "text": [
278 |       "['__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_cache', '_abc_negative_cache', '_abc_negative_cache_version', '_abc_registry', '_estimator_type', '_get_param_names', '_make_estimator', '_set_oob_score', '_validate_X_predict', '_validate_estimator', '_validate_y_class_weight', 'apply', 'decision_path', 'feature_importances_', 'fit', 'get_params', 'predict', 'predict_log_proba', 'predict_proba', 'score', 'set_params']\n",
279 |       "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
280 |       "            max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
281 |       "            min_impurity_decrease=0.0, min_impurity_split=None,\n",
282 |       "            min_samples_leaf=1, min_samples_split=2,\n",
283 |       "            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n",
284 |       "            oob_score=False, random_state=None, verbose=0,\n",
285 |       "            warm_start=False)\n"
286 |      ]
287 |     }
288 |    ],
289 |    "source": [
290 |     "print(dir(RandomForestClassifier))\n",
291 |     "print(RandomForestClassifier())"
292 |    ]
293 |   },
294 |   {
295 |    "cell_type": "markdown",
296 |    "metadata": {},
297 |    "source": [
298 |     "### Explore RandomForestClassifier through Cross-Validation"
299 |    ]
300 |   },
301 |   {
302 |    "cell_type": "code",
303 |    "execution_count": 11,
304 |    "metadata": {
305 |     "collapsed": true,
306 |     "jupyter": {
307 |      "outputs_hidden": true
308 |     }
309 |    },
310 |    "outputs": [],
311 |    "source": [
312 |     "from sklearn.model_selection import KFold, cross_val_score"
313 |    ]
314 |   },
315 |   {
316 |    "cell_type": "code",
317 |    "execution_count": 12,
318 |    "metadata": {},
319 |    "outputs": [
320 |     {
321 |      "data": {
322 |       "text/plain": [
323 |        "array([ 0.96947935,  0.97486535,  0.97124888,  0.95507637,  0.96855346])"
324 |       ]
325 |      },
326 |      "execution_count": 12,
327 |      "metadata": {},
328 |      "output_type": "execute_result"
329 |     }
330 |    ],
331 |    "source": [
332 |     "rf = RandomForestClassifier(n_jobs=-1)\n",
333 |     "k_fold = KFold(n_splits=5)\n",
334 |     "cross_val_score(rf, X_features, data['label'], cv=k_fold, scoring='accuracy', n_jobs=-1)"
335 |    ]
336 |   },
337 |   {
338 |    "cell_type": "code",
339 |    "execution_count": null,
340 |    "metadata": {
341 |     "collapsed": true,
342 |     "jupyter": {
343 |      "outputs_hidden": true
344 |     }
345 |    },
346 |    "outputs": [],
347 |    "source": []
348 |   }
349 |  ],
350 |  "metadata": {
351 |   "kernelspec": {
352 |    "display_name": "Python 3 (ipykernel)",
353 |    "language": "python",
354 |    "name": "python3"
355 |   },
356 |   "language_info": {
357 |    "codemirror_mode": {
358 |     "name": "ipython",
359 |     "version": 3
360 |    },
361 |    "file_extension": ".py",
362 |    "mimetype": "text/x-python",
363 |    "name": "python",
364 |    "nbconvert_exporter": "python",
365 |    "pygments_lexer": "ipython3",
366 |    "version": "3.11.0"
367 |   }
368 |  },
369 |  "nbformat": 4,
370 |  "nbformat_minor": 4
371 | }
372 | 


--------------------------------------------------------------------------------
/5. Building Machine Learning Classifiers/5.2. Random Forest on a holdout test set.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Building Machine Learning Classifiers: Random Forest on a holdout test set"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "### Read in & clean text"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 7,
 20 |    "metadata": {},
 21 |    "outputs": [
 22 |     {
 23 |      "data": {
 24 |       "text/html": [
 25 |        "<div>\n",
 26 |        "<style>\n",
 27 |        "    .dataframe thead tr:only-child th {\n",
 28 |        "        text-align: right;\n",
 29 |        "    }\n",
 30 |        "\n",
 31 |        "    .dataframe thead th {\n",
 32 |        "        text-align: left;\n",
 33 |        "    }\n",
 34 |        "\n",
 35 |        "    .dataframe tbody tr th {\n",
 36 |        "        vertical-align: top;\n",
 37 |        "    }\n",
 38 |        "</style>\n",
 39 |        "<table border=\"1\" class=\"dataframe\">\n",
 40 |        "  <thead>\n",
 41 |        "    <tr style=\"text-align: right;\">\n",
 42 |        "      <th></th>\n",
 43 |        "      <th>body_len</th>\n",
 44 |        "      <th>punct%</th>\n",
 45 |        "      <th>0</th>\n",
 46 |        "      <th>1</th>\n",
 47 |        "      <th>2</th>\n",
 48 |        "      <th>3</th>\n",
 49 |        "      <th>4</th>\n",
 50 |        "      <th>5</th>\n",
 51 |        "      <th>6</th>\n",
 52 |        "      <th>7</th>\n",
 53 |        "      <th>...</th>\n",
 54 |        "      <th>8094</th>\n",
 55 |        "      <th>8095</th>\n",
 56 |        "      <th>8096</th>\n",
 57 |        "      <th>8097</th>\n",
 58 |        "      <th>8098</th>\n",
 59 |        "      <th>8099</th>\n",
 60 |        "      <th>8100</th>\n",
 61 |        "      <th>8101</th>\n",
 62 |        "      <th>8102</th>\n",
 63 |        "      <th>8103</th>\n",
 64 |        "    </tr>\n",
 65 |        "  </thead>\n",
 66 |        "  <tbody>\n",
 67 |        "    <tr>\n",
 68 |        "      <th>0</th>\n",
 69 |        "      <td>128</td>\n",
 70 |        "      <td>4.7</td>\n",
 71 |        "      <td>0.0</td>\n",
 72 |        "      <td>0.0</td>\n",
 73 |        "      <td>0.0</td>\n",
 74 |        "      <td>0.0</td>\n",
 75 |        "      <td>0.0</td>\n",
 76 |        "      <td>0.0</td>\n",
 77 |        "      <td>0.0</td>\n",
 78 |        "      <td>0.0</td>\n",
 79 |        "      <td>...</td>\n",
 80 |        "      <td>0.0</td>\n",
 81 |        "      <td>0.0</td>\n",
 82 |        "      <td>0.0</td>\n",
 83 |        "      <td>0.0</td>\n",
 84 |        "      <td>0.0</td>\n",
 85 |        "      <td>0.0</td>\n",
 86 |        "      <td>0.0</td>\n",
 87 |        "      <td>0.0</td>\n",
 88 |        "      <td>0.0</td>\n",
 89 |        "      <td>0.0</td>\n",
 90 |        "    </tr>\n",
 91 |        "    <tr>\n",
 92 |        "      <th>1</th>\n",
 93 |        "      <td>49</td>\n",
 94 |        "      <td>4.1</td>\n",
 95 |        "      <td>0.0</td>\n",
 96 |        "      <td>0.0</td>\n",
 97 |        "      <td>0.0</td>\n",
 98 |        "      <td>0.0</td>\n",
 99 |        "      <td>0.0</td>\n",
100 |        "      <td>0.0</td>\n",
101 |        "      <td>0.0</td>\n",
102 |        "      <td>0.0</td>\n",
103 |        "      <td>...</td>\n",
104 |        "      <td>0.0</td>\n",
105 |        "      <td>0.0</td>\n",
106 |        "      <td>0.0</td>\n",
107 |        "      <td>0.0</td>\n",
108 |        "      <td>0.0</td>\n",
109 |        "      <td>0.0</td>\n",
110 |        "      <td>0.0</td>\n",
111 |        "      <td>0.0</td>\n",
112 |        "      <td>0.0</td>\n",
113 |        "      <td>0.0</td>\n",
114 |        "    </tr>\n",
115 |        "    <tr>\n",
116 |        "      <th>2</th>\n",
117 |        "      <td>62</td>\n",
118 |        "      <td>3.2</td>\n",
119 |        "      <td>0.0</td>\n",
120 |        "      <td>0.0</td>\n",
121 |        "      <td>0.0</td>\n",
122 |        "      <td>0.0</td>\n",
123 |        "      <td>0.0</td>\n",
124 |        "      <td>0.0</td>\n",
125 |        "      <td>0.0</td>\n",
126 |        "      <td>0.0</td>\n",
127 |        "      <td>...</td>\n",
128 |        "      <td>0.0</td>\n",
129 |        "      <td>0.0</td>\n",
130 |        "      <td>0.0</td>\n",
131 |        "      <td>0.0</td>\n",
132 |        "      <td>0.0</td>\n",
133 |        "      <td>0.0</td>\n",
134 |        "      <td>0.0</td>\n",
135 |        "      <td>0.0</td>\n",
136 |        "      <td>0.0</td>\n",
137 |        "      <td>0.0</td>\n",
138 |        "    </tr>\n",
139 |        "    <tr>\n",
140 |        "      <th>3</th>\n",
141 |        "      <td>28</td>\n",
142 |        "      <td>7.1</td>\n",
143 |        "      <td>0.0</td>\n",
144 |        "      <td>0.0</td>\n",
145 |        "      <td>0.0</td>\n",
146 |        "      <td>0.0</td>\n",
147 |        "      <td>0.0</td>\n",
148 |        "      <td>0.0</td>\n",
149 |        "      <td>0.0</td>\n",
150 |        "      <td>0.0</td>\n",
151 |        "      <td>...</td>\n",
152 |        "      <td>0.0</td>\n",
153 |        "      <td>0.0</td>\n",
154 |        "      <td>0.0</td>\n",
155 |        "      <td>0.0</td>\n",
156 |        "      <td>0.0</td>\n",
157 |        "      <td>0.0</td>\n",
158 |        "      <td>0.0</td>\n",
159 |        "      <td>0.0</td>\n",
160 |        "      <td>0.0</td>\n",
161 |        "      <td>0.0</td>\n",
162 |        "    </tr>\n",
163 |        "    <tr>\n",
164 |        "      <th>4</th>\n",
165 |        "      <td>135</td>\n",
166 |        "      <td>4.4</td>\n",
167 |        "      <td>0.0</td>\n",
168 |        "      <td>0.0</td>\n",
169 |        "      <td>0.0</td>\n",
170 |        "      <td>0.0</td>\n",
171 |        "      <td>0.0</td>\n",
172 |        "      <td>0.0</td>\n",
173 |        "      <td>0.0</td>\n",
174 |        "      <td>0.0</td>\n",
175 |        "      <td>...</td>\n",
176 |        "      <td>0.0</td>\n",
177 |        "      <td>0.0</td>\n",
178 |        "      <td>0.0</td>\n",
179 |        "      <td>0.0</td>\n",
180 |        "      <td>0.0</td>\n",
181 |        "      <td>0.0</td>\n",
182 |        "      <td>0.0</td>\n",
183 |        "      <td>0.0</td>\n",
184 |        "      <td>0.0</td>\n",
185 |        "      <td>0.0</td>\n",
186 |        "    </tr>\n",
187 |        "  </tbody>\n",
188 |        "</table>\n",
189 |        "<p>5 rows × 8106 columns</p>\n",
190 |        "</div>"
191 |       ],
192 |       "text/plain": [
193 |        "   body_len  punct%    0    1    2    3    4    5    6    7  ...   8094  8095  \\\n",
194 |        "0       128     4.7  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
195 |        "1        49     4.1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
196 |        "2        62     3.2  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
197 |        "3        28     7.1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
198 |        "4       135     4.4  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
199 |        "\n",
200 |        "   8096  8097  8098  8099  8100  8101  8102  8103  \n",
201 |        "0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
202 |        "1   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
203 |        "2   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
204 |        "3   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
205 |        "4   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
206 |        "\n",
207 |        "[5 rows x 8106 columns]"
208 |       ]
209 |      },
210 |      "execution_count": 7,
211 |      "metadata": {},
212 |      "output_type": "execute_result"
213 |     }
214 |    ],
215 |    "source": [
216 |     "import nltk\n",
217 |     "import pandas as pd\n",
218 |     "import re\n",
219 |     "from sklearn.feature_extraction.text import TfidfVectorizer\n",
220 |     "import string\n",
221 |     "\n",
222 |     "stopwords = nltk.corpus.stopwords.words('english')\n",
223 |     "ps = nltk.PorterStemmer()\n",
224 |     "\n",
225 |     "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t')\n",
226 |     "data.columns = ['label', 'body_text']\n",
227 |     "\n",
228 |     "def count_punct(text):\n",
229 |     "    count = sum([1 for char in text if char in string.punctuation])\n",
230 |     "    return round(count/(len(text) - text.count(\" \")), 3)*100\n",
231 |     "\n",
232 |     "data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(\" \"))\n",
233 |     "data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))\n",
234 |     "\n",
235 |     "def clean_text(text):\n",
236 |     "    text = \"\".join([word.lower() for word in text if word not in string.punctuation])\n",
237 |     "    tokens = re.split('\\W+', text)\n",
238 |     "    text = [ps.stem(word) for word in tokens if word not in stopwords]\n",
239 |     "    return text\n",
240 |     "\n",
241 |     "tfidf_vect = TfidfVectorizer(analyzer=clean_text)\n",
242 |     "X_tfidf = tfidf_vect.fit_transform(data['body_text'])\n",
243 |     "\n",
244 |     "X_features = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)\n",
245 |     "X_features.head()"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "markdown",
250 |    "metadata": {},
251 |    "source": [
252 |     "### Explore RandomForestClassifier through Holdout Set"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "code",
257 |    "execution_count": 8,
258 |    "metadata": {
259 |     "collapsed": true,
260 |     "jupyter": {
261 |      "outputs_hidden": true
262 |     }
263 |    },
264 |    "outputs": [],
265 |    "source": [
266 |     "from sklearn.metrics import precision_recall_fscore_support as score\n",
267 |     "from sklearn.model_selection import train_test_split"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": 9,
273 |    "metadata": {},
274 |    "outputs": [],
275 |    "source": [
276 |     "X_train, X_test, y_train, y_test = train_test_split(X_features, data['label'], test_size=0.2)"
277 |    ]
278 |   },
279 |   {
280 |    "cell_type": "code",
281 |    "execution_count": 12,
282 |    "metadata": {
283 |     "collapsed": true,
284 |     "jupyter": {
285 |      "outputs_hidden": true
286 |     }
287 |    },
288 |    "outputs": [],
289 |    "source": [
290 |     "from sklearn.ensemble import RandomForestClassifier\n",
291 |     "\n",
292 |     "rf = RandomForestClassifier(n_estimators=50, max_depth=20, n_jobs=-1)\n",
293 |     "rf_model = rf.fit(X_train, y_train)"
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "code",
298 |    "execution_count": 14,
299 |    "metadata": {},
300 |    "outputs": [
301 |     {
302 |      "data": {
303 |       "text/plain": [
304 |        "[(0.071067778644078275, 'body_len'),\n",
305 |        " (0.040562335897847433, 7350),\n",
306 |        " (0.035736155950968088, 3134),\n",
307 |        " (0.025830800898315055, 2031),\n",
308 |        " (0.020706891454006282, 1881),\n",
309 |        " (0.020667459644832679, 5724),\n",
310 |        " (0.020246234600271286, 4796),\n",
311 |        " (0.016709671666146234, 5988),\n",
312 |        " (0.016333631268556359, 1803),\n",
313 |        " (0.015520152981795897, 2171)]"
314 |       ]
315 |      },
316 |      "execution_count": 14,
317 |      "metadata": {},
318 |      "output_type": "execute_result"
319 |     }
320 |    ],
321 |    "source": [
322 |     "sorted(zip(rf_model.feature_importances_, X_train.columns), reverse=True)[0:10]"
323 |    ]
324 |   },
325 |   {
326 |    "cell_type": "code",
327 |    "execution_count": 15,
328 |    "metadata": {
329 |     "collapsed": true,
330 |     "jupyter": {
331 |      "outputs_hidden": true
332 |     }
333 |    },
334 |    "outputs": [],
335 |    "source": [
336 |     "y_pred = rf_model.predict(X_test)\n",
337 |     "precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')"
338 |    ]
339 |   },
340 |   {
341 |    "cell_type": "code",
342 |    "execution_count": 16,
343 |    "metadata": {},
344 |    "outputs": [
345 |     {
346 |      "name": "stdout",
347 |      "output_type": "stream",
348 |      "text": [
349 |       "Precision: 1.0 / Recall: 0.552 / Accuracy: 0.934\n"
350 |      ]
351 |     }
352 |    ],
353 |    "source": [
354 |     "print('Precision: {} / Recall: {} / Accuracy: {}'.format(round(precision, 3),\n",
355 |     "                                                        round(recall, 3),\n",
356 |     "                                                        round((y_pred==y_test).sum() / len(y_pred),3)))"
357 |    ]
358 |   },
359 |   {
360 |    "cell_type": "code",
361 |    "execution_count": null,
362 |    "metadata": {
363 |     "collapsed": true,
364 |     "jupyter": {
365 |      "outputs_hidden": true
366 |     }
367 |    },
368 |    "outputs": [],
369 |    "source": []
370 |   }
371 |  ],
372 |  "metadata": {
373 |   "kernelspec": {
374 |    "display_name": "Python 3 (ipykernel)",
375 |    "language": "python",
376 |    "name": "python3"
377 |   },
378 |   "language_info": {
379 |    "codemirror_mode": {
380 |     "name": "ipython",
381 |     "version": 3
382 |    },
383 |    "file_extension": ".py",
384 |    "mimetype": "text/x-python",
385 |    "name": "python",
386 |    "nbconvert_exporter": "python",
387 |    "pygments_lexer": "ipython3",
388 |    "version": "3.11.0"
389 |   }
390 |  },
391 |  "nbformat": 4,
392 |  "nbformat_minor": 4
393 | }
394 | 


--------------------------------------------------------------------------------
/5. Building Machine Learning Classifiers/5.3. Explore Random Forest Model with Grid-Search.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Building Machine Learning Classifiers: Explore Random Forest model with grid-search"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "**Grid-search:** Exhaustively search all parameter combinations in a given grid to determine the best model."
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "### Read in & clean text"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": 1,
 27 |    "metadata": {},
 28 |    "outputs": [
 29 |     {
 30 |      "data": {
 31 |       "text/html": [
 32 |        "<div>\n",
 33 |        "<style>\n",
 34 |        "    .dataframe thead tr:only-child th {\n",
 35 |        "        text-align: right;\n",
 36 |        "    }\n",
 37 |        "\n",
 38 |        "    .dataframe thead th {\n",
 39 |        "        text-align: left;\n",
 40 |        "    }\n",
 41 |        "\n",
 42 |        "    .dataframe tbody tr th {\n",
 43 |        "        vertical-align: top;\n",
 44 |        "    }\n",
 45 |        "</style>\n",
 46 |        "<table border=\"1\" class=\"dataframe\">\n",
 47 |        "  <thead>\n",
 48 |        "    <tr style=\"text-align: right;\">\n",
 49 |        "      <th></th>\n",
 50 |        "      <th>body_len</th>\n",
 51 |        "      <th>punct%</th>\n",
 52 |        "      <th>0</th>\n",
 53 |        "      <th>1</th>\n",
 54 |        "      <th>2</th>\n",
 55 |        "      <th>3</th>\n",
 56 |        "      <th>4</th>\n",
 57 |        "      <th>5</th>\n",
 58 |        "      <th>6</th>\n",
 59 |        "      <th>7</th>\n",
 60 |        "      <th>...</th>\n",
 61 |        "      <th>8094</th>\n",
 62 |        "      <th>8095</th>\n",
 63 |        "      <th>8096</th>\n",
 64 |        "      <th>8097</th>\n",
 65 |        "      <th>8098</th>\n",
 66 |        "      <th>8099</th>\n",
 67 |        "      <th>8100</th>\n",
 68 |        "      <th>8101</th>\n",
 69 |        "      <th>8102</th>\n",
 70 |        "      <th>8103</th>\n",
 71 |        "    </tr>\n",
 72 |        "  </thead>\n",
 73 |        "  <tbody>\n",
 74 |        "    <tr>\n",
 75 |        "      <th>0</th>\n",
 76 |        "      <td>128</td>\n",
 77 |        "      <td>0.047</td>\n",
 78 |        "      <td>0.0</td>\n",
 79 |        "      <td>0.0</td>\n",
 80 |        "      <td>0.0</td>\n",
 81 |        "      <td>0.0</td>\n",
 82 |        "      <td>0.0</td>\n",
 83 |        "      <td>0.0</td>\n",
 84 |        "      <td>0.0</td>\n",
 85 |        "      <td>0.0</td>\n",
 86 |        "      <td>...</td>\n",
 87 |        "      <td>0.0</td>\n",
 88 |        "      <td>0.0</td>\n",
 89 |        "      <td>0.0</td>\n",
 90 |        "      <td>0.0</td>\n",
 91 |        "      <td>0.0</td>\n",
 92 |        "      <td>0.0</td>\n",
 93 |        "      <td>0.0</td>\n",
 94 |        "      <td>0.0</td>\n",
 95 |        "      <td>0.0</td>\n",
 96 |        "      <td>0.0</td>\n",
 97 |        "    </tr>\n",
 98 |        "    <tr>\n",
 99 |        "      <th>1</th>\n",
100 |        "      <td>49</td>\n",
101 |        "      <td>0.041</td>\n",
102 |        "      <td>0.0</td>\n",
103 |        "      <td>0.0</td>\n",
104 |        "      <td>0.0</td>\n",
105 |        "      <td>0.0</td>\n",
106 |        "      <td>0.0</td>\n",
107 |        "      <td>0.0</td>\n",
108 |        "      <td>0.0</td>\n",
109 |        "      <td>0.0</td>\n",
110 |        "      <td>...</td>\n",
111 |        "      <td>0.0</td>\n",
112 |        "      <td>0.0</td>\n",
113 |        "      <td>0.0</td>\n",
114 |        "      <td>0.0</td>\n",
115 |        "      <td>0.0</td>\n",
116 |        "      <td>0.0</td>\n",
117 |        "      <td>0.0</td>\n",
118 |        "      <td>0.0</td>\n",
119 |        "      <td>0.0</td>\n",
120 |        "      <td>0.0</td>\n",
121 |        "    </tr>\n",
122 |        "    <tr>\n",
123 |        "      <th>2</th>\n",
124 |        "      <td>62</td>\n",
125 |        "      <td>0.032</td>\n",
126 |        "      <td>0.0</td>\n",
127 |        "      <td>0.0</td>\n",
128 |        "      <td>0.0</td>\n",
129 |        "      <td>0.0</td>\n",
130 |        "      <td>0.0</td>\n",
131 |        "      <td>0.0</td>\n",
132 |        "      <td>0.0</td>\n",
133 |        "      <td>0.0</td>\n",
134 |        "      <td>...</td>\n",
135 |        "      <td>0.0</td>\n",
136 |        "      <td>0.0</td>\n",
137 |        "      <td>0.0</td>\n",
138 |        "      <td>0.0</td>\n",
139 |        "      <td>0.0</td>\n",
140 |        "      <td>0.0</td>\n",
141 |        "      <td>0.0</td>\n",
142 |        "      <td>0.0</td>\n",
143 |        "      <td>0.0</td>\n",
144 |        "      <td>0.0</td>\n",
145 |        "    </tr>\n",
146 |        "    <tr>\n",
147 |        "      <th>3</th>\n",
148 |        "      <td>28</td>\n",
149 |        "      <td>0.071</td>\n",
150 |        "      <td>0.0</td>\n",
151 |        "      <td>0.0</td>\n",
152 |        "      <td>0.0</td>\n",
153 |        "      <td>0.0</td>\n",
154 |        "      <td>0.0</td>\n",
155 |        "      <td>0.0</td>\n",
156 |        "      <td>0.0</td>\n",
157 |        "      <td>0.0</td>\n",
158 |        "      <td>...</td>\n",
159 |        "      <td>0.0</td>\n",
160 |        "      <td>0.0</td>\n",
161 |        "      <td>0.0</td>\n",
162 |        "      <td>0.0</td>\n",
163 |        "      <td>0.0</td>\n",
164 |        "      <td>0.0</td>\n",
165 |        "      <td>0.0</td>\n",
166 |        "      <td>0.0</td>\n",
167 |        "      <td>0.0</td>\n",
168 |        "      <td>0.0</td>\n",
169 |        "    </tr>\n",
170 |        "    <tr>\n",
171 |        "      <th>4</th>\n",
172 |        "      <td>135</td>\n",
173 |        "      <td>0.044</td>\n",
174 |        "      <td>0.0</td>\n",
175 |        "      <td>0.0</td>\n",
176 |        "      <td>0.0</td>\n",
177 |        "      <td>0.0</td>\n",
178 |        "      <td>0.0</td>\n",
179 |        "      <td>0.0</td>\n",
180 |        "      <td>0.0</td>\n",
181 |        "      <td>0.0</td>\n",
182 |        "      <td>...</td>\n",
183 |        "      <td>0.0</td>\n",
184 |        "      <td>0.0</td>\n",
185 |        "      <td>0.0</td>\n",
186 |        "      <td>0.0</td>\n",
187 |        "      <td>0.0</td>\n",
188 |        "      <td>0.0</td>\n",
189 |        "      <td>0.0</td>\n",
190 |        "      <td>0.0</td>\n",
191 |        "      <td>0.0</td>\n",
192 |        "      <td>0.0</td>\n",
193 |        "    </tr>\n",
194 |        "  </tbody>\n",
195 |        "</table>\n",
196 |        "<p>5 rows × 8106 columns</p>\n",
197 |        "</div>"
198 |       ],
199 |       "text/plain": [
200 |        "   body_len  punct%    0    1    2    3    4    5    6    7  ...   8094  8095  \\\n",
201 |        "0       128   0.047  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
202 |        "1        49   0.041  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
203 |        "2        62   0.032  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
204 |        "3        28   0.071  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
205 |        "4       135   0.044  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
206 |        "\n",
207 |        "   8096  8097  8098  8099  8100  8101  8102  8103  \n",
208 |        "0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
209 |        "1   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
210 |        "2   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
211 |        "3   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
212 |        "4   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
213 |        "\n",
214 |        "[5 rows x 8106 columns]"
215 |       ]
216 |      },
217 |      "execution_count": 1,
218 |      "metadata": {},
219 |      "output_type": "execute_result"
220 |     }
221 |    ],
222 |    "source": [
223 |     "import nltk\n",
224 |     "import pandas as pd\n",
225 |     "import re\n",
226 |     "from sklearn.feature_extraction.text import TfidfVectorizer\n",
227 |     "import string\n",
228 |     "\n",
229 |     "stopwords = nltk.corpus.stopwords.words('english')\n",
230 |     "ps = nltk.PorterStemmer()\n",
231 |     "\n",
232 |     "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t')\n",
233 |     "data.columns = ['label', 'body_text']\n",
234 |     "\n",
235 |     "def count_punct(text):\n",
236 |     "    count = sum([1 for char in text if char in string.punctuation])\n",
237 |     "    return round(count/(len(text) - text.count(\" \")), 3)\n",
238 |     "\n",
239 |     "data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(\" \"))\n",
240 |     "data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))\n",
241 |     "\n",
242 |     "def clean_text(text):\n",
243 |     "    text = \"\".join([word.lower() for word in text if word not in string.punctuation])\n",
244 |     "    tokens = re.split('\\W+', text)\n",
245 |     "    text = [ps.stem(word) for word in tokens if word not in stopwords]\n",
246 |     "    return text\n",
247 |     "\n",
248 |     "tfidf_vect = TfidfVectorizer(analyzer=clean_text)\n",
249 |     "X_tfidf = tfidf_vect.fit_transform(data['body_text'])\n",
250 |     "\n",
251 |     "X_features = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)\n",
252 |     "X_features.head()"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "markdown",
257 |    "metadata": {},
258 |    "source": [
259 |     "### Build our own Grid-search"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": 2,
265 |    "metadata": {
266 |     "collapsed": true,
267 |     "jupyter": {
268 |      "outputs_hidden": true
269 |     }
270 |    },
271 |    "outputs": [],
272 |    "source": [
273 |     "from sklearn.ensemble import RandomForestClassifier\n",
274 |     "from sklearn.metrics import precision_recall_fscore_support as score\n",
275 |     "from sklearn.model_selection import train_test_split"
276 |    ]
277 |   },
278 |   {
279 |    "cell_type": "code",
280 |    "execution_count": 3,
281 |    "metadata": {
282 |     "collapsed": true,
283 |     "jupyter": {
284 |      "outputs_hidden": true
285 |     }
286 |    },
287 |    "outputs": [],
288 |    "source": [
289 |     "X_train, X_test, y_train, y_test = train_test_split(X_features, data['label'], test_size=0.2)"
290 |    ]
291 |   },
292 |   {
293 |    "cell_type": "code",
294 |    "execution_count": 4,
295 |    "metadata": {
296 |     "collapsed": true,
297 |     "jupyter": {
298 |      "outputs_hidden": true
299 |     }
300 |    },
301 |    "outputs": [],
302 |    "source": [
303 |     "def train_RF(n_est, depth):\n",
304 |     "    rf = RandomForestClassifier(n_estimators=n_est, max_depth=depth, n_jobs=-1)\n",
305 |     "    rf_model = rf.fit(X_train, y_train)\n",
306 |     "    y_pred = rf_model.predict(X_test)\n",
307 |     "    precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')\n",
308 |     "    print('Est: {} / Depth: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(\n",
309 |     "        n_est, depth, round(precision, 3), round(recall, 3),\n",
310 |     "        round((y_pred==y_test).sum() / len(y_pred), 3)))"
311 |    ]
312 |   },
313 |   {
314 |    "cell_type": "code",
315 |    "execution_count": 5,
316 |    "metadata": {},
317 |    "outputs": [
318 |     {
319 |      "name": "stdout",
320 |      "output_type": "stream",
321 |      "text": [
322 |       "Est: 10 / Depth: 10 ---- Precision: 1.0 / Recall: 0.216 / Accuracy: 0.892\n",
323 |       "Est: 10 / Depth: 20 ---- Precision: 0.975 / Recall: 0.516 / Accuracy: 0.932\n",
324 |       "Est: 10 / Depth: 30 ---- Precision: 1.0 / Recall: 0.647 / Accuracy: 0.952\n",
325 |       "Est: 10 / Depth: None ---- Precision: 0.984 / Recall: 0.784 / Accuracy: 0.969\n",
326 |       "Est: 50 / Depth: 10 ---- Precision: 1.0 / Recall: 0.235 / Accuracy: 0.895\n",
327 |       "Est: 50 / Depth: 20 ---- Precision: 1.0 / Recall: 0.562 / Accuracy: 0.94\n",
328 |       "Est: 50 / Depth: 30 ---- Precision: 1.0 / Recall: 0.667 / Accuracy: 0.954\n",
329 |       "Est: 50 / Depth: None ---- Precision: 0.985 / Recall: 0.843 / Accuracy: 0.977\n",
330 |       "Est: 100 / Depth: 10 ---- Precision: 1.0 / Recall: 0.242 / Accuracy: 0.896\n",
331 |       "Est: 100 / Depth: 20 ---- Precision: 1.0 / Recall: 0.601 / Accuracy: 0.945\n",
332 |       "Est: 100 / Depth: 30 ---- Precision: 0.981 / Recall: 0.686 / Accuracy: 0.955\n",
333 |       "Est: 100 / Depth: None ---- Precision: 1.0 / Recall: 0.83 / Accuracy: 0.977\n"
334 |      ]
335 |     }
336 |    ],
337 |    "source": [
338 |     "for n_est in [10, 50, 100]:\n",
339 |     "    for depth in [10, 20, 30, None]:\n",
340 |     "        train_RF(n_est, depth)"
341 |    ]
342 |   },
343 |   {
344 |    "cell_type": "code",
345 |    "execution_count": null,
346 |    "metadata": {
347 |     "collapsed": true,
348 |     "jupyter": {
349 |      "outputs_hidden": true
350 |     }
351 |    },
352 |    "outputs": [],
353 |    "source": []
354 |   }
355 |  ],
356 |  "metadata": {
357 |   "kernelspec": {
358 |    "display_name": "Python 3 (ipykernel)",
359 |    "language": "python",
360 |    "name": "python3"
361 |   },
362 |   "language_info": {
363 |    "codemirror_mode": {
364 |     "name": "ipython",
365 |     "version": 3
366 |    },
367 |    "file_extension": ".py",
368 |    "mimetype": "text/x-python",
369 |    "name": "python",
370 |    "nbconvert_exporter": "python",
371 |    "pygments_lexer": "ipython3",
372 |    "version": "3.11.0"
373 |   }
374 |  },
375 |  "nbformat": 4,
376 |  "nbformat_minor": 4
377 | }
378 | 


--------------------------------------------------------------------------------
/5. Building Machine Learning Classifiers/5.5. Explore Gradient Boosting model with Grid-Search.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Building Machine Learning Classifiers: Explore Gradient Boosting model with grid-search"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "**Grid-search:** Exhaustively search all parameter combinations in a given grid to determine the best model."
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "### Read in & clean text"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": 1,
 27 |    "metadata": {},
 28 |    "outputs": [
 29 |     {
 30 |      "data": {
 31 |       "text/html": [
 32 |        "<div>\n",
 33 |        "<style>\n",
 34 |        "    .dataframe thead tr:only-child th {\n",
 35 |        "        text-align: right;\n",
 36 |        "    }\n",
 37 |        "\n",
 38 |        "    .dataframe thead th {\n",
 39 |        "        text-align: left;\n",
 40 |        "    }\n",
 41 |        "\n",
 42 |        "    .dataframe tbody tr th {\n",
 43 |        "        vertical-align: top;\n",
 44 |        "    }\n",
 45 |        "</style>\n",
 46 |        "<table border=\"1\" class=\"dataframe\">\n",
 47 |        "  <thead>\n",
 48 |        "    <tr style=\"text-align: right;\">\n",
 49 |        "      <th></th>\n",
 50 |        "      <th>body_len</th>\n",
 51 |        "      <th>punct%</th>\n",
 52 |        "      <th>0</th>\n",
 53 |        "      <th>1</th>\n",
 54 |        "      <th>2</th>\n",
 55 |        "      <th>3</th>\n",
 56 |        "      <th>4</th>\n",
 57 |        "      <th>5</th>\n",
 58 |        "      <th>6</th>\n",
 59 |        "      <th>7</th>\n",
 60 |        "      <th>...</th>\n",
 61 |        "      <th>8094</th>\n",
 62 |        "      <th>8095</th>\n",
 63 |        "      <th>8096</th>\n",
 64 |        "      <th>8097</th>\n",
 65 |        "      <th>8098</th>\n",
 66 |        "      <th>8099</th>\n",
 67 |        "      <th>8100</th>\n",
 68 |        "      <th>8101</th>\n",
 69 |        "      <th>8102</th>\n",
 70 |        "      <th>8103</th>\n",
 71 |        "    </tr>\n",
 72 |        "  </thead>\n",
 73 |        "  <tbody>\n",
 74 |        "    <tr>\n",
 75 |        "      <th>0</th>\n",
 76 |        "      <td>128</td>\n",
 77 |        "      <td>4.7</td>\n",
 78 |        "      <td>0.0</td>\n",
 79 |        "      <td>0.0</td>\n",
 80 |        "      <td>0.0</td>\n",
 81 |        "      <td>0.0</td>\n",
 82 |        "      <td>0.0</td>\n",
 83 |        "      <td>0.0</td>\n",
 84 |        "      <td>0.0</td>\n",
 85 |        "      <td>0.0</td>\n",
 86 |        "      <td>...</td>\n",
 87 |        "      <td>0.0</td>\n",
 88 |        "      <td>0.0</td>\n",
 89 |        "      <td>0.0</td>\n",
 90 |        "      <td>0.0</td>\n",
 91 |        "      <td>0.0</td>\n",
 92 |        "      <td>0.0</td>\n",
 93 |        "      <td>0.0</td>\n",
 94 |        "      <td>0.0</td>\n",
 95 |        "      <td>0.0</td>\n",
 96 |        "      <td>0.0</td>\n",
 97 |        "    </tr>\n",
 98 |        "    <tr>\n",
 99 |        "      <th>1</th>\n",
100 |        "      <td>49</td>\n",
101 |        "      <td>4.1</td>\n",
102 |        "      <td>0.0</td>\n",
103 |        "      <td>0.0</td>\n",
104 |        "      <td>0.0</td>\n",
105 |        "      <td>0.0</td>\n",
106 |        "      <td>0.0</td>\n",
107 |        "      <td>0.0</td>\n",
108 |        "      <td>0.0</td>\n",
109 |        "      <td>0.0</td>\n",
110 |        "      <td>...</td>\n",
111 |        "      <td>0.0</td>\n",
112 |        "      <td>0.0</td>\n",
113 |        "      <td>0.0</td>\n",
114 |        "      <td>0.0</td>\n",
115 |        "      <td>0.0</td>\n",
116 |        "      <td>0.0</td>\n",
117 |        "      <td>0.0</td>\n",
118 |        "      <td>0.0</td>\n",
119 |        "      <td>0.0</td>\n",
120 |        "      <td>0.0</td>\n",
121 |        "    </tr>\n",
122 |        "    <tr>\n",
123 |        "      <th>2</th>\n",
124 |        "      <td>62</td>\n",
125 |        "      <td>3.2</td>\n",
126 |        "      <td>0.0</td>\n",
127 |        "      <td>0.0</td>\n",
128 |        "      <td>0.0</td>\n",
129 |        "      <td>0.0</td>\n",
130 |        "      <td>0.0</td>\n",
131 |        "      <td>0.0</td>\n",
132 |        "      <td>0.0</td>\n",
133 |        "      <td>0.0</td>\n",
134 |        "      <td>...</td>\n",
135 |        "      <td>0.0</td>\n",
136 |        "      <td>0.0</td>\n",
137 |        "      <td>0.0</td>\n",
138 |        "      <td>0.0</td>\n",
139 |        "      <td>0.0</td>\n",
140 |        "      <td>0.0</td>\n",
141 |        "      <td>0.0</td>\n",
142 |        "      <td>0.0</td>\n",
143 |        "      <td>0.0</td>\n",
144 |        "      <td>0.0</td>\n",
145 |        "    </tr>\n",
146 |        "    <tr>\n",
147 |        "      <th>3</th>\n",
148 |        "      <td>28</td>\n",
149 |        "      <td>7.1</td>\n",
150 |        "      <td>0.0</td>\n",
151 |        "      <td>0.0</td>\n",
152 |        "      <td>0.0</td>\n",
153 |        "      <td>0.0</td>\n",
154 |        "      <td>0.0</td>\n",
155 |        "      <td>0.0</td>\n",
156 |        "      <td>0.0</td>\n",
157 |        "      <td>0.0</td>\n",
158 |        "      <td>...</td>\n",
159 |        "      <td>0.0</td>\n",
160 |        "      <td>0.0</td>\n",
161 |        "      <td>0.0</td>\n",
162 |        "      <td>0.0</td>\n",
163 |        "      <td>0.0</td>\n",
164 |        "      <td>0.0</td>\n",
165 |        "      <td>0.0</td>\n",
166 |        "      <td>0.0</td>\n",
167 |        "      <td>0.0</td>\n",
168 |        "      <td>0.0</td>\n",
169 |        "    </tr>\n",
170 |        "    <tr>\n",
171 |        "      <th>4</th>\n",
172 |        "      <td>135</td>\n",
173 |        "      <td>4.4</td>\n",
174 |        "      <td>0.0</td>\n",
175 |        "      <td>0.0</td>\n",
176 |        "      <td>0.0</td>\n",
177 |        "      <td>0.0</td>\n",
178 |        "      <td>0.0</td>\n",
179 |        "      <td>0.0</td>\n",
180 |        "      <td>0.0</td>\n",
181 |        "      <td>0.0</td>\n",
182 |        "      <td>...</td>\n",
183 |        "      <td>0.0</td>\n",
184 |        "      <td>0.0</td>\n",
185 |        "      <td>0.0</td>\n",
186 |        "      <td>0.0</td>\n",
187 |        "      <td>0.0</td>\n",
188 |        "      <td>0.0</td>\n",
189 |        "      <td>0.0</td>\n",
190 |        "      <td>0.0</td>\n",
191 |        "      <td>0.0</td>\n",
192 |        "      <td>0.0</td>\n",
193 |        "    </tr>\n",
194 |        "  </tbody>\n",
195 |        "</table>\n",
196 |        "<p>5 rows × 8106 columns</p>\n",
197 |        "</div>"
198 |       ],
199 |       "text/plain": [
200 |        "   body_len  punct%    0    1    2    3    4    5    6    7  ...   8094  8095  \\\n",
201 |        "0       128     4.7  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
202 |        "1        49     4.1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
203 |        "2        62     3.2  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
204 |        "3        28     7.1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
205 |        "4       135     4.4  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
206 |        "\n",
207 |        "   8096  8097  8098  8099  8100  8101  8102  8103  \n",
208 |        "0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
209 |        "1   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
210 |        "2   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
211 |        "3   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
212 |        "4   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
213 |        "\n",
214 |        "[5 rows x 8106 columns]"
215 |       ]
216 |      },
217 |      "execution_count": 1,
218 |      "metadata": {},
219 |      "output_type": "execute_result"
220 |     }
221 |    ],
222 |    "source": [
223 |     "import nltk\n",
224 |     "import pandas as pd\n",
225 |     "import re\n",
226 |     "from sklearn.feature_extraction.text import TfidfVectorizer\n",
227 |     "import string\n",
228 |     "\n",
229 |     "stopwords = nltk.corpus.stopwords.words('english')\n",
230 |     "ps = nltk.PorterStemmer()\n",
231 |     "\n",
232 |     "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t')\n",
233 |     "data.columns = ['label', 'body_text']\n",
234 |     "\n",
235 |     "def count_punct(text):\n",
236 |     "    count = sum([1 for char in text if char in string.punctuation])\n",
237 |     "    return round(count/(len(text) - text.count(\" \")), 3)*100\n",
238 |     "\n",
239 |     "data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(\" \"))\n",
240 |     "data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))\n",
241 |     "\n",
242 |     "def clean_text(text):\n",
243 |     "    text = \"\".join([word.lower() for word in text if word not in string.punctuation])\n",
244 |     "    tokens = re.split('\\W+', text)\n",
245 |     "    text = [ps.stem(word) for word in tokens if word not in stopwords]\n",
246 |     "    return text\n",
247 |     "\n",
248 |     "tfidf_vect = TfidfVectorizer(analyzer=clean_text)\n",
249 |     "X_tfidf = tfidf_vect.fit_transform(data['body_text'])\n",
250 |     "\n",
251 |     "X_features = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)\n",
252 |     "X_features.head()"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "markdown",
257 |    "metadata": {},
258 |    "source": [
259 |     "### Explore GradientBoostingClassifier Attributes & Hyperparameters"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": 2,
265 |    "metadata": {
266 |     "collapsed": true,
267 |     "jupyter": {
268 |      "outputs_hidden": true
269 |     }
270 |    },
271 |    "outputs": [],
272 |    "source": [
273 |     "from sklearn.ensemble import GradientBoostingClassifier"
274 |    ]
275 |   },
276 |   {
277 |    "cell_type": "code",
278 |    "execution_count": 3,
279 |    "metadata": {},
280 |    "outputs": [
281 |     {
282 |      "name": "stdout",
283 |      "output_type": "stream",
284 |      "text": [
285 |       "['_SUPPORTED_LOSS', '__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_cache', '_abc_negative_cache', '_abc_negative_cache_version', '_abc_registry', '_check_initialized', '_check_params', '_clear_state', '_decision_function', '_estimator_type', '_fit_stage', '_fit_stages', '_get_param_names', '_init_decision_function', '_init_state', '_is_initialized', '_make_estimator', '_resize_state', '_staged_decision_function', '_validate_estimator', '_validate_y', 'apply', 'decision_function', 'feature_importances_', 'fit', 'fit_transform', 'get_params', 'predict', 'predict_log_proba', 'predict_proba', 'score', 'set_params', 'staged_decision_function', 'staged_predict', 'staged_predict_proba', 'transform']\n",
286 |       "GradientBoostingClassifier(criterion='friedman_mse', init=None,\n",
287 |       "              learning_rate=0.1, loss='deviance', max_depth=3,\n",
288 |       "              max_features=None, max_leaf_nodes=None,\n",
289 |       "              min_impurity_split=1e-07, min_samples_leaf=1,\n",
290 |       "              min_samples_split=2, min_weight_fraction_leaf=0.0,\n",
291 |       "              n_estimators=100, presort='auto', random_state=None,\n",
292 |       "              subsample=1.0, verbose=0, warm_start=False)\n"
293 |      ]
294 |     }
295 |    ],
296 |    "source": [
297 |     "print(dir(GradientBoostingClassifier))\n",
298 |     "print(GradientBoostingClassifier())"
299 |    ]
300 |   },
301 |   {
302 |    "cell_type": "markdown",
303 |    "metadata": {},
304 |    "source": [
305 |     "### Build our own Grid-search"
306 |    ]
307 |   },
308 |   {
309 |    "cell_type": "code",
310 |    "execution_count": 4,
311 |    "metadata": {
312 |     "collapsed": true,
313 |     "jupyter": {
314 |      "outputs_hidden": true
315 |     }
316 |    },
317 |    "outputs": [],
318 |    "source": [
319 |     "from sklearn.metrics import precision_recall_fscore_support as score\n",
320 |     "from sklearn.model_selection import train_test_split"
321 |    ]
322 |   },
323 |   {
324 |    "cell_type": "code",
325 |    "execution_count": 5,
326 |    "metadata": {
327 |     "collapsed": true,
328 |     "jupyter": {
329 |      "outputs_hidden": true
330 |     }
331 |    },
332 |    "outputs": [],
333 |    "source": [
334 |     "X_train, X_test, y_train, y_test = train_test_split(X_features, data['label'], test_size=0.2)"
335 |    ]
336 |   },
337 |   {
338 |    "cell_type": "code",
339 |    "execution_count": 6,
340 |    "metadata": {
341 |     "collapsed": true,
342 |     "jupyter": {
343 |      "outputs_hidden": true
344 |     }
345 |    },
346 |    "outputs": [],
347 |    "source": [
348 |     "def train_GB(est, max_depth, lr):\n",
349 |     "    gb = GradientBoostingClassifier(n_estimators=est, max_depth=max_depth, learning_rate=lr)\n",
350 |     "    gb_model = gb.fit(X_train, y_train)\n",
351 |     "    y_pred = gb_model.predict(X_test)\n",
352 |     "    precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')\n",
353 |     "    print('Est: {} / Depth: {} / LR: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(\n",
354 |     "        est, max_depth, lr, round(precision, 3), round(recall, 3), \n",
355 |     "        round((y_pred==y_test).sum()/len(y_pred), 3)))"
356 |    ]
357 |   },
358 |   {
359 |    "cell_type": "code",
360 |    "execution_count": 7,
361 |    "metadata": {},
362 |    "outputs": [
363 |     {
364 |      "name": "stderr",
365 |      "output_type": "stream",
366 |      "text": [
367 |       "/Users/djedamski/.pyenv/versions/3.5.3/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples.\n",
368 |       "  'precision', 'predicted', average, warn_for)\n"
369 |      ]
370 |     },
371 |     {
372 |      "name": "stdout",
373 |      "output_type": "stream",
374 |      "text": [
375 |       "Est: 50 / Depth: 3 / LR: 0.01 ---- Precision: 0.0 / Recall: 0.0 / Accuracy: 0.868\n",
376 |       "Est: 50 / Depth: 3 / LR: 0.1 ---- Precision: 1.0 / Recall: 0.687 / Accuracy: 0.959\n",
377 |       "Est: 50 / Depth: 3 / LR: 1 ---- Precision: 0.88 / Recall: 0.796 / Accuracy: 0.959\n",
378 |       "Est: 50 / Depth: 7 / LR: 0.01 ---- Precision: 0.0 / Recall: 0.0 / Accuracy: 0.868\n",
379 |       "Est: 50 / Depth: 7 / LR: 0.1 ---- Precision: 0.968 / Recall: 0.83 / Accuracy: 0.974\n",
380 |       "Est: 50 / Depth: 7 / LR: 1 ---- Precision: 0.917 / Recall: 0.823 / Accuracy: 0.967\n",
381 |       "Est: 50 / Depth: 11 / LR: 0.01 ---- Precision: 1.0 / Recall: 0.027 / Accuracy: 0.872\n",
382 |       "Est: 50 / Depth: 11 / LR: 0.1 ---- Precision: 0.962 / Recall: 0.871 / Accuracy: 0.978\n",
383 |       "Est: 50 / Depth: 11 / LR: 1 ---- Precision: 0.926 / Recall: 0.85 / Accuracy: 0.971\n",
384 |       "Est: 50 / Depth: 15 / LR: 0.01 ---- Precision: 0.0 / Recall: 0.0 / Accuracy: 0.868\n",
385 |       "Est: 50 / Depth: 15 / LR: 0.1 ---- Precision: 0.977 / Recall: 0.857 / Accuracy: 0.978\n",
386 |       "Est: 50 / Depth: 15 / LR: 1 ---- Precision: 0.919 / Recall: 0.85 / Accuracy: 0.97\n",
387 |       "Est: 100 / Depth: 3 / LR: 0.01 ---- Precision: 0.987 / Recall: 0.51 / Accuracy: 0.934\n",
388 |       "Est: 100 / Depth: 3 / LR: 0.1 ---- Precision: 0.991 / Recall: 0.776 / Accuracy: 0.969\n",
389 |       "Est: 100 / Depth: 3 / LR: 1 ---- Precision: 0.901 / Recall: 0.803 / Accuracy: 0.962\n",
390 |       "Est: 100 / Depth: 7 / LR: 0.01 ---- Precision: 0.989 / Recall: 0.612 / Accuracy: 0.948\n",
391 |       "Est: 100 / Depth: 7 / LR: 0.1 ---- Precision: 0.985 / Recall: 0.871 / Accuracy: 0.981\n",
392 |       "Est: 100 / Depth: 7 / LR: 1 ---- Precision: 0.922 / Recall: 0.81 / Accuracy: 0.966\n",
393 |       "Est: 100 / Depth: 11 / LR: 0.01 ---- Precision: 0.991 / Recall: 0.741 / Accuracy: 0.965\n",
394 |       "Est: 100 / Depth: 11 / LR: 0.1 ---- Precision: 0.984 / Recall: 0.864 / Accuracy: 0.98\n",
395 |       "Est: 100 / Depth: 11 / LR: 1 ---- Precision: 0.912 / Recall: 0.844 / Accuracy: 0.969\n",
396 |       "Est: 100 / Depth: 15 / LR: 0.01 ---- Precision: 0.992 / Recall: 0.796 / Accuracy: 0.972\n",
397 |       "Est: 100 / Depth: 15 / LR: 0.1 ---- Precision: 0.977 / Recall: 0.871 / Accuracy: 0.98\n",
398 |       "Est: 100 / Depth: 15 / LR: 1 ---- Precision: 0.932 / Recall: 0.844 / Accuracy: 0.971\n",
399 |       "Est: 150 / Depth: 3 / LR: 0.01 ---- Precision: 0.988 / Recall: 0.537 / Accuracy: 0.938\n",
400 |       "Est: 150 / Depth: 3 / LR: 0.1 ---- Precision: 0.992 / Recall: 0.81 / Accuracy: 0.974\n",
401 |       "Est: 150 / Depth: 3 / LR: 1 ---- Precision: 0.902 / Recall: 0.816 / Accuracy: 0.964\n",
402 |       "Est: 150 / Depth: 7 / LR: 0.01 ---- Precision: 0.99 / Recall: 0.687 / Accuracy: 0.958\n",
403 |       "Est: 150 / Depth: 7 / LR: 0.1 ---- Precision: 0.977 / Recall: 0.857 / Accuracy: 0.978\n",
404 |       "Est: 150 / Depth: 7 / LR: 1 ---- Precision: 0.937 / Recall: 0.81 / Accuracy: 0.968\n",
405 |       "Est: 150 / Depth: 11 / LR: 0.01 ---- Precision: 0.983 / Recall: 0.796 / Accuracy: 0.971\n",
406 |       "Est: 150 / Depth: 11 / LR: 0.1 ---- Precision: 0.985 / Recall: 0.871 / Accuracy: 0.981\n",
407 |       "Est: 150 / Depth: 11 / LR: 1 ---- Precision: 0.904 / Recall: 0.837 / Accuracy: 0.967\n",
408 |       "Est: 150 / Depth: 15 / LR: 0.01 ---- Precision: 0.975 / Recall: 0.796 / Accuracy: 0.97\n",
409 |       "Est: 150 / Depth: 15 / LR: 0.1 ---- Precision: 0.977 / Recall: 0.864 / Accuracy: 0.979\n",
410 |       "Est: 150 / Depth: 15 / LR: 1 ---- Precision: 0.913 / Recall: 0.857 / Accuracy: 0.97\n"
411 |      ]
412 |     }
413 |    ],
414 |    "source": [
415 |     "for n_est in [50, 100, 150]:\n",
416 |     "    for max_depth in [3, 7, 11, 15]:\n",
417 |     "        for lr in [0.01, 0.1, 1]:\n",
418 |     "            train_GB(n_est, max_depth, lr)"
419 |    ]
420 |   },
421 |   {
422 |    "cell_type": "code",
423 |    "execution_count": null,
424 |    "metadata": {
425 |     "collapsed": true,
426 |     "jupyter": {
427 |      "outputs_hidden": true
428 |     }
429 |    },
430 |    "outputs": [],
431 |    "source": []
432 |   }
433 |  ],
434 |  "metadata": {
435 |   "kernelspec": {
436 |    "display_name": "Python 3 (ipykernel)",
437 |    "language": "python",
438 |    "name": "python3"
439 |   },
440 |   "language_info": {
441 |    "codemirror_mode": {
442 |     "name": "ipython",
443 |     "version": 3
444 |    },
445 |    "file_extension": ".py",
446 |    "mimetype": "text/x-python",
447 |    "name": "python",
448 |    "nbconvert_exporter": "python",
449 |    "pygments_lexer": "ipython3",
450 |    "version": "3.11.0"
451 |   }
452 |  },
453 |  "nbformat": 4,
454 |  "nbformat_minor": 4
455 | }
456 | 


--------------------------------------------------------------------------------
/5. Building Machine Learning Classifiers/5.7. Model Selection.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Building Machine Learning Classifiers: Model selection"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "### Read in & clean text"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 1,
 20 |    "metadata": {
 21 |     "collapsed": true,
 22 |     "jupyter": {
 23 |      "outputs_hidden": true
 24 |     }
 25 |    },
 26 |    "outputs": [],
 27 |    "source": [
 28 |     "import nltk\n",
 29 |     "import pandas as pd\n",
 30 |     "import re\n",
 31 |     "from sklearn.feature_extraction.text import TfidfVectorizer\n",
 32 |     "import string\n",
 33 |     "\n",
 34 |     "stopwords = nltk.corpus.stopwords.words('english')\n",
 35 |     "ps = nltk.PorterStemmer()\n",
 36 |     "\n",
 37 |     "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t')\n",
 38 |     "data.columns = ['label', 'body_text']\n",
 39 |     "\n",
 40 |     "def count_punct(text):\n",
 41 |     "    count = sum([1 for char in text if char in string.punctuation])\n",
 42 |     "    return round(count/(len(text) - text.count(\" \")), 3)*100\n",
 43 |     "\n",
 44 |     "data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(\" \"))\n",
 45 |     "data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))\n",
 46 |     "\n",
 47 |     "def clean_text(text):\n",
 48 |     "    text = \"\".join([word.lower() for word in text if word not in string.punctuation])\n",
 49 |     "    tokens = re.split('\\W+', text)\n",
 50 |     "    text = [ps.stem(word) for word in tokens if word not in stopwords]\n",
 51 |     "    return text"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "### Split into train/test"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "code",
 63 |    "execution_count": 2,
 64 |    "metadata": {
 65 |     "collapsed": true,
 66 |     "jupyter": {
 67 |      "outputs_hidden": true
 68 |     }
 69 |    },
 70 |    "outputs": [],
 71 |    "source": [
 72 |     "from sklearn.model_selection import train_test_split\n",
 73 |     "\n",
 74 |     "X_train, X_test, y_train, y_test = train_test_split(data[['body_text', 'body_len', 'punct%']], data['label'], test_size=0.2)"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "markdown",
 79 |    "metadata": {},
 80 |    "source": [
 81 |     "### Vectorize text"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": 3,
 87 |    "metadata": {},
 88 |    "outputs": [
 89 |     {
 90 |      "data": {
 91 |       "text/html": [
 92 |        "<div>\n",
 93 |        "<style>\n",
 94 |        "    .dataframe thead tr:only-child th {\n",
 95 |        "        text-align: right;\n",
 96 |        "    }\n",
 97 |        "\n",
 98 |        "    .dataframe thead th {\n",
 99 |        "        text-align: left;\n",
100 |        "    }\n",
101 |        "\n",
102 |        "    .dataframe tbody tr th {\n",
103 |        "        vertical-align: top;\n",
104 |        "    }\n",
105 |        "</style>\n",
106 |        "<table border=\"1\" class=\"dataframe\">\n",
107 |        "  <thead>\n",
108 |        "    <tr style=\"text-align: right;\">\n",
109 |        "      <th></th>\n",
110 |        "      <th>body_len</th>\n",
111 |        "      <th>punct%</th>\n",
112 |        "      <th>0</th>\n",
113 |        "      <th>1</th>\n",
114 |        "      <th>2</th>\n",
115 |        "      <th>3</th>\n",
116 |        "      <th>4</th>\n",
117 |        "      <th>5</th>\n",
118 |        "      <th>6</th>\n",
119 |        "      <th>7</th>\n",
120 |        "      <th>...</th>\n",
121 |        "      <th>7153</th>\n",
122 |        "      <th>7154</th>\n",
123 |        "      <th>7155</th>\n",
124 |        "      <th>7156</th>\n",
125 |        "      <th>7157</th>\n",
126 |        "      <th>7158</th>\n",
127 |        "      <th>7159</th>\n",
128 |        "      <th>7160</th>\n",
129 |        "      <th>7161</th>\n",
130 |        "      <th>7162</th>\n",
131 |        "    </tr>\n",
132 |        "  </thead>\n",
133 |        "  <tbody>\n",
134 |        "    <tr>\n",
135 |        "      <th>0</th>\n",
136 |        "      <td>19</td>\n",
137 |        "      <td>0.0</td>\n",
138 |        "      <td>0.0</td>\n",
139 |        "      <td>0.0</td>\n",
140 |        "      <td>0.0</td>\n",
141 |        "      <td>0.0</td>\n",
142 |        "      <td>0.0</td>\n",
143 |        "      <td>0.0</td>\n",
144 |        "      <td>0.0</td>\n",
145 |        "      <td>0.0</td>\n",
146 |        "      <td>...</td>\n",
147 |        "      <td>0.0</td>\n",
148 |        "      <td>0.0</td>\n",
149 |        "      <td>0.0</td>\n",
150 |        "      <td>0.0</td>\n",
151 |        "      <td>0.0</td>\n",
152 |        "      <td>0.0</td>\n",
153 |        "      <td>0.0</td>\n",
154 |        "      <td>0.0</td>\n",
155 |        "      <td>0.0</td>\n",
156 |        "      <td>0.0</td>\n",
157 |        "    </tr>\n",
158 |        "    <tr>\n",
159 |        "      <th>1</th>\n",
160 |        "      <td>115</td>\n",
161 |        "      <td>3.5</td>\n",
162 |        "      <td>0.0</td>\n",
163 |        "      <td>0.0</td>\n",
164 |        "      <td>0.0</td>\n",
165 |        "      <td>0.0</td>\n",
166 |        "      <td>0.0</td>\n",
167 |        "      <td>0.0</td>\n",
168 |        "      <td>0.0</td>\n",
169 |        "      <td>0.0</td>\n",
170 |        "      <td>...</td>\n",
171 |        "      <td>0.0</td>\n",
172 |        "      <td>0.0</td>\n",
173 |        "      <td>0.0</td>\n",
174 |        "      <td>0.0</td>\n",
175 |        "      <td>0.0</td>\n",
176 |        "      <td>0.0</td>\n",
177 |        "      <td>0.0</td>\n",
178 |        "      <td>0.0</td>\n",
179 |        "      <td>0.0</td>\n",
180 |        "      <td>0.0</td>\n",
181 |        "    </tr>\n",
182 |        "    <tr>\n",
183 |        "      <th>2</th>\n",
184 |        "      <td>106</td>\n",
185 |        "      <td>2.8</td>\n",
186 |        "      <td>0.0</td>\n",
187 |        "      <td>0.0</td>\n",
188 |        "      <td>0.0</td>\n",
189 |        "      <td>0.0</td>\n",
190 |        "      <td>0.0</td>\n",
191 |        "      <td>0.0</td>\n",
192 |        "      <td>0.0</td>\n",
193 |        "      <td>0.0</td>\n",
194 |        "      <td>...</td>\n",
195 |        "      <td>0.0</td>\n",
196 |        "      <td>0.0</td>\n",
197 |        "      <td>0.0</td>\n",
198 |        "      <td>0.0</td>\n",
199 |        "      <td>0.0</td>\n",
200 |        "      <td>0.0</td>\n",
201 |        "      <td>0.0</td>\n",
202 |        "      <td>0.0</td>\n",
203 |        "      <td>0.0</td>\n",
204 |        "      <td>0.0</td>\n",
205 |        "    </tr>\n",
206 |        "    <tr>\n",
207 |        "      <th>3</th>\n",
208 |        "      <td>29</td>\n",
209 |        "      <td>3.4</td>\n",
210 |        "      <td>0.0</td>\n",
211 |        "      <td>0.0</td>\n",
212 |        "      <td>0.0</td>\n",
213 |        "      <td>0.0</td>\n",
214 |        "      <td>0.0</td>\n",
215 |        "      <td>0.0</td>\n",
216 |        "      <td>0.0</td>\n",
217 |        "      <td>0.0</td>\n",
218 |        "      <td>...</td>\n",
219 |        "      <td>0.0</td>\n",
220 |        "      <td>0.0</td>\n",
221 |        "      <td>0.0</td>\n",
222 |        "      <td>0.0</td>\n",
223 |        "      <td>0.0</td>\n",
224 |        "      <td>0.0</td>\n",
225 |        "      <td>0.0</td>\n",
226 |        "      <td>0.0</td>\n",
227 |        "      <td>0.0</td>\n",
228 |        "      <td>0.0</td>\n",
229 |        "    </tr>\n",
230 |        "    <tr>\n",
231 |        "      <th>4</th>\n",
232 |        "      <td>152</td>\n",
233 |        "      <td>4.6</td>\n",
234 |        "      <td>0.0</td>\n",
235 |        "      <td>0.0</td>\n",
236 |        "      <td>0.0</td>\n",
237 |        "      <td>0.0</td>\n",
238 |        "      <td>0.0</td>\n",
239 |        "      <td>0.0</td>\n",
240 |        "      <td>0.0</td>\n",
241 |        "      <td>0.0</td>\n",
242 |        "      <td>...</td>\n",
243 |        "      <td>0.0</td>\n",
244 |        "      <td>0.0</td>\n",
245 |        "      <td>0.0</td>\n",
246 |        "      <td>0.0</td>\n",
247 |        "      <td>0.0</td>\n",
248 |        "      <td>0.0</td>\n",
249 |        "      <td>0.0</td>\n",
250 |        "      <td>0.0</td>\n",
251 |        "      <td>0.0</td>\n",
252 |        "      <td>0.0</td>\n",
253 |        "    </tr>\n",
254 |        "  </tbody>\n",
255 |        "</table>\n",
256 |        "<p>5 rows × 7165 columns</p>\n",
257 |        "</div>"
258 |       ],
259 |       "text/plain": [
260 |        "   body_len  punct%    0    1    2    3    4    5    6    7  ...   7153  7154  \\\n",
261 |        "0        19     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
262 |        "1       115     3.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
263 |        "2       106     2.8  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
264 |        "3        29     3.4  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
265 |        "4       152     4.6  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0   0.0   \n",
266 |        "\n",
267 |        "   7155  7156  7157  7158  7159  7160  7161  7162  \n",
268 |        "0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
269 |        "1   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
270 |        "2   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
271 |        "3   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
272 |        "4   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
273 |        "\n",
274 |        "[5 rows x 7165 columns]"
275 |       ]
276 |      },
277 |      "execution_count": 3,
278 |      "metadata": {},
279 |      "output_type": "execute_result"
280 |     }
281 |    ],
282 |    "source": [
283 |     "tfidf_vect = TfidfVectorizer(analyzer=clean_text)\n",
284 |     "tfidf_vect_fit = tfidf_vect.fit(X_train['body_text'])\n",
285 |     "\n",
286 |     "tfidf_train = tfidf_vect_fit.transform(X_train['body_text'])\n",
287 |     "tfidf_test = tfidf_vect_fit.transform(X_test['body_text'])\n",
288 |     "\n",
289 |     "X_train_vect = pd.concat([X_train[['body_len', 'punct%']].reset_index(drop=True), \n",
290 |     "           pd.DataFrame(tfidf_train.toarray())], axis=1)\n",
291 |     "X_test_vect = pd.concat([X_test[['body_len', 'punct%']].reset_index(drop=True), \n",
292 |     "           pd.DataFrame(tfidf_test.toarray())], axis=1)\n",
293 |     "\n",
294 |     "X_train_vect.head()"
295 |    ]
296 |   },
297 |   {
298 |    "cell_type": "markdown",
299 |    "metadata": {},
300 |    "source": [
301 |     "### Final evaluation of models"
302 |    ]
303 |   },
304 |   {
305 |    "cell_type": "code",
306 |    "execution_count": 4,
307 |    "metadata": {
308 |     "collapsed": true,
309 |     "jupyter": {
310 |      "outputs_hidden": true
311 |     }
312 |    },
313 |    "outputs": [],
314 |    "source": [
315 |     "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
316 |     "from sklearn.metrics import precision_recall_fscore_support as score\n",
317 |     "import time"
318 |    ]
319 |   },
320 |   {
321 |    "cell_type": "code",
322 |    "execution_count": 5,
323 |    "metadata": {},
324 |    "outputs": [
325 |     {
326 |      "name": "stdout",
327 |      "output_type": "stream",
328 |      "text": [
329 |       "Fit time: 1.782 / Predict time: 0.213 ---- Precision: 1.0 / Recall: 0.81 / Accuracy: 0.975\n"
330 |      ]
331 |     }
332 |    ],
333 |    "source": [
334 |     "rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1)\n",
335 |     "\n",
336 |     "start = time.time()\n",
337 |     "rf_model = rf.fit(X_train_vect, y_train)\n",
338 |     "end = time.time()\n",
339 |     "fit_time = (end - start)\n",
340 |     "\n",
341 |     "start = time.time()\n",
342 |     "y_pred = rf_model.predict(X_test_vect)\n",
343 |     "end = time.time()\n",
344 |     "pred_time = (end - start)\n",
345 |     "\n",
346 |     "precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')\n",
347 |     "print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(\n",
348 |     "    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))"
349 |    ]
350 |   },
351 |   {
352 |    "cell_type": "code",
353 |    "execution_count": 6,
354 |    "metadata": {},
355 |    "outputs": [
356 |     {
357 |      "name": "stdout",
358 |      "output_type": "stream",
359 |      "text": [
360 |       "Fit time: 186.61 / Predict time: 0.135 ---- Precision: 0.889 / Recall: 0.816 / Accuracy: 0.962\n"
361 |      ]
362 |     }
363 |    ],
364 |    "source": [
365 |     "gb = GradientBoostingClassifier(n_estimators=150, max_depth=11)\n",
366 |     "\n",
367 |     "start = time.time()\n",
368 |     "gb_model = gb.fit(X_train_vect, y_train)\n",
369 |     "end = time.time()\n",
370 |     "fit_time = (end - start)\n",
371 |     "\n",
372 |     "start = time.time()\n",
373 |     "y_pred = gb_model.predict(X_test_vect)\n",
374 |     "end = time.time()\n",
375 |     "pred_time = (end - start)\n",
376 |     "\n",
377 |     "precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')\n",
378 |     "print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(\n",
379 |     "    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))"
380 |    ]
381 |   },
382 |   {
383 |    "cell_type": "code",
384 |    "execution_count": null,
385 |    "metadata": {
386 |     "collapsed": true,
387 |     "jupyter": {
388 |      "outputs_hidden": true
389 |     }
390 |    },
391 |    "outputs": [],
392 |    "source": []
393 |   }
394 |  ],
395 |  "metadata": {
396 |   "kernelspec": {
397 |    "display_name": "Python 3 (ipykernel)",
398 |    "language": "python",
399 |    "name": "python3"
400 |   },
401 |   "language_info": {
402 |    "codemirror_mode": {
403 |     "name": "ipython",
404 |     "version": 3
405 |    },
406 |    "file_extension": ".py",
407 |    "mimetype": "text/x-python",
408 |    "name": "python",
409 |    "nbconvert_exporter": "python",
410 |    "pygments_lexer": "ipython3",
411 |    "version": "3.9.13"
412 |   }
413 |  },
414 |  "nbformat": 4,
415 |  "nbformat_minor": 4
416 | }
417 | 


--------------------------------------------------------------------------------
/5. Building Machine Learning Classifiers/empty:
--------------------------------------------------------------------------------
1 | hi
2 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2023 Kshitiz Pandya
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/page.html:
--------------------------------------------------------------------------------
1 | hi
2 | 


--------------------------------------------------------------------------------
/test output/empty:
--------------------------------------------------------------------------------
1 | empty
2 | 


--------------------------------------------------------------------------------
/test output/giphy.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KshitizPandya/Natural-Language-Processing-with-Machine-Learning/896b6ca491e7f41fe8155846e6f79bc55a467280/test output/giphy.gif


--------------------------------------------------------------------------------
/test output/output_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KshitizPandya/Natural-Language-Processing-with-Machine-Learning/896b6ca491e7f41fe8155846e6f79bc55a467280/test output/output_1.png


--------------------------------------------------------------------------------
/test output/output_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KshitizPandya/Natural-Language-Processing-with-Machine-Learning/896b6ca491e7f41fe8155846e6f79bc55a467280/test output/output_2.png


--------------------------------------------------------------------------------