├── 1. NLP Basics ├── 1.1. what is NLP.ipynb ├── 1.2. reading in text data & why do we need cleaning.ipynb ├── 1.3. How to explore a dataset.ipynb ├── 1.4. learning how to use regular expressions.ipynb ├── 1.5. implementing a pipeline to clean text.ipynb ├── SMSSpamCollection.tsv └── SMSSpamCollection_cleaned.tsv ├── 2. Data Cleaning ├── 2.1. stemming.ipynb ├── 2.2. lemmatizing.ipynb └── SMSSpamCollection.tsv ├── 3. Vectorizing Raw Data ├── 3.1. count vectoriztion.ipynb ├── 3.2. N_grams.ipynb ├── 3.3. TF-IDF.ipynb └── SMSSpamCollection.tsv ├── 4. Feature Engineering ├── 4.1. Feature Creation.ipynb ├── 4.2. Transformation.ipynb └── SMSSpamCollection.tsv ├── 5. Building Machine Learning Classifiers ├── 5.1. Building a basic Random Forest Model.ipynb ├── 5.2. Random Forest on a holdout test set.ipynb ├── 5.3. Explore Random Forest Model with Grid-Search.ipynb ├── 5.4. Evaluate Random Forest with GridSearchCV.ipynb ├── 5.5. Explore Gradient Boosting model with Grid-Search.ipynb ├── 5.6. Evaluate Gradient Boosting with GridSearchCV.ipynb ├── 5.7. Model Selection.ipynb ├── SMSSpamCollection.tsv └── empty ├── LICENSE ├── README.md ├── page.html └── test output ├── empty ├── giphy.gif ├── output_1.png └── output_2.png /1. NLP Basics/1.1. what is NLP.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# NLP Basics: What is Natural Language Processing & the Natural Language Toolkit?" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### How to install NLTK on your local machine\n", 15 | "\n", 16 | "Both sets of instructions below assume you already have Python installed. These instructions are taken directly from [http://www.nltk.org/install.html](http://www.nltk.org/install.html).\n", 17 | "\n", 18 | "**Mac/Unix**\n", 19 | "\n", 20 | "From the terminal:\n", 21 | "1. Install NLTK: run `pip install -U nltk`\n", 22 | "2. Test installation: run `python` then type `import nltk`\n", 23 | "\n", 24 | "**Windows**\n", 25 | "\n", 26 | "1. Install NLTK: [http://pypi.python.org/pypi/nltk](http://pypi.python.org/pypi/nltk)\n", 27 | "2. Test installation: `Start>Python35`, then type `import nltk`" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "### Download NLTK data" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 9, 40 | "metadata": {}, 41 | "outputs": [ 42 | { 43 | "name": "stdout", 44 | "output_type": "stream", 45 | "text": [ 46 | "showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml\n" 47 | ] 48 | }, 49 | { 50 | "data": { 51 | "text/plain": [ 52 | "True" 53 | ] 54 | }, 55 | "execution_count": 9, 56 | "metadata": {}, 57 | "output_type": "execute_result" 58 | } 59 | ], 60 | "source": [ 61 | "import nltk\n", 62 | "nltk.download()" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 10, 68 | "metadata": {}, 69 | "outputs": [ 70 | { 71 | "data": { 72 | "text/plain": [ 73 | "['AbstractLazySequence',\n", 74 | " 'AffixTagger',\n", 75 | " 'AlignedSent',\n", 76 | " 'Alignment',\n", 77 | " 'AnnotationTask',\n", 78 | " 'ApplicationExpression',\n", 79 | " 'Assignment',\n", 80 | " 'BigramAssocMeasures',\n", 81 | " 'BigramCollocationFinder',\n", 82 | " 'BigramTagger',\n", 83 | " 'BinaryMaxentFeatureEncoding',\n", 84 | " 'BlanklineTokenizer',\n", 85 | " 'BllipParser',\n", 86 | " 'BottomUpChartParser',\n", 87 | " 'BottomUpLeftCornerChartParser',\n", 88 | " 'BottomUpProbabilisticChartParser',\n", 89 | " 'Boxer',\n", 90 | " 'BrillTagger',\n", 91 | " 'BrillTaggerTrainer',\n", 92 | " 'CFG',\n", 93 | " 'CRFTagger',\n", 94 | " 'CfgReadingCommand',\n", 95 | " 'ChartParser',\n", 96 | " 'ChunkParserI',\n", 97 | " 'ChunkScore',\n", 98 | " 'ClassifierBasedPOSTagger',\n", 99 | " 'ClassifierBasedTagger',\n", 100 | " 'ClassifierI',\n", 101 | " 'ConcordanceIndex',\n", 102 | " 'ConditionalExponentialClassifier',\n", 103 | " 'ConditionalFreqDist',\n", 104 | " 'ConditionalProbDist',\n", 105 | " 'ConditionalProbDistI',\n", 106 | " 'ConfusionMatrix',\n", 107 | " 'ContextIndex',\n", 108 | " 'ContextTagger',\n", 109 | " 'ContingencyMeasures',\n", 110 | " 'CoreNLPDependencyParser',\n", 111 | " 'CoreNLPParser',\n", 112 | " 'Counter',\n", 113 | " 'CrossValidationProbDist',\n", 114 | " 'DRS',\n", 115 | " 'DecisionTreeClassifier',\n", 116 | " 'DefaultTagger',\n", 117 | " 'DependencyEvaluator',\n", 118 | " 'DependencyGrammar',\n", 119 | " 'DependencyGraph',\n", 120 | " 'DependencyProduction',\n", 121 | " 'DictionaryConditionalProbDist',\n", 122 | " 'DictionaryProbDist',\n", 123 | " 'DiscourseTester',\n", 124 | " 'DrtExpression',\n", 125 | " 'DrtGlueReadingCommand',\n", 126 | " 'ELEProbDist',\n", 127 | " 'EarleyChartParser',\n", 128 | " 'Expression',\n", 129 | " 'FStructure',\n", 130 | " 'FeatDict',\n", 131 | " 'FeatList',\n", 132 | " 'FeatStruct',\n", 133 | " 'FeatStructReader',\n", 134 | " 'Feature',\n", 135 | " 'FeatureBottomUpChartParser',\n", 136 | " 'FeatureBottomUpLeftCornerChartParser',\n", 137 | " 'FeatureChartParser',\n", 138 | " 'FeatureEarleyChartParser',\n", 139 | " 'FeatureIncrementalBottomUpChartParser',\n", 140 | " 'FeatureIncrementalBottomUpLeftCornerChartParser',\n", 141 | " 'FeatureIncrementalChartParser',\n", 142 | " 'FeatureIncrementalTopDownChartParser',\n", 143 | " 'FeatureTopDownChartParser',\n", 144 | " 'FreqDist',\n", 145 | " 'HTTPPasswordMgrWithDefaultRealm',\n", 146 | " 'HeldoutProbDist',\n", 147 | " 'HiddenMarkovModelTagger',\n", 148 | " 'HiddenMarkovModelTrainer',\n", 149 | " 'HunposTagger',\n", 150 | " 'IBMModel',\n", 151 | " 'IBMModel1',\n", 152 | " 'IBMModel2',\n", 153 | " 'IBMModel3',\n", 154 | " 'IBMModel4',\n", 155 | " 'IBMModel5',\n", 156 | " 'ISRIStemmer',\n", 157 | " 'ImmutableMultiParentedTree',\n", 158 | " 'ImmutableParentedTree',\n", 159 | " 'ImmutableProbabilisticMixIn',\n", 160 | " 'ImmutableProbabilisticTree',\n", 161 | " 'ImmutableTree',\n", 162 | " 'IncrementalBottomUpChartParser',\n", 163 | " 'IncrementalBottomUpLeftCornerChartParser',\n", 164 | " 'IncrementalChartParser',\n", 165 | " 'IncrementalLeftCornerChartParser',\n", 166 | " 'IncrementalTopDownChartParser',\n", 167 | " 'Index',\n", 168 | " 'InsideChartParser',\n", 169 | " 'JSONTaggedDecoder',\n", 170 | " 'JSONTaggedEncoder',\n", 171 | " 'KneserNeyProbDist',\n", 172 | " 'LancasterStemmer',\n", 173 | " 'LaplaceProbDist',\n", 174 | " 'LazyConcatenation',\n", 175 | " 'LazyEnumerate',\n", 176 | " 'LazyIteratorList',\n", 177 | " 'LazyMap',\n", 178 | " 'LazySubsequence',\n", 179 | " 'LazyZip',\n", 180 | " 'LeftCornerChartParser',\n", 181 | " 'LidstoneProbDist',\n", 182 | " 'LineTokenizer',\n", 183 | " 'LogicalExpressionException',\n", 184 | " 'LongestChartParser',\n", 185 | " 'MLEProbDist',\n", 186 | " 'MWETokenizer',\n", 187 | " 'Mace',\n", 188 | " 'MaceCommand',\n", 189 | " 'MaltParser',\n", 190 | " 'MaxentClassifier',\n", 191 | " 'Model',\n", 192 | " 'MultiClassifierI',\n", 193 | " 'MultiParentedTree',\n", 194 | " 'MutableProbDist',\n", 195 | " 'NaiveBayesClassifier',\n", 196 | " 'NaiveBayesDependencyScorer',\n", 197 | " 'NgramAssocMeasures',\n", 198 | " 'NgramTagger',\n", 199 | " 'NonprojectiveDependencyParser',\n", 200 | " 'Nonterminal',\n", 201 | " 'OrderedDict',\n", 202 | " 'PCFG',\n", 203 | " 'Paice',\n", 204 | " 'ParallelProverBuilder',\n", 205 | " 'ParallelProverBuilderCommand',\n", 206 | " 'ParentedTree',\n", 207 | " 'ParserI',\n", 208 | " 'PerceptronTagger',\n", 209 | " 'PhraseTable',\n", 210 | " 'PorterStemmer',\n", 211 | " 'PositiveNaiveBayesClassifier',\n", 212 | " 'ProbDistI',\n", 213 | " 'ProbabilisticDependencyGrammar',\n", 214 | " 'ProbabilisticMixIn',\n", 215 | " 'ProbabilisticNonprojectiveParser',\n", 216 | " 'ProbabilisticProduction',\n", 217 | " 'ProbabilisticProjectiveDependencyParser',\n", 218 | " 'ProbabilisticTree',\n", 219 | " 'Production',\n", 220 | " 'ProjectiveDependencyParser',\n", 221 | " 'Prover9',\n", 222 | " 'Prover9Command',\n", 223 | " 'ProxyBasicAuthHandler',\n", 224 | " 'ProxyDigestAuthHandler',\n", 225 | " 'ProxyHandler',\n", 226 | " 'PunktSentenceTokenizer',\n", 227 | " 'QuadgramCollocationFinder',\n", 228 | " 'RSLPStemmer',\n", 229 | " 'RTEFeatureExtractor',\n", 230 | " 'RUS_PICKLE',\n", 231 | " 'RandomChartParser',\n", 232 | " 'RangeFeature',\n", 233 | " 'ReadingCommand',\n", 234 | " 'RecursiveDescentParser',\n", 235 | " 'RegexpChunkParser',\n", 236 | " 'RegexpParser',\n", 237 | " 'RegexpStemmer',\n", 238 | " 'RegexpTagger',\n", 239 | " 'RegexpTokenizer',\n", 240 | " 'ReppTokenizer',\n", 241 | " 'ResolutionProver',\n", 242 | " 'ResolutionProverCommand',\n", 243 | " 'SExprTokenizer',\n", 244 | " 'SLASH',\n", 245 | " 'Senna',\n", 246 | " 'SennaChunkTagger',\n", 247 | " 'SennaNERTagger',\n", 248 | " 'SennaTagger',\n", 249 | " 'SequentialBackoffTagger',\n", 250 | " 'ShiftReduceParser',\n", 251 | " 'SimpleGoodTuringProbDist',\n", 252 | " 'SklearnClassifier',\n", 253 | " 'SlashFeature',\n", 254 | " 'SnowballStemmer',\n", 255 | " 'SpaceTokenizer',\n", 256 | " 'StackDecoder',\n", 257 | " 'StanfordNERTagger',\n", 258 | " 'StanfordPOSTagger',\n", 259 | " 'StanfordSegmenter',\n", 260 | " 'StanfordTagger',\n", 261 | " 'StemmerI',\n", 262 | " 'SteppingChartParser',\n", 263 | " 'SteppingRecursiveDescentParser',\n", 264 | " 'SteppingShiftReduceParser',\n", 265 | " 'TYPE',\n", 266 | " 'TabTokenizer',\n", 267 | " 'TableauProver',\n", 268 | " 'TableauProverCommand',\n", 269 | " 'TaggerI',\n", 270 | " 'TestGrammar',\n", 271 | " 'Text',\n", 272 | " 'TextCat',\n", 273 | " 'TextCollection',\n", 274 | " 'TextTilingTokenizer',\n", 275 | " 'TnT',\n", 276 | " 'TokenSearcher',\n", 277 | " 'ToktokTokenizer',\n", 278 | " 'TopDownChartParser',\n", 279 | " 'TransitionParser',\n", 280 | " 'Tree',\n", 281 | " 'TreebankWordTokenizer',\n", 282 | " 'Trie',\n", 283 | " 'TrigramAssocMeasures',\n", 284 | " 'TrigramCollocationFinder',\n", 285 | " 'TrigramTagger',\n", 286 | " 'TweetTokenizer',\n", 287 | " 'TypedMaxentFeatureEncoding',\n", 288 | " 'Undefined',\n", 289 | " 'UniformProbDist',\n", 290 | " 'UnigramTagger',\n", 291 | " 'UnsortedChartParser',\n", 292 | " 'Valuation',\n", 293 | " 'Variable',\n", 294 | " 'ViterbiParser',\n", 295 | " 'WekaClassifier',\n", 296 | " 'WhitespaceTokenizer',\n", 297 | " 'WittenBellProbDist',\n", 298 | " 'WordNetLemmatizer',\n", 299 | " 'WordPunctTokenizer',\n", 300 | " '__author__',\n", 301 | " '__author_email__',\n", 302 | " '__builtins__',\n", 303 | " '__cached__',\n", 304 | " '__classifiers__',\n", 305 | " '__copyright__',\n", 306 | " '__doc__',\n", 307 | " '__file__',\n", 308 | " '__keywords__',\n", 309 | " '__license__',\n", 310 | " '__loader__',\n", 311 | " '__longdescr__',\n", 312 | " '__maintainer__',\n", 313 | " '__maintainer_email__',\n", 314 | " '__name__',\n", 315 | " '__package__',\n", 316 | " '__path__',\n", 317 | " '__spec__',\n", 318 | " '__url__',\n", 319 | " '__version__',\n", 320 | " 'absolute_import',\n", 321 | " 'accuracy',\n", 322 | " 'add_logs',\n", 323 | " 'agreement',\n", 324 | " 'align',\n", 325 | " 'alignment_error_rate',\n", 326 | " 'aline',\n", 327 | " 'api',\n", 328 | " 'app',\n", 329 | " 'apply_features',\n", 330 | " 'approxrand',\n", 331 | " 'arity',\n", 332 | " 'association',\n", 333 | " 'bigrams',\n", 334 | " 'binary_distance',\n", 335 | " 'binary_search_file',\n", 336 | " 'binding_ops',\n", 337 | " 'bisect',\n", 338 | " 'blankline_tokenize',\n", 339 | " 'bleu',\n", 340 | " 'bleu_score',\n", 341 | " 'bllip',\n", 342 | " 'boolean_ops',\n", 343 | " 'boxer',\n", 344 | " 'bracket_parse',\n", 345 | " 'breadth_first',\n", 346 | " 'brill',\n", 347 | " 'brill_trainer',\n", 348 | " 'build_opener',\n", 349 | " 'call_megam',\n", 350 | " 'casual',\n", 351 | " 'casual_tokenize',\n", 352 | " 'ccg',\n", 353 | " 'chain',\n", 354 | " 'chart',\n", 355 | " 'chat',\n", 356 | " 'choose',\n", 357 | " 'chunk',\n", 358 | " 'class_types',\n", 359 | " 'classify',\n", 360 | " 'clause',\n", 361 | " 'clean_html',\n", 362 | " 'clean_url',\n", 363 | " 'cluster',\n", 364 | " 'collections',\n", 365 | " 'collocations',\n", 366 | " 'combinations',\n", 367 | " 'compat',\n", 368 | " 'config_java',\n", 369 | " 'config_megam',\n", 370 | " 'config_weka',\n", 371 | " 'conflicts',\n", 372 | " 'confusionmatrix',\n", 373 | " 'conllstr2tree',\n", 374 | " 'conlltags2tree',\n", 375 | " 'corenlp',\n", 376 | " 'corpus',\n", 377 | " 'crf',\n", 378 | " 'custom_distance',\n", 379 | " 'data',\n", 380 | " 'decisiontree',\n", 381 | " 'decorator',\n", 382 | " 'decorators',\n", 383 | " 'defaultdict',\n", 384 | " 'demo',\n", 385 | " 'dependencygraph',\n", 386 | " 'deque',\n", 387 | " 'discourse',\n", 388 | " 'distance',\n", 389 | " 'download',\n", 390 | " 'download_gui',\n", 391 | " 'download_shell',\n", 392 | " 'downloader',\n", 393 | " 'draw',\n", 394 | " 'drt',\n", 395 | " 'earleychart',\n", 396 | " 'edit_distance',\n", 397 | " 'elementtree_indent',\n", 398 | " 'entropy',\n", 399 | " 'equality_preds',\n", 400 | " 'evaluate',\n", 401 | " 'evaluate_sents',\n", 402 | " 'everygrams',\n", 403 | " 'extract_rels',\n", 404 | " 'extract_test_sentences',\n", 405 | " 'f_measure',\n", 406 | " 'featstruct',\n", 407 | " 'featurechart',\n", 408 | " 'filestring',\n", 409 | " 'find',\n", 410 | " 'flatten',\n", 411 | " 'fractional_presence',\n", 412 | " 'getproxies',\n", 413 | " 'ghd',\n", 414 | " 'glue',\n", 415 | " 'grammar',\n", 416 | " 'guess_encoding',\n", 417 | " 'help',\n", 418 | " 'hmm',\n", 419 | " 'hunpos',\n", 420 | " 'ibm1',\n", 421 | " 'ibm2',\n", 422 | " 'ibm3',\n", 423 | " 'ibm4',\n", 424 | " 'ibm5',\n", 425 | " 'ibm_model',\n", 426 | " 'ieerstr2tree',\n", 427 | " 'improved_close_quote_regex',\n", 428 | " 'improved_open_quote_regex',\n", 429 | " 'improved_punct_regex',\n", 430 | " 'in_idle',\n", 431 | " 'induce_pcfg',\n", 432 | " 'inference',\n", 433 | " 'infile',\n", 434 | " 'inspect',\n", 435 | " 'install_opener',\n", 436 | " 'internals',\n", 437 | " 'interpret_sents',\n", 438 | " 'interval_distance',\n", 439 | " 'invert_dict',\n", 440 | " 'invert_graph',\n", 441 | " 'is_rel',\n", 442 | " 'islice',\n", 443 | " 'isri',\n", 444 | " 'jaccard_distance',\n", 445 | " 'json_tags',\n", 446 | " 'jsontags',\n", 447 | " 'lancaster',\n", 448 | " 'lazyimport',\n", 449 | " 'lfg',\n", 450 | " 'line_tokenize',\n", 451 | " 'linearlogic',\n", 452 | " 'load',\n", 453 | " 'load_parser',\n", 454 | " 'locale',\n", 455 | " 'log_likelihood',\n", 456 | " 'logic',\n", 457 | " 'mace',\n", 458 | " 'malt',\n", 459 | " 'map_tag',\n", 460 | " 'mapping',\n", 461 | " 'masi_distance',\n", 462 | " 'maxent',\n", 463 | " 'megam',\n", 464 | " 'memoize',\n", 465 | " 'metrics',\n", 466 | " 'misc',\n", 467 | " 'mwe',\n", 468 | " 'naivebayes',\n", 469 | " 'ne_chunk',\n", 470 | " 'ne_chunk_sents',\n", 471 | " 'ngrams',\n", 472 | " 'nonprojectivedependencyparser',\n", 473 | " 'nonterminals',\n", 474 | " 'numpy',\n", 475 | " 'os',\n", 476 | " 'pad_sequence',\n", 477 | " 'paice',\n", 478 | " 'parse',\n", 479 | " 'parse_sents',\n", 480 | " 'pchart',\n", 481 | " 'perceptron',\n", 482 | " 'pk',\n", 483 | " 'porter',\n", 484 | " 'pos_tag',\n", 485 | " 'pos_tag_sents',\n", 486 | " 'positivenaivebayes',\n", 487 | " 'pprint',\n", 488 | " 'pr',\n", 489 | " 'precision',\n", 490 | " 'presence',\n", 491 | " 'print_function',\n", 492 | " 'print_string',\n", 493 | " 'probability',\n", 494 | " 'projectivedependencyparser',\n", 495 | " 'prover9',\n", 496 | " 'punkt',\n", 497 | " 'py25',\n", 498 | " 'py26',\n", 499 | " 'py27',\n", 500 | " 'pydoc',\n", 501 | " 'python_2_unicode_compatible',\n", 502 | " 'raise_unorderable_types',\n", 503 | " 'ranks_from_scores',\n", 504 | " 'ranks_from_sequence',\n", 505 | " 're',\n", 506 | " 're_show',\n", 507 | " 'read_grammar',\n", 508 | " 'read_logic',\n", 509 | " 'read_valuation',\n", 510 | " 'recall',\n", 511 | " 'recursivedescent',\n", 512 | " 'regexp',\n", 513 | " 'regexp_span_tokenize',\n", 514 | " 'regexp_tokenize',\n", 515 | " 'register_tag',\n", 516 | " 'relextract',\n", 517 | " 'repp',\n", 518 | " 'resolution',\n", 519 | " 'ribes',\n", 520 | " 'ribes_score',\n", 521 | " 'root_semrep',\n", 522 | " 'rslp',\n", 523 | " 'rte_classifier',\n", 524 | " 'rte_classify',\n", 525 | " 'rte_features',\n", 526 | " 'rtuple',\n", 527 | " 'scikitlearn',\n", 528 | " 'scores',\n", 529 | " 'segmentation',\n", 530 | " 'sem',\n", 531 | " 'senna',\n", 532 | " 'sent_tokenize',\n", 533 | " 'sequential',\n", 534 | " 'set2rel',\n", 535 | " 'set_proxy',\n", 536 | " 'sexpr',\n", 537 | " 'sexpr_tokenize',\n", 538 | " 'shiftreduce',\n", 539 | " 'simple',\n", 540 | " 'sinica_parse',\n", 541 | " 'skipgrams',\n", 542 | " 'skolemize',\n", 543 | " 'slice_bounds',\n", 544 | " 'snowball',\n", 545 | " 'spearman',\n", 546 | " 'spearman_correlation',\n", 547 | " 'stack_decoder',\n", 548 | " 'stanford',\n", 549 | " 'stanford_segmenter',\n", 550 | " 'stem',\n", 551 | " 'str2tuple',\n", 552 | " 'string_span_tokenize',\n", 553 | " 'string_types',\n", 554 | " 'subprocess',\n", 555 | " 'subsumes',\n", 556 | " 'sum_logs',\n", 557 | " 'sys',\n", 558 | " 'tableau',\n", 559 | " 'tadm',\n", 560 | " 'tag',\n", 561 | " 'tagset_mapping',\n", 562 | " 'tagstr2tree',\n", 563 | " 'tbl',\n", 564 | " 'text',\n", 565 | " 'text_type',\n", 566 | " 'textcat',\n", 567 | " 'texttiling',\n", 568 | " 'textwrap',\n", 569 | " 'tkinter',\n", 570 | " 'tnt',\n", 571 | " 'tokenize',\n", 572 | " 'tokenwrap',\n", 573 | " 'toktok',\n", 574 | " 'toolbox',\n", 575 | " 'total_ordering',\n", 576 | " 'transitionparser',\n", 577 | " 'transitive_closure',\n", 578 | " 'translate',\n", 579 | " 'tree',\n", 580 | " 'tree2conllstr',\n", 581 | " 'tree2conlltags',\n", 582 | " 'treebank',\n", 583 | " 'treetransforms',\n", 584 | " 'trigrams',\n", 585 | " 'tuple2str',\n", 586 | " 'types',\n", 587 | " 'unify',\n", 588 | " 'unique_list',\n", 589 | " 'untag',\n", 590 | " 'usage',\n", 591 | " 'util',\n", 592 | " 'version_file',\n", 593 | " 'version_info',\n", 594 | " 'viterbi',\n", 595 | " 'weka',\n", 596 | " 'windowdiff',\n", 597 | " 'word_tokenize',\n", 598 | " 'wordnet',\n", 599 | " 'wordpunct_tokenize',\n", 600 | " 'wsd']" 601 | ] 602 | }, 603 | "execution_count": 10, 604 | "metadata": {}, 605 | "output_type": "execute_result" 606 | } 607 | ], 608 | "source": [ 609 | "dir(nltk)" 610 | ] 611 | }, 612 | { 613 | "cell_type": "markdown", 614 | "metadata": {}, 615 | "source": [ 616 | "### What can you do with NLTK?" 617 | ] 618 | }, 619 | { 620 | "cell_type": "code", 621 | "execution_count": 12, 622 | "metadata": {}, 623 | "outputs": [ 624 | { 625 | "data": { 626 | "text/plain": [ 627 | "['i', 'herself', 'been', 'with', 'here', 'very', 'doesn', 'won']" 628 | ] 629 | }, 630 | "execution_count": 12, 631 | "metadata": {}, 632 | "output_type": "execute_result" 633 | } 634 | ], 635 | "source": [ 636 | "from nltk.corpus import stopwords\n", 637 | "\n", 638 | "stopwords.words('english')[0:500:25]" 639 | ] 640 | }, 641 | { 642 | "cell_type": "code", 643 | "execution_count": null, 644 | "metadata": { 645 | "collapsed": true, 646 | "jupyter": { 647 | "outputs_hidden": true 648 | } 649 | }, 650 | "outputs": [], 651 | "source": [] 652 | } 653 | ], 654 | "metadata": { 655 | "kernelspec": { 656 | "display_name": "Python 3 (ipykernel)", 657 | "language": "python", 658 | "name": "python3" 659 | }, 660 | "language_info": { 661 | "codemirror_mode": { 662 | "name": "ipython", 663 | "version": 3 664 | }, 665 | "file_extension": ".py", 666 | "mimetype": "text/x-python", 667 | "name": "python", 668 | "nbconvert_exporter": "python", 669 | "pygments_lexer": "ipython3", 670 | "version": "3.11.0" 671 | } 672 | }, 673 | "nbformat": 4, 674 | "nbformat_minor": 4 675 | } 676 | -------------------------------------------------------------------------------- /1. NLP Basics/1.2. reading in text data & why do we need cleaning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# NLP Basics: Reading in text data & why do we need to clean the text?" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Read in semi-structured text data" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 2, 20 | "metadata": {}, 21 | "outputs": [ 22 | { 23 | "data": { 24 | "text/plain": [ 25 | "\"ham\\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\\nspam\\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\\nham\\tNah I don't think he goes to usf, he lives around here though\\nham\\tEven my brother is not like to speak with me. They treat me like aid\"" 26 | ] 27 | }, 28 | "execution_count": 2, 29 | "metadata": {}, 30 | "output_type": "execute_result" 31 | } 32 | ], 33 | "source": [ 34 | "# Read in the raw text\n", 35 | "rawData = open(\"SMSSpamCollection.tsv\").read()\n", 36 | "\n", 37 | "# Print the raw data\n", 38 | "rawData[0:500]" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 3, 44 | "metadata": { 45 | "tags": [] 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "parsedData = rawData.replace('\\t', '\\n').split('\\n')" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 4, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "['ham',\n", 61 | " \"I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\",\n", 62 | " 'spam',\n", 63 | " \"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\",\n", 64 | " 'ham']" 65 | ] 66 | }, 67 | "execution_count": 4, 68 | "metadata": {}, 69 | "output_type": "execute_result" 70 | } 71 | ], 72 | "source": [ 73 | "parsedData[0:5]" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 5, 79 | "metadata": { 80 | "tags": [] 81 | }, 82 | "outputs": [], 83 | "source": [ 84 | "labelList = parsedData[0::2]\n", 85 | "textList = parsedData[1::2]" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 6, 91 | "metadata": {}, 92 | "outputs": [ 93 | { 94 | "name": "stdout", 95 | "output_type": "stream", 96 | "text": [ 97 | "['ham', 'spam', 'ham', 'ham', 'ham']\n", 98 | "[\"I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\", \"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\", \"Nah I don't think he goes to usf, he lives around here though\", 'Even my brother is not like to speak with me. They treat me like aids patent.', 'I HAVE A DATE ON SUNDAY WITH WILL!!']\n" 99 | ] 100 | } 101 | ], 102 | "source": [ 103 | "print(labelList[0:5])\n", 104 | "print(textList[0:5])" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 12, 110 | "metadata": {}, 111 | "outputs": [ 112 | { 113 | "ename": "ValueError", 114 | "evalue": "All arrays must be of the same length", 115 | "output_type": "error", 116 | "traceback": [ 117 | "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", 118 | "\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)", 119 | "Cell \u001b[1;32mIn[12], line 3\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mpandas\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mpd\u001b[39;00m\n\u001b[1;32m----> 3\u001b[0m fullCorpus \u001b[38;5;241m=\u001b[39m \u001b[43mpd\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mDataFrame\u001b[49m\u001b[43m(\u001b[49m\u001b[43m{\u001b[49m\n\u001b[0;32m 4\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mlabel\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m:\u001b[49m\u001b[43m \u001b[49m\u001b[43mlabelList\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 5\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mbody_list\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m:\u001b[49m\u001b[43m \u001b[49m\u001b[43mtextList\u001b[49m\n\u001b[0;32m 6\u001b[0m \u001b[43m}\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 8\u001b[0m fullCorpus\u001b[38;5;241m.\u001b[39mhead()\n", 120 | "File \u001b[1;32m~\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\core\\frame.py:663\u001b[0m, in \u001b[0;36mDataFrame.__init__\u001b[1;34m(self, data, index, columns, dtype, copy)\u001b[0m\n\u001b[0;32m 657\u001b[0m mgr \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_init_mgr(\n\u001b[0;32m 658\u001b[0m data, axes\u001b[38;5;241m=\u001b[39m{\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mindex\u001b[39m\u001b[38;5;124m\"\u001b[39m: index, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcolumns\u001b[39m\u001b[38;5;124m\"\u001b[39m: columns}, dtype\u001b[38;5;241m=\u001b[39mdtype, copy\u001b[38;5;241m=\u001b[39mcopy\n\u001b[0;32m 659\u001b[0m )\n\u001b[0;32m 661\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data, \u001b[38;5;28mdict\u001b[39m):\n\u001b[0;32m 662\u001b[0m \u001b[38;5;66;03m# GH#38939 de facto copy defaults to False only in non-dict cases\u001b[39;00m\n\u001b[1;32m--> 663\u001b[0m mgr \u001b[38;5;241m=\u001b[39m \u001b[43mdict_to_mgr\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcopy\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcopy\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtyp\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmanager\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 664\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data, ma\u001b[38;5;241m.\u001b[39mMaskedArray):\n\u001b[0;32m 665\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mnumpy\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mma\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmrecords\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mmrecords\u001b[39;00m\n", 121 | "File \u001b[1;32m~\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\core\\internals\\construction.py:493\u001b[0m, in \u001b[0;36mdict_to_mgr\u001b[1;34m(data, index, columns, dtype, typ, copy)\u001b[0m\n\u001b[0;32m 489\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 490\u001b[0m \u001b[38;5;66;03m# dtype check to exclude e.g. range objects, scalars\u001b[39;00m\n\u001b[0;32m 491\u001b[0m arrays \u001b[38;5;241m=\u001b[39m [x\u001b[38;5;241m.\u001b[39mcopy() \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(x, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdtype\u001b[39m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;28;01melse\u001b[39;00m x \u001b[38;5;28;01mfor\u001b[39;00m x \u001b[38;5;129;01min\u001b[39;00m arrays]\n\u001b[1;32m--> 493\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43marrays_to_mgr\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrays\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtyp\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtyp\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mconsolidate\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcopy\u001b[49m\u001b[43m)\u001b[49m\n", 122 | "File \u001b[1;32m~\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\core\\internals\\construction.py:118\u001b[0m, in \u001b[0;36marrays_to_mgr\u001b[1;34m(arrays, columns, index, dtype, verify_integrity, typ, consolidate)\u001b[0m\n\u001b[0;32m 115\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m verify_integrity:\n\u001b[0;32m 116\u001b[0m \u001b[38;5;66;03m# figure out the index, if necessary\u001b[39;00m\n\u001b[0;32m 117\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m index \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m--> 118\u001b[0m index \u001b[38;5;241m=\u001b[39m \u001b[43m_extract_index\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrays\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 119\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 120\u001b[0m index \u001b[38;5;241m=\u001b[39m ensure_index(index)\n", 123 | "File \u001b[1;32m~\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\core\\internals\\construction.py:666\u001b[0m, in \u001b[0;36m_extract_index\u001b[1;34m(data)\u001b[0m\n\u001b[0;32m 664\u001b[0m lengths \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mlist\u001b[39m(\u001b[38;5;28mset\u001b[39m(raw_lengths))\n\u001b[0;32m 665\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(lengths) \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[1;32m--> 666\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mAll arrays must be of the same length\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m 668\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m have_dicts:\n\u001b[0;32m 669\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[0;32m 670\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMixing dicts with non-Series may lead to ambiguous ordering.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 671\u001b[0m )\n", 124 | "\u001b[1;31mValueError\u001b[0m: All arrays must be of the same length" 125 | ] 126 | } 127 | ], 128 | "source": [ 129 | "import pandas as pd\n", 130 | "\n", 131 | "fullCorpus = pd.DataFrame({\n", 132 | " 'label': labelList,\n", 133 | " 'body_list': textList\n", 134 | "})\n", 135 | "\n", 136 | "fullCorpus.head()" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 8, 142 | "metadata": {}, 143 | "outputs": [ 144 | { 145 | "name": "stdout", 146 | "output_type": "stream", 147 | "text": [ 148 | "5571\n", 149 | "5570\n" 150 | ] 151 | } 152 | ], 153 | "source": [ 154 | "print(len(labelList))\n", 155 | "print(len(textList))" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 9, 161 | "metadata": {}, 162 | "outputs": [ 163 | { 164 | "name": "stdout", 165 | "output_type": "stream", 166 | "text": [ 167 | "['ham', 'ham', 'ham', 'ham', '']\n" 168 | ] 169 | } 170 | ], 171 | "source": [ 172 | "print(labelList[-5:])" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": 10, 178 | "metadata": {}, 179 | "outputs": [ 180 | { 181 | "data": { 182 | "text/html": [ 183 | "
\n", 184 | "\n", 197 | "\n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | "
labelbody_list
0hamI've been searching for the right words to tha...
1spamFree entry in 2 a wkly comp to win FA Cup fina...
2hamNah I don't think he goes to usf, he lives aro...
3hamEven my brother is not like to speak with me. ...
4hamI HAVE A DATE ON SUNDAY WITH WILL!!
\n", 233 | "
" 234 | ], 235 | "text/plain": [ 236 | " label body_list\n", 237 | "0 ham I've been searching for the right words to tha...\n", 238 | "1 spam Free entry in 2 a wkly comp to win FA Cup fina...\n", 239 | "2 ham Nah I don't think he goes to usf, he lives aro...\n", 240 | "3 ham Even my brother is not like to speak with me. ...\n", 241 | "4 ham I HAVE A DATE ON SUNDAY WITH WILL!!" 242 | ] 243 | }, 244 | "execution_count": 10, 245 | "metadata": {}, 246 | "output_type": "execute_result" 247 | } 248 | ], 249 | "source": [ 250 | "fullCorpus = pd.DataFrame({\n", 251 | " 'label': labelList[:-1],\n", 252 | " 'body_list': textList\n", 253 | "})\n", 254 | "\n", 255 | "fullCorpus.head()" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 11, 261 | "metadata": {}, 262 | "outputs": [ 263 | { 264 | "data": { 265 | "text/html": [ 266 | "
\n", 267 | "\n", 280 | "\n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | "
01
0hamI've been searching for the right words to tha...
1spamFree entry in 2 a wkly comp to win FA Cup fina...
2hamNah I don't think he goes to usf, he lives aro...
3hamEven my brother is not like to speak with me. ...
4hamI HAVE A DATE ON SUNDAY WITH WILL!!
\n", 316 | "
" 317 | ], 318 | "text/plain": [ 319 | " 0 1\n", 320 | "0 ham I've been searching for the right words to tha...\n", 321 | "1 spam Free entry in 2 a wkly comp to win FA Cup fina...\n", 322 | "2 ham Nah I don't think he goes to usf, he lives aro...\n", 323 | "3 ham Even my brother is not like to speak with me. ...\n", 324 | "4 ham I HAVE A DATE ON SUNDAY WITH WILL!!" 325 | ] 326 | }, 327 | "execution_count": 11, 328 | "metadata": {}, 329 | "output_type": "execute_result" 330 | } 331 | ], 332 | "source": [ 333 | "dataset = pd.read_csv(\"SMSSpamCollection.tsv\", sep=\"\\t\", header=None)\n", 334 | "dataset.head(" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": null, 340 | "metadata": { 341 | "collapsed": true, 342 | "jupyter": { 343 | "outputs_hidden": true 344 | } 345 | }, 346 | "outputs": [], 347 | "source": [] 348 | } 349 | ], 350 | "metadata": { 351 | "kernelspec": { 352 | "display_name": "Python 3 (ipykernel)", 353 | "language": "python", 354 | "name": "python3" 355 | }, 356 | "language_info": { 357 | "codemirror_mode": { 358 | "name": "ipython", 359 | "version": 3 360 | }, 361 | "file_extension": ".py", 362 | "mimetype": "text/x-python", 363 | "name": "python", 364 | "nbconvert_exporter": "python", 365 | "pygments_lexer": "ipython3", 366 | "version": "3.11.0" 367 | } 368 | }, 369 | "nbformat": 4, 370 | "nbformat_minor": 4 371 | } 372 | -------------------------------------------------------------------------------- /1. NLP Basics/1.3. How to explore a dataset.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# NLP Basics: Exploring the dataset" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Read in text data" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": {}, 21 | "outputs": [ 22 | { 23 | "data": { 24 | "text/html": [ 25 | "
\n", 26 | "\n", 39 | "\n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | "
labelbody_text
0hamI've been searching for the right words to tha...
1spamFree entry in 2 a wkly comp to win FA Cup fina...
2hamNah I don't think he goes to usf, he lives aro...
3hamEven my brother is not like to speak with me. ...
4hamI HAVE A DATE ON SUNDAY WITH WILL!!
\n", 75 | "
" 76 | ], 77 | "text/plain": [ 78 | " label body_text\n", 79 | "0 ham I've been searching for the right words to tha...\n", 80 | "1 spam Free entry in 2 a wkly comp to win FA Cup fina...\n", 81 | "2 ham Nah I don't think he goes to usf, he lives aro...\n", 82 | "3 ham Even my brother is not like to speak with me. ...\n", 83 | "4 ham I HAVE A DATE ON SUNDAY WITH WILL!!" 84 | ] 85 | }, 86 | "execution_count": 1, 87 | "metadata": {}, 88 | "output_type": "execute_result" 89 | } 90 | ], 91 | "source": [ 92 | "import pandas as pd\n", 93 | "\n", 94 | "fullCorpus = pd.read_csv('SMSSpamCollection.tsv', sep='\\t', header=None)\n", 95 | "fullCorpus.columns = ['label', 'body_text']\n", 96 | "\n", 97 | "fullCorpus.head()" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "### Explore the dataset" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 2, 110 | "metadata": {}, 111 | "outputs": [ 112 | { 113 | "name": "stdout", 114 | "output_type": "stream", 115 | "text": [ 116 | "Input data has 5568 rows and 2 columns\n" 117 | ] 118 | } 119 | ], 120 | "source": [ 121 | "# What is the shape of the dataset?\n", 122 | "\n", 123 | "print(\"Input data has {} rows and {} columns\".format(len(fullCorpus), len(fullCorpus.columns)))" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 5, 129 | "metadata": {}, 130 | "outputs": [ 131 | { 132 | "name": "stdout", 133 | "output_type": "stream", 134 | "text": [ 135 | "Out of 5568 rows, 746 are spam, 4822 are ham\n" 136 | ] 137 | } 138 | ], 139 | "source": [ 140 | "# How many spam/ham are there?\n", 141 | "\n", 142 | "print(\"Out of {} rows, {} are spam, {} are ham\".format(len(fullCorpus),\n", 143 | " len(fullCorpus[fullCorpus['label']=='spam']),\n", 144 | " len(fullCorpus[fullCorpus['label']=='ham'])))" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 6, 150 | "metadata": {}, 151 | "outputs": [ 152 | { 153 | "name": "stdout", 154 | "output_type": "stream", 155 | "text": [ 156 | "Number of null in label: 0\n", 157 | "Number of null in text: 0\n" 158 | ] 159 | } 160 | ], 161 | "source": [ 162 | "# How much missing data is there?\n", 163 | "\n", 164 | "print(\"Number of null in label: {}\".format(fullCorpus['label'].isnull().sum()))\n", 165 | "print(\"Number of null in text: {}\".format(fullCorpus['body_text'].isnull().sum()))" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": null, 171 | "metadata": { 172 | "collapsed": true, 173 | "jupyter": { 174 | "outputs_hidden": true 175 | } 176 | }, 177 | "outputs": [], 178 | "source": [] 179 | } 180 | ], 181 | "metadata": { 182 | "kernelspec": { 183 | "display_name": "Python 3 (ipykernel)", 184 | "language": "python", 185 | "name": "python3" 186 | }, 187 | "language_info": { 188 | "codemirror_mode": { 189 | "name": "ipython", 190 | "version": 3 191 | }, 192 | "file_extension": ".py", 193 | "mimetype": "text/x-python", 194 | "name": "python", 195 | "nbconvert_exporter": "python", 196 | "pygments_lexer": "ipython3", 197 | "version": "3.11.0" 198 | } 199 | }, 200 | "nbformat": 4, 201 | "nbformat_minor": 4 202 | } 203 | -------------------------------------------------------------------------------- /1. NLP Basics/1.4. learning how to use regular expressions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# NLP Basics: Learning how to use regular expressions" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Using regular expressions in Python\n", 15 | "\n", 16 | "Python's `re` package is the most commonly used regex resource. More details can be found [here](https://docs.python.org/3/library/re.html)." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 1, 22 | "metadata": { 23 | "collapsed": true, 24 | "jupyter": { 25 | "outputs_hidden": true 26 | } 27 | }, 28 | "outputs": [], 29 | "source": [ 30 | "import re\n", 31 | "\n", 32 | "re_test = 'This is a made up string to test 2 different regex methods'\n", 33 | "re_test_messy = 'This is a made up string to test 2 different regex methods'\n", 34 | "re_test_messy1 = 'This-is-a-made/up.string*to>>>>test----2\"\"\"\"\"\"different~regex-methods'" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "### Splitting a sentence into a list of words" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 3, 47 | "metadata": {}, 48 | "outputs": [ 49 | { 50 | "data": { 51 | "text/plain": [ 52 | "['This',\n", 53 | " 'is',\n", 54 | " 'a',\n", 55 | " 'made',\n", 56 | " 'up',\n", 57 | " 'string',\n", 58 | " 'to',\n", 59 | " 'test',\n", 60 | " '2',\n", 61 | " 'different',\n", 62 | " 'regex',\n", 63 | " 'methods']" 64 | ] 65 | }, 66 | "execution_count": 3, 67 | "metadata": {}, 68 | "output_type": "execute_result" 69 | } 70 | ], 71 | "source": [ 72 | "re.split('\\s', re_test)" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 4, 78 | "metadata": {}, 79 | "outputs": [ 80 | { 81 | "data": { 82 | "text/plain": [ 83 | "['This',\n", 84 | " '',\n", 85 | " '',\n", 86 | " '',\n", 87 | " '',\n", 88 | " '',\n", 89 | " 'is',\n", 90 | " 'a',\n", 91 | " 'made',\n", 92 | " 'up',\n", 93 | " '',\n", 94 | " '',\n", 95 | " '',\n", 96 | " '',\n", 97 | " 'string',\n", 98 | " 'to',\n", 99 | " 'test',\n", 100 | " '2',\n", 101 | " '',\n", 102 | " '',\n", 103 | " '',\n", 104 | " 'different',\n", 105 | " 'regex',\n", 106 | " 'methods']" 107 | ] 108 | }, 109 | "execution_count": 4, 110 | "metadata": {}, 111 | "output_type": "execute_result" 112 | } 113 | ], 114 | "source": [ 115 | "re.split('\\s', re_test_messy)" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 5, 121 | "metadata": {}, 122 | "outputs": [ 123 | { 124 | "data": { 125 | "text/plain": [ 126 | "['This',\n", 127 | " 'is',\n", 128 | " 'a',\n", 129 | " 'made',\n", 130 | " 'up',\n", 131 | " 'string',\n", 132 | " 'to',\n", 133 | " 'test',\n", 134 | " '2',\n", 135 | " 'different',\n", 136 | " 'regex',\n", 137 | " 'methods']" 138 | ] 139 | }, 140 | "execution_count": 5, 141 | "metadata": {}, 142 | "output_type": "execute_result" 143 | } 144 | ], 145 | "source": [ 146 | "re.split('\\s+', re_test_messy)" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 6, 152 | "metadata": {}, 153 | "outputs": [ 154 | { 155 | "data": { 156 | "text/plain": [ 157 | "['This-is-a-made/up.string*to>>>>test----2\"\"\"\"\"\"different~regex-methods']" 158 | ] 159 | }, 160 | "execution_count": 6, 161 | "metadata": {}, 162 | "output_type": "execute_result" 163 | } 164 | ], 165 | "source": [ 166 | "re.split('\\s+', re_test_messy1)" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 7, 172 | "metadata": {}, 173 | "outputs": [ 174 | { 175 | "data": { 176 | "text/plain": [ 177 | "['This',\n", 178 | " 'is',\n", 179 | " 'a',\n", 180 | " 'made',\n", 181 | " 'up',\n", 182 | " 'string',\n", 183 | " 'to',\n", 184 | " 'test',\n", 185 | " '2',\n", 186 | " 'different',\n", 187 | " 'regex',\n", 188 | " 'methods']" 189 | ] 190 | }, 191 | "execution_count": 7, 192 | "metadata": {}, 193 | "output_type": "execute_result" 194 | } 195 | ], 196 | "source": [ 197 | "re.split('\\W+', re_test_messy1)" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 10, 203 | "metadata": {}, 204 | "outputs": [ 205 | { 206 | "data": { 207 | "text/plain": [ 208 | "['This',\n", 209 | " 'is',\n", 210 | " 'a',\n", 211 | " 'made',\n", 212 | " 'up',\n", 213 | " 'string',\n", 214 | " 'to',\n", 215 | " 'test',\n", 216 | " '2',\n", 217 | " 'different',\n", 218 | " 'regex',\n", 219 | " 'methods']" 220 | ] 221 | }, 222 | "execution_count": 10, 223 | "metadata": {}, 224 | "output_type": "execute_result" 225 | } 226 | ], 227 | "source": [ 228 | "re.findall('\\S+', re_test)" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 11, 234 | "metadata": {}, 235 | "outputs": [ 236 | { 237 | "data": { 238 | "text/plain": [ 239 | "['This',\n", 240 | " 'is',\n", 241 | " 'a',\n", 242 | " 'made',\n", 243 | " 'up',\n", 244 | " 'string',\n", 245 | " 'to',\n", 246 | " 'test',\n", 247 | " '2',\n", 248 | " 'different',\n", 249 | " 'regex',\n", 250 | " 'methods']" 251 | ] 252 | }, 253 | "execution_count": 11, 254 | "metadata": {}, 255 | "output_type": "execute_result" 256 | } 257 | ], 258 | "source": [ 259 | "re.findall('\\S+', re_test_messy)" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": 12, 265 | "metadata": {}, 266 | "outputs": [ 267 | { 268 | "data": { 269 | "text/plain": [ 270 | "['This-is-a-made/up.string*to>>>>test----2\"\"\"\"\"\"different~regex-methods']" 271 | ] 272 | }, 273 | "execution_count": 12, 274 | "metadata": {}, 275 | "output_type": "execute_result" 276 | } 277 | ], 278 | "source": [ 279 | "re.findall('\\S+', re_test_messy1)" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": 13, 285 | "metadata": {}, 286 | "outputs": [ 287 | { 288 | "data": { 289 | "text/plain": [ 290 | "['This',\n", 291 | " 'is',\n", 292 | " 'a',\n", 293 | " 'made',\n", 294 | " 'up',\n", 295 | " 'string',\n", 296 | " 'to',\n", 297 | " 'test',\n", 298 | " '2',\n", 299 | " 'different',\n", 300 | " 'regex',\n", 301 | " 'methods']" 302 | ] 303 | }, 304 | "execution_count": 13, 305 | "metadata": {}, 306 | "output_type": "execute_result" 307 | } 308 | ], 309 | "source": [ 310 | "re.findall('\\w+', re_test_messy1)" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "### Replacing a specific string" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 3, 323 | "metadata": { 324 | "collapsed": true, 325 | "jupyter": { 326 | "outputs_hidden": true 327 | } 328 | }, 329 | "outputs": [], 330 | "source": [ 331 | "pep8_test = 'I try to follow PEP8 guidelines'\n", 332 | "pep7_test = 'I try to follow PEP7 guidelines'\n", 333 | "peep8_test = 'I try to follow PEEP8 guidelines'" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": 4, 339 | "metadata": {}, 340 | "outputs": [ 341 | { 342 | "data": { 343 | "text/plain": [ 344 | "['try', 'to', 'follow', 'guidelines']" 345 | ] 346 | }, 347 | "execution_count": 4, 348 | "metadata": {}, 349 | "output_type": "execute_result" 350 | } 351 | ], 352 | "source": [ 353 | "import re\n", 354 | "\n", 355 | "re.findall('[a-z]+', pep8_test)" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": 5, 361 | "metadata": {}, 362 | "outputs": [ 363 | { 364 | "data": { 365 | "text/plain": [ 366 | "['I', 'PEP']" 367 | ] 368 | }, 369 | "execution_count": 5, 370 | "metadata": {}, 371 | "output_type": "execute_result" 372 | } 373 | ], 374 | "source": [ 375 | "re.findall('[A-Z]+', pep8_test)" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": 8, 381 | "metadata": {}, 382 | "outputs": [ 383 | { 384 | "data": { 385 | "text/plain": [ 386 | "['PEEP8']" 387 | ] 388 | }, 389 | "execution_count": 8, 390 | "metadata": {}, 391 | "output_type": "execute_result" 392 | } 393 | ], 394 | "source": [ 395 | "re.findall('[A-Z]+[0-9]+', peep8_test)" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": 11, 401 | "metadata": {}, 402 | "outputs": [ 403 | { 404 | "data": { 405 | "text/plain": [ 406 | "'I try to follow PEP8 Python Styleguide guidelines'" 407 | ] 408 | }, 409 | "execution_count": 11, 410 | "metadata": {}, 411 | "output_type": "execute_result" 412 | } 413 | ], 414 | "source": [ 415 | "re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styleguide', peep8_test)" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": {}, 421 | "source": [ 422 | "### Other examples of regex methods\n", 423 | "\n", 424 | "- re.search()\n", 425 | "- re.match()\n", 426 | "- re.fullmatch()\n", 427 | "- re.finditer()\n", 428 | "- re.escape()" 429 | ] 430 | } 431 | ], 432 | "metadata": { 433 | "kernelspec": { 434 | "display_name": "Python 3 (ipykernel)", 435 | "language": "python", 436 | "name": "python3" 437 | }, 438 | "language_info": { 439 | "codemirror_mode": { 440 | "name": "ipython", 441 | "version": 3 442 | }, 443 | "file_extension": ".py", 444 | "mimetype": "text/x-python", 445 | "name": "python", 446 | "nbconvert_exporter": "python", 447 | "pygments_lexer": "ipython3", 448 | "version": "3.11.0" 449 | } 450 | }, 451 | "nbformat": 4, 452 | "nbformat_minor": 4 453 | } 454 | -------------------------------------------------------------------------------- /1. NLP Basics/1.5. implementing a pipeline to clean text.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# NLP Basics: Implementing a pipeline to clean text" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Pre-processing text data\n", 15 | "\n", 16 | "Cleaning up the text data is necessary to highlight attributes that you're going to want your machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of a number of steps:\n", 17 | "1. **Remove punctuation**\n", 18 | "2. **Tokenization**\n", 19 | "3. **Remove stopwords**\n", 20 | "4. Lemmatize/Stem\n", 21 | "\n", 22 | "The first three steps are covered in this chapter as they're implemented in pretty much any text cleaning pipeline. Lemmatizing and stemming are covered in the next chapter as they're helpful but not critical." 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 1, 28 | "metadata": {}, 29 | "outputs": [ 30 | { 31 | "data": { 32 | "text/html": [ 33 | "
\n", 34 | "\n", 47 | "\n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | "
labelbody_text
0hamI've been searching for the right words to thank you for this breather. I promise i wont take yo...
1spamFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2hamNah I don't think he goes to usf, he lives around here though
3hamEven my brother is not like to speak with me. They treat me like aids patent.
4hamI HAVE A DATE ON SUNDAY WITH WILL!!
\n", 83 | "
" 84 | ], 85 | "text/plain": [ 86 | " label \\\n", 87 | "0 ham \n", 88 | "1 spam \n", 89 | "2 ham \n", 90 | "3 ham \n", 91 | "4 ham \n", 92 | "\n", 93 | " body_text \n", 94 | "0 I've been searching for the right words to thank you for this breather. I promise i wont take yo... \n", 95 | "1 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... \n", 96 | "2 Nah I don't think he goes to usf, he lives around here though \n", 97 | "3 Even my brother is not like to speak with me. They treat me like aids patent. \n", 98 | "4 I HAVE A DATE ON SUNDAY WITH WILL!! " 99 | ] 100 | }, 101 | "execution_count": 1, 102 | "metadata": {}, 103 | "output_type": "execute_result" 104 | } 105 | ], 106 | "source": [ 107 | "import pandas as pd\n", 108 | "pd.set_option('display.max_colwidth', 100)\n", 109 | "\n", 110 | "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t', header=None)\n", 111 | "data.columns = ['label', 'body_text']\n", 112 | "\n", 113 | "data.head()" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 2, 119 | "metadata": {}, 120 | "outputs": [ 121 | { 122 | "data": { 123 | "text/html": [ 124 | "
\n", 125 | "\n", 138 | "\n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | "
labelbody_textbody_text_nostop
0hamI've been searching for the right words to thank you for this breather. I promise i wont take yo...['ive', 'searching', 'right', 'words', 'thank', 'breather', 'promise', 'wont', 'take', 'help', '...
1spamFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005...
2hamNah I don't think he goes to usf, he lives around here though['nah', 'dont', 'think', 'goes', 'usf', 'lives', 'around', 'though']
3hamEven my brother is not like to speak with me. They treat me like aids patent.['even', 'brother', 'like', 'speak', 'treat', 'like', 'aids', 'patent']
4hamI HAVE A DATE ON SUNDAY WITH WILL!!['date', 'sunday']
\n", 180 | "
" 181 | ], 182 | "text/plain": [ 183 | " label \\\n", 184 | "0 ham \n", 185 | "1 spam \n", 186 | "2 ham \n", 187 | "3 ham \n", 188 | "4 ham \n", 189 | "\n", 190 | " body_text \\\n", 191 | "0 I've been searching for the right words to thank you for this breather. I promise i wont take yo... \n", 192 | "1 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... \n", 193 | "2 Nah I don't think he goes to usf, he lives around here though \n", 194 | "3 Even my brother is not like to speak with me. They treat me like aids patent. \n", 195 | "4 I HAVE A DATE ON SUNDAY WITH WILL!! \n", 196 | "\n", 197 | " body_text_nostop \n", 198 | "0 ['ive', 'searching', 'right', 'words', 'thank', 'breather', 'promise', 'wont', 'take', 'help', '... \n", 199 | "1 ['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005... \n", 200 | "2 ['nah', 'dont', 'think', 'goes', 'usf', 'lives', 'around', 'though'] \n", 201 | "3 ['even', 'brother', 'like', 'speak', 'treat', 'like', 'aids', 'patent'] \n", 202 | "4 ['date', 'sunday'] " 203 | ] 204 | }, 205 | "execution_count": 2, 206 | "metadata": {}, 207 | "output_type": "execute_result" 208 | } 209 | ], 210 | "source": [ 211 | "# What does the cleaned version look like?\n", 212 | "data_cleaned = pd.read_csv(\"SMSSpamCollection_cleaned.tsv\", sep='\\t')\n", 213 | "data_cleaned.head()" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": {}, 219 | "source": [ 220 | "### Remove punctuation" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 3, 226 | "metadata": {}, 227 | "outputs": [ 228 | { 229 | "data": { 230 | "text/plain": [ 231 | "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'" 232 | ] 233 | }, 234 | "execution_count": 3, 235 | "metadata": {}, 236 | "output_type": "execute_result" 237 | } 238 | ], 239 | "source": [ 240 | "import string\n", 241 | "string.punctuation" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 4, 247 | "metadata": {}, 248 | "outputs": [ 249 | { 250 | "data": { 251 | "text/plain": [ 252 | "False" 253 | ] 254 | }, 255 | "execution_count": 4, 256 | "metadata": {}, 257 | "output_type": "execute_result" 258 | } 259 | ], 260 | "source": [ 261 | "\"I like NLP.\" == \"I like NLP\"" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 5, 267 | "metadata": {}, 268 | "outputs": [ 269 | { 270 | "data": { 271 | "text/html": [ 272 | "
\n", 273 | "\n", 286 | "\n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | "
labelbody_textbody_text_clean
0hamI've been searching for the right words to thank you for this breather. I promise i wont take yo...Ive been searching for the right words to thank you for this breather I promise i wont take your...
1spamFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
2hamNah I don't think he goes to usf, he lives around here thoughNah I dont think he goes to usf he lives around here though
3hamEven my brother is not like to speak with me. They treat me like aids patent.Even my brother is not like to speak with me They treat me like aids patent
4hamI HAVE A DATE ON SUNDAY WITH WILL!!I HAVE A DATE ON SUNDAY WITH WILL
\n", 328 | "
" 329 | ], 330 | "text/plain": [ 331 | " label \\\n", 332 | "0 ham \n", 333 | "1 spam \n", 334 | "2 ham \n", 335 | "3 ham \n", 336 | "4 ham \n", 337 | "\n", 338 | " body_text \\\n", 339 | "0 I've been searching for the right words to thank you for this breather. I promise i wont take yo... \n", 340 | "1 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... \n", 341 | "2 Nah I don't think he goes to usf, he lives around here though \n", 342 | "3 Even my brother is not like to speak with me. They treat me like aids patent. \n", 343 | "4 I HAVE A DATE ON SUNDAY WITH WILL!! \n", 344 | "\n", 345 | " body_text_clean \n", 346 | "0 Ive been searching for the right words to thank you for this breather I promise i wont take your... \n", 347 | "1 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e... \n", 348 | "2 Nah I dont think he goes to usf he lives around here though \n", 349 | "3 Even my brother is not like to speak with me They treat me like aids patent \n", 350 | "4 I HAVE A DATE ON SUNDAY WITH WILL " 351 | ] 352 | }, 353 | "execution_count": 5, 354 | "metadata": {}, 355 | "output_type": "execute_result" 356 | } 357 | ], 358 | "source": [ 359 | "def remove_punct(text):\n", 360 | " text_nopunct = \"\".join([char for char in text if char not in string.punctuation])\n", 361 | " return text_nopunct\n", 362 | "\n", 363 | "data['body_text_clean'] = data['body_text'].apply(lambda x: remove_punct(x))\n", 364 | "\n", 365 | "data.head()" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": {}, 371 | "source": [ 372 | "### Tokenization" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": 6, 378 | "metadata": {}, 379 | "outputs": [ 380 | { 381 | "data": { 382 | "text/html": [ 383 | "
\n", 384 | "\n", 397 | "\n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | "
labelbody_textbody_text_cleanbody_text_tokenized
0hamI've been searching for the right words to thank you for this breather. I promise i wont take yo...Ive been searching for the right words to thank you for this breather I promise i wont take your...[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...
1spamFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...
2hamNah I don't think he goes to usf, he lives around here thoughNah I dont think he goes to usf he lives around here though[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]
3hamEven my brother is not like to speak with me. They treat me like aids patent.Even my brother is not like to speak with me They treat me like aids patent[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]
4hamI HAVE A DATE ON SUNDAY WITH WILL!!I HAVE A DATE ON SUNDAY WITH WILL[i, have, a, date, on, sunday, with, will]
\n", 445 | "
" 446 | ], 447 | "text/plain": [ 448 | " label \\\n", 449 | "0 ham \n", 450 | "1 spam \n", 451 | "2 ham \n", 452 | "3 ham \n", 453 | "4 ham \n", 454 | "\n", 455 | " body_text \\\n", 456 | "0 I've been searching for the right words to thank you for this breather. I promise i wont take yo... \n", 457 | "1 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... \n", 458 | "2 Nah I don't think he goes to usf, he lives around here though \n", 459 | "3 Even my brother is not like to speak with me. They treat me like aids patent. \n", 460 | "4 I HAVE A DATE ON SUNDAY WITH WILL!! \n", 461 | "\n", 462 | " body_text_clean \\\n", 463 | "0 Ive been searching for the right words to thank you for this breather I promise i wont take your... \n", 464 | "1 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e... \n", 465 | "2 Nah I dont think he goes to usf he lives around here though \n", 466 | "3 Even my brother is not like to speak with me They treat me like aids patent \n", 467 | "4 I HAVE A DATE ON SUNDAY WITH WILL \n", 468 | "\n", 469 | " body_text_tokenized \n", 470 | "0 [ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ... \n", 471 | "1 [free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to... \n", 472 | "2 [nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though] \n", 473 | "3 [even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent] \n", 474 | "4 [i, have, a, date, on, sunday, with, will] " 475 | ] 476 | }, 477 | "execution_count": 6, 478 | "metadata": {}, 479 | "output_type": "execute_result" 480 | } 481 | ], 482 | "source": [ 483 | "import re\n", 484 | "\n", 485 | "def tokenize(text):\n", 486 | " tokens = re.split('\\W+', text)\n", 487 | " return tokens\n", 488 | "\n", 489 | "data['body_text_tokenized'] = data['body_text_clean'].apply(lambda x: tokenize(x.lower()))\n", 490 | "\n", 491 | "data.head()" 492 | ] 493 | }, 494 | { 495 | "cell_type": "code", 496 | "execution_count": 7, 497 | "metadata": {}, 498 | "outputs": [ 499 | { 500 | "data": { 501 | "text/plain": [ 502 | "False" 503 | ] 504 | }, 505 | "execution_count": 7, 506 | "metadata": {}, 507 | "output_type": "execute_result" 508 | } 509 | ], 510 | "source": [ 511 | "'NLP' == 'nlp'" 512 | ] 513 | }, 514 | { 515 | "cell_type": "markdown", 516 | "metadata": {}, 517 | "source": [ 518 | "### Remove stopwords" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": 8, 524 | "metadata": { 525 | "collapsed": true, 526 | "jupyter": { 527 | "outputs_hidden": true 528 | } 529 | }, 530 | "outputs": [], 531 | "source": [ 532 | "import nltk\n", 533 | "\n", 534 | "stopword = nltk.corpus.stopwords.words('english')" 535 | ] 536 | }, 537 | { 538 | "cell_type": "code", 539 | "execution_count": 9, 540 | "metadata": {}, 541 | "outputs": [ 542 | { 543 | "data": { 544 | "text/html": [ 545 | "
\n", 546 | "\n", 559 | "\n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | "
labelbody_textbody_text_cleanbody_text_tokenizedbody_text_nostop
0hamI've been searching for the right words to thank you for this breather. I promise i wont take yo...Ive been searching for the right words to thank you for this breather I promise i wont take your...[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...
1spamFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...
2hamNah I don't think he goes to usf, he lives around here thoughNah I dont think he goes to usf he lives around here though[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though][nah, dont, think, goes, usf, lives, around, though]
3hamEven my brother is not like to speak with me. They treat me like aids patent.Even my brother is not like to speak with me They treat me like aids patent[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent][even, brother, like, speak, treat, like, aids, patent]
4hamI HAVE A DATE ON SUNDAY WITH WILL!!I HAVE A DATE ON SUNDAY WITH WILL[i, have, a, date, on, sunday, with, will][date, sunday]
\n", 613 | "
" 614 | ], 615 | "text/plain": [ 616 | " label \\\n", 617 | "0 ham \n", 618 | "1 spam \n", 619 | "2 ham \n", 620 | "3 ham \n", 621 | "4 ham \n", 622 | "\n", 623 | " body_text \\\n", 624 | "0 I've been searching for the right words to thank you for this breather. I promise i wont take yo... \n", 625 | "1 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... \n", 626 | "2 Nah I don't think he goes to usf, he lives around here though \n", 627 | "3 Even my brother is not like to speak with me. They treat me like aids patent. \n", 628 | "4 I HAVE A DATE ON SUNDAY WITH WILL!! \n", 629 | "\n", 630 | " body_text_clean \\\n", 631 | "0 Ive been searching for the right words to thank you for this breather I promise i wont take your... \n", 632 | "1 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e... \n", 633 | "2 Nah I dont think he goes to usf he lives around here though \n", 634 | "3 Even my brother is not like to speak with me They treat me like aids patent \n", 635 | "4 I HAVE A DATE ON SUNDAY WITH WILL \n", 636 | "\n", 637 | " body_text_tokenized \\\n", 638 | "0 [ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ... \n", 639 | "1 [free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to... \n", 640 | "2 [nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though] \n", 641 | "3 [even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent] \n", 642 | "4 [i, have, a, date, on, sunday, with, will] \n", 643 | "\n", 644 | " body_text_nostop \n", 645 | "0 [ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom... \n", 646 | "1 [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv... \n", 647 | "2 [nah, dont, think, goes, usf, lives, around, though] \n", 648 | "3 [even, brother, like, speak, treat, like, aids, patent] \n", 649 | "4 [date, sunday] " 650 | ] 651 | }, 652 | "execution_count": 9, 653 | "metadata": {}, 654 | "output_type": "execute_result" 655 | } 656 | ], 657 | "source": [ 658 | "def remove_stopwords(tokenized_list):\n", 659 | " text = [word for word in tokenized_list if word not in stopword]\n", 660 | " return text\n", 661 | "\n", 662 | "data['body_text_nostop'] = data['body_text_tokenized'].apply(lambda x: remove_stopwords(x))\n", 663 | "\n", 664 | "data.head()" 665 | ] 666 | }, 667 | { 668 | "cell_type": "code", 669 | "execution_count": null, 670 | "metadata": { 671 | "collapsed": true, 672 | "jupyter": { 673 | "outputs_hidden": true 674 | } 675 | }, 676 | "outputs": [], 677 | "source": [] 678 | } 679 | ], 680 | "metadata": { 681 | "kernelspec": { 682 | "display_name": "Python 3 (ipykernel)", 683 | "language": "python", 684 | "name": "python3" 685 | }, 686 | "language_info": { 687 | "codemirror_mode": { 688 | "name": "ipython", 689 | "version": 3 690 | }, 691 | "file_extension": ".py", 692 | "mimetype": "text/x-python", 693 | "name": "python", 694 | "nbconvert_exporter": "python", 695 | "pygments_lexer": "ipython3", 696 | "version": "3.11.0" 697 | } 698 | }, 699 | "nbformat": 4, 700 | "nbformat_minor": 4 701 | } 702 | -------------------------------------------------------------------------------- /2. Data Cleaning/2.1. stemming.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Supplemental Data Cleaning: Using Stemming" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Test out Porter stemmer" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 10, 20 | "metadata": { 21 | "collapsed": true, 22 | "jupyter": { 23 | "outputs_hidden": true 24 | } 25 | }, 26 | "outputs": [], 27 | "source": [ 28 | "import nltk\n", 29 | "\n", 30 | "ps = nltk.PorterStemmer()" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 12, 36 | "metadata": {}, 37 | "outputs": [ 38 | { 39 | "data": { 40 | "text/plain": [ 41 | "['MARTIN_EXTENSIONS',\n", 42 | " 'NLTK_EXTENSIONS',\n", 43 | " 'ORIGINAL_ALGORITHM',\n", 44 | " '__abstractmethods__',\n", 45 | " '__class__',\n", 46 | " '__delattr__',\n", 47 | " '__dict__',\n", 48 | " '__dir__',\n", 49 | " '__doc__',\n", 50 | " '__eq__',\n", 51 | " '__format__',\n", 52 | " '__ge__',\n", 53 | " '__getattribute__',\n", 54 | " '__gt__',\n", 55 | " '__hash__',\n", 56 | " '__init__',\n", 57 | " '__init_subclass__',\n", 58 | " '__le__',\n", 59 | " '__lt__',\n", 60 | " '__module__',\n", 61 | " '__ne__',\n", 62 | " '__new__',\n", 63 | " '__reduce__',\n", 64 | " '__reduce_ex__',\n", 65 | " '__repr__',\n", 66 | " '__setattr__',\n", 67 | " '__sizeof__',\n", 68 | " '__str__',\n", 69 | " '__subclasshook__',\n", 70 | " '__unicode__',\n", 71 | " '__weakref__',\n", 72 | " '_abc_cache',\n", 73 | " '_abc_negative_cache',\n", 74 | " '_abc_negative_cache_version',\n", 75 | " '_abc_registry',\n", 76 | " '_apply_rule_list',\n", 77 | " '_contains_vowel',\n", 78 | " '_ends_cvc',\n", 79 | " '_ends_double_consonant',\n", 80 | " '_has_positive_measure',\n", 81 | " '_is_consonant',\n", 82 | " '_measure',\n", 83 | " '_replace_suffix',\n", 84 | " '_step1a',\n", 85 | " '_step1b',\n", 86 | " '_step1c',\n", 87 | " '_step2',\n", 88 | " '_step3',\n", 89 | " '_step4',\n", 90 | " '_step5a',\n", 91 | " '_step5b',\n", 92 | " 'mode',\n", 93 | " 'pool',\n", 94 | " 'stem',\n", 95 | " 'unicode_repr',\n", 96 | " 'vowels']" 97 | ] 98 | }, 99 | "execution_count": 12, 100 | "metadata": {}, 101 | "output_type": "execute_result" 102 | } 103 | ], 104 | "source": [ 105 | "dir(ps)" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 13, 111 | "metadata": {}, 112 | "outputs": [ 113 | { 114 | "name": "stdout", 115 | "output_type": "stream", 116 | "text": [ 117 | "grow\n", 118 | "grow\n", 119 | "grow\n" 120 | ] 121 | } 122 | ], 123 | "source": [ 124 | "print(ps.stem('grows'))\n", 125 | "print(ps.stem('growing'))\n", 126 | "print(ps.stem('grow'))" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 14, 132 | "metadata": {}, 133 | "outputs": [ 134 | { 135 | "name": "stdout", 136 | "output_type": "stream", 137 | "text": [ 138 | "run\n", 139 | "run\n", 140 | "runner\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "print(ps.stem('run'))\n", 146 | "print(ps.stem('running'))\n", 147 | "print(ps.stem('runner'))" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "### Read in raw text" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 15, 160 | "metadata": {}, 161 | "outputs": [ 162 | { 163 | "data": { 164 | "text/html": [ 165 | "
\n", 166 | "\n", 179 | "\n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | "
labelbody_text
0spamFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
1hamNah I don't think he goes to usf, he lives around here though
2hamEven my brother is not like to speak with me. They treat me like aids patent.
3hamI HAVE A DATE ON SUNDAY WITH WILL!!
4hamAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...
\n", 215 | "
" 216 | ], 217 | "text/plain": [ 218 | " label \\\n", 219 | "0 spam \n", 220 | "1 ham \n", 221 | "2 ham \n", 222 | "3 ham \n", 223 | "4 ham \n", 224 | "\n", 225 | " body_text \n", 226 | "0 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... \n", 227 | "1 Nah I don't think he goes to usf, he lives around here though \n", 228 | "2 Even my brother is not like to speak with me. They treat me like aids patent. \n", 229 | "3 I HAVE A DATE ON SUNDAY WITH WILL!! \n", 230 | "4 As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call... " 231 | ] 232 | }, 233 | "execution_count": 15, 234 | "metadata": {}, 235 | "output_type": "execute_result" 236 | } 237 | ], 238 | "source": [ 239 | "import pandas as pd\n", 240 | "import re\n", 241 | "import string\n", 242 | "pd.set_option('display.max_colwidth', 100)\n", 243 | "\n", 244 | "stopwords = nltk.corpus.stopwords.words('english')\n", 245 | "\n", 246 | "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t')\n", 247 | "data.columns = ['label', 'body_text']\n", 248 | "\n", 249 | "data.head()" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "### Clean up text" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": 16, 262 | "metadata": {}, 263 | "outputs": [ 264 | { 265 | "data": { 266 | "text/html": [ 267 | "
\n", 268 | "\n", 281 | "\n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | "
labelbody_textbody_text_nostop
0spamFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...
1hamNah I don't think he goes to usf, he lives around here though[nah, dont, think, goes, usf, lives, around, though]
2hamEven my brother is not like to speak with me. They treat me like aids patent.[even, brother, like, speak, treat, like, aids, patent]
3hamI HAVE A DATE ON SUNDAY WITH WILL!![date, sunday]
4hamAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...[per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr...
\n", 323 | "
" 324 | ], 325 | "text/plain": [ 326 | " label \\\n", 327 | "0 spam \n", 328 | "1 ham \n", 329 | "2 ham \n", 330 | "3 ham \n", 331 | "4 ham \n", 332 | "\n", 333 | " body_text \\\n", 334 | "0 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... \n", 335 | "1 Nah I don't think he goes to usf, he lives around here though \n", 336 | "2 Even my brother is not like to speak with me. They treat me like aids patent. \n", 337 | "3 I HAVE A DATE ON SUNDAY WITH WILL!! \n", 338 | "4 As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call... \n", 339 | "\n", 340 | " body_text_nostop \n", 341 | "0 [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv... \n", 342 | "1 [nah, dont, think, goes, usf, lives, around, though] \n", 343 | "2 [even, brother, like, speak, treat, like, aids, patent] \n", 344 | "3 [date, sunday] \n", 345 | "4 [per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr... " 346 | ] 347 | }, 348 | "execution_count": 16, 349 | "metadata": {}, 350 | "output_type": "execute_result" 351 | } 352 | ], 353 | "source": [ 354 | "def clean_text(text):\n", 355 | " text = \"\".join([word for word in text if word not in string.punctuation])\n", 356 | " tokens = re.split('\\W+', text)\n", 357 | " text = [word for word in tokens if word not in stopwords]\n", 358 | " return text\n", 359 | "\n", 360 | "data['body_text_nostop'] = data['body_text'].apply(lambda x: clean_text(x.lower()))\n", 361 | "\n", 362 | "data.head()" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "### Stem text" 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": 17, 375 | "metadata": {}, 376 | "outputs": [ 377 | { 378 | "data": { 379 | "text/html": [ 380 | "
\n", 381 | "\n", 394 | "\n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | "
labelbody_textbody_text_nostopbody_text_stemmed
0spamFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,...
1hamNah I don't think he goes to usf, he lives around here though[nah, dont, think, goes, usf, lives, around, though][nah, dont, think, goe, usf, live, around, though]
2hamEven my brother is not like to speak with me. They treat me like aids patent.[even, brother, like, speak, treat, like, aids, patent][even, brother, like, speak, treat, like, aid, patent]
3hamI HAVE A DATE ON SUNDAY WITH WILL!![date, sunday][date, sunday]
4hamAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...[per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr...[per, request, mell, mell, oru, minnaminungint, nurungu, vettam, set, callertun, caller, press, ...
\n", 442 | "
" 443 | ], 444 | "text/plain": [ 445 | " label \\\n", 446 | "0 spam \n", 447 | "1 ham \n", 448 | "2 ham \n", 449 | "3 ham \n", 450 | "4 ham \n", 451 | "\n", 452 | " body_text \\\n", 453 | "0 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... \n", 454 | "1 Nah I don't think he goes to usf, he lives around here though \n", 455 | "2 Even my brother is not like to speak with me. They treat me like aids patent. \n", 456 | "3 I HAVE A DATE ON SUNDAY WITH WILL!! \n", 457 | "4 As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call... \n", 458 | "\n", 459 | " body_text_nostop \\\n", 460 | "0 [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv... \n", 461 | "1 [nah, dont, think, goes, usf, lives, around, though] \n", 462 | "2 [even, brother, like, speak, treat, like, aids, patent] \n", 463 | "3 [date, sunday] \n", 464 | "4 [per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr... \n", 465 | "\n", 466 | " body_text_stemmed \n", 467 | "0 [free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,... \n", 468 | "1 [nah, dont, think, goe, usf, live, around, though] \n", 469 | "2 [even, brother, like, speak, treat, like, aid, patent] \n", 470 | "3 [date, sunday] \n", 471 | "4 [per, request, mell, mell, oru, minnaminungint, nurungu, vettam, set, callertun, caller, press, ... " 472 | ] 473 | }, 474 | "execution_count": 17, 475 | "metadata": {}, 476 | "output_type": "execute_result" 477 | } 478 | ], 479 | "source": [ 480 | "def stemming(tokenized_text):\n", 481 | " text = [ps.stem(word) for word in tokenized_text]\n", 482 | " return text\n", 483 | "\n", 484 | "data['body_text_stemmed'] = data['body_text_nostop'].apply(lambda x: stemming(x))\n", 485 | "\n", 486 | "data.head()" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": null, 492 | "metadata": { 493 | "collapsed": true, 494 | "jupyter": { 495 | "outputs_hidden": true 496 | } 497 | }, 498 | "outputs": [], 499 | "source": [] 500 | } 501 | ], 502 | "metadata": { 503 | "kernelspec": { 504 | "display_name": "Python 3 (ipykernel)", 505 | "language": "python", 506 | "name": "python3" 507 | }, 508 | "language_info": { 509 | "codemirror_mode": { 510 | "name": "ipython", 511 | "version": 3 512 | }, 513 | "file_extension": ".py", 514 | "mimetype": "text/x-python", 515 | "name": "python", 516 | "nbconvert_exporter": "python", 517 | "pygments_lexer": "ipython3", 518 | "version": "3.11.0" 519 | } 520 | }, 521 | "nbformat": 4, 522 | "nbformat_minor": 4 523 | } 524 | -------------------------------------------------------------------------------- /2. Data Cleaning/2.2. lemmatizing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Supplemental Data Cleaning: Using a Lemmatizer" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": { 13 | "tags": [] 14 | }, 15 | "source": [ 16 | "### Test out WordNet lemmatizer (read more about WordNet [here](https://wordnet.princeton.edu/))" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 2, 22 | "metadata": { 23 | "collapsed": true, 24 | "jupyter": { 25 | "outputs_hidden": true 26 | } 27 | }, 28 | "outputs": [], 29 | "source": [ 30 | "import nltk\n", 31 | "\n", 32 | "wn = nltk.WordNetLemmatizer()\n", 33 | "ps = nltk.PorterStemmer()" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 3, 39 | "metadata": {}, 40 | "outputs": [ 41 | { 42 | "data": { 43 | "text/plain": [ 44 | "['__class__',\n", 45 | " '__delattr__',\n", 46 | " '__dict__',\n", 47 | " '__dir__',\n", 48 | " '__doc__',\n", 49 | " '__eq__',\n", 50 | " '__format__',\n", 51 | " '__ge__',\n", 52 | " '__getattribute__',\n", 53 | " '__gt__',\n", 54 | " '__hash__',\n", 55 | " '__init__',\n", 56 | " '__init_subclass__',\n", 57 | " '__le__',\n", 58 | " '__lt__',\n", 59 | " '__module__',\n", 60 | " '__ne__',\n", 61 | " '__new__',\n", 62 | " '__reduce__',\n", 63 | " '__reduce_ex__',\n", 64 | " '__repr__',\n", 65 | " '__setattr__',\n", 66 | " '__sizeof__',\n", 67 | " '__str__',\n", 68 | " '__subclasshook__',\n", 69 | " '__unicode__',\n", 70 | " '__weakref__',\n", 71 | " 'lemmatize',\n", 72 | " 'unicode_repr']" 73 | ] 74 | }, 75 | "execution_count": 3, 76 | "metadata": {}, 77 | "output_type": "execute_result" 78 | } 79 | ], 80 | "source": [ 81 | "dir(wn)" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 4, 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "name": "stdout", 91 | "output_type": "stream", 92 | "text": [ 93 | "mean\n", 94 | "mean\n" 95 | ] 96 | } 97 | ], 98 | "source": [ 99 | "print(ps.stem('meanness'))\n", 100 | "print(ps.stem('meaning'))" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 5, 106 | "metadata": {}, 107 | "outputs": [ 108 | { 109 | "name": "stdout", 110 | "output_type": "stream", 111 | "text": [ 112 | "meanness\n", 113 | "meaning\n" 114 | ] 115 | } 116 | ], 117 | "source": [ 118 | "print(wn.lemmatize('meanness'))\n", 119 | "print(wn.lemmatize('meaning'))" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 6, 125 | "metadata": {}, 126 | "outputs": [ 127 | { 128 | "name": "stdout", 129 | "output_type": "stream", 130 | "text": [ 131 | "goos\n", 132 | "gees\n" 133 | ] 134 | } 135 | ], 136 | "source": [ 137 | "print(ps.stem('goose'))\n", 138 | "print(ps.stem('geese'))" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 8, 144 | "metadata": {}, 145 | "outputs": [ 146 | { 147 | "name": "stdout", 148 | "output_type": "stream", 149 | "text": [ 150 | "goose\n", 151 | "goose\n" 152 | ] 153 | } 154 | ], 155 | "source": [ 156 | "print(wn.lemmatize('goose'))\n", 157 | "print(wn.lemmatize('geese'))" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "### Read in raw text" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 9, 170 | "metadata": {}, 171 | "outputs": [ 172 | { 173 | "data": { 174 | "text/html": [ 175 | "
\n", 176 | "\n", 189 | "\n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | "
labelbody_text
0spamFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
1hamNah I don't think he goes to usf, he lives around here though
2hamEven my brother is not like to speak with me. They treat me like aids patent.
3hamI HAVE A DATE ON SUNDAY WITH WILL!!
4hamAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...
\n", 225 | "
" 226 | ], 227 | "text/plain": [ 228 | " label \\\n", 229 | "0 spam \n", 230 | "1 ham \n", 231 | "2 ham \n", 232 | "3 ham \n", 233 | "4 ham \n", 234 | "\n", 235 | " body_text \n", 236 | "0 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... \n", 237 | "1 Nah I don't think he goes to usf, he lives around here though \n", 238 | "2 Even my brother is not like to speak with me. They treat me like aids patent. \n", 239 | "3 I HAVE A DATE ON SUNDAY WITH WILL!! \n", 240 | "4 As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call... " 241 | ] 242 | }, 243 | "execution_count": 9, 244 | "metadata": {}, 245 | "output_type": "execute_result" 246 | } 247 | ], 248 | "source": [ 249 | "import pandas as pd\n", 250 | "import re\n", 251 | "import string\n", 252 | "pd.set_option('display.max_colwidth', 100)\n", 253 | "\n", 254 | "stopwords = nltk.corpus.stopwords.words('english')\n", 255 | "\n", 256 | "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t')\n", 257 | "data.columns = ['label', 'body_text']\n", 258 | "\n", 259 | "data.head()" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "### Clean up text" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 10, 272 | "metadata": {}, 273 | "outputs": [ 274 | { 275 | "data": { 276 | "text/html": [ 277 | "
\n", 278 | "\n", 291 | "\n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | "
labelbody_textbody_text_nostop
0spamFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...
1hamNah I don't think he goes to usf, he lives around here though[nah, dont, think, goes, usf, lives, around, though]
2hamEven my brother is not like to speak with me. They treat me like aids patent.[even, brother, like, speak, treat, like, aids, patent]
3hamI HAVE A DATE ON SUNDAY WITH WILL!![date, sunday]
4hamAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...[per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr...
\n", 333 | "
" 334 | ], 335 | "text/plain": [ 336 | " label \\\n", 337 | "0 spam \n", 338 | "1 ham \n", 339 | "2 ham \n", 340 | "3 ham \n", 341 | "4 ham \n", 342 | "\n", 343 | " body_text \\\n", 344 | "0 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... \n", 345 | "1 Nah I don't think he goes to usf, he lives around here though \n", 346 | "2 Even my brother is not like to speak with me. They treat me like aids patent. \n", 347 | "3 I HAVE A DATE ON SUNDAY WITH WILL!! \n", 348 | "4 As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call... \n", 349 | "\n", 350 | " body_text_nostop \n", 351 | "0 [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv... \n", 352 | "1 [nah, dont, think, goes, usf, lives, around, though] \n", 353 | "2 [even, brother, like, speak, treat, like, aids, patent] \n", 354 | "3 [date, sunday] \n", 355 | "4 [per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr... " 356 | ] 357 | }, 358 | "execution_count": 10, 359 | "metadata": {}, 360 | "output_type": "execute_result" 361 | } 362 | ], 363 | "source": [ 364 | "def clean_text(text):\n", 365 | " text = \"\".join([word for word in text if word not in string.punctuation])\n", 366 | " tokens = re.split('\\W+', text)\n", 367 | " text = [word for word in tokens if word not in stopwords]\n", 368 | " return text\n", 369 | "\n", 370 | "data['body_text_nostop'] = data['body_text'].apply(lambda x: clean_text(x.lower()))\n", 371 | "\n", 372 | "data.head()" 373 | ] 374 | }, 375 | { 376 | "cell_type": "markdown", 377 | "metadata": {}, 378 | "source": [ 379 | "### Lemmatize text" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": 11, 385 | "metadata": {}, 386 | "outputs": [ 387 | { 388 | "data": { 389 | "text/html": [ 390 | "
\n", 391 | "\n", 404 | "\n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | "
labelbody_textbody_text_nostopbody_text_lemmatized
0spamFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...
1hamNah I don't think he goes to usf, he lives around here though[nah, dont, think, goes, usf, lives, around, though][nah, dont, think, go, usf, life, around, though]
2hamEven my brother is not like to speak with me. They treat me like aids patent.[even, brother, like, speak, treat, like, aids, patent][even, brother, like, speak, treat, like, aid, patent]
3hamI HAVE A DATE ON SUNDAY WITH WILL!![date, sunday][date, sunday]
4hamAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...[per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr...[per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, caller, pre...
5spamWINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To c...[winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 0906170...[winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 0906170...
6spamHad your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with came...[mobile, 11, months, u, r, entitled, update, latest, colour, mobiles, camera, free, call, mobile...[mobile, 11, month, u, r, entitled, update, latest, colour, mobile, camera, free, call, mobile, ...
7hamI'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried ...[im, gonna, home, soon, dont, want, talk, stuff, anymore, tonight, k, ive, cried, enough, today][im, gonna, home, soon, dont, want, talk, stuff, anymore, tonight, k, ive, cried, enough, today]
8spamSIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, ...[six, chances, win, cash, 100, 20000, pounds, txt, csh11, send, 87575, cost, 150pday, 6days, 16,...[six, chance, win, cash, 100, 20000, pound, txt, csh11, send, 87575, cost, 150pday, 6days, 16, t...
9spamURGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM...[urgent, 1, week, free, membership, 100000, prize, jackpot, txt, word, claim, 81010, tc, wwwdbuk...[urgent, 1, week, free, membership, 100000, prize, jackpot, txt, word, claim, 81010, tc, wwwdbuk...
\n", 487 | "
" 488 | ], 489 | "text/plain": [ 490 | " label \\\n", 491 | "0 spam \n", 492 | "1 ham \n", 493 | "2 ham \n", 494 | "3 ham \n", 495 | "4 ham \n", 496 | "5 spam \n", 497 | "6 spam \n", 498 | "7 ham \n", 499 | "8 spam \n", 500 | "9 spam \n", 501 | "\n", 502 | " body_text \\\n", 503 | "0 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ... \n", 504 | "1 Nah I don't think he goes to usf, he lives around here though \n", 505 | "2 Even my brother is not like to speak with me. They treat me like aids patent. \n", 506 | "3 I HAVE A DATE ON SUNDAY WITH WILL!! \n", 507 | "4 As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call... \n", 508 | "5 WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To c... \n", 509 | "6 Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with came... \n", 510 | "7 I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried ... \n", 511 | "8 SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, ... \n", 512 | "9 URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM... \n", 513 | "\n", 514 | " body_text_nostop \\\n", 515 | "0 [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv... \n", 516 | "1 [nah, dont, think, goes, usf, lives, around, though] \n", 517 | "2 [even, brother, like, speak, treat, like, aids, patent] \n", 518 | "3 [date, sunday] \n", 519 | "4 [per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr... \n", 520 | "5 [winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 0906170... \n", 521 | "6 [mobile, 11, months, u, r, entitled, update, latest, colour, mobiles, camera, free, call, mobile... \n", 522 | "7 [im, gonna, home, soon, dont, want, talk, stuff, anymore, tonight, k, ive, cried, enough, today] \n", 523 | "8 [six, chances, win, cash, 100, 20000, pounds, txt, csh11, send, 87575, cost, 150pday, 6days, 16,... \n", 524 | "9 [urgent, 1, week, free, membership, 100000, prize, jackpot, txt, word, claim, 81010, tc, wwwdbuk... \n", 525 | "\n", 526 | " body_text_lemmatized \n", 527 | "0 [free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv... \n", 528 | "1 [nah, dont, think, go, usf, life, around, though] \n", 529 | "2 [even, brother, like, speak, treat, like, aid, patent] \n", 530 | "3 [date, sunday] \n", 531 | "4 [per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, caller, pre... \n", 532 | "5 [winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 0906170... \n", 533 | "6 [mobile, 11, month, u, r, entitled, update, latest, colour, mobile, camera, free, call, mobile, ... \n", 534 | "7 [im, gonna, home, soon, dont, want, talk, stuff, anymore, tonight, k, ive, cried, enough, today] \n", 535 | "8 [six, chance, win, cash, 100, 20000, pound, txt, csh11, send, 87575, cost, 150pday, 6days, 16, t... \n", 536 | "9 [urgent, 1, week, free, membership, 100000, prize, jackpot, txt, word, claim, 81010, tc, wwwdbuk... " 537 | ] 538 | }, 539 | "execution_count": 11, 540 | "metadata": {}, 541 | "output_type": "execute_result" 542 | } 543 | ], 544 | "source": [ 545 | "def lemmatizing(tokenized_text):\n", 546 | " text = [wn.lemmatize(word) for word in tokenized_text]\n", 547 | " return text\n", 548 | "\n", 549 | "data['body_text_lemmatized'] = data['body_text_nostop'].apply(lambda x: lemmatizing(x))\n", 550 | "\n", 551 | "data.head(10)" 552 | ] 553 | }, 554 | { 555 | "cell_type": "code", 556 | "execution_count": null, 557 | "metadata": { 558 | "collapsed": true, 559 | "jupyter": { 560 | "outputs_hidden": true 561 | } 562 | }, 563 | "outputs": [], 564 | "source": [] 565 | } 566 | ], 567 | "metadata": { 568 | "kernelspec": { 569 | "display_name": "Python 3 (ipykernel)", 570 | "language": "python", 571 | "name": "python3" 572 | }, 573 | "language_info": { 574 | "codemirror_mode": { 575 | "name": "ipython", 576 | "version": 3 577 | }, 578 | "file_extension": ".py", 579 | "mimetype": "text/x-python", 580 | "name": "python", 581 | "nbconvert_exporter": "python", 582 | "pygments_lexer": "ipython3", 583 | "version": "3.11.0" 584 | } 585 | }, 586 | "nbformat": 4, 587 | "nbformat_minor": 4 588 | } 589 | -------------------------------------------------------------------------------- /4. Feature Engineering/4.1. Feature Creation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Feature Engineering: Feature Creation" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Read in text" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": { 21 | "collapsed": true, 22 | "jupyter": { 23 | "outputs_hidden": true 24 | } 25 | }, 26 | "outputs": [], 27 | "source": [ 28 | "import pandas as pd\n", 29 | "\n", 30 | "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t')\n", 31 | "data.columns = ['label', 'body_text']" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "### Create feature for text message length" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 2, 44 | "metadata": {}, 45 | "outputs": [ 46 | { 47 | "data": { 48 | "text/html": [ 49 | "
\n", 50 | "\n", 63 | "\n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | "
labelbody_textbody_len
0spamFree entry in 2 a wkly comp to win FA Cup fina...128
1hamNah I don't think he goes to usf, he lives aro...49
2hamEven my brother is not like to speak with me. ...62
3hamI HAVE A DATE ON SUNDAY WITH WILL!!28
4hamAs per your request 'Melle Melle (Oru Minnamin...135
\n", 105 | "
" 106 | ], 107 | "text/plain": [ 108 | " label body_text body_len\n", 109 | "0 spam Free entry in 2 a wkly comp to win FA Cup fina... 128\n", 110 | "1 ham Nah I don't think he goes to usf, he lives aro... 49\n", 111 | "2 ham Even my brother is not like to speak with me. ... 62\n", 112 | "3 ham I HAVE A DATE ON SUNDAY WITH WILL!! 28\n", 113 | "4 ham As per your request 'Melle Melle (Oru Minnamin... 135" 114 | ] 115 | }, 116 | "execution_count": 2, 117 | "metadata": {}, 118 | "output_type": "execute_result" 119 | } 120 | ], 121 | "source": [ 122 | "data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(\" \"))\n", 123 | "\n", 124 | "data.head()" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "### Create feature for % of text that is punctuation" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 3, 137 | "metadata": {}, 138 | "outputs": [ 139 | { 140 | "data": { 141 | "text/html": [ 142 | "
\n", 143 | "\n", 156 | "\n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | "
labelbody_textbody_lenpunct%
0spamFree entry in 2 a wkly comp to win FA Cup fina...1284.7
1hamNah I don't think he goes to usf, he lives aro...494.1
2hamEven my brother is not like to speak with me. ...623.2
3hamI HAVE A DATE ON SUNDAY WITH WILL!!287.1
4hamAs per your request 'Melle Melle (Oru Minnamin...1354.4
\n", 204 | "
" 205 | ], 206 | "text/plain": [ 207 | " label body_text body_len punct%\n", 208 | "0 spam Free entry in 2 a wkly comp to win FA Cup fina... 128 4.7\n", 209 | "1 ham Nah I don't think he goes to usf, he lives aro... 49 4.1\n", 210 | "2 ham Even my brother is not like to speak with me. ... 62 3.2\n", 211 | "3 ham I HAVE A DATE ON SUNDAY WITH WILL!! 28 7.1\n", 212 | "4 ham As per your request 'Melle Melle (Oru Minnamin... 135 4.4" 213 | ] 214 | }, 215 | "execution_count": 3, 216 | "metadata": {}, 217 | "output_type": "execute_result" 218 | } 219 | ], 220 | "source": [ 221 | "import string\n", 222 | "\n", 223 | "def count_punct(text):\n", 224 | " count = sum([1 for char in text if char in string.punctuation])\n", 225 | " return round(count/(len(text) - text.count(\" \")), 3)*100\n", 226 | "\n", 227 | "data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))\n", 228 | "\n", 229 | "data.head()" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "### Evaluate created features" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 4, 242 | "metadata": { 243 | "collapsed": true, 244 | "jupyter": { 245 | "outputs_hidden": true 246 | } 247 | }, 248 | "outputs": [], 249 | "source": [ 250 | "from matplotlib import pyplot\n", 251 | "import numpy as np\n", 252 | "%matplotlib inline" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 5, 258 | "metadata": {}, 259 | "outputs": [ 260 | { 261 | "data": { 262 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD8CAYAAACb4nSYAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAFSdJREFUeJzt3X+M3PWd3/Hn2z+wkxZMz7gRsYE1\nBU62szIExyYqnGQlOHYS4lyAxrTobAUFXYrTwokEfFEQJXe9QNq6VwXlQs4oBNHgK/nlCF84UpM0\nrYDYBnz2hgMW8JU9U+IY5COAwTbv/jHftcbD7s6sdz2zu5/nQ7L2O5/5fHfe853xaz/zmc98JzIT\nSVIZJnW6AElS+xj6klQQQ1+SCmLoS1JBDH1JKoihL0kFMfQlqSCGviQVxNCXpIJM6XQBjU455ZTs\n6urqdBmSNK5s3779N5k5q1m/MRf6XV1dbNu2rdNlSNK4EhF/30o/p3ckqSCGviQVxNCXpIKMuTn9\ngRw8eJC+vj4OHDjQ6VLabvr06cyZM4epU6d2uhRJE8C4CP2+vj5OPPFEurq6iIhOl9M2mcm+ffvo\n6+tj7ty5nS5H0gQwLqZ3Dhw4wMyZM4sKfICIYObMmUW+wpF0fIyL0AeKC/x+pd5vScfHuAl9SdLI\njYs5/UbrH3x6VH/fdRefM6q/T5LGqnEZ+pKaG2pw5ECnXE7vtOi1117jYx/7GAsXLuR973sfGzdu\npKurixtuuIHFixezePFient7Afjxj3/MkiVLOO+88/jwhz/MSy+9BMDNN9/M6tWrWbZsGV1dXXz/\n+9/ni1/8It3d3SxfvpyDBw928i5KKoCh36Kf/OQnvPe972XHjh3s2rWL5cuXA3DSSSfxy1/+krVr\n13LttdcCcOGFF/LII4/w+OOPs2rVKm677bYjv+fZZ5/l/vvv50c/+hFXXnklS5cuZefOnbzrXe/i\n/vvv78h9k1QOQ79F3d3d/PSnP+WGG27gF7/4BTNmzADgiiuuOPLz4YcfBmqfK/jIRz5Cd3c3X/va\n1+jp6Tnye1asWMHUqVPp7u7m8OHDR/54dHd3s3v37vbeKUnFMfRbdM4557B9+3a6u7tZt24dt9xy\nC3D0ksr+7c9//vOsXbuWnTt38s1vfvOodfbTpk0DYNKkSUydOvXIPpMmTeLQoUPtujuSCmXot2jP\nnj28+93v5sorr+T666/nscceA2Djxo1Hfn7wgx8EYP/+/cyePRuAu+66qzMFS9IAxuXqnU6sPNi5\ncydf+MIXjozQv/GNb3DZZZfx5ptvsmTJEt5++22++93vArU3bC+//HJmz57NBRdcwPPPP9/2eiVp\nIJGZna7hKIsWLcrGL1F58sknmTdvXocqGlz/F76ccsopx/V2xur919jmks2yRMT2zFzUrJ/TO5JU\nkHE5vTNWuNpG0njjSF+SCtJS6EfE8oh4KiJ6I+LGAa6fFhEbq+sfjYiuhutPj4jfRsT1o1O2JOlY\nNA39iJgM3A6sAOYDV0TE/IZuVwGvZOZZwHrg1obr1wN/PfJyJUkj0cpIfzHQm5nPZeZbwL3AyoY+\nK4H+Ben3AR+K6lNHEfFJ4DmgB0lSR7XyRu5s4IW6y33AksH6ZOahiNgPzIyIN4AbgIuB0ZvaeejP\nRu1XAbB0XdMuu3fv5uMf/zi7du0a3duWpDZqZaQ/0Fc3NS7uH6zPfwDWZ+Zvh7yBiKsjYltEbNu7\nd28LJUmSjkUrod8HnFZ3eQ6wZ7A+ETEFmAG8TO0VwW0RsRu4FvjjiFjbeAOZeUdmLsrMRbNmzRr2\nnWiXw4cP89nPfpYFCxawbNky3njjDb71rW/xgQ98gIULF3LppZfy+uuvA7BmzRo+97nPsXTpUs48\n80x+/vOf85nPfIZ58+axZs2azt4RScVqJfS3AmdHxNyIOAFYBWxq6LMJWF1tXwZsyZqLMrMrM7uA\n/wr8x8z8+ijV3nbPPPMM11xzDT09PZx88sl873vf41Of+hRbt25lx44dzJs3jw0bNhzp/8orr7Bl\nyxbWr1/PJZdcwnXXXUdPTw87d+7kiSee6OA9kVSqpqGfmYeAtcADwJPAX2VmT0TcEhGfqLptoDaH\n3wv8EfCOZZ0Twdy5czn33HMBOP/889m9eze7du3ioosuoru7m3vuueeo0yhfcsklRATd3d285z3v\nobu7m0mTJrFgwQI/2CWpI1r6RG5mbgY2N7TdVLd9ALi8ye+4+RjqG1P6T4sMMHnyZN544w3WrFnD\nD3/4QxYuXMi3v/1tfvazn72j/6RJk47a19MoS+oUP5E7Qq+++iqnnnoqBw8e5J577ul0OZI0pPF5\n7p0Wlli2y1e+8hWWLFnCGWecQXd3N6+++mqnS5KkQXlq5XGg9PuvY+OplcviqZUlSe9g6EtSQcZN\n6I+1aah2KfV+Szo+xkXoT58+nX379hUXgJnJvn37mD59eqdLkTRBjIvVO3PmzKGvr48Sz8szffp0\n5syZ0+kyJE0Q4yL0p06dyty5cztdhiSNe+NiekeSNDoMfUkqiKEvSQUx9CWpIIa+JBXE0Jekghj6\nklQQQ1+SCmLoS1JBDH1JKoihL0kFMfQlqSCGviQVxNCXpIIY+pJUEENfkgpi6EtSQQx9SSqIoS9J\nBTH0Jakghr4kFcTQl6SCGPqSVBBDX5IKYuhLUkEMfUkqiKEvSQUx9CWpIIa+JBXE0JekgrQU+hGx\nPCKeiojeiLhxgOunRcTG6vpHI6Kral8cEU9U/3ZExO+PbvmSpOFoGvoRMRm4HVgBzAeuiIj5Dd2u\nAl7JzLOA9cCtVfsuYFFmngssB74ZEVNGq3hJ0vC0EsCLgd7MfA4gIu4FVgK/quuzEri52r4P+HpE\nRGa+XtdnOpAjrlgSAOsffLrTJWgcamV6ZzbwQt3lvqptwD6ZeQjYD8wEiIglEdED7AT+sLpektQB\nrYR+DNDWOGIftE9mPpqZC4APAOsiYvo7biDi6ojYFhHb9u7d20JJkqRj0Uro9wGn1V2eA+wZrE81\nZz8DeLm+Q2Y+CbwGvK/xBjLzjsxclJmLZs2a1Xr1kqRhaSX0twJnR8TciDgBWAVsauizCVhdbV8G\nbMnMrPaZAhARZwC/C+welcolScPW9I3czDwUEWuBB4DJwJ2Z2RMRtwDbMnMTsAG4OyJ6qY3wV1W7\nXwjcGBEHgbeBf5uZvzked0SS1FxLyyczczOwuaHtprrtA8DlA+x3N3D3CGuUJI0SP5ErSQUx9CWp\nIIa+JBXE0Jekghj6klQQQ1+SCmLoS1JBDH1JKoihL0kFMfQlqSCGviQVxNCXpIIY+pJUEENfkgpi\n6EtSQQx9SSqIoS9JBTH0Jakghr4kFcTQl6SCGPqSVBBDX5IKYuhLUkEMfUkqiKEvSQUx9CWpIIa+\nJBXE0Jekghj6klQQQ1+SCmLoS1JBpnS6AEkDW//g050uQROQI31JKoihL0kFMfQlqSCGviQVxNCX\npIIY+pJUEENfkgrS0jr9iFgO/DkwGfjLzPxqw/XTgO8A5wP7gE9n5u6IuBj4KnAC8BbwhczcMor1\njy0P/dnQ1y9d1546JGkQTUf6ETEZuB1YAcwHroiI+Q3drgJeycyzgPXArVX7b4BLMrMbWA3cPVqF\nS5KGr5XpncVAb2Y+l5lvAfcCKxv6rATuqrbvAz4UEZGZj2fmnqq9B5hevSqQJHVAK6E/G3ih7nJf\n1TZgn8w8BOwHZjb0uRR4PDPfPLZSJUkj1cqcfgzQlsPpExELqE35LBvwBiKuBq4GOP3001soSZJ0\nLFoZ6fcBp9VdngPsGaxPREwBZgAvV5fnAD8A/iAznx3oBjLzjsxclJmLZs2aNbx7IElqWSsj/a3A\n2RExF/gHYBXwrxv6bKL2Ru3DwGXAlszMiDgZuB9Yl5n/Z/TKHqeGWt3jyh5JbdA09DPzUESsBR6g\ntmTzzszsiYhbgG2ZuQnYANwdEb3URvirqt3XAmcBX46IL1dtyzLz16N9RyS1rtlpm6+7+Jw2VaJ2\na2mdfmZuBjY3tN1Ut30AuHyA/f4E+JMR1ihJGiV+IleSCmLoS1JBDH1JKoihL0kFMfQlqSCGviQV\nxNCXpIK0tE5fbeC5+CW1gaE/XvhHQdIocHpHkgpi6EtSQQx9SSqIoS9JBTH0Jakghr4kFcTQl6SC\nuE5/OJqtlZekMc6RviQVxJG+1CHNvqdWOh4c6UtSQQx9SSqIoS9JBTH0Jakghr4kFcTQl6SCGPqS\nVBBDX5IKYuhLUkEMfUkqiKdhkI4TT7OgsciRviQVxNCXpIIY+pJUEENfkgpi6EtSQQx9SSqIoS9J\nBXGd/kTR7Evbl65rTx2SxjRH+pJUkJZCPyKWR8RTEdEbETcOcP20iNhYXf9oRHRV7TMj4qGI+G1E\nfH10S5ckDVfT0I+IycDtwApgPnBFRMxv6HYV8EpmngWsB26t2g8AXwauH7WKJUnHrJWR/mKgNzOf\ny8y3gHuBlQ19VgJ3Vdv3AR+KiMjM1zLzf1MLf0lSh7US+rOBF+ou91VtA/bJzEPAfmDmaBQoSRo9\nrYR+DNCWx9Bn8BuIuDoitkXEtr1797a6myRpmFoJ/T7gtLrLc4A9g/WJiCnADODlVovIzDsyc1Fm\nLpo1a1aru0mShqmV0N8KnB0RcyPiBGAVsKmhzyZgdbV9GbAlM1se6UuS2qPph7My81BErAUeACYD\nd2ZmT0TcAmzLzE3ABuDuiOilNsJf1b9/ROwGTgJOiIhPAssy81ejf1ckSc209InczNwMbG5ou6lu\n+wBw+SD7do2gPknSKPI0DKXwNA2S8DQMklQUQ1+SCmLoS1JBnNNXTbM5/6H4foA0bjjSl6SCGPqS\nVBBDX5IKYuhLUkEMfUkqiKEvSQVxyaZGzlM8SOOGI31JKogjfWkI6x98esjrr7v4nDZVIo0OQ18a\ngWZ/FKSxxukdSSqII30dfx18o9fpGelojvQlqSCGviQVxOkddZ7r/KW2MfRVNFffqDSGvjRGXfB/\n7xjy+kdOv7pNlWgiMfQ19g01/ePUjzQshr40Th3PVwIudZ24XL0jSQVxpK+mHn5u35DXf/DMmW2q\n5J2ajUidF5eOZuhrXGsW6iPdv9kfhaH2H8m+0vFi6EtDGEkwG+oai5zTl6SCGPqSVBCnd+o1Ox2A\nJI1zjvQlqSCO9KUJaiQrizRxGfrjxFheK3+8NbvvklpXVug7Zz8mGepS+5QV+jouDO3yeG6e8cvQ\nlwrk6SnKNfFCv9ApnJHO+Ttal8ow8UJ/DBsqWCfyG7GaeJqfYuI/taUODV9LoR8Ry4E/ByYDf5mZ\nX224fhrwHeB8YB/w6czcXV23DrgKOAz8u8x8YNSqn0AcaWsicc5/7Goa+hExGbgduBjoA7ZGxKbM\n/FVdt6uAVzLzrIhYBdwKfDoi5gOrgAXAe4GfRsQ5mXl4tO9IO5S8bFJl8WRxE1crI/3FQG9mPgcQ\nEfcCK4H60F8J3Fxt3wd8PSKiar83M98Eno+I3ur3PTw65Usai5r+0XhoBAOksfwVmc3eUxwDtbcS\n+rOBF+ou9wFLBuuTmYciYj8ws2p/pGHf2cdc7XE2kadYJvJ90/gzove3mgRrJ1+RN71tOv9HoZXQ\njwHassU+rexLRFwN9K8R+21EPNVCXYM5BfjNCPY/XqxreKxreKxreMZoXX88krrOaKVTK6HfB5xW\nd3kOsGeQPn0RMQWYAbzc4r5k5h3AqEwiRsS2zFw0Gr9rNFnX8FjX8FjX8JRcVytn2dwKnB0RcyPi\nBGpvzG5q6LMJWF1tXwZsycys2ldFxLSImAucDfxydEqXJA1X05F+NUe/FniA2pLNOzOzJyJuAbZl\n5iZgA3B39Ubty9T+MFD1+ytqb/oeAq4Zryt3JGkiaGmdfmZuBjY3tN1Ut30AuHyQff8U+NMR1Dhc\nY3WtmXUNj3UNj3UNT7F1RW0WRpJUAr85S5IKMmFCPyKWR8RTEdEbETd2sI7TIuKhiHgyInoi4t9X\n7TdHxD9ExBPVv492oLbdEbGzuv1tVdvvRMSDEfFM9fOftbmm3607Jk9ExD9GxLWdOF4RcWdE/Doi\ndtW1DXh8oua/Vc+3v42I97e5rq9FxN9Vt/2DiDi5au+KiDfqjttftLmuQR+3iFhXHa+nIuIjba5r\nY11NuyPiiaq9ncdrsGxo73MsM8f9P2pvMD8LnAmcAOwA5neollOB91fbJwJPA/OpfWL5+g4fp93A\nKQ1ttwE3Vts3Ard2+HH8f9TWG7f9eAG/B7wf2NXs+AAfBf6a2mdRLgAebXNdy4Ap1fatdXV11ffr\nwPEa8HGr/g/sAKYBc6v/r5PbVVfD9f8ZuKkDx2uwbGjrc2yijPSPnCoiM98C+k8V0XaZ+WJmPlZt\nvwo8yRj+FDK143RXtX0X8MkO1vIh4NnM/PtO3Hhm/i9qq8/qDXZ8VgLfyZpHgJMj4tR21ZWZf5OZ\nh6qLj1D7DExbDXK8BnPklCyZ+TzQf0qWttYVEQH8K+C7x+O2hzJENrT1OTZRQn+gU0V0PGgjogs4\nD3i0alpbvUy7s93TKJUE/iYitkftU9AA78nMF6H2pAT+eQfq6reKo/8zdvp4weDHZyw95z5DbUTY\nb25EPB4RP4+IizpQz0CP21g5XhcBL2XmM3VtbT9eDdnQ1ufYRAn9lk730E4R8U+B7wHXZuY/At8A\n/gVwLvAitZeY7fYvM/P9wArgmoj4vQ7UMKCoffDvE8D/qJrGwvEayph4zkXEl6h9BuaequlF4PTM\nPA/4I+C/R8RJbSxpsMdtTBwv4AqOHli0/XgNkA2Ddh2gbcTHbKKEfkune2iXiJhK7UG9JzO/D5CZ\nL2Xm4cx8G/gWx+ml7VAyc0/189fAD6oaXup/yVj9/HW766qsAB7LzJeqGjt+vCqDHZ+OP+ciYjXw\nceDfZDUJXE2f7Ku2t1ObO2/byeuHeNzGwvGaAnwK2Njf1u7jNVA20Obn2EQJ/VZOFdEW1ZzhBuDJ\nzPwvde31c3G/D+xq3Pc41/VPIuLE/m1qbwTu4uhTaKwGftTOuuocNQLr9PGqM9jx2QT8QbXC4gJg\nf/9L9HaI2hcb3QB8IjNfr2ufFbXvwCAizqR26pPn2ljXYI/bWDgly4eBv8vMvv6Gdh6vwbKBdj/H\n2vGudTv+UXun+2lqf6m/1ME6LqT2EuxvgSeqfx8F7gZ2Vu2bgFPbXNeZ1FZP7AB6+o8RtVNg/0/g\nmern73TgmL2b2jeuzahra/vxovZH50XgILVR1lWDHR9qL71vr55vO4FFba6rl9p8b/9z7C+qvpdW\nj+8O4DHgkjbXNejjBnypOl5PASvaWVfV/m3gDxv6tvN4DZYNbX2O+YlcSSrIRJnekSS1wNCXpIIY\n+pJUEENfkgpi6EtSQQx9SSqIoS9JBTH0Jakg/x8m8I8uRmS2CQAAAABJRU5ErkJggg==\n", 263 | "text/plain": [ 264 | "" 265 | ] 266 | }, 267 | "metadata": {}, 268 | "output_type": "display_data" 269 | } 270 | ], 271 | "source": [ 272 | "bins = np.linspace(0, 200, 40)\n", 273 | "\n", 274 | "pyplot.hist(data[data['label']=='spam']['body_len'], bins, alpha=0.5, normed=True, label='spam')\n", 275 | "pyplot.hist(data[data['label']=='ham']['body_len'], bins, alpha=0.5, normed=True, label='ham')\n", 276 | "pyplot.legend(loc='upper left')\n", 277 | "pyplot.show()" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 6, 283 | "metadata": {}, 284 | "outputs": [ 285 | { 286 | "data": { 287 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD8CAYAAACb4nSYAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAGI9JREFUeJzt3X+Q1PWd5/Hnix+CF6NGnFjKQGYs\nsQrIRLOOg9aqF0xChovKVoQLZK2FixXuspLbuBsVUndocFOJyd6yW6WVkkRPYjTgGbMh51yIiuel\ntlAH/DWMrHEkHHRIKUHiagzCwPv+6C9c0xno78z0TDP9eT2qKPr7+X6+335/yvbVXz797U8rIjAz\nszSMqnUBZmY2fBz6ZmYJceibmSXEoW9mlhCHvplZQhz6ZmYJceibmSXEoW9mlhCHvplZQsbUuoBy\nZ555ZjQ1NdW6DDOzEWXz5s2/jYiGSv1OuNBvampi06ZNtS7DzGxEkfR/8/Tz9I6ZWUIc+mZmCXHo\nm5kl5ISb0zczy+PAgQMUCgX27dtX61KG1fjx42lsbGTs2LEDOt6hb2YjUqFQ4P3vfz9NTU1IqnU5\nwyIi2LNnD4VCgebm5gGdw9M7ZjYi7du3jwkTJiQT+ACSmDBhwqD+dZMr9CW1S3pFUo+kpX3sv0LS\nc5J6Jc0t2zdZ0s8lbZX0sqSmAVdrZlYipcA/bLBjrhj6kkYDdwGzgWnAAknTyrrtABYBD/Zxiu8D\n346IqUAb8MZgCjYzs4HLM6ffBvRExDYASWuAOcDLhztExPZs36HSA7M3hzER8VjW753qlG1mdrSV\nj/2yque78ZPnV/V8J4o8oT8R2FmyXQBm5Dz/+cDvJD0CNAOPA0sj4mC/qhwmlV409foiMLN05JnT\n72sCKXKefwxwOfAV4GLgXIrTQEc/gbRY0iZJm3bv3p3z1GZmtfX73/+eT3/601xwwQV8+MMfZu3a\ntTQ1NXHLLbfQ1tZGW1sbPT09APz0pz9lxowZfPSjH+UTn/gEr7/+OgC33XYbCxcuZNasWTQ1NfHI\nI49w880309LSQnt7OwcOHKhqzXlCvwBMKtluBHblPH8BeD4itkVEL/BPwJ+Ud4qIVRHRGhGtDQ0V\n1wsyMzsh/OxnP+Occ87hxRdfZMuWLbS3twNw6qmn8uyzz7JkyRK+/OUvA3DZZZfx9NNP8/zzzzN/\n/ny+9a1vHTnPa6+9xqOPPspPfvITrrvuOmbOnElXVxcnn3wyjz76aFVrzhP6ncAUSc2STgLmA+ty\nnr8T+ICkw0l+JSWfBZiZjWQtLS08/vjj3HLLLfziF7/gtNNOA2DBggVH/t64cSNQ/F7Bpz71KVpa\nWvj2t79Nd3f3kfPMnj2bsWPH0tLSwsGDB4+8ebS0tLB9+/aq1lwx9LMr9CXAemAr8FBEdEtaIeka\nAEkXSyoA84C7JXVnxx6kOLXzhKQuilNF363qCMzMauT8889n8+bNtLS0sGzZMlasWAEcfVvl4cdf\n+tKXWLJkCV1dXdx9991H3Ws/btw4AEaNGsXYsWOPHDNq1Ch6e3urWnOub+RGRAfQUda2vORxJ8Vp\nn76OfQz4yCBqNDM7Ie3atYszzjiD6667jlNOOYX77rsPgLVr17J06VLWrl3LpZdeCsBbb73FxIkT\nAVi9enWtSvYyDGZWH2pxd11XVxc33XTTkSv073znO8ydO5f33nuPGTNmcOjQIX74wx8CxQ9s582b\nx8SJE7nkkkv41a9+Nez1Aigi7404w6O1tTVq9SMqvmXTbOTYunUrU6dOrXUZf+TwD0GdeeaZQ/Yc\nfY1d0uaIaK10rNfeMTNLiKd3zMyqqNp321Sbr/TNzBLi0DczS4hD38wsIQ59M7OE+INcM6sPT36j\nuuebuaxil+3bt3PVVVexZcuW6j73EPKVvplZQhz6ZmaDcPDgQb7whS8wffp0Zs2axR/+8Ae++93v\ncvHFF3PBBRdw7bXX8u677wKwaNEivvjFLzJz5kzOPfdcnnrqKT7/+c8zdepUFi1aNCz1OvTNzAbh\n1Vdf5YYbbqC7u5vTTz+dH/3oR3zmM5+hs7OTF198kalTp3LPPfcc6b937142bNjAypUrufrqq7nx\nxhvp7u6mq6uLF154YcjrdeibmQ1Cc3MzF154IQAXXXQR27dvZ8uWLVx++eW0tLTwwAMPHLWM8tVX\nX40kWlpaOOuss2hpaWHUqFFMnz59WL7Y5dA3MxuEw8siA4wePZre3l4WLVrEnXfeSVdXF7feeusx\nl1EuPXYollHui0PfzKzK3n77bc4++2wOHDjAAw88UOtyjuJbNs2sPuS4xXK43H777cyYMYMPfehD\ntLS08Pbbb9e6pCO8tHIJL61sNnKcqEsrD4chX1pZUrukVyT1SFrax/4rJD0nqVfS3D72nyrp15Lu\nzPN8ZmY2NCqGvqTRwF3AbGAasEDStLJuO4BFwIPHOM3twFMDL9PMzKohz5V+G9ATEdsiYj+wBphT\n2iEitkfES8Ch8oMlXQScBfy8CvWamR1xok1PD4fBjjlP6E8EdpZsF7K2iiSNAv4bcFP/SzMzO7bx\n48ezZ8+epII/ItizZw/jx48f8Dny3L2jvp475/n/EuiIiJ1SX6fJnkBaDCwGmDx5cs5Tm1nKGhsb\nKRQK7N69u9alDKvx48fT2Ng44OPzhH4BmFSy3Qjsynn+S4HLJf0lcApwkqR3IuKoD4MjYhWwCop3\n7+Q8t5klbOzYsTQ3N9e6jBEnT+h3AlMkNQO/BuYDn8tz8oj488OPJS0CWssD38zMhk/FOf2I6AWW\nAOuBrcBDEdEtaYWkawAkXSypAMwD7pbUfewzmplZreT6Rm5EdAAdZW3LSx53Upz2Od457gPu63eF\nZmZWNV57x8wsIQ59M7OEOPTNzBLiVTaryAu2mdmJzlf6ZmYJceibmSXEoW9mlhCHvplZQhz6ZmYJ\nceibmSXEoW9mlhCHvplZQvzlrH6o9OUrM7MTna/0zcwS4tA3M0tIUtM7np4xs9T5St/MLCG5Ql9S\nu6RXJPVI+qPfuJV0haTnJPVKmlvSfqGkjZK6Jb0k6bPVLN7MzPqnYuhLGg3cBcwGpgELJE0r67YD\nWAQ8WNb+LvAXETEdaAf+QdLpgy3azMwGJs+cfhvQExHbACStAeYALx/uEBHbs32HSg+MiF+WPN4l\n6Q2gAfjdoCs3M7N+yzO9MxHYWbJdyNr6RVIbcBLwWh/7FkvaJGnT7t27+3tqMzPLKU/oq4+26M+T\nSDobuB/4DxFxqHx/RKyKiNaIaG1oaOjPqc3MrB/yhH4BmFSy3QjsyvsEkk4FHgX+S0Q83b/yzMys\nmvKEficwRVKzpJOA+cC6PCfP+v8Y+H5E/I+Bl2lmZtVQMfQjohdYAqwHtgIPRUS3pBWSrgGQdLGk\nAjAPuFtSd3b4vweuABZJeiH7c+GQjMTMzCrK9Y3ciOgAOsralpc87qQ47VN+3A+AHwyyRjMzqxJ/\nI9fMLCEOfTOzhDj0zcwS4tA3M0uIQ9/MLCFJrac/WJfsWHXc/U9PXjxMlZiZDYyv9M3MEuLQNzNL\niEPfzCwhDn0zs4Q49M3MEuLQNzNLiEPfzCwhvk+/RKX78M3MRjpf6ZuZJcShb2aWEIe+mVlCcoW+\npHZJr0jqkbS0j/1XSHpOUq+kuWX7Fkp6NfuzsFqFm5lZ/1UMfUmjgbuA2cA0YIGkaWXddgCLgAfL\njj0DuBWYAbQBt0r6wODLNjOzgchzpd8G9ETEtojYD6wB5pR2iIjtEfEScKjs2E8Bj0XEmxGxF3gM\naK9C3WZmNgB5Qn8isLNku5C15ZHrWEmLJW2StGn37t05T21mZv2VJ/TVR1vkPH+uYyNiVUS0RkRr\nQ0NDzlObmVl/5Qn9AjCpZLsR2JXz/IM51szMqixP6HcCUyQ1SzoJmA+sy3n+9cAsSR/IPsCdlbWZ\nmVkNVAz9iOgFllAM663AQxHRLWmFpGsAJF0sqQDMA+6W1J0d+yZwO8U3jk5gRdZmZmY1kGvtnYjo\nADrK2paXPO6kOHXT17H3AvcOokYzM6sSfyPXzCwhDn0zs4Q49M3MEuLQNzNLiEPfzCwhDn0zs4Q4\n9M3MEuLQNzNLiEPfzCwhDn0zs4TkWobBqmPlY7885r4bP3n+MFZiZqnylb6ZWUIc+mZmCfH0ThVd\nsmPVcfc/PXnxMFViZtY3X+mbmSXEoW9mlpBcoS+pXdIrknokLe1j/zhJa7P9z0hqytrHSlotqUvS\nVknLqlu+mZn1R8XQlzQauAuYDUwDFkiaVtbtemBvRJwHrATuyNrnAeMiogW4CPiPh98QzMxs+OW5\n0m8DeiJiW0TsB9YAc8r6zAFWZ48fBj4uSUAA75M0BjgZ2A/8a1UqNzOzfssT+hOBnSXbhaytzz7Z\nD6m/BUyg+Abwe+A3wA7g7/zD6GZmtZMn9NVHW+Ts0wYcBM4BmoG/kXTuHz2BtFjSJkmbdu/enaMk\nMzMbiDz36ReASSXbjcCuY/QpZFM5pwFvAp8DfhYRB4A3JP0z0ApsKz04IlYBqwBaW1vL31D658lv\nHGfntYM6tZnZSJfnSr8TmCKpWdJJwHxgXVmfdcDC7PFcYENEBMUpnStV9D7gEuBfqlO6mZn1V8XQ\nz+bolwDrga3AQxHRLWmFpGuybvcAEyT1AH8NHL6t8y7gFGALxTeP/x4RL1V5DGZmllOuZRgiogPo\nKGtbXvJ4H8XbM8uPe6evdjMzqw1/I9fMLCEOfTOzhDj0zcwS4tA3M0uIQ9/MLCEOfTOzhDj0zcwS\n4tA3M0uIQ9/MLCEOfTOzhORahsGq45Idq46z9++GrQ4zS5ev9M3MEuLQNzNLiEPfzCwhDn0zs4Q4\n9M3MEuLQNzNLSK7Ql9Qu6RVJPZKW9rF/nKS12f5nJDWV7PuIpI2SuiV1SRpfvfLNzKw/Koa+pNEU\nf+t2NjANWCBpWlm364G9EXEesBK4Izt2DPAD4D9FxHTgY8CBqlVvZmb9kudKvw3oiYhtEbEfWAPM\nKeszB1idPX4Y+LgkAbOAlyLiRYCI2BMRB6tTupmZ9Vee0J8I7CzZLmRtffaJiF7gLWACcD4QktZL\nek7SzYMv2czMBirPMgzqoy1y9hkDXAZcDLwLPCFpc0Q8cdTB0mJgMcDkyZNzlGRmZgOR50q/AEwq\n2W4Edh2rTzaPfxrwZtb+VET8NiLeBTqAPyl/gohYFRGtEdHa0NDQ/1GYmVkueUK/E5giqVnSScB8\nYF1Zn3XAwuzxXGBDRASwHviIpH+TvRn8W+Dl6pRuZmb9VXF6JyJ6JS2hGOCjgXsjolvSCmBTRKwD\n7gHul9RD8Qp/fnbsXkl/T/GNI4COiHh0iMYysj35jePvn7lseOows7qWa2nliOigODVT2ra85PE+\nYN4xjv0Bxds2zcysxvyNXDOzhDj0zcwS4tA3M0uIQ9/MLCFJ/Ubu8X+j1sys/vlK38wsIQ59M7OE\nOPTNzBLi0DczS4hD38wsIQ59M7OEOPTNzBLi0DczS4hD38wsIUl9I3dE83r7ZlYFDv0TxMZte467\n/9JzJxz/BH5TMLMcPL1jZpaQXKEvqV3SK5J6JC3tY/84SWuz/c9IairbP1nSO5K+Up2yzcxsICqG\nvqTRwF3AbGAasEDStLJu1wN7I+I8YCVwR9n+lcD/Gny5ZmY2GHmu9NuAnojYFhH7gTXAnLI+c4DV\n2eOHgY9LEoCkPwO2Ad3VKdnMzAYqT+hPBHaWbBeytj77REQv8BYwQdL7gFuArx3vCSQtlrRJ0qbd\nu3fnrd3MzPopT+irj7bI2edrwMqIeOd4TxARqyKiNSJaGxoacpRkZmYDkeeWzQIwqWS7Edh1jD4F\nSWOA04A3gRnAXEnfAk4HDknaFxF3DrryY6h066OZWcryhH4nMEVSM/BrYD7wubI+64CFwEZgLrAh\nIgK4/HAHSbcB7wxl4JuZ2fFVDP2I6JW0BFgPjAbujYhuSSuATRGxDrgHuF9SD8Ur/PlDWbSZmQ1M\nrm/kRkQH0FHWtrzk8T5gXoVz3DaA+iwz6G/smpnhb+SamSXFoW9mlhCHvplZQhz6ZmYJceibmSXE\noW9mlhCHvplZQhz6ZmYJceibmSXEoW9mlhCHvplZQhz6ZmYJceibmSXEoW9mlhCHvplZQhz6ZmYJ\nyfUjKpLagX+k+MtZ34uIb5btHwd8H7gI2AN8NiK2S/ok8E3gJGA/cFNEbKhi/ZbXk984/v6Zy4an\nDjOrqYpX+pJGA3cBs4FpwAJJ08q6XQ/sjYjzgJXAHVn7b4GrI6KF4m/o3l+tws3MrP/yTO+0AT0R\nsS0i9gNrgDllfeYAq7PHDwMfl6SIeD4idmXt3cD47F8FZmZWA3lCfyKws2S7kLX12ScieoG3gPIf\nbb0WeD4i3htYqWZmNlh55vTVR1v0p4+k6RSnfGb1+QTSYmAxwOTJk3OUZGZmA5HnSr8ATCrZbgR2\nHauPpDHAacCb2XYj8GPgLyLitb6eICJWRURrRLQ2NDT0bwRmZpZbniv9TmCKpGbg18B84HNlfdZR\n/KB2IzAX2BARIel04FFgWUT8c/XKtqrz3T1mSah4pZ/N0S8B1gNbgYciolvSCknXZN3uASZI6gH+\nGliatS8BzgP+q6QXsj8frPoozMwsl1z36UdEB9BR1ra85PE+YF4fx/0t8LeDrNHMzKokV+ibDYqn\njsxOGA59y+d4we3QNhsxHPo2eJWu5M3shOEF18zMEuLQNzNLiEPfzCwhntOvExu37Tnu/kvPLV8K\nycxS5Ct9M7OEOPTNzBLi0DczS4hD38wsIf4g12rPyzSYDRtf6ZuZJcShb2aWEE/vJKLSffyV1PQ+\n/8FM/3jqyOwoDn0b+bzgm1lunt4xM0tIrit9Se3APwKjge9FxDfL9o8Dvg9cBOwBPhsR27N9y4Dr\ngYPAf46I9VWr3mywBvuvBE8P2QhTMfQljQbuAj4JFIBOSesi4uWSbtcDeyPiPEnzgTuAz0qaRvGH\n1KcD5wCPSzo/Ig5WeyBmNTGYN41Kbxj+PMKGQJ4r/TagJyK2AUhaA8wBSkN/DnBb9vhh4E5JytrX\nRMR7wK+yH05vAzZWp3wbLoP9IPh4Kn1IPJjnruuF5k7kf6X4DeuElSf0JwI7S7YLwIxj9YmIXklv\nAROy9qfLjp044GrN6smJ/AF0rUN7MD/PWevaj+cEqC1P6KuPtsjZJ8+xSFoMLM4235H0So66juVM\n4LeDOH4kSm3MqY0XhmTMX63RsbmPP8aYh+W5a+Srg/nv/KE8nfKEfgGYVLLdCOw6Rp+CpDHAacCb\nOY8lIlYBq/IUXImkTRHRWo1zjRSpjTm18YLHnIrhGHOeWzY7gSmSmiWdRPGD2XVlfdYBC7PHc4EN\nERFZ+3xJ4yQ1A1OAZ6tTupmZ9VfFK/1sjn4JsJ7iLZv3RkS3pBXApohYB9wD3J99UPsmxTcGsn4P\nUfzQtxe4wXfumJnVTq779COiA+goa1te8ngfMO8Yx34d+PogauyvqkwTjTCpjTm18YLHnIohH7OK\nszBmZpYCL8NgZpaQugl9Se2SXpHUI2lpresZCpLulfSGpC0lbWdIekzSq9nfH6hljdUmaZKkJyVt\nldQt6a+y9rodt6Txkp6V9GI25q9l7c2SnsnGvDa7saJuSBot6XlJ/zPbruvxAkjaLqlL0guSNmVt\nQ/rarovQL1kqYjYwDViQLQFRb+4D2svalgJPRMQU4Ilsu570An8TEVOBS4Absv+29Tzu94ArI+IC\n4EKgXdIlFJc3WZmNeS/F5U/qyV8BW0u26328h82MiAtLbtUc0td2XYQ+JUtFRMR+4PBSEXUlIv4P\nxbujSs0BVmePVwN/NqxFDbGI+E1EPJc9fptiKEykjscdRe9km2OzPwFcSXGZE6izMUtqBD4NfC/b\nFnU83gqG9LVdL6Hf11IRqSz3cFZE/AaKAQl8sMb1DBlJTcBHgWeo83FnUx0vAG8AjwGvAb+LiN6s\nS729xv8BuBk4lG1PoL7He1gAP5e0OVuZAIb4tV0vP6KSa7kHG7kknQL8CPhyRPxr8UKwfmXfZ7lQ\n0unAj4GpfXUb3qqGhqSrgDciYrOkjx1u7qNrXYy3zJ9GxC5JHwQek/QvQ/2E9XKln2u5hzr1uqSz\nAbK/36hxPVUnaSzFwH8gIh7Jmut+3AAR8Tvgf1P8POP0bJkTqK/X+J8C10jaTnFq9kqKV/71Ot4j\nImJX9vcbFN/c2xji13a9hH6epSLqVekSGAuBn9SwlqrL5nbvAbZGxN+X7KrbcUtqyK7wkXQy8AmK\nn2U8SXGZE6ijMUfEsohojIgmiv/vboiIP6dOx3uYpPdJev/hx8AsYAtD/Nqumy9nSfp3FK8ODi8V\nMZzfAh4Wkn4IfIzi6oOvA7cC/wQ8BEwGdgDzIqL8w94RS9JlwC+ALv7/fO9XKc7r1+W4JX2E4gd4\noylemD0UESsknUvxSvgM4Hnguuy3KupGNr3zlYi4qt7Hm43vx9nmGODBiPi6pAkM4Wu7bkLfzMwq\nq5fpHTMzy8Ghb2aWEIe+mVlCHPpmZglx6JuZJcShb2aWEIe+mVlCHPpmZgn5fyjZgnDU1A4AAAAA\nAElFTkSuQmCC\n", 288 | "text/plain": [ 289 | "" 290 | ] 291 | }, 292 | "metadata": {}, 293 | "output_type": "display_data" 294 | } 295 | ], 296 | "source": [ 297 | "bins = np.linspace(0, 50, 40)\n", 298 | "\n", 299 | "pyplot.hist(data[data['label']=='spam']['punct%'], bins, alpha=0.5, normed=True, label='spam')\n", 300 | "pyplot.hist(data[data['label']=='ham']['punct%'], bins, alpha=0.5, normed=True, label='ham')\n", 301 | "pyplot.legend(loc='upper right')\n", 302 | "pyplot.show()" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "metadata": { 309 | "collapsed": true, 310 | "jupyter": { 311 | "outputs_hidden": true 312 | } 313 | }, 314 | "outputs": [], 315 | "source": [] 316 | } 317 | ], 318 | "metadata": { 319 | "kernelspec": { 320 | "display_name": "Python 3 (ipykernel)", 321 | "language": "python", 322 | "name": "python3" 323 | }, 324 | "language_info": { 325 | "codemirror_mode": { 326 | "name": "ipython", 327 | "version": 3 328 | }, 329 | "file_extension": ".py", 330 | "mimetype": "text/x-python", 331 | "name": "python", 332 | "nbconvert_exporter": "python", 333 | "pygments_lexer": "ipython3", 334 | "version": "3.11.0" 335 | } 336 | }, 337 | "nbformat": 4, 338 | "nbformat_minor": 4 339 | } 340 | -------------------------------------------------------------------------------- /5. Building Machine Learning Classifiers/5.1. Building a basic Random Forest Model.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Building Machine Learning Classifiers: Building a basic Random Forest model" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Read in & clean text" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 2, 20 | "metadata": {}, 21 | "outputs": [ 22 | { 23 | "data": { 24 | "text/html": [ 25 | "
\n", 26 | "\n", 39 | "\n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | "
body_lenpunct%01234567...8094809580968097809880998100810181028103
01284.70.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
1494.10.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
2623.20.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
3287.10.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
41354.40.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
\n", 189 | "

5 rows × 8106 columns

\n", 190 | "
" 191 | ], 192 | "text/plain": [ 193 | " body_len punct% 0 1 2 3 4 5 6 7 ... 8094 8095 \\\n", 194 | "0 128 4.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 195 | "1 49 4.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 196 | "2 62 3.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 197 | "3 28 7.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 198 | "4 135 4.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 199 | "\n", 200 | " 8096 8097 8098 8099 8100 8101 8102 8103 \n", 201 | "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 202 | "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 203 | "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 204 | "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 205 | "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 206 | "\n", 207 | "[5 rows x 8106 columns]" 208 | ] 209 | }, 210 | "execution_count": 2, 211 | "metadata": {}, 212 | "output_type": "execute_result" 213 | } 214 | ], 215 | "source": [ 216 | "import nltk\n", 217 | "import pandas as pd\n", 218 | "import re\n", 219 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 220 | "import string\n", 221 | "\n", 222 | "stopwords = nltk.corpus.stopwords.words('english')\n", 223 | "ps = nltk.PorterStemmer()\n", 224 | "\n", 225 | "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t')\n", 226 | "data.columns = ['label', 'body_text']\n", 227 | "\n", 228 | "def count_punct(text):\n", 229 | " count = sum([1 for char in text if char in string.punctuation])\n", 230 | " return round(count/(len(text) - text.count(\" \")), 3)*100\n", 231 | "\n", 232 | "data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(\" \"))\n", 233 | "data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))\n", 234 | "\n", 235 | "def clean_text(text):\n", 236 | " text = \"\".join([word.lower() for word in text if word not in string.punctuation])\n", 237 | " tokens = re.split('\\W+', text)\n", 238 | " text = [ps.stem(word) for word in tokens if word not in stopwords]\n", 239 | " return text\n", 240 | "\n", 241 | "tfidf_vect = TfidfVectorizer(analyzer=clean_text)\n", 242 | "X_tfidf = tfidf_vect.fit_transform(data['body_text'])\n", 243 | "\n", 244 | "X_features = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)\n", 245 | "X_features.head()" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": {}, 251 | "source": [ 252 | "### Explore RandomForestClassifier Attributes & Hyperparameters" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 5, 258 | "metadata": { 259 | "collapsed": true, 260 | "jupyter": { 261 | "outputs_hidden": true 262 | } 263 | }, 264 | "outputs": [], 265 | "source": [ 266 | "from sklearn.ensemble import RandomForestClassifier" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 6, 272 | "metadata": {}, 273 | "outputs": [ 274 | { 275 | "name": "stdout", 276 | "output_type": "stream", 277 | "text": [ 278 | "['__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_cache', '_abc_negative_cache', '_abc_negative_cache_version', '_abc_registry', '_estimator_type', '_get_param_names', '_make_estimator', '_set_oob_score', '_validate_X_predict', '_validate_estimator', '_validate_y_class_weight', 'apply', 'decision_path', 'feature_importances_', 'fit', 'get_params', 'predict', 'predict_log_proba', 'predict_proba', 'score', 'set_params']\n", 279 | "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", 280 | " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", 281 | " min_impurity_decrease=0.0, min_impurity_split=None,\n", 282 | " min_samples_leaf=1, min_samples_split=2,\n", 283 | " min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n", 284 | " oob_score=False, random_state=None, verbose=0,\n", 285 | " warm_start=False)\n" 286 | ] 287 | } 288 | ], 289 | "source": [ 290 | "print(dir(RandomForestClassifier))\n", 291 | "print(RandomForestClassifier())" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "### Explore RandomForestClassifier through Cross-Validation" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": 11, 304 | "metadata": { 305 | "collapsed": true, 306 | "jupyter": { 307 | "outputs_hidden": true 308 | } 309 | }, 310 | "outputs": [], 311 | "source": [ 312 | "from sklearn.model_selection import KFold, cross_val_score" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": 12, 318 | "metadata": {}, 319 | "outputs": [ 320 | { 321 | "data": { 322 | "text/plain": [ 323 | "array([ 0.96947935, 0.97486535, 0.97124888, 0.95507637, 0.96855346])" 324 | ] 325 | }, 326 | "execution_count": 12, 327 | "metadata": {}, 328 | "output_type": "execute_result" 329 | } 330 | ], 331 | "source": [ 332 | "rf = RandomForestClassifier(n_jobs=-1)\n", 333 | "k_fold = KFold(n_splits=5)\n", 334 | "cross_val_score(rf, X_features, data['label'], cv=k_fold, scoring='accuracy', n_jobs=-1)" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": null, 340 | "metadata": { 341 | "collapsed": true, 342 | "jupyter": { 343 | "outputs_hidden": true 344 | } 345 | }, 346 | "outputs": [], 347 | "source": [] 348 | } 349 | ], 350 | "metadata": { 351 | "kernelspec": { 352 | "display_name": "Python 3 (ipykernel)", 353 | "language": "python", 354 | "name": "python3" 355 | }, 356 | "language_info": { 357 | "codemirror_mode": { 358 | "name": "ipython", 359 | "version": 3 360 | }, 361 | "file_extension": ".py", 362 | "mimetype": "text/x-python", 363 | "name": "python", 364 | "nbconvert_exporter": "python", 365 | "pygments_lexer": "ipython3", 366 | "version": "3.11.0" 367 | } 368 | }, 369 | "nbformat": 4, 370 | "nbformat_minor": 4 371 | } 372 | -------------------------------------------------------------------------------- /5. Building Machine Learning Classifiers/5.2. Random Forest on a holdout test set.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Building Machine Learning Classifiers: Random Forest on a holdout test set" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Read in & clean text" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 7, 20 | "metadata": {}, 21 | "outputs": [ 22 | { 23 | "data": { 24 | "text/html": [ 25 | "
\n", 26 | "\n", 39 | "\n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | "
body_lenpunct%01234567...8094809580968097809880998100810181028103
01284.70.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
1494.10.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
2623.20.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
3287.10.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
41354.40.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
\n", 189 | "

5 rows × 8106 columns

\n", 190 | "
" 191 | ], 192 | "text/plain": [ 193 | " body_len punct% 0 1 2 3 4 5 6 7 ... 8094 8095 \\\n", 194 | "0 128 4.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 195 | "1 49 4.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 196 | "2 62 3.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 197 | "3 28 7.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 198 | "4 135 4.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 199 | "\n", 200 | " 8096 8097 8098 8099 8100 8101 8102 8103 \n", 201 | "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 202 | "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 203 | "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 204 | "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 205 | "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 206 | "\n", 207 | "[5 rows x 8106 columns]" 208 | ] 209 | }, 210 | "execution_count": 7, 211 | "metadata": {}, 212 | "output_type": "execute_result" 213 | } 214 | ], 215 | "source": [ 216 | "import nltk\n", 217 | "import pandas as pd\n", 218 | "import re\n", 219 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 220 | "import string\n", 221 | "\n", 222 | "stopwords = nltk.corpus.stopwords.words('english')\n", 223 | "ps = nltk.PorterStemmer()\n", 224 | "\n", 225 | "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t')\n", 226 | "data.columns = ['label', 'body_text']\n", 227 | "\n", 228 | "def count_punct(text):\n", 229 | " count = sum([1 for char in text if char in string.punctuation])\n", 230 | " return round(count/(len(text) - text.count(\" \")), 3)*100\n", 231 | "\n", 232 | "data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(\" \"))\n", 233 | "data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))\n", 234 | "\n", 235 | "def clean_text(text):\n", 236 | " text = \"\".join([word.lower() for word in text if word not in string.punctuation])\n", 237 | " tokens = re.split('\\W+', text)\n", 238 | " text = [ps.stem(word) for word in tokens if word not in stopwords]\n", 239 | " return text\n", 240 | "\n", 241 | "tfidf_vect = TfidfVectorizer(analyzer=clean_text)\n", 242 | "X_tfidf = tfidf_vect.fit_transform(data['body_text'])\n", 243 | "\n", 244 | "X_features = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)\n", 245 | "X_features.head()" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": {}, 251 | "source": [ 252 | "### Explore RandomForestClassifier through Holdout Set" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 8, 258 | "metadata": { 259 | "collapsed": true, 260 | "jupyter": { 261 | "outputs_hidden": true 262 | } 263 | }, 264 | "outputs": [], 265 | "source": [ 266 | "from sklearn.metrics import precision_recall_fscore_support as score\n", 267 | "from sklearn.model_selection import train_test_split" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 9, 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [ 276 | "X_train, X_test, y_train, y_test = train_test_split(X_features, data['label'], test_size=0.2)" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": 12, 282 | "metadata": { 283 | "collapsed": true, 284 | "jupyter": { 285 | "outputs_hidden": true 286 | } 287 | }, 288 | "outputs": [], 289 | "source": [ 290 | "from sklearn.ensemble import RandomForestClassifier\n", 291 | "\n", 292 | "rf = RandomForestClassifier(n_estimators=50, max_depth=20, n_jobs=-1)\n", 293 | "rf_model = rf.fit(X_train, y_train)" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": 14, 299 | "metadata": {}, 300 | "outputs": [ 301 | { 302 | "data": { 303 | "text/plain": [ 304 | "[(0.071067778644078275, 'body_len'),\n", 305 | " (0.040562335897847433, 7350),\n", 306 | " (0.035736155950968088, 3134),\n", 307 | " (0.025830800898315055, 2031),\n", 308 | " (0.020706891454006282, 1881),\n", 309 | " (0.020667459644832679, 5724),\n", 310 | " (0.020246234600271286, 4796),\n", 311 | " (0.016709671666146234, 5988),\n", 312 | " (0.016333631268556359, 1803),\n", 313 | " (0.015520152981795897, 2171)]" 314 | ] 315 | }, 316 | "execution_count": 14, 317 | "metadata": {}, 318 | "output_type": "execute_result" 319 | } 320 | ], 321 | "source": [ 322 | "sorted(zip(rf_model.feature_importances_, X_train.columns), reverse=True)[0:10]" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": 15, 328 | "metadata": { 329 | "collapsed": true, 330 | "jupyter": { 331 | "outputs_hidden": true 332 | } 333 | }, 334 | "outputs": [], 335 | "source": [ 336 | "y_pred = rf_model.predict(X_test)\n", 337 | "precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 16, 343 | "metadata": {}, 344 | "outputs": [ 345 | { 346 | "name": "stdout", 347 | "output_type": "stream", 348 | "text": [ 349 | "Precision: 1.0 / Recall: 0.552 / Accuracy: 0.934\n" 350 | ] 351 | } 352 | ], 353 | "source": [ 354 | "print('Precision: {} / Recall: {} / Accuracy: {}'.format(round(precision, 3),\n", 355 | " round(recall, 3),\n", 356 | " round((y_pred==y_test).sum() / len(y_pred),3)))" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": null, 362 | "metadata": { 363 | "collapsed": true, 364 | "jupyter": { 365 | "outputs_hidden": true 366 | } 367 | }, 368 | "outputs": [], 369 | "source": [] 370 | } 371 | ], 372 | "metadata": { 373 | "kernelspec": { 374 | "display_name": "Python 3 (ipykernel)", 375 | "language": "python", 376 | "name": "python3" 377 | }, 378 | "language_info": { 379 | "codemirror_mode": { 380 | "name": "ipython", 381 | "version": 3 382 | }, 383 | "file_extension": ".py", 384 | "mimetype": "text/x-python", 385 | "name": "python", 386 | "nbconvert_exporter": "python", 387 | "pygments_lexer": "ipython3", 388 | "version": "3.11.0" 389 | } 390 | }, 391 | "nbformat": 4, 392 | "nbformat_minor": 4 393 | } 394 | -------------------------------------------------------------------------------- /5. Building Machine Learning Classifiers/5.3. Explore Random Forest Model with Grid-Search.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Building Machine Learning Classifiers: Explore Random Forest model with grid-search" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "**Grid-search:** Exhaustively search all parameter combinations in a given grid to determine the best model." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "### Read in & clean text" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 1, 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/html": [ 32 | "
\n", 33 | "\n", 46 | "\n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | "
body_lenpunct%01234567...8094809580968097809880998100810181028103
01280.0470.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
1490.0410.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
2620.0320.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
3280.0710.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
41350.0440.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
\n", 196 | "

5 rows × 8106 columns

\n", 197 | "
" 198 | ], 199 | "text/plain": [ 200 | " body_len punct% 0 1 2 3 4 5 6 7 ... 8094 8095 \\\n", 201 | "0 128 0.047 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 202 | "1 49 0.041 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 203 | "2 62 0.032 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 204 | "3 28 0.071 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 205 | "4 135 0.044 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 206 | "\n", 207 | " 8096 8097 8098 8099 8100 8101 8102 8103 \n", 208 | "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 209 | "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 210 | "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 211 | "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 212 | "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 213 | "\n", 214 | "[5 rows x 8106 columns]" 215 | ] 216 | }, 217 | "execution_count": 1, 218 | "metadata": {}, 219 | "output_type": "execute_result" 220 | } 221 | ], 222 | "source": [ 223 | "import nltk\n", 224 | "import pandas as pd\n", 225 | "import re\n", 226 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 227 | "import string\n", 228 | "\n", 229 | "stopwords = nltk.corpus.stopwords.words('english')\n", 230 | "ps = nltk.PorterStemmer()\n", 231 | "\n", 232 | "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t')\n", 233 | "data.columns = ['label', 'body_text']\n", 234 | "\n", 235 | "def count_punct(text):\n", 236 | " count = sum([1 for char in text if char in string.punctuation])\n", 237 | " return round(count/(len(text) - text.count(\" \")), 3)\n", 238 | "\n", 239 | "data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(\" \"))\n", 240 | "data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))\n", 241 | "\n", 242 | "def clean_text(text):\n", 243 | " text = \"\".join([word.lower() for word in text if word not in string.punctuation])\n", 244 | " tokens = re.split('\\W+', text)\n", 245 | " text = [ps.stem(word) for word in tokens if word not in stopwords]\n", 246 | " return text\n", 247 | "\n", 248 | "tfidf_vect = TfidfVectorizer(analyzer=clean_text)\n", 249 | "X_tfidf = tfidf_vect.fit_transform(data['body_text'])\n", 250 | "\n", 251 | "X_features = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)\n", 252 | "X_features.head()" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "### Build our own Grid-search" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": 2, 265 | "metadata": { 266 | "collapsed": true, 267 | "jupyter": { 268 | "outputs_hidden": true 269 | } 270 | }, 271 | "outputs": [], 272 | "source": [ 273 | "from sklearn.ensemble import RandomForestClassifier\n", 274 | "from sklearn.metrics import precision_recall_fscore_support as score\n", 275 | "from sklearn.model_selection import train_test_split" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": 3, 281 | "metadata": { 282 | "collapsed": true, 283 | "jupyter": { 284 | "outputs_hidden": true 285 | } 286 | }, 287 | "outputs": [], 288 | "source": [ 289 | "X_train, X_test, y_train, y_test = train_test_split(X_features, data['label'], test_size=0.2)" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": 4, 295 | "metadata": { 296 | "collapsed": true, 297 | "jupyter": { 298 | "outputs_hidden": true 299 | } 300 | }, 301 | "outputs": [], 302 | "source": [ 303 | "def train_RF(n_est, depth):\n", 304 | " rf = RandomForestClassifier(n_estimators=n_est, max_depth=depth, n_jobs=-1)\n", 305 | " rf_model = rf.fit(X_train, y_train)\n", 306 | " y_pred = rf_model.predict(X_test)\n", 307 | " precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')\n", 308 | " print('Est: {} / Depth: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(\n", 309 | " n_est, depth, round(precision, 3), round(recall, 3),\n", 310 | " round((y_pred==y_test).sum() / len(y_pred), 3)))" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": 5, 316 | "metadata": {}, 317 | "outputs": [ 318 | { 319 | "name": "stdout", 320 | "output_type": "stream", 321 | "text": [ 322 | "Est: 10 / Depth: 10 ---- Precision: 1.0 / Recall: 0.216 / Accuracy: 0.892\n", 323 | "Est: 10 / Depth: 20 ---- Precision: 0.975 / Recall: 0.516 / Accuracy: 0.932\n", 324 | "Est: 10 / Depth: 30 ---- Precision: 1.0 / Recall: 0.647 / Accuracy: 0.952\n", 325 | "Est: 10 / Depth: None ---- Precision: 0.984 / Recall: 0.784 / Accuracy: 0.969\n", 326 | "Est: 50 / Depth: 10 ---- Precision: 1.0 / Recall: 0.235 / Accuracy: 0.895\n", 327 | "Est: 50 / Depth: 20 ---- Precision: 1.0 / Recall: 0.562 / Accuracy: 0.94\n", 328 | "Est: 50 / Depth: 30 ---- Precision: 1.0 / Recall: 0.667 / Accuracy: 0.954\n", 329 | "Est: 50 / Depth: None ---- Precision: 0.985 / Recall: 0.843 / Accuracy: 0.977\n", 330 | "Est: 100 / Depth: 10 ---- Precision: 1.0 / Recall: 0.242 / Accuracy: 0.896\n", 331 | "Est: 100 / Depth: 20 ---- Precision: 1.0 / Recall: 0.601 / Accuracy: 0.945\n", 332 | "Est: 100 / Depth: 30 ---- Precision: 0.981 / Recall: 0.686 / Accuracy: 0.955\n", 333 | "Est: 100 / Depth: None ---- Precision: 1.0 / Recall: 0.83 / Accuracy: 0.977\n" 334 | ] 335 | } 336 | ], 337 | "source": [ 338 | "for n_est in [10, 50, 100]:\n", 339 | " for depth in [10, 20, 30, None]:\n", 340 | " train_RF(n_est, depth)" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": { 347 | "collapsed": true, 348 | "jupyter": { 349 | "outputs_hidden": true 350 | } 351 | }, 352 | "outputs": [], 353 | "source": [] 354 | } 355 | ], 356 | "metadata": { 357 | "kernelspec": { 358 | "display_name": "Python 3 (ipykernel)", 359 | "language": "python", 360 | "name": "python3" 361 | }, 362 | "language_info": { 363 | "codemirror_mode": { 364 | "name": "ipython", 365 | "version": 3 366 | }, 367 | "file_extension": ".py", 368 | "mimetype": "text/x-python", 369 | "name": "python", 370 | "nbconvert_exporter": "python", 371 | "pygments_lexer": "ipython3", 372 | "version": "3.11.0" 373 | } 374 | }, 375 | "nbformat": 4, 376 | "nbformat_minor": 4 377 | } 378 | -------------------------------------------------------------------------------- /5. Building Machine Learning Classifiers/5.5. Explore Gradient Boosting model with Grid-Search.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Building Machine Learning Classifiers: Explore Gradient Boosting model with grid-search" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "**Grid-search:** Exhaustively search all parameter combinations in a given grid to determine the best model." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "### Read in & clean text" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 1, 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/html": [ 32 | "
\n", 33 | "\n", 46 | "\n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | "
body_lenpunct%01234567...8094809580968097809880998100810181028103
01284.70.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
1494.10.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
2623.20.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
3287.10.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
41354.40.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
\n", 196 | "

5 rows × 8106 columns

\n", 197 | "
" 198 | ], 199 | "text/plain": [ 200 | " body_len punct% 0 1 2 3 4 5 6 7 ... 8094 8095 \\\n", 201 | "0 128 4.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 202 | "1 49 4.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 203 | "2 62 3.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 204 | "3 28 7.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 205 | "4 135 4.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 206 | "\n", 207 | " 8096 8097 8098 8099 8100 8101 8102 8103 \n", 208 | "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 209 | "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 210 | "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 211 | "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 212 | "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 213 | "\n", 214 | "[5 rows x 8106 columns]" 215 | ] 216 | }, 217 | "execution_count": 1, 218 | "metadata": {}, 219 | "output_type": "execute_result" 220 | } 221 | ], 222 | "source": [ 223 | "import nltk\n", 224 | "import pandas as pd\n", 225 | "import re\n", 226 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 227 | "import string\n", 228 | "\n", 229 | "stopwords = nltk.corpus.stopwords.words('english')\n", 230 | "ps = nltk.PorterStemmer()\n", 231 | "\n", 232 | "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t')\n", 233 | "data.columns = ['label', 'body_text']\n", 234 | "\n", 235 | "def count_punct(text):\n", 236 | " count = sum([1 for char in text if char in string.punctuation])\n", 237 | " return round(count/(len(text) - text.count(\" \")), 3)*100\n", 238 | "\n", 239 | "data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(\" \"))\n", 240 | "data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))\n", 241 | "\n", 242 | "def clean_text(text):\n", 243 | " text = \"\".join([word.lower() for word in text if word not in string.punctuation])\n", 244 | " tokens = re.split('\\W+', text)\n", 245 | " text = [ps.stem(word) for word in tokens if word not in stopwords]\n", 246 | " return text\n", 247 | "\n", 248 | "tfidf_vect = TfidfVectorizer(analyzer=clean_text)\n", 249 | "X_tfidf = tfidf_vect.fit_transform(data['body_text'])\n", 250 | "\n", 251 | "X_features = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)\n", 252 | "X_features.head()" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "### Explore GradientBoostingClassifier Attributes & Hyperparameters" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": 2, 265 | "metadata": { 266 | "collapsed": true, 267 | "jupyter": { 268 | "outputs_hidden": true 269 | } 270 | }, 271 | "outputs": [], 272 | "source": [ 273 | "from sklearn.ensemble import GradientBoostingClassifier" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": 3, 279 | "metadata": {}, 280 | "outputs": [ 281 | { 282 | "name": "stdout", 283 | "output_type": "stream", 284 | "text": [ 285 | "['_SUPPORTED_LOSS', '__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_cache', '_abc_negative_cache', '_abc_negative_cache_version', '_abc_registry', '_check_initialized', '_check_params', '_clear_state', '_decision_function', '_estimator_type', '_fit_stage', '_fit_stages', '_get_param_names', '_init_decision_function', '_init_state', '_is_initialized', '_make_estimator', '_resize_state', '_staged_decision_function', '_validate_estimator', '_validate_y', 'apply', 'decision_function', 'feature_importances_', 'fit', 'fit_transform', 'get_params', 'predict', 'predict_log_proba', 'predict_proba', 'score', 'set_params', 'staged_decision_function', 'staged_predict', 'staged_predict_proba', 'transform']\n", 286 | "GradientBoostingClassifier(criterion='friedman_mse', init=None,\n", 287 | " learning_rate=0.1, loss='deviance', max_depth=3,\n", 288 | " max_features=None, max_leaf_nodes=None,\n", 289 | " min_impurity_split=1e-07, min_samples_leaf=1,\n", 290 | " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", 291 | " n_estimators=100, presort='auto', random_state=None,\n", 292 | " subsample=1.0, verbose=0, warm_start=False)\n" 293 | ] 294 | } 295 | ], 296 | "source": [ 297 | "print(dir(GradientBoostingClassifier))\n", 298 | "print(GradientBoostingClassifier())" 299 | ] 300 | }, 301 | { 302 | "cell_type": "markdown", 303 | "metadata": {}, 304 | "source": [ 305 | "### Build our own Grid-search" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": 4, 311 | "metadata": { 312 | "collapsed": true, 313 | "jupyter": { 314 | "outputs_hidden": true 315 | } 316 | }, 317 | "outputs": [], 318 | "source": [ 319 | "from sklearn.metrics import precision_recall_fscore_support as score\n", 320 | "from sklearn.model_selection import train_test_split" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": 5, 326 | "metadata": { 327 | "collapsed": true, 328 | "jupyter": { 329 | "outputs_hidden": true 330 | } 331 | }, 332 | "outputs": [], 333 | "source": [ 334 | "X_train, X_test, y_train, y_test = train_test_split(X_features, data['label'], test_size=0.2)" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": 6, 340 | "metadata": { 341 | "collapsed": true, 342 | "jupyter": { 343 | "outputs_hidden": true 344 | } 345 | }, 346 | "outputs": [], 347 | "source": [ 348 | "def train_GB(est, max_depth, lr):\n", 349 | " gb = GradientBoostingClassifier(n_estimators=est, max_depth=max_depth, learning_rate=lr)\n", 350 | " gb_model = gb.fit(X_train, y_train)\n", 351 | " y_pred = gb_model.predict(X_test)\n", 352 | " precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')\n", 353 | " print('Est: {} / Depth: {} / LR: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(\n", 354 | " est, max_depth, lr, round(precision, 3), round(recall, 3), \n", 355 | " round((y_pred==y_test).sum()/len(y_pred), 3)))" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": 7, 361 | "metadata": {}, 362 | "outputs": [ 363 | { 364 | "name": "stderr", 365 | "output_type": "stream", 366 | "text": [ 367 | "/Users/djedamski/.pyenv/versions/3.5.3/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples.\n", 368 | " 'precision', 'predicted', average, warn_for)\n" 369 | ] 370 | }, 371 | { 372 | "name": "stdout", 373 | "output_type": "stream", 374 | "text": [ 375 | "Est: 50 / Depth: 3 / LR: 0.01 ---- Precision: 0.0 / Recall: 0.0 / Accuracy: 0.868\n", 376 | "Est: 50 / Depth: 3 / LR: 0.1 ---- Precision: 1.0 / Recall: 0.687 / Accuracy: 0.959\n", 377 | "Est: 50 / Depth: 3 / LR: 1 ---- Precision: 0.88 / Recall: 0.796 / Accuracy: 0.959\n", 378 | "Est: 50 / Depth: 7 / LR: 0.01 ---- Precision: 0.0 / Recall: 0.0 / Accuracy: 0.868\n", 379 | "Est: 50 / Depth: 7 / LR: 0.1 ---- Precision: 0.968 / Recall: 0.83 / Accuracy: 0.974\n", 380 | "Est: 50 / Depth: 7 / LR: 1 ---- Precision: 0.917 / Recall: 0.823 / Accuracy: 0.967\n", 381 | "Est: 50 / Depth: 11 / LR: 0.01 ---- Precision: 1.0 / Recall: 0.027 / Accuracy: 0.872\n", 382 | "Est: 50 / Depth: 11 / LR: 0.1 ---- Precision: 0.962 / Recall: 0.871 / Accuracy: 0.978\n", 383 | "Est: 50 / Depth: 11 / LR: 1 ---- Precision: 0.926 / Recall: 0.85 / Accuracy: 0.971\n", 384 | "Est: 50 / Depth: 15 / LR: 0.01 ---- Precision: 0.0 / Recall: 0.0 / Accuracy: 0.868\n", 385 | "Est: 50 / Depth: 15 / LR: 0.1 ---- Precision: 0.977 / Recall: 0.857 / Accuracy: 0.978\n", 386 | "Est: 50 / Depth: 15 / LR: 1 ---- Precision: 0.919 / Recall: 0.85 / Accuracy: 0.97\n", 387 | "Est: 100 / Depth: 3 / LR: 0.01 ---- Precision: 0.987 / Recall: 0.51 / Accuracy: 0.934\n", 388 | "Est: 100 / Depth: 3 / LR: 0.1 ---- Precision: 0.991 / Recall: 0.776 / Accuracy: 0.969\n", 389 | "Est: 100 / Depth: 3 / LR: 1 ---- Precision: 0.901 / Recall: 0.803 / Accuracy: 0.962\n", 390 | "Est: 100 / Depth: 7 / LR: 0.01 ---- Precision: 0.989 / Recall: 0.612 / Accuracy: 0.948\n", 391 | "Est: 100 / Depth: 7 / LR: 0.1 ---- Precision: 0.985 / Recall: 0.871 / Accuracy: 0.981\n", 392 | "Est: 100 / Depth: 7 / LR: 1 ---- Precision: 0.922 / Recall: 0.81 / Accuracy: 0.966\n", 393 | "Est: 100 / Depth: 11 / LR: 0.01 ---- Precision: 0.991 / Recall: 0.741 / Accuracy: 0.965\n", 394 | "Est: 100 / Depth: 11 / LR: 0.1 ---- Precision: 0.984 / Recall: 0.864 / Accuracy: 0.98\n", 395 | "Est: 100 / Depth: 11 / LR: 1 ---- Precision: 0.912 / Recall: 0.844 / Accuracy: 0.969\n", 396 | "Est: 100 / Depth: 15 / LR: 0.01 ---- Precision: 0.992 / Recall: 0.796 / Accuracy: 0.972\n", 397 | "Est: 100 / Depth: 15 / LR: 0.1 ---- Precision: 0.977 / Recall: 0.871 / Accuracy: 0.98\n", 398 | "Est: 100 / Depth: 15 / LR: 1 ---- Precision: 0.932 / Recall: 0.844 / Accuracy: 0.971\n", 399 | "Est: 150 / Depth: 3 / LR: 0.01 ---- Precision: 0.988 / Recall: 0.537 / Accuracy: 0.938\n", 400 | "Est: 150 / Depth: 3 / LR: 0.1 ---- Precision: 0.992 / Recall: 0.81 / Accuracy: 0.974\n", 401 | "Est: 150 / Depth: 3 / LR: 1 ---- Precision: 0.902 / Recall: 0.816 / Accuracy: 0.964\n", 402 | "Est: 150 / Depth: 7 / LR: 0.01 ---- Precision: 0.99 / Recall: 0.687 / Accuracy: 0.958\n", 403 | "Est: 150 / Depth: 7 / LR: 0.1 ---- Precision: 0.977 / Recall: 0.857 / Accuracy: 0.978\n", 404 | "Est: 150 / Depth: 7 / LR: 1 ---- Precision: 0.937 / Recall: 0.81 / Accuracy: 0.968\n", 405 | "Est: 150 / Depth: 11 / LR: 0.01 ---- Precision: 0.983 / Recall: 0.796 / Accuracy: 0.971\n", 406 | "Est: 150 / Depth: 11 / LR: 0.1 ---- Precision: 0.985 / Recall: 0.871 / Accuracy: 0.981\n", 407 | "Est: 150 / Depth: 11 / LR: 1 ---- Precision: 0.904 / Recall: 0.837 / Accuracy: 0.967\n", 408 | "Est: 150 / Depth: 15 / LR: 0.01 ---- Precision: 0.975 / Recall: 0.796 / Accuracy: 0.97\n", 409 | "Est: 150 / Depth: 15 / LR: 0.1 ---- Precision: 0.977 / Recall: 0.864 / Accuracy: 0.979\n", 410 | "Est: 150 / Depth: 15 / LR: 1 ---- Precision: 0.913 / Recall: 0.857 / Accuracy: 0.97\n" 411 | ] 412 | } 413 | ], 414 | "source": [ 415 | "for n_est in [50, 100, 150]:\n", 416 | " for max_depth in [3, 7, 11, 15]:\n", 417 | " for lr in [0.01, 0.1, 1]:\n", 418 | " train_GB(n_est, max_depth, lr)" 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": null, 424 | "metadata": { 425 | "collapsed": true, 426 | "jupyter": { 427 | "outputs_hidden": true 428 | } 429 | }, 430 | "outputs": [], 431 | "source": [] 432 | } 433 | ], 434 | "metadata": { 435 | "kernelspec": { 436 | "display_name": "Python 3 (ipykernel)", 437 | "language": "python", 438 | "name": "python3" 439 | }, 440 | "language_info": { 441 | "codemirror_mode": { 442 | "name": "ipython", 443 | "version": 3 444 | }, 445 | "file_extension": ".py", 446 | "mimetype": "text/x-python", 447 | "name": "python", 448 | "nbconvert_exporter": "python", 449 | "pygments_lexer": "ipython3", 450 | "version": "3.11.0" 451 | } 452 | }, 453 | "nbformat": 4, 454 | "nbformat_minor": 4 455 | } 456 | -------------------------------------------------------------------------------- /5. Building Machine Learning Classifiers/5.7. Model Selection.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Building Machine Learning Classifiers: Model selection" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Read in & clean text" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": { 21 | "collapsed": true, 22 | "jupyter": { 23 | "outputs_hidden": true 24 | } 25 | }, 26 | "outputs": [], 27 | "source": [ 28 | "import nltk\n", 29 | "import pandas as pd\n", 30 | "import re\n", 31 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 32 | "import string\n", 33 | "\n", 34 | "stopwords = nltk.corpus.stopwords.words('english')\n", 35 | "ps = nltk.PorterStemmer()\n", 36 | "\n", 37 | "data = pd.read_csv(\"SMSSpamCollection.tsv\", sep='\\t')\n", 38 | "data.columns = ['label', 'body_text']\n", 39 | "\n", 40 | "def count_punct(text):\n", 41 | " count = sum([1 for char in text if char in string.punctuation])\n", 42 | " return round(count/(len(text) - text.count(\" \")), 3)*100\n", 43 | "\n", 44 | "data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(\" \"))\n", 45 | "data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))\n", 46 | "\n", 47 | "def clean_text(text):\n", 48 | " text = \"\".join([word.lower() for word in text if word not in string.punctuation])\n", 49 | " tokens = re.split('\\W+', text)\n", 50 | " text = [ps.stem(word) for word in tokens if word not in stopwords]\n", 51 | " return text" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "### Split into train/test" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 2, 64 | "metadata": { 65 | "collapsed": true, 66 | "jupyter": { 67 | "outputs_hidden": true 68 | } 69 | }, 70 | "outputs": [], 71 | "source": [ 72 | "from sklearn.model_selection import train_test_split\n", 73 | "\n", 74 | "X_train, X_test, y_train, y_test = train_test_split(data[['body_text', 'body_len', 'punct%']], data['label'], test_size=0.2)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "### Vectorize text" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 3, 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "data": { 91 | "text/html": [ 92 | "
\n", 93 | "\n", 106 | "\n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | "
body_lenpunct%01234567...7153715471557156715771587159716071617162
0190.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
11153.50.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
21062.80.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
3293.40.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
41524.60.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
\n", 256 | "

5 rows × 7165 columns

\n", 257 | "
" 258 | ], 259 | "text/plain": [ 260 | " body_len punct% 0 1 2 3 4 5 6 7 ... 7153 7154 \\\n", 261 | "0 19 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 262 | "1 115 3.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 263 | "2 106 2.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 264 | "3 29 3.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 265 | "4 152 4.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n", 266 | "\n", 267 | " 7155 7156 7157 7158 7159 7160 7161 7162 \n", 268 | "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 269 | "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 270 | "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 271 | "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 272 | "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 273 | "\n", 274 | "[5 rows x 7165 columns]" 275 | ] 276 | }, 277 | "execution_count": 3, 278 | "metadata": {}, 279 | "output_type": "execute_result" 280 | } 281 | ], 282 | "source": [ 283 | "tfidf_vect = TfidfVectorizer(analyzer=clean_text)\n", 284 | "tfidf_vect_fit = tfidf_vect.fit(X_train['body_text'])\n", 285 | "\n", 286 | "tfidf_train = tfidf_vect_fit.transform(X_train['body_text'])\n", 287 | "tfidf_test = tfidf_vect_fit.transform(X_test['body_text'])\n", 288 | "\n", 289 | "X_train_vect = pd.concat([X_train[['body_len', 'punct%']].reset_index(drop=True), \n", 290 | " pd.DataFrame(tfidf_train.toarray())], axis=1)\n", 291 | "X_test_vect = pd.concat([X_test[['body_len', 'punct%']].reset_index(drop=True), \n", 292 | " pd.DataFrame(tfidf_test.toarray())], axis=1)\n", 293 | "\n", 294 | "X_train_vect.head()" 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": {}, 300 | "source": [ 301 | "### Final evaluation of models" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": 4, 307 | "metadata": { 308 | "collapsed": true, 309 | "jupyter": { 310 | "outputs_hidden": true 311 | } 312 | }, 313 | "outputs": [], 314 | "source": [ 315 | "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n", 316 | "from sklearn.metrics import precision_recall_fscore_support as score\n", 317 | "import time" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 5, 323 | "metadata": {}, 324 | "outputs": [ 325 | { 326 | "name": "stdout", 327 | "output_type": "stream", 328 | "text": [ 329 | "Fit time: 1.782 / Predict time: 0.213 ---- Precision: 1.0 / Recall: 0.81 / Accuracy: 0.975\n" 330 | ] 331 | } 332 | ], 333 | "source": [ 334 | "rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1)\n", 335 | "\n", 336 | "start = time.time()\n", 337 | "rf_model = rf.fit(X_train_vect, y_train)\n", 338 | "end = time.time()\n", 339 | "fit_time = (end - start)\n", 340 | "\n", 341 | "start = time.time()\n", 342 | "y_pred = rf_model.predict(X_test_vect)\n", 343 | "end = time.time()\n", 344 | "pred_time = (end - start)\n", 345 | "\n", 346 | "precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')\n", 347 | "print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(\n", 348 | " round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))" 349 | ] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "execution_count": 6, 354 | "metadata": {}, 355 | "outputs": [ 356 | { 357 | "name": "stdout", 358 | "output_type": "stream", 359 | "text": [ 360 | "Fit time: 186.61 / Predict time: 0.135 ---- Precision: 0.889 / Recall: 0.816 / Accuracy: 0.962\n" 361 | ] 362 | } 363 | ], 364 | "source": [ 365 | "gb = GradientBoostingClassifier(n_estimators=150, max_depth=11)\n", 366 | "\n", 367 | "start = time.time()\n", 368 | "gb_model = gb.fit(X_train_vect, y_train)\n", 369 | "end = time.time()\n", 370 | "fit_time = (end - start)\n", 371 | "\n", 372 | "start = time.time()\n", 373 | "y_pred = gb_model.predict(X_test_vect)\n", 374 | "end = time.time()\n", 375 | "pred_time = (end - start)\n", 376 | "\n", 377 | "precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')\n", 378 | "print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(\n", 379 | " round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": null, 385 | "metadata": { 386 | "collapsed": true, 387 | "jupyter": { 388 | "outputs_hidden": true 389 | } 390 | }, 391 | "outputs": [], 392 | "source": [] 393 | } 394 | ], 395 | "metadata": { 396 | "kernelspec": { 397 | "display_name": "Python 3 (ipykernel)", 398 | "language": "python", 399 | "name": "python3" 400 | }, 401 | "language_info": { 402 | "codemirror_mode": { 403 | "name": "ipython", 404 | "version": 3 405 | }, 406 | "file_extension": ".py", 407 | "mimetype": "text/x-python", 408 | "name": "python", 409 | "nbconvert_exporter": "python", 410 | "pygments_lexer": "ipython3", 411 | "version": "3.9.13" 412 | } 413 | }, 414 | "nbformat": 4, 415 | "nbformat_minor": 4 416 | } 417 | -------------------------------------------------------------------------------- /5. Building Machine Learning Classifiers/empty: -------------------------------------------------------------------------------- 1 | hi 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Kshitiz Pandya 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /page.html: -------------------------------------------------------------------------------- 1 | hi 2 | -------------------------------------------------------------------------------- /test output/empty: -------------------------------------------------------------------------------- 1 | empty 2 | -------------------------------------------------------------------------------- /test output/giphy.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KshitizPandya/Natural-Language-Processing-with-Machine-Learning/896b6ca491e7f41fe8155846e6f79bc55a467280/test output/giphy.gif -------------------------------------------------------------------------------- /test output/output_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KshitizPandya/Natural-Language-Processing-with-Machine-Learning/896b6ca491e7f41fe8155846e6f79bc55a467280/test output/output_1.png -------------------------------------------------------------------------------- /test output/output_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KshitizPandya/Natural-Language-Processing-with-Machine-Learning/896b6ca491e7f41fe8155846e6f79bc55a467280/test output/output_2.png --------------------------------------------------------------------------------