├── .gitignore ├── LICENSE ├── README.md ├── nlp_pipeline_manager ├── README.md ├── __init__.py ├── nlp_preprocessor.py ├── nlpipe.py ├── pipeline_demo.ipynb ├── save_pipeline.mdl ├── supervised_nlp.py └── topic_modeling_nlp.py ├── prototyping └── building_an_nlp_pipeline_class.ipynb └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.ipynb_checkpoints* 2 | .idea/* 3 | *.egg-info/* 4 | build/* 5 | dist/* 6 | **~ 7 | lunch_and_learn_notes.md 8 | *.DS_Store* 9 | *.npz 10 | **/__pycache__/* 11 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Zach Miller 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # NLP Pipeline Manager 2 | 3 | The most frustrating part of doing NLP for me is keeping track of all the 4 | different combinations of cleaning functions, stemmers, lemmatizers, 5 | vectorizers, models, etc. I almost always resort to writing some awful 6 | function that hacks those bits together and then prints out some scoring 7 | piece. To help manage all of this better, I've developed a pipelining system 8 | that allows the user to load all of the pieces into a class and then let the 9 | class do the management for them. 10 | 11 | ### Installation 12 | 13 | Clone this repo. Go to the directory where it is cloned and run: 14 | 15 | ```bash 16 | python setup.py install 17 | ``` 18 | 19 | nlp_pipeline_manager will then install to your machine and be available. This project 20 | assumes python 3 and requires NLTK and SkLearn. 21 | 22 | 23 | ### Examples of using the pipeline: 24 | 25 | ```python 26 | from nlp_pipeline_manager import nlp_preprocessor 27 | 28 | corpus = ['BOB the builder', 'is a strange', 'caRtoon type thing'] 29 | nlp = nlp_preprocessor() 30 | nlp.fit(corpus) 31 | nlp.transform(corpus).toarray() 32 | ``` 33 | 34 | ```bash 35 | array([[1, 1, 0, 0, 0, 1, 0, 0], 36 | [0, 0, 0, 1, 1, 0, 0, 0], 37 | [0, 0, 1, 0, 0, 0, 1, 1]]) 38 | ``` 39 | 40 | Loading a stemmer into the pipeline (we actually pass in the stemming method): 41 | 42 | ```python 43 | from nltk.stem import PorterStemmer 44 | 45 | nlp = nlp_preprocessor(stemmer=PorterStemmer().stem) 46 | ``` 47 | 48 | The pipeline allows users to set: 49 | 50 | * Vectorizer (using SkLearn classes) 51 | * Cleaning Function (user can just provide a function name without the parens 52 | at the end) 53 | * Tokenizer (Either an NLTK tokenizer or a function that takes a string and 54 | returns a list of tokens) 55 | * Stemmer (Can be any function that takes in a word and returns a root form - be it a stemmer or a lemmatizer) 56 | 57 | If the user wants to provide a cleaning function, it must accept 3 arguments. 58 | The text, the tokenizer, and the stemmer. Here's an example extra cleaning 59 | function: 60 | 61 | ```python 62 | def clean_text(text, tokenizer, stemmer): 63 | """ 64 | A naive function to lowercase all words and clean them quickly 65 | with a stemmer if it exists. 66 | """ 67 | cleaned_text = [] 68 | for post in text: 69 | cleaned_words = [] 70 | for word in tokenizer(post): 71 | low_word = word.lower() 72 | if stemmer: 73 | low_word = stemmer(low_word) 74 | cleaned_words.append(low_word) 75 | cleaned_text.append(' '.join(cleaned_words)) 76 | return cleaned_text 77 | ``` 78 | 79 | # ML with the pipeline 80 | 81 | It's quick and easy to create modeling pipelines that wrap around the 82 | preprocessor. Two example pipes are shown, one for classification and one for 83 | topic modeling. Here's an example of using the classification pipe: 84 | 85 | ```python 86 | from nlp_pipeline_manager import supervised_nlp 87 | from nlp_pipeline_manager import nlp_preprocessor 88 | from sklearn.naive_bayes import MultinomialNB 89 | 90 | nlp = nlp_preprocessor() 91 | 92 | nlp_pipe = supervised_nlp(MultinomialNB(), nlp) 93 | nlp_pipe.fit(ng_train_data, ng_train_targets) 94 | nlp_pipe.score(ng_test_data, ng_test_targets) 95 | ``` 96 | 97 | 98 | -------------------------------------------------------------------------------- /nlp_pipeline_manager/README.md: -------------------------------------------------------------------------------- 1 | # NLP Pipe Manager 2 | 3 | **nlpipe.py** 4 | 5 | This file contains a parent class for file I/O to disk. This allows any pipeline using this as the parent class to 6 | save all attributes to disk in a pickle file for later use. 7 | 8 | **nlp_preprocessor.py** 9 | 10 | This is an implementation of the main pipeline. It allows a user to put in bits and pieces, which it then chains 11 | together so the user only needs to call `fit` and `transform` to get vectorized NLP data. 12 | 13 | **pipeline_demo.ipynb** 14 | 15 | A notebook that shows the pipeline in action. 16 | 17 | **save_pipeline.mdl** 18 | 19 | A saved model file from the pipeline_demo. This demonstrates the I/O capabilites of the pipeline 20 | 21 | **supervised_nlp.py** 22 | 23 | This file implements a class for using the nlp preprocessor along with a model to do prediction (classification 24 | or regression). It assumes that the user will be providing an SkLearn model to work with. Simplifies the user API 25 | to just `fit`, `predict`, and `score`. 26 | 27 | **topic_modeling_nlp.py** 28 | 29 | This file implements a class for doing topic modeling with the nlp_preprocess. The user must provide a 30 | topic modeling method like `TruncatedSVD` or `LatentDirichletAllocation` from SkLearn. This simplifies the user API 31 | to just `fit`, `transform`, and `print_topics`. 32 | -------------------------------------------------------------------------------- /nlp_pipeline_manager/__init__.py: -------------------------------------------------------------------------------- 1 | from .nlpipe import nlpipe 2 | from .nlp_preprocessor import nlp_preprocessor 3 | from .supervised_nlp import supervised_nlp 4 | from .topic_modeling_nlp import topic_modeling_nlp 5 | 6 | __all__ = ['nlpipe','nlp_preprocessor','supervised_nlp','topic_modeling_nlp'] -------------------------------------------------------------------------------- /nlp_pipeline_manager/nlp_preprocessor.py: -------------------------------------------------------------------------------- 1 | from sklearn.feature_extraction.text import CountVectorizer 2 | from nlp_pipeline_manager.nlpipe import nlpipe 3 | 4 | 5 | class nlp_preprocessor(nlpipe): 6 | 7 | def __init__(self, vectorizer=CountVectorizer(), tokenizer=None, cleaning_function=None, 8 | stemmer=None): 9 | """ 10 | A class for pipelining our data in NLP problems. The user provides a series of 11 | tools, and this class manages all of the training, transforming, and modification 12 | of the text data. 13 | --- 14 | Inputs: 15 | vectorizer: the model to use for vectorization of text data 16 | tokenizer: The tokenizer to use, if none defaults to split on spaces 17 | cleaning_function: how to clean the data, if None, defaults to the in built class 18 | stemmer: a function that returns a stemmed version of a token. For NLTK, this 19 | means getting a stemmer class, then providing the stemming function underneath it. 20 | """ 21 | if not tokenizer: 22 | tokenizer = self.splitter 23 | if not cleaning_function: 24 | cleaning_function = self.clean_text 25 | self.stemmer = stemmer 26 | self.tokenizer = tokenizer 27 | self.cleaning_function = cleaning_function 28 | self.vectorizer = vectorizer 29 | self._is_fit = False 30 | 31 | def splitter(self, text): 32 | """ 33 | Default tokenizer that splits on spaces naively 34 | """ 35 | return text.split(' ') 36 | 37 | def clean_text(self, text, tokenizer, stemmer): 38 | """ 39 | A naive function to lowercase all works can clean them quickly. 40 | This is the default behavior if no other cleaning function is specified 41 | """ 42 | cleaned_text = [] 43 | for post in text: 44 | cleaned_words = [] 45 | for word in tokenizer(post): 46 | low_word = word.lower() 47 | if stemmer: 48 | low_word = stemmer(low_word) 49 | cleaned_words.append(low_word) 50 | cleaned_text.append(' '.join(cleaned_words)) 51 | return cleaned_text 52 | 53 | def fit(self, text): 54 | """ 55 | Cleans the data and then fits the vectorizer with 56 | the user provided text 57 | """ 58 | clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer) 59 | self.vectorizer.fit(clean_text) 60 | self._is_fit = True 61 | 62 | def transform(self, text, return_clean_text=False): 63 | """ 64 | Cleans any provided data and then transforms the data into 65 | a vectorized format based on the fit function. 66 | If return_clean_text is set to True, it returns the cleaned 67 | form of the text. If it's set to False, it returns the 68 | vectorized form of the data. 69 | """ 70 | if not self._is_fit: 71 | raise ValueError("Must fit the models before transforming!") 72 | clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer) 73 | if return_clean_text: 74 | return clean_text 75 | return self.vectorizer.transform(clean_text) 76 | 77 | 78 | if __name__ == '__main__': 79 | from nltk.stem import PorterStemmer 80 | 81 | ps = PorterStemmer() 82 | corpus = ['Testing the', 'class for', 'good behavior.'] 83 | nlp = nlp_preprocessor(stemmer=ps.stem) 84 | nlp.fit(corpus) 85 | print(nlp.transform(corpus,return_clean_text=True)) 86 | print(nlp.transform(corpus).toarray()) -------------------------------------------------------------------------------- /nlp_pipeline_manager/nlpipe.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | 3 | 4 | class nlpipe: 5 | 6 | def __init__(self): 7 | """ 8 | Empty parent class for nlp pipelines that contains 9 | shared file i/o that happens in every class. 10 | """ 11 | pass 12 | 13 | def save_pipe(self, filename): 14 | """ 15 | Writes the attributes of the pipeline to a file 16 | allowing a pipeline to be loaded later with the 17 | pre-trained pieces in place. 18 | """ 19 | if type(filename) != str: 20 | raise TypeError("filename must be a string") 21 | pickle.dump(self.__dict__, open(filename+".mdl",'wb')) 22 | 23 | def load_pipe(self, filename): 24 | """ 25 | Writes the attributes of the pipeline to a file 26 | allowing a pipeline to be loaded later with the 27 | pre-trained pieces in place. 28 | """ 29 | if type(filename) != str: 30 | raise TypeError("filename must be a string") 31 | if filename[-4:] != '.mdl': 32 | filename += '.mdl' 33 | self.__dict__ = pickle.load(open(filename,'rb')) -------------------------------------------------------------------------------- /nlp_pipeline_manager/pipeline_demo.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "ExecuteTime": { 8 | "end_time": "2018-08-16T13:58:45.891755Z", 9 | "start_time": "2018-08-16T13:58:45.456866Z" 10 | } 11 | }, 12 | "outputs": [], 13 | "source": [ 14 | "from nlp_preprocessor import nlp_preprocessor\n", 15 | "import pandas as pd" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "# Testing the preprocessor class" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 3, 28 | "metadata": { 29 | "ExecuteTime": { 30 | "end_time": "2018-08-16T13:58:59.678473Z", 31 | "start_time": "2018-08-16T13:58:57.722641Z" 32 | } 33 | }, 34 | "outputs": [ 35 | { 36 | "data": { 37 | "text/html": [ 38 | "
\n", 39 | "\n", 52 | "\n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | "
bobbuildercartoonisstrangethethingtype
011000100
100011000
200100011
\n", 102 | "
" 103 | ], 104 | "text/plain": [ 105 | " bob builder cartoon is strange the thing type\n", 106 | "0 1 1 0 0 0 1 0 0\n", 107 | "1 0 0 0 1 1 0 0 0\n", 108 | "2 0 0 1 0 0 0 1 1" 109 | ] 110 | }, 111 | "execution_count": 3, 112 | "metadata": {}, 113 | "output_type": "execute_result" 114 | } 115 | ], 116 | "source": [ 117 | "from nltk.stem import WordNetLemmatizer\n", 118 | "\n", 119 | "lemma = WordNetLemmatizer()\n", 120 | "\n", 121 | "corpus = ['BOB the builder', 'is a strange', 'caRtoon type thing']\n", 122 | "nlp = nlp_preprocessor(stemmer = lemma.lemmatize)\n", 123 | "nlp.fit(corpus)\n", 124 | "pd.DataFrame(nlp.transform(corpus).toarray(), columns=nlp.vectorizer.get_feature_names())" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": 4, 130 | "metadata": { 131 | "ExecuteTime": { 132 | "end_time": "2018-08-16T13:59:03.953921Z", 133 | "start_time": "2018-08-16T13:59:03.948587Z" 134 | } 135 | }, 136 | "outputs": [ 137 | { 138 | "data": { 139 | "text/plain": [ 140 | "array([[1, 1, 0, 0, 0, 1, 0, 0],\n", 141 | " [0, 0, 0, 1, 1, 0, 0, 0],\n", 142 | " [0, 0, 1, 0, 0, 0, 1, 1]])" 143 | ] 144 | }, 145 | "execution_count": 4, 146 | "metadata": {}, 147 | "output_type": "execute_result" 148 | } 149 | ], 150 | "source": [ 151 | "nlp.transform(corpus).toarray()" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "# Testing the Supervised Learning Class" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 5, 164 | "metadata": { 165 | "ExecuteTime": { 166 | "end_time": "2018-08-16T13:59:06.378161Z", 167 | "start_time": "2018-08-16T13:59:06.372315Z" 168 | } 169 | }, 170 | "outputs": [], 171 | "source": [ 172 | "from supervised_nlp import supervised_nlp" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": 6, 178 | "metadata": { 179 | "ExecuteTime": { 180 | "end_time": "2018-08-16T13:59:09.307480Z", 181 | "start_time": "2018-08-16T13:59:06.929183Z" 182 | } 183 | }, 184 | "outputs": [], 185 | "source": [ 186 | "from sklearn import datasets\n", 187 | "\n", 188 | "categories = ['alt.atheism', 'comp.graphics', 'rec.sport.baseball']\n", 189 | "ng_train = datasets.fetch_20newsgroups(subset='train', \n", 190 | " categories=categories, \n", 191 | " remove=('headers', \n", 192 | " 'footers', 'quotes'))\n", 193 | "ng_train_data = ng_train.data\n", 194 | "ng_train_targets = ng_train.target\n", 195 | "\n", 196 | "ng_test = datasets.fetch_20newsgroups(subset='test', \n", 197 | " categories=categories, \n", 198 | " remove=('headers', \n", 199 | " 'footers', 'quotes'))\n", 200 | "\n", 201 | "ng_test_data = ng_test.data\n", 202 | "ng_test_targets = ng_test.target" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 7, 208 | "metadata": { 209 | "ExecuteTime": { 210 | "end_time": "2018-08-16T13:59:13.076084Z", 211 | "start_time": "2018-08-16T13:59:09.309465Z" 212 | } 213 | }, 214 | "outputs": [ 215 | { 216 | "data": { 217 | "text/plain": [ 218 | "0.9113122171945701" 219 | ] 220 | }, 221 | "execution_count": 7, 222 | "metadata": {}, 223 | "output_type": "execute_result" 224 | } 225 | ], 226 | "source": [ 227 | "from sklearn.naive_bayes import MultinomialNB\n", 228 | "\n", 229 | "nlp_pipe = supervised_nlp(MultinomialNB(), nlp)\n", 230 | "nlp_pipe.fit(ng_train_data, ng_train_targets)\n", 231 | "nlp_pipe.score(ng_test_data, ng_test_targets)" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "# Testing the Topic Modeling Class" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 8, 244 | "metadata": { 245 | "ExecuteTime": { 246 | "end_time": "2018-08-16T13:59:38.430684Z", 247 | "start_time": "2018-08-16T13:59:35.472286Z" 248 | } 249 | }, 250 | "outputs": [ 251 | { 252 | "name": "stdout", 253 | "output_type": "stream", 254 | "text": [ 255 | "Topic #0: image jpeg file edu gif format color data pub ftp\n", 256 | "Topic #1: edu pub data graphics mail ftp ray send graphic com\n", 257 | "Topic #2: jesus god wa atheist matthew people ha atheism christian prophecy\n", 258 | "Topic #3: image data processing tool analysis software user available using sun\n", 259 | "Topic #4: jesus matthew prophecy wa psalm messiah day isaiah david prophet\n", 260 | "Topic #5: argument fallacy conclusion premise example true argumentum ad false valid\n", 261 | "Topic #6: data available ftp grass sgi vertex package model pci motecc\n", 262 | "Topic #7: wa game year team hit run don think good win\n", 263 | "Topic #8: posting response god subject typical information universe einstein wa bush\n", 264 | "Topic #9: den radius double theta sqrt pi sin rtheta pt pole\n", 265 | "Topic #10: program read think menu don bit change file want pressing\n", 266 | "Topic #11: program menu file read display game pressing change bit home\n", 267 | "Topic #12: won lost idle new york sox year san american chicago\n", 268 | "Topic #13: atheism alt faq send edu usenet news article answers newsgroup\n", 269 | "Topic #14: col int row value cub char imagewidth imageheight suck atheism\n" 270 | ] 271 | } 272 | ], 273 | "source": [ 274 | "from sklearn.decomposition import TruncatedSVD\n", 275 | "from sklearn.feature_extraction.text import CountVectorizer\n", 276 | "\n", 277 | "from topic_modeling_nlp import topic_modeling_nlp\n", 278 | "\n", 279 | "\n", 280 | "cv = CountVectorizer(stop_words='english', token_pattern='\\\\b[a-z][a-z]+\\\\b')\n", 281 | "cleaning_pipe = nlp_preprocessor(vectorizer=cv, stemmer=lemma.lemmatize)\n", 282 | "topic_chain = topic_modeling_nlp(TruncatedSVD(n_components=15), preprocessing_pipeline=cleaning_pipe)\n", 283 | "\n", 284 | "topic_chain.fit(ng_train_data)\n", 285 | "topic_chain.print_topics()" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "# Testing Saving and Loading a Pipeline" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": 9, 298 | "metadata": { 299 | "ExecuteTime": { 300 | "end_time": "2018-08-16T13:59:48.733671Z", 301 | "start_time": "2018-08-16T13:59:48.644948Z" 302 | } 303 | }, 304 | "outputs": [ 305 | { 306 | "data": { 307 | "text/plain": [ 308 | "method" 309 | ] 310 | }, 311 | "execution_count": 9, 312 | "metadata": {}, 313 | "output_type": "execute_result" 314 | } 315 | ], 316 | "source": [ 317 | "from nltk.stem import PorterStemmer\n", 318 | "\n", 319 | "nlp = nlp_preprocessor(stemmer=PorterStemmer().stem)\n", 320 | "nlp.save_pipe('save_pipeline')\n", 321 | "type(nlp.stemmer)" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 10, 327 | "metadata": { 328 | "ExecuteTime": { 329 | "end_time": "2018-08-16T13:59:53.293256Z", 330 | "start_time": "2018-08-16T13:59:53.287874Z" 331 | } 332 | }, 333 | "outputs": [ 334 | { 335 | "data": { 336 | "text/plain": [ 337 | "NoneType" 338 | ] 339 | }, 340 | "execution_count": 10, 341 | "metadata": {}, 342 | "output_type": "execute_result" 343 | } 344 | ], 345 | "source": [ 346 | "nlp2 = nlp_preprocessor()\n", 347 | "type(nlp2.stemmer)" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": 11, 353 | "metadata": { 354 | "ExecuteTime": { 355 | "end_time": "2018-08-16T13:59:56.710006Z", 356 | "start_time": "2018-08-16T13:59:56.688327Z" 357 | } 358 | }, 359 | "outputs": [ 360 | { 361 | "data": { 362 | "text/plain": [ 363 | "method" 364 | ] 365 | }, 366 | "execution_count": 11, 367 | "metadata": {}, 368 | "output_type": "execute_result" 369 | } 370 | ], 371 | "source": [ 372 | "nlp2.load_pipe('save_pipeline')\n", 373 | "type(nlp2.stemmer)" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": null, 379 | "metadata": {}, 380 | "outputs": [], 381 | "source": [] 382 | } 383 | ], 384 | "metadata": { 385 | "kernelspec": { 386 | "display_name": "Python [default]", 387 | "language": "python", 388 | "name": "python3" 389 | }, 390 | "language_info": { 391 | "codemirror_mode": { 392 | "name": "ipython", 393 | "version": 3 394 | }, 395 | "file_extension": ".py", 396 | "mimetype": "text/x-python", 397 | "name": "python", 398 | "nbconvert_exporter": "python", 399 | "pygments_lexer": "ipython3", 400 | "version": "3.6.6" 401 | }, 402 | "toc": { 403 | "nav_menu": {}, 404 | "number_sections": true, 405 | "sideBar": true, 406 | "skip_h1_title": false, 407 | "toc_cell": false, 408 | "toc_position": {}, 409 | "toc_section_display": "block", 410 | "toc_window_display": false 411 | } 412 | }, 413 | "nbformat": 4, 414 | "nbformat_minor": 2 415 | } 416 | -------------------------------------------------------------------------------- /nlp_pipeline_manager/save_pipeline.mdl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ZWMiller/nlp_pipe_manager/5a73ed4c7afa345573bec38a59169f445416c3cf/nlp_pipeline_manager/save_pipeline.mdl -------------------------------------------------------------------------------- /nlp_pipeline_manager/supervised_nlp.py: -------------------------------------------------------------------------------- 1 | from nlp_pipeline_manager.nlp_preprocessor import nlp_preprocessor 2 | from nlp_pipeline_manager.nlpipe import nlpipe 3 | 4 | 5 | class supervised_nlp(nlpipe): 6 | 7 | def __init__(self, model, preprocessing_pipeline=None): 8 | """ 9 | A pipeline for doing supervised nlp. Expects a model and creates 10 | a preprocessing pipeline if one isn't provided. 11 | """ 12 | self.model = model 13 | self._is_fit = False 14 | if not preprocessing_pipeline: 15 | self.preprocessor = nlp_preprocessor() 16 | else: 17 | self.preprocessor = preprocessing_pipeline 18 | 19 | def fit(self, X, y): 20 | """ 21 | Trains the vectorizer and model together using the 22 | users input training data. 23 | """ 24 | self.preprocessor.fit(X) 25 | train_data = self.preprocessor.transform(X) 26 | self.model.fit(train_data, y) 27 | self._is_fit = True 28 | 29 | def predict(self, X): 30 | """ 31 | Makes a prediction on the data provided by the users using the 32 | preprocessing pipeline and provided model. 33 | """ 34 | if not self._is_fit: 35 | raise ValueError("Must fit the models before transforming!") 36 | test_data = self.preprocessor.transform(X) 37 | preds = self.model.predict(test_data) 38 | return preds 39 | 40 | def score(self, X, y): 41 | """ 42 | Returns the accuracy for the model after using the trained 43 | preprocessing pipeline to prepare the data. 44 | """ 45 | test_data = self.preprocessor.transform(X) 46 | return self.model.score(test_data, y) 47 | 48 | 49 | if __name__ == "__main__": 50 | from sklearn import datasets 51 | 52 | categories = ['alt.atheism', 'comp.graphics', 'rec.sport.baseball'] 53 | ng_train = datasets.fetch_20newsgroups(subset='train', 54 | categories=categories, 55 | remove=('headers', 56 | 'footers', 'quotes')) 57 | ng_train_data = ng_train.data 58 | ng_train_targets = ng_train.target 59 | 60 | ng_test = datasets.fetch_20newsgroups(subset='test', 61 | categories=categories, 62 | remove=('headers', 63 | 'footers', 'quotes')) 64 | 65 | ng_test_data = ng_test.data 66 | ng_test_targets = ng_test.target 67 | 68 | from sklearn.naive_bayes import MultinomialNB 69 | 70 | nlp_pipe = supervised_nlp(MultinomialNB()) 71 | nlp_pipe.fit(ng_train_data, ng_train_targets) 72 | print("Accuracy: ", nlp_pipe.score(ng_test_data, ng_test_targets)) -------------------------------------------------------------------------------- /nlp_pipeline_manager/topic_modeling_nlp.py: -------------------------------------------------------------------------------- 1 | from nlp_pipeline_manager.nlp_preprocessor import nlp_preprocessor 2 | from nlp_pipeline_manager.nlpipe import nlpipe 3 | 4 | class topic_modeling_nlp(nlpipe): 5 | 6 | def __init__(self, model, preprocessing_pipeline=None): 7 | """ 8 | A pipeline for doing supervised nlp. Expects a model and creates 9 | a preprocessing pipeline if one isn't provided. 10 | """ 11 | self.model = model 12 | self._is_fit = False 13 | if not preprocessing_pipeline: 14 | self.preprocessor = nlp_preprocessor() 15 | else: 16 | self.preprocessor = preprocessing_pipeline 17 | 18 | def fit(self, X): 19 | """ 20 | Trains the vectorizer and model together using the 21 | users input training data. 22 | """ 23 | self.preprocessor.fit(X) 24 | train_data = self.preprocessor.transform(X) 25 | self.model.fit(train_data) 26 | self._is_fit = True 27 | 28 | def transform(self, X): 29 | """ 30 | Makes a prediction on the data provided by the users using the 31 | preprocessing pipeline and provided model. 32 | """ 33 | if not self._is_fit: 34 | raise ValueError("Must fit the models before transforming!") 35 | test_data = self.preprocessor.transform(X) 36 | preds = self.model.transform(test_data) 37 | return preds 38 | 39 | def print_topics(self, num_words=10): 40 | """ 41 | A function to print out the top words for each topic 42 | """ 43 | feat_names = self.preprocessor.vectorizer.get_feature_names() 44 | for topic_idx, topic in enumerate(self.model.components_): 45 | message = "Topic #%d: " % topic_idx 46 | message += " ".join([feat_names[i] 47 | for i in topic.argsort()[:-num_words - 1:-1]]) 48 | print(message) 49 | 50 | 51 | 52 | if __name__ == "__main__": 53 | from sklearn import datasets 54 | 55 | categories = ['alt.atheism', 'comp.graphics', 'rec.sport.baseball'] 56 | ng_train = datasets.fetch_20newsgroups(subset='train', 57 | categories=categories, 58 | remove=('headers', 59 | 'footers', 'quotes')) 60 | ng_train_data = ng_train.data 61 | ng_train_targets = ng_train.target 62 | 63 | ng_test = datasets.fetch_20newsgroups(subset='test', 64 | categories=categories, 65 | remove=('headers', 66 | 'footers', 'quotes')) 67 | from sklearn.decomposition import TruncatedSVD 68 | from sklearn.feature_extraction.text import CountVectorizer 69 | 70 | from topic_modeling_nlp import topic_modeling_nlp 71 | 72 | 73 | cv = CountVectorizer(stop_words='english', token_pattern='\\b[a-z][a-z]+\\b') 74 | cleaning_pipe = nlp_preprocessor(vectorizer=cv) 75 | topic_chain = topic_modeling_nlp(TruncatedSVD(n_components=15), preprocessing_pipeline=cleaning_pipe) 76 | 77 | topic_chain.fit(ng_train_data) 78 | topic_chain.print_topics() -------------------------------------------------------------------------------- /prototyping/building_an_nlp_pipeline_class.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Building a class to manage our NLP pipelines" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Because it's such a pain to manage all the permutations of NLP cleaners/tokenizers/vectorizers/stemmers/etc, we're going to build a class that takes all of those pieces in and manages the pipelines for us." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": { 21 | "ExecuteTime": { 22 | "end_time": "2018-08-12T19:36:53.819932Z", 23 | "start_time": "2018-08-12T19:36:52.882108Z" 24 | } 25 | }, 26 | "outputs": [ 27 | { 28 | "name": "stdout", 29 | "output_type": "stream", 30 | "text": [ 31 | "Python Version: 3.6.6 |Anaconda custom (64-bit)| (default, Jun 28 2018, 11:07:29) \n", 32 | "[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] \n", 33 | "\n", 34 | "Matplotlib Version: 2.2.2\n", 35 | "Numpy Version: 1.15.0\n", 36 | "Pandas Version: 0.23.3\n", 37 | "NLTK Version: 3.3\n", 38 | "sklearn Version: 0.19.1\n" 39 | ] 40 | } 41 | ], 42 | "source": [ 43 | "import numpy as np\n", 44 | "import sklearn\n", 45 | "import matplotlib\n", 46 | "import pandas as pd\n", 47 | "import sklearn\n", 48 | "import sys\n", 49 | "import nltk\n", 50 | "\n", 51 | "libraries = (('Matplotlib', matplotlib), ('Numpy', np), ('Pandas', pd), ('NLTK', nltk), ('sklearn',sklearn))\n", 52 | "\n", 53 | "print(\"Python Version:\", sys.version, '\\n')\n", 54 | "for lib in libraries:\n", 55 | " print('{0} Version: {1}'.format(lib[0], lib[1].__version__))" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 2, 61 | "metadata": { 62 | "ExecuteTime": { 63 | "end_time": "2018-08-12T19:36:53.986272Z", 64 | "start_time": "2018-08-12T19:36:53.822664Z" 65 | } 66 | }, 67 | "outputs": [], 68 | "source": [ 69 | "import numpy as np\n", 70 | "import random\n", 71 | "import matplotlib.pyplot as plt\n", 72 | "import pandas as pd\n", 73 | "import math\n", 74 | "import scipy\n", 75 | "%matplotlib inline\n", 76 | "plt.style.use('seaborn')" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 3, 82 | "metadata": { 83 | "ExecuteTime": { 84 | "end_time": "2018-08-12T19:36:53.999203Z", 85 | "start_time": "2018-08-12T19:36:53.988652Z" 86 | } 87 | }, 88 | "outputs": [], 89 | "source": [ 90 | "from sklearn.feature_extraction.text import CountVectorizer\n", 91 | "import pickle\n", 92 | "\n", 93 | "class nlp_preprocessor:\n", 94 | " \n", 95 | " def __init__(self, vectorizer=CountVectorizer(), tokenizer=None, cleaning_function=None, \n", 96 | " stemmer=None, model=None):\n", 97 | " \"\"\"\n", 98 | " A class for pipelining our data in NLP problems. The user provides a series of \n", 99 | " tools, and this class manages all of the training, transforming, and modification\n", 100 | " of the text data.\n", 101 | " ---\n", 102 | " Inputs:\n", 103 | " vectorizer: the model to use for vectorization of text data\n", 104 | " tokenizer: The tokenizer to use, if none defaults to split on spaces\n", 105 | " cleaning_function: how to clean the data, if None, defaults to the in built class\n", 106 | " \"\"\"\n", 107 | " if not tokenizer:\n", 108 | " tokenizer = self.splitter\n", 109 | " if not cleaning_function:\n", 110 | " cleaning_function = self.clean_text\n", 111 | " self.stemmer = stemmer\n", 112 | " self.tokenizer = tokenizer\n", 113 | " self.model = model\n", 114 | " self.cleaning_function = cleaning_function\n", 115 | " self.vectorizer = vectorizer\n", 116 | " self._is_fit = False\n", 117 | " \n", 118 | " def splitter(self, text):\n", 119 | " \"\"\"\n", 120 | " Default tokenizer that splits on spaces naively\n", 121 | " \"\"\"\n", 122 | " return text.split(' ')\n", 123 | " \n", 124 | " def clean_text(self, text, tokenizer, stemmer):\n", 125 | " \"\"\"\n", 126 | " A naive function to lowercase all works can clean them quickly.\n", 127 | " This is the default behavior if no other cleaning function is specified\n", 128 | " \"\"\"\n", 129 | " cleaned_text = []\n", 130 | " for post in text:\n", 131 | " cleaned_words = []\n", 132 | " for word in tokenizer(post):\n", 133 | " low_word = word.lower()\n", 134 | " if stemmer:\n", 135 | " low_word = stemmer.stem(low_word)\n", 136 | " cleaned_words.append(low_word)\n", 137 | " cleaned_text.append(' '.join(cleaned_words))\n", 138 | " return cleaned_text\n", 139 | " \n", 140 | " def fit(self, text):\n", 141 | " \"\"\"\n", 142 | " Cleans the data and then fits the vectorizer with\n", 143 | " the user provided text\n", 144 | " \"\"\"\n", 145 | " clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer)\n", 146 | " self.vectorizer.fit(clean_text)\n", 147 | " self._is_fit = True\n", 148 | " \n", 149 | " def transform(self, text):\n", 150 | " \"\"\"\n", 151 | " Cleans any provided data and then transforms the data into\n", 152 | " a vectorized format based on the fit function. Returns the\n", 153 | " vectorized form of the data.\n", 154 | " \"\"\"\n", 155 | " if not self._is_fit:\n", 156 | " raise ValueError(\"Must fit the models before transforming!\")\n", 157 | " clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer)\n", 158 | " return self.vectorizer.transform(clean_text)\n", 159 | " \n", 160 | " def save_pipe(self, filename):\n", 161 | " \"\"\"\n", 162 | " Writes the attributes of the pipeline to a file\n", 163 | " allowing a pipeline to be loaded later with the\n", 164 | " pre-trained pieces in place.\n", 165 | " \"\"\"\n", 166 | " if type(filename) != str:\n", 167 | " raise TypeError(\"filename must be a string\")\n", 168 | " pickle.dump(self.__dict__, open(filename+\".mdl\",'wb'))\n", 169 | " \n", 170 | " def load_pipe(self, filename):\n", 171 | " \"\"\"\n", 172 | " Writes the attributes of the pipeline to a file\n", 173 | " allowing a pipeline to be loaded later with the\n", 174 | " pre-trained pieces in place.\n", 175 | " \"\"\"\n", 176 | " if type(filename) != str:\n", 177 | " raise TypeError(\"filename must be a string\")\n", 178 | " if filename[-4:] != '.mdl':\n", 179 | " filename += '.mdl'\n", 180 | " self.__dict__ = pickle.load(open(filename,'rb'))" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "## Now let's test the model with the defaults" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": 4, 193 | "metadata": { 194 | "ExecuteTime": { 195 | "end_time": "2018-08-12T19:36:54.006636Z", 196 | "start_time": "2018-08-12T19:36:54.001864Z" 197 | } 198 | }, 199 | "outputs": [], 200 | "source": [ 201 | "corpus = ['BOB the builder', 'is a strange', 'caRtoon type thing']" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 5, 207 | "metadata": { 208 | "ExecuteTime": { 209 | "end_time": "2018-08-12T19:36:54.015526Z", 210 | "start_time": "2018-08-12T19:36:54.011931Z" 211 | } 212 | }, 213 | "outputs": [], 214 | "source": [ 215 | "nlp = nlp_preprocessor()" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": 6, 221 | "metadata": { 222 | "ExecuteTime": { 223 | "end_time": "2018-08-12T19:36:54.027433Z", 224 | "start_time": "2018-08-12T19:36:54.018577Z" 225 | } 226 | }, 227 | "outputs": [], 228 | "source": [ 229 | "nlp.fit(corpus)" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 7, 235 | "metadata": { 236 | "ExecuteTime": { 237 | "end_time": "2018-08-12T19:36:54.036505Z", 238 | "start_time": "2018-08-12T19:36:54.029831Z" 239 | } 240 | }, 241 | "outputs": [ 242 | { 243 | "data": { 244 | "text/plain": [ 245 | "array([[1, 1, 0, 0, 0, 1, 0, 0],\n", 246 | " [0, 0, 0, 1, 1, 0, 0, 0],\n", 247 | " [0, 0, 1, 0, 0, 0, 1, 1]])" 248 | ] 249 | }, 250 | "execution_count": 7, 251 | "metadata": {}, 252 | "output_type": "execute_result" 253 | } 254 | ], 255 | "source": [ 256 | "nlp.transform(corpus).toarray()" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": 8, 262 | "metadata": { 263 | "ExecuteTime": { 264 | "end_time": "2018-08-12T19:36:54.141882Z", 265 | "start_time": "2018-08-12T19:36:54.039121Z" 266 | } 267 | }, 268 | "outputs": [ 269 | { 270 | "data": { 271 | "text/html": [ 272 | "
\n", 273 | "\n", 286 | "\n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | "
bobbuildercartoonisstrangethethingtype
011000100
100011000
200100011
\n", 336 | "
" 337 | ], 338 | "text/plain": [ 339 | " bob builder cartoon is strange the thing type\n", 340 | "0 1 1 0 0 0 1 0 0\n", 341 | "1 0 0 0 1 1 0 0 0\n", 342 | "2 0 0 1 0 0 0 1 1" 343 | ] 344 | }, 345 | "execution_count": 8, 346 | "metadata": {}, 347 | "output_type": "execute_result" 348 | } 349 | ], 350 | "source": [ 351 | "pd.DataFrame(nlp.transform(corpus).toarray(), columns=nlp.vectorizer.get_feature_names())" 352 | ] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "metadata": {}, 357 | "source": [ 358 | "## What if we want to swap pieces in?" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": 9, 364 | "metadata": { 365 | "ExecuteTime": { 366 | "end_time": "2018-08-12T19:36:54.151700Z", 367 | "start_time": "2018-08-12T19:36:54.144563Z" 368 | } 369 | }, 370 | "outputs": [], 371 | "source": [ 372 | "def new_clean_text(text, tokenizer, stemmer):\n", 373 | " \"\"\"\n", 374 | " A naive function to lowercase all works can clean them quickly.\n", 375 | " This is the default behavior if no other cleaning function is specified\n", 376 | " \"\"\"\n", 377 | " cleaned_text = []\n", 378 | " for post in text:\n", 379 | " cleaned_words = []\n", 380 | " for word in tokenizer(post):\n", 381 | " low_word = word.lower()\n", 382 | " if low_word in ['builder']: # remove the word builder\n", 383 | " continue\n", 384 | " if stemmer:\n", 385 | " low_word = stemmer.stem(low_word)\n", 386 | " cleaned_words.append(low_word)\n", 387 | " cleaned_text.append(' '.join(cleaned_words))\n", 388 | " return cleaned_text" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 10, 394 | "metadata": { 395 | "ExecuteTime": { 396 | "end_time": "2018-08-12T19:36:54.161583Z", 397 | "start_time": "2018-08-12T19:36:54.156314Z" 398 | } 399 | }, 400 | "outputs": [], 401 | "source": [ 402 | "from nltk.stem import PorterStemmer\n", 403 | "\n", 404 | "nlp2 = nlp_preprocessor(cleaning_function=new_clean_text, vectorizer=CountVectorizer(lowercase=False), \n", 405 | " stemmer=PorterStemmer())" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": 11, 411 | "metadata": { 412 | "ExecuteTime": { 413 | "end_time": "2018-08-12T19:36:54.174080Z", 414 | "start_time": "2018-08-12T19:36:54.164305Z" 415 | } 416 | }, 417 | "outputs": [ 418 | { 419 | "data": { 420 | "text/plain": [ 421 | "['bob', 'cartoon', 'is', 'strang', 'the', 'thing', 'type']" 422 | ] 423 | }, 424 | "execution_count": 11, 425 | "metadata": {}, 426 | "output_type": "execute_result" 427 | } 428 | ], 429 | "source": [ 430 | "nlp2.fit(corpus)\n", 431 | "nlp2.vectorizer.get_feature_names()" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": 12, 437 | "metadata": { 438 | "ExecuteTime": { 439 | "end_time": "2018-08-12T19:36:54.188085Z", 440 | "start_time": "2018-08-12T19:36:54.176340Z" 441 | } 442 | }, 443 | "outputs": [ 444 | { 445 | "data": { 446 | "text/html": [ 447 | "
\n", 448 | "\n", 461 | "\n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | "
bobcartoonisstrangthethingtype
01000100
10011000
20100011
\n", 507 | "
" 508 | ], 509 | "text/plain": [ 510 | " bob cartoon is strang the thing type\n", 511 | "0 1 0 0 0 1 0 0\n", 512 | "1 0 0 1 1 0 0 0\n", 513 | "2 0 1 0 0 0 1 1" 514 | ] 515 | }, 516 | "execution_count": 12, 517 | "metadata": {}, 518 | "output_type": "execute_result" 519 | } 520 | ], 521 | "source": [ 522 | "pd.DataFrame(nlp2.transform(corpus).toarray(), columns=nlp2.vectorizer.get_feature_names())" 523 | ] 524 | }, 525 | { 526 | "cell_type": "markdown", 527 | "metadata": {}, 528 | "source": [ 529 | "## What about using TF-IDF instead?" 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": 13, 535 | "metadata": { 536 | "ExecuteTime": { 537 | "end_time": "2018-08-12T19:36:54.194725Z", 538 | "start_time": "2018-08-12T19:36:54.190607Z" 539 | } 540 | }, 541 | "outputs": [], 542 | "source": [ 543 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 544 | "\n", 545 | "nlp3 = nlp_preprocessor(cleaning_function=new_clean_text, vectorizer=TfidfVectorizer(lowercase=False))" 546 | ] 547 | }, 548 | { 549 | "cell_type": "code", 550 | "execution_count": 14, 551 | "metadata": { 552 | "ExecuteTime": { 553 | "end_time": "2018-08-12T19:36:54.214380Z", 554 | "start_time": "2018-08-12T19:36:54.197369Z" 555 | } 556 | }, 557 | "outputs": [ 558 | { 559 | "data": { 560 | "text/html": [ 561 | "
\n", 562 | "\n", 575 | "\n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | "
bobcartoonisstrangethethingtype
00.7071070.000000.0000000.0000000.7071070.000000.00000
10.0000000.000000.7071070.7071070.0000000.000000.00000
20.0000000.577350.0000000.0000000.0000000.577350.57735
\n", 621 | "
" 622 | ], 623 | "text/plain": [ 624 | " bob cartoon is strange the thing type\n", 625 | "0 0.707107 0.00000 0.000000 0.000000 0.707107 0.00000 0.00000\n", 626 | "1 0.000000 0.00000 0.707107 0.707107 0.000000 0.00000 0.00000\n", 627 | "2 0.000000 0.57735 0.000000 0.000000 0.000000 0.57735 0.57735" 628 | ] 629 | }, 630 | "execution_count": 14, 631 | "metadata": {}, 632 | "output_type": "execute_result" 633 | } 634 | ], 635 | "source": [ 636 | "nlp3.fit(corpus)\n", 637 | "nlp3.vectorizer.get_feature_names()\n", 638 | "pd.DataFrame(nlp3.transform(corpus).toarray(), columns=nlp3.vectorizer.get_feature_names())" 639 | ] 640 | }, 641 | { 642 | "cell_type": "markdown", 643 | "metadata": {}, 644 | "source": [ 645 | "# So what? Let's use some real data to try some different modeling approaches" 646 | ] 647 | }, 648 | { 649 | "cell_type": "code", 650 | "execution_count": 15, 651 | "metadata": { 652 | "ExecuteTime": { 653 | "end_time": "2018-08-12T19:36:56.475435Z", 654 | "start_time": "2018-08-12T19:36:54.216987Z" 655 | } 656 | }, 657 | "outputs": [], 658 | "source": [ 659 | "from sklearn import datasets\n", 660 | "\n", 661 | "categories = ['alt.atheism', 'comp.graphics', 'rec.sport.baseball']\n", 662 | "ng_train = datasets.fetch_20newsgroups(subset='train', \n", 663 | " categories=categories, \n", 664 | " remove=('headers', \n", 665 | " 'footers', 'quotes'))\n", 666 | "ng_train_data = ng_train.data\n", 667 | "ng_train_targets = ng_train.target\n", 668 | "\n", 669 | "ng_test = datasets.fetch_20newsgroups(subset='test', \n", 670 | " categories=categories, \n", 671 | " remove=('headers', \n", 672 | " 'footers', 'quotes'))\n", 673 | "\n", 674 | "ng_test_data = ng_test.data\n", 675 | "ng_test_targets = ng_test.target" 676 | ] 677 | }, 678 | { 679 | "cell_type": "code", 680 | "execution_count": 16, 681 | "metadata": { 682 | "ExecuteTime": { 683 | "end_time": "2018-08-12T19:37:10.328227Z", 684 | "start_time": "2018-08-12T19:36:56.477401Z" 685 | } 686 | }, 687 | "outputs": [ 688 | { 689 | "name": "stdout", 690 | "output_type": "stream", 691 | "text": [ 692 | "Chain 0: 0.9031674208144796\n", 693 | "Chain 1: 0.9076923076923077\n", 694 | "Chain 2: 0.8995475113122172\n" 695 | ] 696 | } 697 | ], 698 | "source": [ 699 | "from sklearn.naive_bayes import MultinomialNB\n", 700 | "from nltk.stem import PorterStemmer\n", 701 | "\n", 702 | "nlp = nlp_preprocessor(stemmer=PorterStemmer())\n", 703 | "nlp2 = nlp_preprocessor(vectorizer=CountVectorizer(lowercase=False))\n", 704 | "nlp3 = nlp_preprocessor(cleaning_function=new_clean_text, vectorizer=TfidfVectorizer(lowercase=False))\n", 705 | "nlp_chains = [nlp, nlp2, nlp3]\n", 706 | "\n", 707 | "for ix, chain in enumerate(nlp_chains):\n", 708 | " nb = MultinomialNB()\n", 709 | " chain.fit(ng_train_data)\n", 710 | " train_data = chain.transform(ng_train_data)\n", 711 | " test_data = chain.transform(ng_test_data)\n", 712 | " nb.fit(train_data, ng_train_targets)\n", 713 | " accuracy = nb.score(test_data, ng_test_targets)\n", 714 | " print(\"Chain {}: {}\".format(ix, accuracy))" 715 | ] 716 | }, 717 | { 718 | "cell_type": "markdown", 719 | "metadata": {}, 720 | "source": [ 721 | "## Summary" 722 | ] 723 | }, 724 | { 725 | "cell_type": "markdown", 726 | "metadata": {}, 727 | "source": [ 728 | "This allows us to sweep all of the preprocessing into a class where we can control the pieces and parts that go in, and can see what comes out. If we wanted to, we could even add a model into the class as well and put the whole pipe into a single class that manages all of our challenges. In this case, we've left it outside for demo purposes. This also saves all of the pieces together, so we can just pickle a class object and that will keep the whole structure of our models together - such as the vectorizer and the stemmer we used, as well as the cleaning routine, so we don't lose any of the pieces if we want to run it on new data later." 729 | ] 730 | }, 731 | { 732 | "cell_type": "markdown", 733 | "metadata": {}, 734 | "source": [ 735 | "# Adding a model to the mix\n", 736 | "\n", 737 | "Depending on the type of model we want to build, we'll need to wrap the preprocessing class a little bit differently for the specific case. For example, if we're doing supervised learning, we'll want a `predict` method. If we're doing topic modeling, we'll want a `transform` method. To make that happen, I'll show a few examples below that wrap around the preprocessing class to make the most of it. " 738 | ] 739 | }, 740 | { 741 | "cell_type": "markdown", 742 | "metadata": {}, 743 | "source": [ 744 | "#### Supervised: Classification\n", 745 | "\n", 746 | "Here we'll write a class to predict a class given the text of the document. " 747 | ] 748 | }, 749 | { 750 | "cell_type": "code", 751 | "execution_count": 17, 752 | "metadata": { 753 | "ExecuteTime": { 754 | "end_time": "2018-08-12T19:37:10.340551Z", 755 | "start_time": "2018-08-12T19:37:10.330954Z" 756 | } 757 | }, 758 | "outputs": [], 759 | "source": [ 760 | "class supervised_nlp:\n", 761 | " \n", 762 | " def __init__(self, model, preprocessing_pipeline=None):\n", 763 | " \"\"\"\n", 764 | " A pipeline for doing supervised nlp. Expects a model and creates\n", 765 | " a preprocessing pipeline if one isn't provided.\n", 766 | " \"\"\"\n", 767 | " self.model = model\n", 768 | " self._is_fit = False\n", 769 | " if not preprocessing_pipeline:\n", 770 | " self.preprocessor = nlp_preprocessor()\n", 771 | " else:\n", 772 | " self.preprocessor = preprocessing_pipeline\n", 773 | " \n", 774 | " def fit(self, X, y):\n", 775 | " \"\"\"\n", 776 | " Trains the vectorizer and model together using the \n", 777 | " users input training data.\n", 778 | " \"\"\"\n", 779 | " self.preprocessor.fit(X)\n", 780 | " train_data = self.preprocessor.transform(X)\n", 781 | " self.model.fit(train_data, y)\n", 782 | " self._is_fit = True\n", 783 | " \n", 784 | " def predict(self, X):\n", 785 | " \"\"\"\n", 786 | " Makes a prediction on the data provided by the users using the \n", 787 | " preprocessing pipeline and provided model.\n", 788 | " \"\"\"\n", 789 | " if not self._is_fit:\n", 790 | " raise ValueError(\"Must fit the models before transforming!\")\n", 791 | " test_data = self.preprocessor.transform(X)\n", 792 | " preds = self.model.predict(test_data)\n", 793 | " return preds\n", 794 | " \n", 795 | " def score(self, X, y):\n", 796 | " \"\"\"\n", 797 | " Returns the accuracy for the model after using the trained\n", 798 | " preprocessing pipeline to prepare the data.\n", 799 | " \"\"\"\n", 800 | " test_data = self.preprocessor.transform(X)\n", 801 | " return self.model.score(test_data, y)\n", 802 | " \n", 803 | " def save_pipe(self, filename):\n", 804 | " \"\"\"\n", 805 | " Writes the attributes of the pipeline to a file\n", 806 | " allowing a pipeline to be loaded later with the\n", 807 | " pre-trained pieces in place.\n", 808 | " \"\"\"\n", 809 | " if type(filename) != str:\n", 810 | " raise TypeError(\"filename must be a string\")\n", 811 | " pickle.dump(self.__dict__, open(filename+\".mdl\",'wb'))\n", 812 | " \n", 813 | " def load_pipe(self, filename):\n", 814 | " \"\"\"\n", 815 | " Writes the attributes of the pipeline to a file\n", 816 | " allowing a pipeline to be loaded later with the\n", 817 | " pre-trained pieces in place.\n", 818 | " \"\"\"\n", 819 | " if type(filename) != str:\n", 820 | " raise TypeError(\"filename must be a string\")\n", 821 | " if filename[-4:] != '.mdl':\n", 822 | " filename += '.mdl'\n", 823 | " self.__dict__ = pickle.load(open(filename,'rb'))" 824 | ] 825 | }, 826 | { 827 | "cell_type": "code", 828 | "execution_count": 18, 829 | "metadata": { 830 | "ExecuteTime": { 831 | "end_time": "2018-08-12T19:37:22.978597Z", 832 | "start_time": "2018-08-12T19:37:10.343254Z" 833 | } 834 | }, 835 | "outputs": [ 836 | { 837 | "data": { 838 | "text/plain": [ 839 | "0.9031674208144796" 840 | ] 841 | }, 842 | "execution_count": 18, 843 | "metadata": {}, 844 | "output_type": "execute_result" 845 | } 846 | ], 847 | "source": [ 848 | "nlp_pipe = supervised_nlp(MultinomialNB(), nlp)\n", 849 | "nlp_pipe.fit(ng_train_data, ng_train_targets)\n", 850 | "nlp_pipe.score(ng_test_data, ng_test_targets)" 851 | ] 852 | }, 853 | { 854 | "cell_type": "markdown", 855 | "metadata": {}, 856 | "source": [ 857 | "Swap out the model for something different." 858 | ] 859 | }, 860 | { 861 | "cell_type": "code", 862 | "execution_count": 19, 863 | "metadata": { 864 | "ExecuteTime": { 865 | "end_time": "2018-08-12T19:37:35.730306Z", 866 | "start_time": "2018-08-12T19:37:22.981131Z" 867 | } 868 | }, 869 | "outputs": [ 870 | { 871 | "data": { 872 | "text/plain": [ 873 | "0.8262443438914027" 874 | ] 875 | }, 876 | "execution_count": 19, 877 | "metadata": {}, 878 | "output_type": "execute_result" 879 | } 880 | ], 881 | "source": [ 882 | "from sklearn.svm import LinearSVC\n", 883 | "\n", 884 | "nlp_pipe = supervised_nlp(LinearSVC(), nlp)\n", 885 | "nlp_pipe.fit(ng_train_data, ng_train_targets)\n", 886 | "nlp_pipe.score(ng_test_data, ng_test_targets)" 887 | ] 888 | }, 889 | { 890 | "cell_type": "markdown", 891 | "metadata": {}, 892 | "source": [ 893 | "#### Unsupervised: Topic Modeling\n", 894 | "\n", 895 | "We don't want to make a prediction with this example, simply to find topics and have the ability to cast our data into the \"topic space\" from the \"word space.\" With this in mind, we'll add a transform feature and also the ability to print out the topics." 896 | ] 897 | }, 898 | { 899 | "cell_type": "code", 900 | "execution_count": 20, 901 | "metadata": { 902 | "ExecuteTime": { 903 | "end_time": "2018-08-12T19:37:35.740426Z", 904 | "start_time": "2018-08-12T19:37:35.732734Z" 905 | } 906 | }, 907 | "outputs": [], 908 | "source": [ 909 | "class topic_modeling_nlp:\n", 910 | " \n", 911 | " def __init__(self, model, preprocessing_pipeline=None):\n", 912 | " \"\"\"\n", 913 | " A pipeline for doing supervised nlp. Expects a model and creates\n", 914 | " a preprocessing pipeline if one isn't provided.\n", 915 | " \"\"\"\n", 916 | " self.model = model\n", 917 | " self._is_fit = False\n", 918 | " if not preprocessing_pipeline:\n", 919 | " self.preprocessor = nlp_preprocessor()\n", 920 | " else:\n", 921 | " self.preprocessor = preprocessing_pipeline\n", 922 | " \n", 923 | " def fit(self, X):\n", 924 | " \"\"\"\n", 925 | " Trains the vectorizer and model together using the \n", 926 | " users input training data.\n", 927 | " \"\"\"\n", 928 | " self.preprocessor.fit(X)\n", 929 | " train_data = self.preprocessor.transform(X)\n", 930 | " self.model.fit(train_data)\n", 931 | " self._is_fit = True\n", 932 | " \n", 933 | " def transform(self, X):\n", 934 | " \"\"\"\n", 935 | " Makes a prediction on the data provided by the users using the \n", 936 | " preprocessing pipeline and provided model.\n", 937 | " \"\"\"\n", 938 | " if not self._is_fit:\n", 939 | " raise ValueError(\"Must fit the models before transforming!\")\n", 940 | " test_data = self.preprocessor.transform(X)\n", 941 | " preds = self.model.transform(test_data)\n", 942 | " return preds\n", 943 | " \n", 944 | " def print_topics(self, num_words=10):\n", 945 | " \"\"\"\n", 946 | " A function to print out the top words for each topic\n", 947 | " \"\"\"\n", 948 | " feat_names = self.preprocessor.vectorizer.get_feature_names()\n", 949 | " for topic_idx, topic in enumerate(self.model.components_):\n", 950 | " message = \"Topic #%d: \" % topic_idx\n", 951 | " message += \" \".join([feat_names[i]\n", 952 | " for i in topic.argsort()[:-num_words - 1:-1]])\n", 953 | " print(message)\n", 954 | " \n", 955 | " def save_pipe(self, filename):\n", 956 | " \"\"\"\n", 957 | " Writes the attributes of the pipeline to a file\n", 958 | " allowing a pipeline to be loaded later with the\n", 959 | " pre-trained pieces in place.\n", 960 | " \"\"\"\n", 961 | " if type(filename) != str:\n", 962 | " raise TypeError(\"filename must be a string\")\n", 963 | " pickle.dump(self.__dict__, open(filename+\".mdl\",'wb'))\n", 964 | " \n", 965 | " def load_pipe(self, filename):\n", 966 | " \"\"\"\n", 967 | " Writes the attributes of the pipeline to a file\n", 968 | " allowing a pipeline to be loaded later with the\n", 969 | " pre-trained pieces in place.\n", 970 | " \"\"\"\n", 971 | " if type(filename) != str:\n", 972 | " raise TypeError(\"filename must be a string\")\n", 973 | " if filename[-4:] != '.mdl':\n", 974 | " filename += '.mdl'\n", 975 | " self.__dict__ = pickle.load(open(filename,'rb'))" 976 | ] 977 | }, 978 | { 979 | "cell_type": "code", 980 | "execution_count": 21, 981 | "metadata": { 982 | "ExecuteTime": { 983 | "end_time": "2018-08-12T19:37:35.760623Z", 984 | "start_time": "2018-08-12T19:37:35.743417Z" 985 | } 986 | }, 987 | "outputs": [], 988 | "source": [ 989 | "from sklearn.decomposition import TruncatedSVD\n", 990 | "\n", 991 | "cv = CountVectorizer(stop_words='english', token_pattern='\\\\b[a-z][a-z]+\\\\b')\n", 992 | "cleaning_pipe = nlp_preprocessor(vectorizer=cv)\n", 993 | "topic_chain = topic_modeling_nlp(TruncatedSVD(n_components=15), preprocessing_pipeline=cleaning_pipe)" 994 | ] 995 | }, 996 | { 997 | "cell_type": "code", 998 | "execution_count": 22, 999 | "metadata": { 1000 | "ExecuteTime": { 1001 | "end_time": "2018-08-12T19:37:36.687204Z", 1002 | "start_time": "2018-08-12T19:37:35.762876Z" 1003 | } 1004 | }, 1005 | "outputs": [ 1006 | { 1007 | "data": { 1008 | "text/plain": [ 1009 | "(1661, 15)" 1010 | ] 1011 | }, 1012 | "execution_count": 22, 1013 | "metadata": {}, 1014 | "output_type": "execute_result" 1015 | } 1016 | ], 1017 | "source": [ 1018 | "topic_chain.fit(ng_train_data)\n", 1019 | "topic_chain.transform(ng_train_data).shape" 1020 | ] 1021 | }, 1022 | { 1023 | "cell_type": "code", 1024 | "execution_count": 23, 1025 | "metadata": { 1026 | "ExecuteTime": { 1027 | "end_time": "2018-08-12T19:37:36.734831Z", 1028 | "start_time": "2018-08-12T19:37:36.689485Z" 1029 | } 1030 | }, 1031 | "outputs": [ 1032 | { 1033 | "name": "stdout", 1034 | "output_type": "stream", 1035 | "text": [ 1036 | "Topic #0: jpeg image edu file graphics gif images format color pub\n", 1037 | "Topic #1: edu graphics pub data mail ray ftp send com objects\n", 1038 | "Topic #2: jesus god atheists matthew people atheism does religious said religion\n", 1039 | "Topic #3: image data processing analysis software available display tools tool user\n", 1040 | "Topic #4: jesus matthew prophecy messiah psalm isaiah david said lord israel\n", 1041 | "Topic #5: argument fallacy conclusion example true argumentum ad premises false valid\n", 1042 | "Topic #6: data available ftp sgi grass vertex pci motecc model info\n", 1043 | "Topic #7: game year don good think hit won runs team home\n", 1044 | "Topic #8: god posting subject response typical information universe einstein bush evidence\n", 1045 | "Topic #9: den radius double theta sqrt pi sin rtheta pole pt\n", 1046 | "Topic #10: program read menu bits display change file pressing want don\n", 1047 | "Topic #11: lost program won cubs atheism game menu display bits home\n", 1048 | "Topic #12: game runs second hit run cubs graphics home win sunday\n", 1049 | "Topic #13: atheism alt faq send edu usenet news files otis answers\n", 1050 | "Topic #14: col int row value char imagewidth imageheight unsigned cubs pitcher\n" 1051 | ] 1052 | } 1053 | ], 1054 | "source": [ 1055 | "topic_chain.print_topics()" 1056 | ] 1057 | }, 1058 | { 1059 | "cell_type": "markdown", 1060 | "metadata": {}, 1061 | "source": [ 1062 | "Swap out the model for something different." 1063 | ] 1064 | }, 1065 | { 1066 | "cell_type": "code", 1067 | "execution_count": 24, 1068 | "metadata": { 1069 | "ExecuteTime": { 1070 | "end_time": "2018-08-12T19:37:36.741910Z", 1071 | "start_time": "2018-08-12T19:37:36.737246Z" 1072 | } 1073 | }, 1074 | "outputs": [], 1075 | "source": [ 1076 | "from sklearn.decomposition import LatentDirichletAllocation\n", 1077 | "topic_chain = topic_modeling_nlp(LatentDirichletAllocation(n_components=15), preprocessing_pipeline=cleaning_pipe)" 1078 | ] 1079 | }, 1080 | { 1081 | "cell_type": "code", 1082 | "execution_count": 25, 1083 | "metadata": { 1084 | "ExecuteTime": { 1085 | "end_time": "2018-08-12T19:37:42.475762Z", 1086 | "start_time": "2018-08-12T19:37:36.745876Z" 1087 | } 1088 | }, 1089 | "outputs": [ 1090 | { 1091 | "name": "stderr", 1092 | "output_type": "stream", 1093 | "text": [ 1094 | "/Users/zachariahmiller/anaconda3/lib/python3.6/site-packages/sklearn/decomposition/online_lda.py:536: DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.\n", 1095 | " DeprecationWarning)\n" 1096 | ] 1097 | }, 1098 | { 1099 | "name": "stdout", 1100 | "output_type": "stream", 1101 | "text": [ 1102 | "Topic #0: cornerstone pappas nis dualpage dollar tour karlin eastwick rawley mcnally\n", 1103 | "Topic #1: corel tune pens bone trimming dykstra reprints highway qcr packs\n", 1104 | "Topic #2: enviroleague youth mission organizations markus true adult max pointer abekas\n", 1105 | "Topic #3: image graphics edu file jpeg data files software use images\n", 1106 | "Topic #4: edu cobb hou jewish polygon problem illinois schwartz sigkids pay\n", 1107 | "Topic #5: graeme computer radiosity pp curves cubic bezier stephan generation dtax\n", 1108 | "Topic #6: god people atheism does atheists jesus religion don believe argument\n", 1109 | "Topic #7: sex sea tek said vice bronx com ico bob away\n", 1110 | "Topic #8: won cubs lost york new team edu san sox reds\n", 1111 | "Topic #9: den col int radius row war value bombing theta hussein\n", 1112 | "Topic #10: text copy texts septuagint masoretic parody ot passages various toilet\n", 1113 | "Topic #11: erickson cage animals temporarily syllogism spell cch cold products polygoon\n", 1114 | "Topic #12: alomar average league players baerga double player rbi ab slg\n", 1115 | "Topic #13: xxxx cruel anti rb germany paul mercedes woofing explaining girl\n", 1116 | "Topic #14: don think good just year like better time know game\n" 1117 | ] 1118 | } 1119 | ], 1120 | "source": [ 1121 | "topic_chain.fit(ng_train_data)\n", 1122 | "topic_chain.transform(ng_train_data).shape\n", 1123 | "topic_chain.print_topics()" 1124 | ] 1125 | }, 1126 | { 1127 | "cell_type": "code", 1128 | "execution_count": null, 1129 | "metadata": {}, 1130 | "outputs": [], 1131 | "source": [] 1132 | } 1133 | ], 1134 | "metadata": { 1135 | "kernelspec": { 1136 | "display_name": "Python [default]", 1137 | "language": "python", 1138 | "name": "python3" 1139 | }, 1140 | "language_info": { 1141 | "codemirror_mode": { 1142 | "name": "ipython", 1143 | "version": 3 1144 | }, 1145 | "file_extension": ".py", 1146 | "mimetype": "text/x-python", 1147 | "name": "python", 1148 | "nbconvert_exporter": "python", 1149 | "pygments_lexer": "ipython3", 1150 | "version": "3.6.6" 1151 | }, 1152 | "toc": { 1153 | "nav_menu": {}, 1154 | "number_sections": true, 1155 | "sideBar": true, 1156 | "skip_h1_title": false, 1157 | "toc_cell": false, 1158 | "toc_position": {}, 1159 | "toc_section_display": "block", 1160 | "toc_window_display": false 1161 | } 1162 | }, 1163 | "nbformat": 4, 1164 | "nbformat_minor": 2 1165 | } 1166 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import setuptools 2 | 3 | with open("README.md") as fh: 4 | README_CONTENTS = fh.read() 5 | 6 | config = { 7 | 'name': 'NLPPipeManager', 8 | 'version': '0.1', 9 | 'author': 'ZWMiller', 10 | 'author_email': 'zach@notarealemail.com', 11 | 'long_description': README_CONTENTS, 12 | 'url': 'https://github.com/zwmiller/nlp_pipe_manager/', 13 | 'packages': setuptools.find_packages(), 14 | 'install_requires': ['scikit-learn','nltk'] 15 | } 16 | 17 | setuptools.setup(**config) 18 | --------------------------------------------------------------------------------