├── .gitignore
├── LICENSE
├── README.md
├── nlp_pipeline_manager
    ├── README.md
    ├── __init__.py
    ├── nlp_preprocessor.py
    ├── nlpipe.py
    ├── pipeline_demo.ipynb
    ├── save_pipeline.mdl
    ├── supervised_nlp.py
    └── topic_modeling_nlp.py
├── prototyping
    └── building_an_nlp_pipeline_class.ipynb
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.ipynb_checkpoints*
 2 | .idea/*
 3 | *.egg-info/*
 4 | build/*
 5 | dist/*
 6 | **~
 7 | lunch_and_learn_notes.md
 8 | *.DS_Store*
 9 | *.npz
10 | **/__pycache__/*
11 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Zach Miller
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # NLP Pipeline Manager
 2 | 
 3 | The most frustrating part of doing NLP for me is keeping track of all the
 4 | different combinations of cleaning functions, stemmers, lemmatizers,
 5 | vectorizers, models, etc. I almost always resort to writing some awful
 6 | function that hacks those bits together and then prints out some scoring
 7 | piece. To help manage all of this better, I've developed a pipelining system
 8 | that allows the user to load all of the pieces into a class and then let the
 9 | class do the management for them. 
10 | 
11 | ### Installation
12 | 
13 | Clone this repo. Go to the directory where it is cloned and run:
14 | 
15 | ```bash
16 | python setup.py install
17 | ```
18 | 
19 | nlp_pipeline_manager will then install to your machine and be available. This project
20 | assumes python 3 and requires NLTK and SkLearn.
21 | 
22 | 
23 | ### Examples of using the pipeline:
24 | 
25 | ```python
26 | from nlp_pipeline_manager import nlp_preprocessor
27 | 
28 | corpus = ['BOB the builder', 'is a strange', 'caRtoon type thing']
29 | nlp = nlp_preprocessor()
30 | nlp.fit(corpus)
31 | nlp.transform(corpus).toarray()
32 | ```
33 | 
34 | ```bash
35 | array([[1, 1, 0, 0, 0, 1, 0, 0],
36 |        [0, 0, 0, 1, 1, 0, 0, 0],
37 |        [0, 0, 1, 0, 0, 0, 1, 1]])
38 | ```
39 | 
40 | Loading a stemmer into the pipeline (we actually pass in the stemming method):
41 | 
42 | ```python
43 | from nltk.stem import PorterStemmer
44 | 
45 | nlp = nlp_preprocessor(stemmer=PorterStemmer().stem)
46 | ```
47 | 
48 | The pipeline allows users to set:
49 | 
50 | * Vectorizer (using SkLearn classes)
51 | * Cleaning Function (user can just provide a function name without the parens
52 | at the end)
53 | * Tokenizer (Either an NLTK tokenizer or a function that takes a string and
54 | returns a list of tokens)
55 | * Stemmer (Can be any function that takes in a word and returns a root form - be it a stemmer or a lemmatizer)
56 | 
57 | If the user wants to provide a cleaning function, it must accept 3 arguments.
58 | The text, the tokenizer, and the stemmer. Here's an example extra cleaning
59 | function:
60 | 
61 | ```python
62 | def clean_text(text, tokenizer, stemmer):
63 |         """
64 |         A naive function to lowercase all words and clean them quickly
65 |         with a stemmer if it exists.
66 |         """
67 |         cleaned_text = []
68 |         for post in text:
69 |             cleaned_words = []
70 |             for word in tokenizer(post):
71 |                 low_word = word.lower()
72 |                 if stemmer:
73 |                     low_word = stemmer(low_word)
74 |                 cleaned_words.append(low_word)
75 |             cleaned_text.append(' '.join(cleaned_words))
76 |         return cleaned_text
77 | ```
78 | 
79 | # ML with the pipeline
80 | 
81 | It's quick and easy to create modeling pipelines that wrap around the
82 | preprocessor. Two example pipes are shown, one for classification and one for
83 | topic modeling. Here's an example of using the classification pipe:
84 | 
85 | ```python
86 | from nlp_pipeline_manager import supervised_nlp
87 | from nlp_pipeline_manager import nlp_preprocessor
88 | from sklearn.naive_bayes import MultinomialNB
89 | 
90 | nlp = nlp_preprocessor()
91 | 
92 | nlp_pipe = supervised_nlp(MultinomialNB(), nlp)
93 | nlp_pipe.fit(ng_train_data, ng_train_targets)
94 | nlp_pipe.score(ng_test_data, ng_test_targets)
95 | ```
96 | 
97 | 
98 | 


--------------------------------------------------------------------------------
/nlp_pipeline_manager/README.md:
--------------------------------------------------------------------------------
 1 | # NLP Pipe Manager 
 2 | 
 3 | **nlpipe.py**
 4 | 
 5 | This file contains a parent class for file I/O to disk. This allows any pipeline using this as the parent class to 
 6 | save all attributes to disk in a pickle file for later use.
 7 | 
 8 | **nlp_preprocessor.py**
 9 | 
10 | This is an implementation of the main pipeline. It allows a user to put in bits and pieces, which it then chains
11 | together so the user only needs to call `fit` and `transform` to get vectorized NLP data.
12 | 
13 | **pipeline_demo.ipynb**
14 | 
15 | A notebook that shows the pipeline in action.
16 | 
17 | **save_pipeline.mdl**
18 | 
19 | A saved model file from the pipeline_demo. This demonstrates the I/O capabilites of the pipeline
20 | 
21 | **supervised_nlp.py**
22 | 
23 | This file implements a class for using the nlp preprocessor along with a model to do prediction (classification
24 | or regression). It assumes that the user will be providing an SkLearn model to work with. Simplifies the user API
25 | to just `fit`, `predict`, and `score`.
26 | 
27 | **topic_modeling_nlp.py**
28 | 
29 | This file implements a class for doing topic modeling with the nlp_preprocess. The user must provide a 
30 | topic modeling method like `TruncatedSVD` or `LatentDirichletAllocation` from SkLearn. This simplifies the user API
31 | to just `fit`, `transform`, and `print_topics`.
32 | 


--------------------------------------------------------------------------------
/nlp_pipeline_manager/__init__.py:
--------------------------------------------------------------------------------
1 | from .nlpipe import nlpipe
2 | from .nlp_preprocessor import nlp_preprocessor
3 | from .supervised_nlp import supervised_nlp
4 | from .topic_modeling_nlp import topic_modeling_nlp
5 | 
6 | __all__ = ['nlpipe','nlp_preprocessor','supervised_nlp','topic_modeling_nlp']


--------------------------------------------------------------------------------
/nlp_pipeline_manager/nlp_preprocessor.py:
--------------------------------------------------------------------------------
 1 | from sklearn.feature_extraction.text import CountVectorizer
 2 | from nlp_pipeline_manager.nlpipe import nlpipe
 3 | 
 4 | 
 5 | class nlp_preprocessor(nlpipe):
 6 | 
 7 |     def __init__(self, vectorizer=CountVectorizer(), tokenizer=None, cleaning_function=None,
 8 |                  stemmer=None):
 9 |         """
10 |         A class for pipelining our data in NLP problems. The user provides a series of 
11 |         tools, and this class manages all of the training, transforming, and modification
12 |         of the text data.
13 |         ---
14 |         Inputs:
15 |         vectorizer: the model to use for vectorization of text data
16 |         tokenizer: The tokenizer to use, if none defaults to split on spaces
17 |         cleaning_function: how to clean the data, if None, defaults to the in built class
18 |         stemmer: a function that returns a stemmed version of a token. For NLTK, this
19 |         means getting a stemmer class, then providing the stemming function underneath it.
20 |         """
21 |         if not tokenizer:
22 |             tokenizer = self.splitter
23 |         if not cleaning_function:
24 |             cleaning_function = self.clean_text
25 |         self.stemmer = stemmer
26 |         self.tokenizer = tokenizer
27 |         self.cleaning_function = cleaning_function
28 |         self.vectorizer = vectorizer
29 |         self._is_fit = False
30 |         
31 |     def splitter(self, text):
32 |         """
33 |         Default tokenizer that splits on spaces naively
34 |         """
35 |         return text.split(' ')
36 |         
37 |     def clean_text(self, text, tokenizer, stemmer):
38 |         """
39 |         A naive function to lowercase all works can clean them quickly.
40 |         This is the default behavior if no other cleaning function is specified
41 |         """
42 |         cleaned_text = []
43 |         for post in text:
44 |             cleaned_words = []
45 |             for word in tokenizer(post):
46 |                 low_word = word.lower()
47 |                 if stemmer:
48 |                     low_word = stemmer(low_word)
49 |                 cleaned_words.append(low_word)
50 |             cleaned_text.append(' '.join(cleaned_words))
51 |         return cleaned_text
52 |     
53 |     def fit(self, text):
54 |         """
55 |         Cleans the data and then fits the vectorizer with
56 |         the user provided text
57 |         """
58 |         clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer)
59 |         self.vectorizer.fit(clean_text)
60 |         self._is_fit = True
61 |         
62 |     def transform(self, text, return_clean_text=False):
63 |         """
64 |         Cleans any provided data and then transforms the data into
65 |         a vectorized format based on the fit function.
66 |         If return_clean_text is set to True, it returns the cleaned
67 |         form of the text. If it's set to False, it returns the
68 |         vectorized form of the data.
69 |         """
70 |         if not self._is_fit:
71 |             raise ValueError("Must fit the models before transforming!")
72 |         clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer)
73 |         if return_clean_text:
74 |             return clean_text
75 |         return self.vectorizer.transform(clean_text)
76 | 
77 | 
78 | if __name__ == '__main__':
79 |     from nltk.stem import PorterStemmer
80 | 
81 |     ps = PorterStemmer()
82 |     corpus = ['Testing the', 'class for', 'good behavior.']
83 |     nlp = nlp_preprocessor(stemmer=ps.stem)
84 |     nlp.fit(corpus)
85 |     print(nlp.transform(corpus,return_clean_text=True))
86 |     print(nlp.transform(corpus).toarray())


--------------------------------------------------------------------------------
/nlp_pipeline_manager/nlpipe.py:
--------------------------------------------------------------------------------
 1 | import pickle
 2 | 
 3 | 
 4 | class nlpipe:
 5 |     
 6 |     def __init__(self):
 7 |         """
 8 |         Empty parent class for nlp pipelines that contains
 9 |         shared file i/o that happens in every class.
10 |         """
11 |         pass
12 |     
13 |     def save_pipe(self, filename):
14 |         """
15 |         Writes the attributes of the pipeline to a file
16 |         allowing a pipeline to be loaded later with the
17 |         pre-trained pieces in place.
18 |         """
19 |         if type(filename) != str:
20 |             raise TypeError("filename must be a string")
21 |         pickle.dump(self.__dict__, open(filename+".mdl",'wb'))
22 |         
23 |     def load_pipe(self, filename):
24 |         """
25 |         Writes the attributes of the pipeline to a file
26 |         allowing a pipeline to be loaded later with the
27 |         pre-trained pieces in place.
28 |         """
29 |         if type(filename) != str:
30 |             raise TypeError("filename must be a string")
31 |         if filename[-4:] != '.mdl':
32 |             filename += '.mdl'
33 |         self.__dict__ = pickle.load(open(filename,'rb'))


--------------------------------------------------------------------------------
/nlp_pipeline_manager/pipeline_demo.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "ExecuteTime": {
  8 |      "end_time": "2018-08-16T13:58:45.891755Z",
  9 |      "start_time": "2018-08-16T13:58:45.456866Z"
 10 |     }
 11 |    },
 12 |    "outputs": [],
 13 |    "source": [
 14 |     "from nlp_preprocessor import nlp_preprocessor\n",
 15 |     "import pandas as pd"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "markdown",
 20 |    "metadata": {},
 21 |    "source": [
 22 |     "# Testing the preprocessor class"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": 3,
 28 |    "metadata": {
 29 |     "ExecuteTime": {
 30 |      "end_time": "2018-08-16T13:58:59.678473Z",
 31 |      "start_time": "2018-08-16T13:58:57.722641Z"
 32 |     }
 33 |    },
 34 |    "outputs": [
 35 |     {
 36 |      "data": {
 37 |       "text/html": [
 38 |        "<div>\n",
 39 |        "<style scoped>\n",
 40 |        "    .dataframe tbody tr th:only-of-type {\n",
 41 |        "        vertical-align: middle;\n",
 42 |        "    }\n",
 43 |        "\n",
 44 |        "    .dataframe tbody tr th {\n",
 45 |        "        vertical-align: top;\n",
 46 |        "    }\n",
 47 |        "\n",
 48 |        "    .dataframe thead th {\n",
 49 |        "        text-align: right;\n",
 50 |        "    }\n",
 51 |        "</style>\n",
 52 |        "<table border=\"1\" class=\"dataframe\">\n",
 53 |        "  <thead>\n",
 54 |        "    <tr style=\"text-align: right;\">\n",
 55 |        "      <th></th>\n",
 56 |        "      <th>bob</th>\n",
 57 |        "      <th>builder</th>\n",
 58 |        "      <th>cartoon</th>\n",
 59 |        "      <th>is</th>\n",
 60 |        "      <th>strange</th>\n",
 61 |        "      <th>the</th>\n",
 62 |        "      <th>thing</th>\n",
 63 |        "      <th>type</th>\n",
 64 |        "    </tr>\n",
 65 |        "  </thead>\n",
 66 |        "  <tbody>\n",
 67 |        "    <tr>\n",
 68 |        "      <th>0</th>\n",
 69 |        "      <td>1</td>\n",
 70 |        "      <td>1</td>\n",
 71 |        "      <td>0</td>\n",
 72 |        "      <td>0</td>\n",
 73 |        "      <td>0</td>\n",
 74 |        "      <td>1</td>\n",
 75 |        "      <td>0</td>\n",
 76 |        "      <td>0</td>\n",
 77 |        "    </tr>\n",
 78 |        "    <tr>\n",
 79 |        "      <th>1</th>\n",
 80 |        "      <td>0</td>\n",
 81 |        "      <td>0</td>\n",
 82 |        "      <td>0</td>\n",
 83 |        "      <td>1</td>\n",
 84 |        "      <td>1</td>\n",
 85 |        "      <td>0</td>\n",
 86 |        "      <td>0</td>\n",
 87 |        "      <td>0</td>\n",
 88 |        "    </tr>\n",
 89 |        "    <tr>\n",
 90 |        "      <th>2</th>\n",
 91 |        "      <td>0</td>\n",
 92 |        "      <td>0</td>\n",
 93 |        "      <td>1</td>\n",
 94 |        "      <td>0</td>\n",
 95 |        "      <td>0</td>\n",
 96 |        "      <td>0</td>\n",
 97 |        "      <td>1</td>\n",
 98 |        "      <td>1</td>\n",
 99 |        "    </tr>\n",
100 |        "  </tbody>\n",
101 |        "</table>\n",
102 |        "</div>"
103 |       ],
104 |       "text/plain": [
105 |        "   bob  builder  cartoon  is  strange  the  thing  type\n",
106 |        "0    1        1        0   0        0    1      0     0\n",
107 |        "1    0        0        0   1        1    0      0     0\n",
108 |        "2    0        0        1   0        0    0      1     1"
109 |       ]
110 |      },
111 |      "execution_count": 3,
112 |      "metadata": {},
113 |      "output_type": "execute_result"
114 |     }
115 |    ],
116 |    "source": [
117 |     "from nltk.stem import WordNetLemmatizer\n",
118 |     "\n",
119 |     "lemma = WordNetLemmatizer()\n",
120 |     "\n",
121 |     "corpus = ['BOB the builder', 'is a strange', 'caRtoon type thing']\n",
122 |     "nlp = nlp_preprocessor(stemmer = lemma.lemmatize)\n",
123 |     "nlp.fit(corpus)\n",
124 |     "pd.DataFrame(nlp.transform(corpus).toarray(), columns=nlp.vectorizer.get_feature_names())"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "code",
129 |    "execution_count": 4,
130 |    "metadata": {
131 |     "ExecuteTime": {
132 |      "end_time": "2018-08-16T13:59:03.953921Z",
133 |      "start_time": "2018-08-16T13:59:03.948587Z"
134 |     }
135 |    },
136 |    "outputs": [
137 |     {
138 |      "data": {
139 |       "text/plain": [
140 |        "array([[1, 1, 0, 0, 0, 1, 0, 0],\n",
141 |        "       [0, 0, 0, 1, 1, 0, 0, 0],\n",
142 |        "       [0, 0, 1, 0, 0, 0, 1, 1]])"
143 |       ]
144 |      },
145 |      "execution_count": 4,
146 |      "metadata": {},
147 |      "output_type": "execute_result"
148 |     }
149 |    ],
150 |    "source": [
151 |     "nlp.transform(corpus).toarray()"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "markdown",
156 |    "metadata": {},
157 |    "source": [
158 |     "# Testing the Supervised Learning Class"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "code",
163 |    "execution_count": 5,
164 |    "metadata": {
165 |     "ExecuteTime": {
166 |      "end_time": "2018-08-16T13:59:06.378161Z",
167 |      "start_time": "2018-08-16T13:59:06.372315Z"
168 |     }
169 |    },
170 |    "outputs": [],
171 |    "source": [
172 |     "from supervised_nlp import supervised_nlp"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": 6,
178 |    "metadata": {
179 |     "ExecuteTime": {
180 |      "end_time": "2018-08-16T13:59:09.307480Z",
181 |      "start_time": "2018-08-16T13:59:06.929183Z"
182 |     }
183 |    },
184 |    "outputs": [],
185 |    "source": [
186 |     "from sklearn import datasets\n",
187 |     "\n",
188 |     "categories = ['alt.atheism', 'comp.graphics', 'rec.sport.baseball']\n",
189 |     "ng_train = datasets.fetch_20newsgroups(subset='train', \n",
190 |     "                                       categories=categories, \n",
191 |     "                                       remove=('headers', \n",
192 |     "                                               'footers', 'quotes'))\n",
193 |     "ng_train_data = ng_train.data\n",
194 |     "ng_train_targets = ng_train.target\n",
195 |     "\n",
196 |     "ng_test = datasets.fetch_20newsgroups(subset='test', \n",
197 |     "                                       categories=categories, \n",
198 |     "                                       remove=('headers', \n",
199 |     "                                               'footers', 'quotes'))\n",
200 |     "\n",
201 |     "ng_test_data = ng_test.data\n",
202 |     "ng_test_targets = ng_test.target"
203 |    ]
204 |   },
205 |   {
206 |    "cell_type": "code",
207 |    "execution_count": 7,
208 |    "metadata": {
209 |     "ExecuteTime": {
210 |      "end_time": "2018-08-16T13:59:13.076084Z",
211 |      "start_time": "2018-08-16T13:59:09.309465Z"
212 |     }
213 |    },
214 |    "outputs": [
215 |     {
216 |      "data": {
217 |       "text/plain": [
218 |        "0.9113122171945701"
219 |       ]
220 |      },
221 |      "execution_count": 7,
222 |      "metadata": {},
223 |      "output_type": "execute_result"
224 |     }
225 |    ],
226 |    "source": [
227 |     "from sklearn.naive_bayes import MultinomialNB\n",
228 |     "\n",
229 |     "nlp_pipe = supervised_nlp(MultinomialNB(), nlp)\n",
230 |     "nlp_pipe.fit(ng_train_data, ng_train_targets)\n",
231 |     "nlp_pipe.score(ng_test_data, ng_test_targets)"
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "markdown",
236 |    "metadata": {},
237 |    "source": [
238 |     "# Testing the Topic Modeling Class"
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "code",
243 |    "execution_count": 8,
244 |    "metadata": {
245 |     "ExecuteTime": {
246 |      "end_time": "2018-08-16T13:59:38.430684Z",
247 |      "start_time": "2018-08-16T13:59:35.472286Z"
248 |     }
249 |    },
250 |    "outputs": [
251 |     {
252 |      "name": "stdout",
253 |      "output_type": "stream",
254 |      "text": [
255 |       "Topic #0: image jpeg file edu gif format color data pub ftp\n",
256 |       "Topic #1: edu pub data graphics mail ftp ray send graphic com\n",
257 |       "Topic #2: jesus god wa atheist matthew people ha atheism christian prophecy\n",
258 |       "Topic #3: image data processing tool analysis software user available using sun\n",
259 |       "Topic #4: jesus matthew prophecy wa psalm messiah day isaiah david prophet\n",
260 |       "Topic #5: argument fallacy conclusion premise example true argumentum ad false valid\n",
261 |       "Topic #6: data available ftp grass sgi vertex package model pci motecc\n",
262 |       "Topic #7: wa game year team hit run don think good win\n",
263 |       "Topic #8: posting response god subject typical information universe einstein wa bush\n",
264 |       "Topic #9: den radius double theta sqrt pi sin rtheta pt pole\n",
265 |       "Topic #10: program read think menu don bit change file want pressing\n",
266 |       "Topic #11: program menu file read display game pressing change bit home\n",
267 |       "Topic #12: won lost idle new york sox year san american chicago\n",
268 |       "Topic #13: atheism alt faq send edu usenet news article answers newsgroup\n",
269 |       "Topic #14: col int row value cub char imagewidth imageheight suck atheism\n"
270 |      ]
271 |     }
272 |    ],
273 |    "source": [
274 |     "from sklearn.decomposition import TruncatedSVD\n",
275 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
276 |     "\n",
277 |     "from topic_modeling_nlp import topic_modeling_nlp\n",
278 |     "\n",
279 |     "\n",
280 |     "cv = CountVectorizer(stop_words='english', token_pattern='\\\\b[a-z][a-z]+\\\\b')\n",
281 |     "cleaning_pipe = nlp_preprocessor(vectorizer=cv, stemmer=lemma.lemmatize)\n",
282 |     "topic_chain = topic_modeling_nlp(TruncatedSVD(n_components=15), preprocessing_pipeline=cleaning_pipe)\n",
283 |     "\n",
284 |     "topic_chain.fit(ng_train_data)\n",
285 |     "topic_chain.print_topics()"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "markdown",
290 |    "metadata": {},
291 |    "source": [
292 |     "# Testing Saving and Loading a Pipeline"
293 |    ]
294 |   },
295 |   {
296 |    "cell_type": "code",
297 |    "execution_count": 9,
298 |    "metadata": {
299 |     "ExecuteTime": {
300 |      "end_time": "2018-08-16T13:59:48.733671Z",
301 |      "start_time": "2018-08-16T13:59:48.644948Z"
302 |     }
303 |    },
304 |    "outputs": [
305 |     {
306 |      "data": {
307 |       "text/plain": [
308 |        "method"
309 |       ]
310 |      },
311 |      "execution_count": 9,
312 |      "metadata": {},
313 |      "output_type": "execute_result"
314 |     }
315 |    ],
316 |    "source": [
317 |     "from nltk.stem import PorterStemmer\n",
318 |     "\n",
319 |     "nlp = nlp_preprocessor(stemmer=PorterStemmer().stem)\n",
320 |     "nlp.save_pipe('save_pipeline')\n",
321 |     "type(nlp.stemmer)"
322 |    ]
323 |   },
324 |   {
325 |    "cell_type": "code",
326 |    "execution_count": 10,
327 |    "metadata": {
328 |     "ExecuteTime": {
329 |      "end_time": "2018-08-16T13:59:53.293256Z",
330 |      "start_time": "2018-08-16T13:59:53.287874Z"
331 |     }
332 |    },
333 |    "outputs": [
334 |     {
335 |      "data": {
336 |       "text/plain": [
337 |        "NoneType"
338 |       ]
339 |      },
340 |      "execution_count": 10,
341 |      "metadata": {},
342 |      "output_type": "execute_result"
343 |     }
344 |    ],
345 |    "source": [
346 |     "nlp2 = nlp_preprocessor()\n",
347 |     "type(nlp2.stemmer)"
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "code",
352 |    "execution_count": 11,
353 |    "metadata": {
354 |     "ExecuteTime": {
355 |      "end_time": "2018-08-16T13:59:56.710006Z",
356 |      "start_time": "2018-08-16T13:59:56.688327Z"
357 |     }
358 |    },
359 |    "outputs": [
360 |     {
361 |      "data": {
362 |       "text/plain": [
363 |        "method"
364 |       ]
365 |      },
366 |      "execution_count": 11,
367 |      "metadata": {},
368 |      "output_type": "execute_result"
369 |     }
370 |    ],
371 |    "source": [
372 |     "nlp2.load_pipe('save_pipeline')\n",
373 |     "type(nlp2.stemmer)"
374 |    ]
375 |   },
376 |   {
377 |    "cell_type": "code",
378 |    "execution_count": null,
379 |    "metadata": {},
380 |    "outputs": [],
381 |    "source": []
382 |   }
383 |  ],
384 |  "metadata": {
385 |   "kernelspec": {
386 |    "display_name": "Python [default]",
387 |    "language": "python",
388 |    "name": "python3"
389 |   },
390 |   "language_info": {
391 |    "codemirror_mode": {
392 |     "name": "ipython",
393 |     "version": 3
394 |    },
395 |    "file_extension": ".py",
396 |    "mimetype": "text/x-python",
397 |    "name": "python",
398 |    "nbconvert_exporter": "python",
399 |    "pygments_lexer": "ipython3",
400 |    "version": "3.6.6"
401 |   },
402 |   "toc": {
403 |    "nav_menu": {},
404 |    "number_sections": true,
405 |    "sideBar": true,
406 |    "skip_h1_title": false,
407 |    "toc_cell": false,
408 |    "toc_position": {},
409 |    "toc_section_display": "block",
410 |    "toc_window_display": false
411 |   }
412 |  },
413 |  "nbformat": 4,
414 |  "nbformat_minor": 2
415 | }
416 | 


--------------------------------------------------------------------------------
/nlp_pipeline_manager/save_pipeline.mdl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ZWMiller/nlp_pipe_manager/5a73ed4c7afa345573bec38a59169f445416c3cf/nlp_pipeline_manager/save_pipeline.mdl


--------------------------------------------------------------------------------
/nlp_pipeline_manager/supervised_nlp.py:
--------------------------------------------------------------------------------
 1 | from nlp_pipeline_manager.nlp_preprocessor import nlp_preprocessor
 2 | from nlp_pipeline_manager.nlpipe import nlpipe
 3 | 
 4 | 
 5 | class supervised_nlp(nlpipe):
 6 |     
 7 |     def __init__(self, model, preprocessing_pipeline=None):
 8 |         """
 9 |         A pipeline for doing supervised nlp. Expects a model and creates
10 |         a preprocessing pipeline if one isn't provided.
11 |         """
12 |         self.model = model
13 |         self._is_fit = False
14 |         if not preprocessing_pipeline:
15 |             self.preprocessor = nlp_preprocessor()
16 |         else:
17 |             self.preprocessor = preprocessing_pipeline
18 |         
19 |     def fit(self, X, y):
20 |         """
21 |         Trains the vectorizer and model together using the 
22 |         users input training data.
23 |         """
24 |         self.preprocessor.fit(X)
25 |         train_data = self.preprocessor.transform(X)
26 |         self.model.fit(train_data, y)
27 |         self._is_fit = True
28 |     
29 |     def predict(self, X):
30 |         """
31 |         Makes a prediction on the data provided by the users using the 
32 |         preprocessing pipeline and provided model.
33 |         """
34 |         if not self._is_fit:
35 |             raise ValueError("Must fit the models before transforming!")
36 |         test_data = self.preprocessor.transform(X)
37 |         preds = self.model.predict(test_data)
38 |         return preds
39 |     
40 |     def score(self, X, y):
41 |         """
42 |         Returns the accuracy for the model after using the trained
43 |         preprocessing pipeline to prepare the data.
44 |         """
45 |         test_data = self.preprocessor.transform(X)
46 |         return self.model.score(test_data, y)
47 | 
48 | 
49 | if __name__ == "__main__":
50 |     from sklearn import datasets
51 | 
52 |     categories = ['alt.atheism', 'comp.graphics', 'rec.sport.baseball']
53 |     ng_train = datasets.fetch_20newsgroups(subset='train',
54 |                                            categories=categories,
55 |                                            remove=('headers',
56 |                                                    'footers', 'quotes'))
57 |     ng_train_data = ng_train.data
58 |     ng_train_targets = ng_train.target
59 | 
60 |     ng_test = datasets.fetch_20newsgroups(subset='test',
61 |                                           categories=categories,
62 |                                           remove=('headers',
63 |                                                   'footers', 'quotes'))
64 | 
65 |     ng_test_data = ng_test.data
66 |     ng_test_targets = ng_test.target
67 | 
68 |     from sklearn.naive_bayes import MultinomialNB
69 | 
70 |     nlp_pipe = supervised_nlp(MultinomialNB())
71 |     nlp_pipe.fit(ng_train_data, ng_train_targets)
72 |     print("Accuracy: ", nlp_pipe.score(ng_test_data, ng_test_targets))


--------------------------------------------------------------------------------
/nlp_pipeline_manager/topic_modeling_nlp.py:
--------------------------------------------------------------------------------
 1 | from nlp_pipeline_manager.nlp_preprocessor import nlp_preprocessor
 2 | from nlp_pipeline_manager.nlpipe import nlpipe
 3 | 
 4 | class topic_modeling_nlp(nlpipe):
 5 |     
 6 |     def __init__(self, model, preprocessing_pipeline=None):
 7 |         """
 8 |         A pipeline for doing supervised nlp. Expects a model and creates
 9 |         a preprocessing pipeline if one isn't provided.
10 |         """
11 |         self.model = model
12 |         self._is_fit = False
13 |         if not preprocessing_pipeline:
14 |             self.preprocessor = nlp_preprocessor()
15 |         else:
16 |             self.preprocessor = preprocessing_pipeline
17 |         
18 |     def fit(self, X):
19 |         """
20 |         Trains the vectorizer and model together using the 
21 |         users input training data.
22 |         """
23 |         self.preprocessor.fit(X)
24 |         train_data = self.preprocessor.transform(X)
25 |         self.model.fit(train_data)
26 |         self._is_fit = True
27 |     
28 |     def transform(self, X):
29 |         """
30 |         Makes a prediction on the data provided by the users using the 
31 |         preprocessing pipeline and provided model.
32 |         """
33 |         if not self._is_fit:
34 |             raise ValueError("Must fit the models before transforming!")
35 |         test_data = self.preprocessor.transform(X)
36 |         preds = self.model.transform(test_data)
37 |         return preds
38 |     
39 |     def print_topics(self, num_words=10):
40 |         """
41 |         A function to print out the top words for each topic
42 |         """
43 |         feat_names = self.preprocessor.vectorizer.get_feature_names()
44 |         for topic_idx, topic in enumerate(self.model.components_):
45 |             message = "Topic #%d: " % topic_idx
46 |             message += " ".join([feat_names[i]
47 |                                  for i in topic.argsort()[:-num_words - 1:-1]])
48 |             print(message)
49 | 
50 | 
51 | 
52 | if __name__ == "__main__":
53 |     from sklearn import datasets
54 | 
55 |     categories = ['alt.atheism', 'comp.graphics', 'rec.sport.baseball']
56 |     ng_train = datasets.fetch_20newsgroups(subset='train',
57 |                                            categories=categories,
58 |                                            remove=('headers',
59 |                                                    'footers', 'quotes'))
60 |     ng_train_data = ng_train.data
61 |     ng_train_targets = ng_train.target
62 | 
63 |     ng_test = datasets.fetch_20newsgroups(subset='test',
64 |                                           categories=categories,
65 |                                           remove=('headers',
66 |                                                   'footers', 'quotes'))
67 |     from sklearn.decomposition import TruncatedSVD
68 |     from sklearn.feature_extraction.text import CountVectorizer
69 | 
70 |     from topic_modeling_nlp import topic_modeling_nlp
71 | 
72 | 
73 |     cv = CountVectorizer(stop_words='english', token_pattern='\\b[a-z][a-z]+\\b')
74 |     cleaning_pipe = nlp_preprocessor(vectorizer=cv)
75 |     topic_chain = topic_modeling_nlp(TruncatedSVD(n_components=15), preprocessing_pipeline=cleaning_pipe)
76 | 
77 |     topic_chain.fit(ng_train_data)
78 |     topic_chain.print_topics()


--------------------------------------------------------------------------------
/prototyping/building_an_nlp_pipeline_class.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Building a class to manage our NLP pipelines"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "Because it's such a pain to manage all the permutations of NLP cleaners/tokenizers/vectorizers/stemmers/etc, we're going to build a class that takes all of those pieces in and manages the pipelines for us."
  15 |    ]
  16 |   },
  17 |   {
  18 |    "cell_type": "code",
  19 |    "execution_count": 1,
  20 |    "metadata": {
  21 |     "ExecuteTime": {
  22 |      "end_time": "2018-08-12T19:36:53.819932Z",
  23 |      "start_time": "2018-08-12T19:36:52.882108Z"
  24 |     }
  25 |    },
  26 |    "outputs": [
  27 |     {
  28 |      "name": "stdout",
  29 |      "output_type": "stream",
  30 |      "text": [
  31 |       "Python Version: 3.6.6 |Anaconda custom (64-bit)| (default, Jun 28 2018, 11:07:29) \n",
  32 |       "[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] \n",
  33 |       "\n",
  34 |       "Matplotlib Version: 2.2.2\n",
  35 |       "Numpy Version: 1.15.0\n",
  36 |       "Pandas Version: 0.23.3\n",
  37 |       "NLTK Version: 3.3\n",
  38 |       "sklearn Version: 0.19.1\n"
  39 |      ]
  40 |     }
  41 |    ],
  42 |    "source": [
  43 |     "import numpy as np\n",
  44 |     "import sklearn\n",
  45 |     "import matplotlib\n",
  46 |     "import pandas as pd\n",
  47 |     "import sklearn\n",
  48 |     "import sys\n",
  49 |     "import nltk\n",
  50 |     "\n",
  51 |     "libraries = (('Matplotlib', matplotlib), ('Numpy', np), ('Pandas', pd), ('NLTK', nltk), ('sklearn',sklearn))\n",
  52 |     "\n",
  53 |     "print(\"Python Version:\", sys.version, '\\n')\n",
  54 |     "for lib in libraries:\n",
  55 |     "    print('{0} Version: {1}'.format(lib[0], lib[1].__version__))"
  56 |    ]
  57 |   },
  58 |   {
  59 |    "cell_type": "code",
  60 |    "execution_count": 2,
  61 |    "metadata": {
  62 |     "ExecuteTime": {
  63 |      "end_time": "2018-08-12T19:36:53.986272Z",
  64 |      "start_time": "2018-08-12T19:36:53.822664Z"
  65 |     }
  66 |    },
  67 |    "outputs": [],
  68 |    "source": [
  69 |     "import numpy as np\n",
  70 |     "import random\n",
  71 |     "import matplotlib.pyplot as plt\n",
  72 |     "import pandas as pd\n",
  73 |     "import math\n",
  74 |     "import scipy\n",
  75 |     "%matplotlib inline\n",
  76 |     "plt.style.use('seaborn')"
  77 |    ]
  78 |   },
  79 |   {
  80 |    "cell_type": "code",
  81 |    "execution_count": 3,
  82 |    "metadata": {
  83 |     "ExecuteTime": {
  84 |      "end_time": "2018-08-12T19:36:53.999203Z",
  85 |      "start_time": "2018-08-12T19:36:53.988652Z"
  86 |     }
  87 |    },
  88 |    "outputs": [],
  89 |    "source": [
  90 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
  91 |     "import pickle\n",
  92 |     "\n",
  93 |     "class nlp_preprocessor:\n",
  94 |     "   \n",
  95 |     "    def __init__(self, vectorizer=CountVectorizer(), tokenizer=None, cleaning_function=None, \n",
  96 |     "                 stemmer=None, model=None):\n",
  97 |     "        \"\"\"\n",
  98 |     "        A class for pipelining our data in NLP problems. The user provides a series of \n",
  99 |     "        tools, and this class manages all of the training, transforming, and modification\n",
 100 |     "        of the text data.\n",
 101 |     "        ---\n",
 102 |     "        Inputs:\n",
 103 |     "        vectorizer: the model to use for vectorization of text data\n",
 104 |     "        tokenizer: The tokenizer to use, if none defaults to split on spaces\n",
 105 |     "        cleaning_function: how to clean the data, if None, defaults to the in built class\n",
 106 |     "        \"\"\"\n",
 107 |     "        if not tokenizer:\n",
 108 |     "            tokenizer = self.splitter\n",
 109 |     "        if not cleaning_function:\n",
 110 |     "            cleaning_function = self.clean_text\n",
 111 |     "        self.stemmer = stemmer\n",
 112 |     "        self.tokenizer = tokenizer\n",
 113 |     "        self.model = model\n",
 114 |     "        self.cleaning_function = cleaning_function\n",
 115 |     "        self.vectorizer = vectorizer\n",
 116 |     "        self._is_fit = False\n",
 117 |     "        \n",
 118 |     "    def splitter(self, text):\n",
 119 |     "        \"\"\"\n",
 120 |     "        Default tokenizer that splits on spaces naively\n",
 121 |     "        \"\"\"\n",
 122 |     "        return text.split(' ')\n",
 123 |     "        \n",
 124 |     "    def clean_text(self, text, tokenizer, stemmer):\n",
 125 |     "        \"\"\"\n",
 126 |     "        A naive function to lowercase all works can clean them quickly.\n",
 127 |     "        This is the default behavior if no other cleaning function is specified\n",
 128 |     "        \"\"\"\n",
 129 |     "        cleaned_text = []\n",
 130 |     "        for post in text:\n",
 131 |     "            cleaned_words = []\n",
 132 |     "            for word in tokenizer(post):\n",
 133 |     "                low_word = word.lower()\n",
 134 |     "                if stemmer:\n",
 135 |     "                    low_word = stemmer.stem(low_word)\n",
 136 |     "                cleaned_words.append(low_word)\n",
 137 |     "            cleaned_text.append(' '.join(cleaned_words))\n",
 138 |     "        return cleaned_text\n",
 139 |     "    \n",
 140 |     "    def fit(self, text):\n",
 141 |     "        \"\"\"\n",
 142 |     "        Cleans the data and then fits the vectorizer with\n",
 143 |     "        the user provided text\n",
 144 |     "        \"\"\"\n",
 145 |     "        clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer)\n",
 146 |     "        self.vectorizer.fit(clean_text)\n",
 147 |     "        self._is_fit = True\n",
 148 |     "        \n",
 149 |     "    def transform(self, text):\n",
 150 |     "        \"\"\"\n",
 151 |     "        Cleans any provided data and then transforms the data into\n",
 152 |     "        a vectorized format based on the fit function. Returns the\n",
 153 |     "        vectorized form of the data.\n",
 154 |     "        \"\"\"\n",
 155 |     "        if not self._is_fit:\n",
 156 |     "            raise ValueError(\"Must fit the models before transforming!\")\n",
 157 |     "        clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer)\n",
 158 |     "        return self.vectorizer.transform(clean_text)\n",
 159 |     "    \n",
 160 |     "    def save_pipe(self, filename):\n",
 161 |     "        \"\"\"\n",
 162 |     "        Writes the attributes of the pipeline to a file\n",
 163 |     "        allowing a pipeline to be loaded later with the\n",
 164 |     "        pre-trained pieces in place.\n",
 165 |     "        \"\"\"\n",
 166 |     "        if type(filename) != str:\n",
 167 |     "            raise TypeError(\"filename must be a string\")\n",
 168 |     "        pickle.dump(self.__dict__, open(filename+\".mdl\",'wb'))\n",
 169 |     "        \n",
 170 |     "    def load_pipe(self, filename):\n",
 171 |     "        \"\"\"\n",
 172 |     "        Writes the attributes of the pipeline to a file\n",
 173 |     "        allowing a pipeline to be loaded later with the\n",
 174 |     "        pre-trained pieces in place.\n",
 175 |     "        \"\"\"\n",
 176 |     "        if type(filename) != str:\n",
 177 |     "            raise TypeError(\"filename must be a string\")\n",
 178 |     "        if filename[-4:] != '.mdl':\n",
 179 |     "            filename += '.mdl'\n",
 180 |     "        self.__dict__ = pickle.load(open(filename,'rb'))"
 181 |    ]
 182 |   },
 183 |   {
 184 |    "cell_type": "markdown",
 185 |    "metadata": {},
 186 |    "source": [
 187 |     "## Now let's test the model with the defaults"
 188 |    ]
 189 |   },
 190 |   {
 191 |    "cell_type": "code",
 192 |    "execution_count": 4,
 193 |    "metadata": {
 194 |     "ExecuteTime": {
 195 |      "end_time": "2018-08-12T19:36:54.006636Z",
 196 |      "start_time": "2018-08-12T19:36:54.001864Z"
 197 |     }
 198 |    },
 199 |    "outputs": [],
 200 |    "source": [
 201 |     "corpus = ['BOB the builder', 'is a strange', 'caRtoon type thing']"
 202 |    ]
 203 |   },
 204 |   {
 205 |    "cell_type": "code",
 206 |    "execution_count": 5,
 207 |    "metadata": {
 208 |     "ExecuteTime": {
 209 |      "end_time": "2018-08-12T19:36:54.015526Z",
 210 |      "start_time": "2018-08-12T19:36:54.011931Z"
 211 |     }
 212 |    },
 213 |    "outputs": [],
 214 |    "source": [
 215 |     "nlp = nlp_preprocessor()"
 216 |    ]
 217 |   },
 218 |   {
 219 |    "cell_type": "code",
 220 |    "execution_count": 6,
 221 |    "metadata": {
 222 |     "ExecuteTime": {
 223 |      "end_time": "2018-08-12T19:36:54.027433Z",
 224 |      "start_time": "2018-08-12T19:36:54.018577Z"
 225 |     }
 226 |    },
 227 |    "outputs": [],
 228 |    "source": [
 229 |     "nlp.fit(corpus)"
 230 |    ]
 231 |   },
 232 |   {
 233 |    "cell_type": "code",
 234 |    "execution_count": 7,
 235 |    "metadata": {
 236 |     "ExecuteTime": {
 237 |      "end_time": "2018-08-12T19:36:54.036505Z",
 238 |      "start_time": "2018-08-12T19:36:54.029831Z"
 239 |     }
 240 |    },
 241 |    "outputs": [
 242 |     {
 243 |      "data": {
 244 |       "text/plain": [
 245 |        "array([[1, 1, 0, 0, 0, 1, 0, 0],\n",
 246 |        "       [0, 0, 0, 1, 1, 0, 0, 0],\n",
 247 |        "       [0, 0, 1, 0, 0, 0, 1, 1]])"
 248 |       ]
 249 |      },
 250 |      "execution_count": 7,
 251 |      "metadata": {},
 252 |      "output_type": "execute_result"
 253 |     }
 254 |    ],
 255 |    "source": [
 256 |     "nlp.transform(corpus).toarray()"
 257 |    ]
 258 |   },
 259 |   {
 260 |    "cell_type": "code",
 261 |    "execution_count": 8,
 262 |    "metadata": {
 263 |     "ExecuteTime": {
 264 |      "end_time": "2018-08-12T19:36:54.141882Z",
 265 |      "start_time": "2018-08-12T19:36:54.039121Z"
 266 |     }
 267 |    },
 268 |    "outputs": [
 269 |     {
 270 |      "data": {
 271 |       "text/html": [
 272 |        "<div>\n",
 273 |        "<style scoped>\n",
 274 |        "    .dataframe tbody tr th:only-of-type {\n",
 275 |        "        vertical-align: middle;\n",
 276 |        "    }\n",
 277 |        "\n",
 278 |        "    .dataframe tbody tr th {\n",
 279 |        "        vertical-align: top;\n",
 280 |        "    }\n",
 281 |        "\n",
 282 |        "    .dataframe thead th {\n",
 283 |        "        text-align: right;\n",
 284 |        "    }\n",
 285 |        "</style>\n",
 286 |        "<table border=\"1\" class=\"dataframe\">\n",
 287 |        "  <thead>\n",
 288 |        "    <tr style=\"text-align: right;\">\n",
 289 |        "      <th></th>\n",
 290 |        "      <th>bob</th>\n",
 291 |        "      <th>builder</th>\n",
 292 |        "      <th>cartoon</th>\n",
 293 |        "      <th>is</th>\n",
 294 |        "      <th>strange</th>\n",
 295 |        "      <th>the</th>\n",
 296 |        "      <th>thing</th>\n",
 297 |        "      <th>type</th>\n",
 298 |        "    </tr>\n",
 299 |        "  </thead>\n",
 300 |        "  <tbody>\n",
 301 |        "    <tr>\n",
 302 |        "      <th>0</th>\n",
 303 |        "      <td>1</td>\n",
 304 |        "      <td>1</td>\n",
 305 |        "      <td>0</td>\n",
 306 |        "      <td>0</td>\n",
 307 |        "      <td>0</td>\n",
 308 |        "      <td>1</td>\n",
 309 |        "      <td>0</td>\n",
 310 |        "      <td>0</td>\n",
 311 |        "    </tr>\n",
 312 |        "    <tr>\n",
 313 |        "      <th>1</th>\n",
 314 |        "      <td>0</td>\n",
 315 |        "      <td>0</td>\n",
 316 |        "      <td>0</td>\n",
 317 |        "      <td>1</td>\n",
 318 |        "      <td>1</td>\n",
 319 |        "      <td>0</td>\n",
 320 |        "      <td>0</td>\n",
 321 |        "      <td>0</td>\n",
 322 |        "    </tr>\n",
 323 |        "    <tr>\n",
 324 |        "      <th>2</th>\n",
 325 |        "      <td>0</td>\n",
 326 |        "      <td>0</td>\n",
 327 |        "      <td>1</td>\n",
 328 |        "      <td>0</td>\n",
 329 |        "      <td>0</td>\n",
 330 |        "      <td>0</td>\n",
 331 |        "      <td>1</td>\n",
 332 |        "      <td>1</td>\n",
 333 |        "    </tr>\n",
 334 |        "  </tbody>\n",
 335 |        "</table>\n",
 336 |        "</div>"
 337 |       ],
 338 |       "text/plain": [
 339 |        "   bob  builder  cartoon  is  strange  the  thing  type\n",
 340 |        "0    1        1        0   0        0    1      0     0\n",
 341 |        "1    0        0        0   1        1    0      0     0\n",
 342 |        "2    0        0        1   0        0    0      1     1"
 343 |       ]
 344 |      },
 345 |      "execution_count": 8,
 346 |      "metadata": {},
 347 |      "output_type": "execute_result"
 348 |     }
 349 |    ],
 350 |    "source": [
 351 |     "pd.DataFrame(nlp.transform(corpus).toarray(), columns=nlp.vectorizer.get_feature_names())"
 352 |    ]
 353 |   },
 354 |   {
 355 |    "cell_type": "markdown",
 356 |    "metadata": {},
 357 |    "source": [
 358 |     "## What if we want to swap pieces in?"
 359 |    ]
 360 |   },
 361 |   {
 362 |    "cell_type": "code",
 363 |    "execution_count": 9,
 364 |    "metadata": {
 365 |     "ExecuteTime": {
 366 |      "end_time": "2018-08-12T19:36:54.151700Z",
 367 |      "start_time": "2018-08-12T19:36:54.144563Z"
 368 |     }
 369 |    },
 370 |    "outputs": [],
 371 |    "source": [
 372 |     "def new_clean_text(text, tokenizer, stemmer):\n",
 373 |     "    \"\"\"\n",
 374 |     "    A naive function to lowercase all works can clean them quickly.\n",
 375 |     "    This is the default behavior if no other cleaning function is specified\n",
 376 |     "    \"\"\"\n",
 377 |     "    cleaned_text = []\n",
 378 |     "    for post in text:\n",
 379 |     "        cleaned_words = []\n",
 380 |     "        for word in tokenizer(post):\n",
 381 |     "            low_word = word.lower()\n",
 382 |     "            if low_word in ['builder']: # remove the word builder\n",
 383 |     "                continue\n",
 384 |     "            if stemmer:\n",
 385 |     "                low_word = stemmer.stem(low_word)\n",
 386 |     "            cleaned_words.append(low_word)\n",
 387 |     "        cleaned_text.append(' '.join(cleaned_words))\n",
 388 |     "    return cleaned_text"
 389 |    ]
 390 |   },
 391 |   {
 392 |    "cell_type": "code",
 393 |    "execution_count": 10,
 394 |    "metadata": {
 395 |     "ExecuteTime": {
 396 |      "end_time": "2018-08-12T19:36:54.161583Z",
 397 |      "start_time": "2018-08-12T19:36:54.156314Z"
 398 |     }
 399 |    },
 400 |    "outputs": [],
 401 |    "source": [
 402 |     "from nltk.stem import PorterStemmer\n",
 403 |     "\n",
 404 |     "nlp2 = nlp_preprocessor(cleaning_function=new_clean_text, vectorizer=CountVectorizer(lowercase=False), \n",
 405 |     "                        stemmer=PorterStemmer())"
 406 |    ]
 407 |   },
 408 |   {
 409 |    "cell_type": "code",
 410 |    "execution_count": 11,
 411 |    "metadata": {
 412 |     "ExecuteTime": {
 413 |      "end_time": "2018-08-12T19:36:54.174080Z",
 414 |      "start_time": "2018-08-12T19:36:54.164305Z"
 415 |     }
 416 |    },
 417 |    "outputs": [
 418 |     {
 419 |      "data": {
 420 |       "text/plain": [
 421 |        "['bob', 'cartoon', 'is', 'strang', 'the', 'thing', 'type']"
 422 |       ]
 423 |      },
 424 |      "execution_count": 11,
 425 |      "metadata": {},
 426 |      "output_type": "execute_result"
 427 |     }
 428 |    ],
 429 |    "source": [
 430 |     "nlp2.fit(corpus)\n",
 431 |     "nlp2.vectorizer.get_feature_names()"
 432 |    ]
 433 |   },
 434 |   {
 435 |    "cell_type": "code",
 436 |    "execution_count": 12,
 437 |    "metadata": {
 438 |     "ExecuteTime": {
 439 |      "end_time": "2018-08-12T19:36:54.188085Z",
 440 |      "start_time": "2018-08-12T19:36:54.176340Z"
 441 |     }
 442 |    },
 443 |    "outputs": [
 444 |     {
 445 |      "data": {
 446 |       "text/html": [
 447 |        "<div>\n",
 448 |        "<style scoped>\n",
 449 |        "    .dataframe tbody tr th:only-of-type {\n",
 450 |        "        vertical-align: middle;\n",
 451 |        "    }\n",
 452 |        "\n",
 453 |        "    .dataframe tbody tr th {\n",
 454 |        "        vertical-align: top;\n",
 455 |        "    }\n",
 456 |        "\n",
 457 |        "    .dataframe thead th {\n",
 458 |        "        text-align: right;\n",
 459 |        "    }\n",
 460 |        "</style>\n",
 461 |        "<table border=\"1\" class=\"dataframe\">\n",
 462 |        "  <thead>\n",
 463 |        "    <tr style=\"text-align: right;\">\n",
 464 |        "      <th></th>\n",
 465 |        "      <th>bob</th>\n",
 466 |        "      <th>cartoon</th>\n",
 467 |        "      <th>is</th>\n",
 468 |        "      <th>strang</th>\n",
 469 |        "      <th>the</th>\n",
 470 |        "      <th>thing</th>\n",
 471 |        "      <th>type</th>\n",
 472 |        "    </tr>\n",
 473 |        "  </thead>\n",
 474 |        "  <tbody>\n",
 475 |        "    <tr>\n",
 476 |        "      <th>0</th>\n",
 477 |        "      <td>1</td>\n",
 478 |        "      <td>0</td>\n",
 479 |        "      <td>0</td>\n",
 480 |        "      <td>0</td>\n",
 481 |        "      <td>1</td>\n",
 482 |        "      <td>0</td>\n",
 483 |        "      <td>0</td>\n",
 484 |        "    </tr>\n",
 485 |        "    <tr>\n",
 486 |        "      <th>1</th>\n",
 487 |        "      <td>0</td>\n",
 488 |        "      <td>0</td>\n",
 489 |        "      <td>1</td>\n",
 490 |        "      <td>1</td>\n",
 491 |        "      <td>0</td>\n",
 492 |        "      <td>0</td>\n",
 493 |        "      <td>0</td>\n",
 494 |        "    </tr>\n",
 495 |        "    <tr>\n",
 496 |        "      <th>2</th>\n",
 497 |        "      <td>0</td>\n",
 498 |        "      <td>1</td>\n",
 499 |        "      <td>0</td>\n",
 500 |        "      <td>0</td>\n",
 501 |        "      <td>0</td>\n",
 502 |        "      <td>1</td>\n",
 503 |        "      <td>1</td>\n",
 504 |        "    </tr>\n",
 505 |        "  </tbody>\n",
 506 |        "</table>\n",
 507 |        "</div>"
 508 |       ],
 509 |       "text/plain": [
 510 |        "   bob  cartoon  is  strang  the  thing  type\n",
 511 |        "0    1        0   0       0    1      0     0\n",
 512 |        "1    0        0   1       1    0      0     0\n",
 513 |        "2    0        1   0       0    0      1     1"
 514 |       ]
 515 |      },
 516 |      "execution_count": 12,
 517 |      "metadata": {},
 518 |      "output_type": "execute_result"
 519 |     }
 520 |    ],
 521 |    "source": [
 522 |     "pd.DataFrame(nlp2.transform(corpus).toarray(), columns=nlp2.vectorizer.get_feature_names())"
 523 |    ]
 524 |   },
 525 |   {
 526 |    "cell_type": "markdown",
 527 |    "metadata": {},
 528 |    "source": [
 529 |     "## What about using TF-IDF instead?"
 530 |    ]
 531 |   },
 532 |   {
 533 |    "cell_type": "code",
 534 |    "execution_count": 13,
 535 |    "metadata": {
 536 |     "ExecuteTime": {
 537 |      "end_time": "2018-08-12T19:36:54.194725Z",
 538 |      "start_time": "2018-08-12T19:36:54.190607Z"
 539 |     }
 540 |    },
 541 |    "outputs": [],
 542 |    "source": [
 543 |     "from sklearn.feature_extraction.text import TfidfVectorizer\n",
 544 |     "\n",
 545 |     "nlp3 = nlp_preprocessor(cleaning_function=new_clean_text, vectorizer=TfidfVectorizer(lowercase=False))"
 546 |    ]
 547 |   },
 548 |   {
 549 |    "cell_type": "code",
 550 |    "execution_count": 14,
 551 |    "metadata": {
 552 |     "ExecuteTime": {
 553 |      "end_time": "2018-08-12T19:36:54.214380Z",
 554 |      "start_time": "2018-08-12T19:36:54.197369Z"
 555 |     }
 556 |    },
 557 |    "outputs": [
 558 |     {
 559 |      "data": {
 560 |       "text/html": [
 561 |        "<div>\n",
 562 |        "<style scoped>\n",
 563 |        "    .dataframe tbody tr th:only-of-type {\n",
 564 |        "        vertical-align: middle;\n",
 565 |        "    }\n",
 566 |        "\n",
 567 |        "    .dataframe tbody tr th {\n",
 568 |        "        vertical-align: top;\n",
 569 |        "    }\n",
 570 |        "\n",
 571 |        "    .dataframe thead th {\n",
 572 |        "        text-align: right;\n",
 573 |        "    }\n",
 574 |        "</style>\n",
 575 |        "<table border=\"1\" class=\"dataframe\">\n",
 576 |        "  <thead>\n",
 577 |        "    <tr style=\"text-align: right;\">\n",
 578 |        "      <th></th>\n",
 579 |        "      <th>bob</th>\n",
 580 |        "      <th>cartoon</th>\n",
 581 |        "      <th>is</th>\n",
 582 |        "      <th>strange</th>\n",
 583 |        "      <th>the</th>\n",
 584 |        "      <th>thing</th>\n",
 585 |        "      <th>type</th>\n",
 586 |        "    </tr>\n",
 587 |        "  </thead>\n",
 588 |        "  <tbody>\n",
 589 |        "    <tr>\n",
 590 |        "      <th>0</th>\n",
 591 |        "      <td>0.707107</td>\n",
 592 |        "      <td>0.00000</td>\n",
 593 |        "      <td>0.000000</td>\n",
 594 |        "      <td>0.000000</td>\n",
 595 |        "      <td>0.707107</td>\n",
 596 |        "      <td>0.00000</td>\n",
 597 |        "      <td>0.00000</td>\n",
 598 |        "    </tr>\n",
 599 |        "    <tr>\n",
 600 |        "      <th>1</th>\n",
 601 |        "      <td>0.000000</td>\n",
 602 |        "      <td>0.00000</td>\n",
 603 |        "      <td>0.707107</td>\n",
 604 |        "      <td>0.707107</td>\n",
 605 |        "      <td>0.000000</td>\n",
 606 |        "      <td>0.00000</td>\n",
 607 |        "      <td>0.00000</td>\n",
 608 |        "    </tr>\n",
 609 |        "    <tr>\n",
 610 |        "      <th>2</th>\n",
 611 |        "      <td>0.000000</td>\n",
 612 |        "      <td>0.57735</td>\n",
 613 |        "      <td>0.000000</td>\n",
 614 |        "      <td>0.000000</td>\n",
 615 |        "      <td>0.000000</td>\n",
 616 |        "      <td>0.57735</td>\n",
 617 |        "      <td>0.57735</td>\n",
 618 |        "    </tr>\n",
 619 |        "  </tbody>\n",
 620 |        "</table>\n",
 621 |        "</div>"
 622 |       ],
 623 |       "text/plain": [
 624 |        "        bob  cartoon        is   strange       the    thing     type\n",
 625 |        "0  0.707107  0.00000  0.000000  0.000000  0.707107  0.00000  0.00000\n",
 626 |        "1  0.000000  0.00000  0.707107  0.707107  0.000000  0.00000  0.00000\n",
 627 |        "2  0.000000  0.57735  0.000000  0.000000  0.000000  0.57735  0.57735"
 628 |       ]
 629 |      },
 630 |      "execution_count": 14,
 631 |      "metadata": {},
 632 |      "output_type": "execute_result"
 633 |     }
 634 |    ],
 635 |    "source": [
 636 |     "nlp3.fit(corpus)\n",
 637 |     "nlp3.vectorizer.get_feature_names()\n",
 638 |     "pd.DataFrame(nlp3.transform(corpus).toarray(), columns=nlp3.vectorizer.get_feature_names())"
 639 |    ]
 640 |   },
 641 |   {
 642 |    "cell_type": "markdown",
 643 |    "metadata": {},
 644 |    "source": [
 645 |     "# So what? Let's use some real data to try some different modeling approaches"
 646 |    ]
 647 |   },
 648 |   {
 649 |    "cell_type": "code",
 650 |    "execution_count": 15,
 651 |    "metadata": {
 652 |     "ExecuteTime": {
 653 |      "end_time": "2018-08-12T19:36:56.475435Z",
 654 |      "start_time": "2018-08-12T19:36:54.216987Z"
 655 |     }
 656 |    },
 657 |    "outputs": [],
 658 |    "source": [
 659 |     "from sklearn import datasets\n",
 660 |     "\n",
 661 |     "categories = ['alt.atheism', 'comp.graphics', 'rec.sport.baseball']\n",
 662 |     "ng_train = datasets.fetch_20newsgroups(subset='train', \n",
 663 |     "                                       categories=categories, \n",
 664 |     "                                       remove=('headers', \n",
 665 |     "                                               'footers', 'quotes'))\n",
 666 |     "ng_train_data = ng_train.data\n",
 667 |     "ng_train_targets = ng_train.target\n",
 668 |     "\n",
 669 |     "ng_test = datasets.fetch_20newsgroups(subset='test', \n",
 670 |     "                                       categories=categories, \n",
 671 |     "                                       remove=('headers', \n",
 672 |     "                                               'footers', 'quotes'))\n",
 673 |     "\n",
 674 |     "ng_test_data = ng_test.data\n",
 675 |     "ng_test_targets = ng_test.target"
 676 |    ]
 677 |   },
 678 |   {
 679 |    "cell_type": "code",
 680 |    "execution_count": 16,
 681 |    "metadata": {
 682 |     "ExecuteTime": {
 683 |      "end_time": "2018-08-12T19:37:10.328227Z",
 684 |      "start_time": "2018-08-12T19:36:56.477401Z"
 685 |     }
 686 |    },
 687 |    "outputs": [
 688 |     {
 689 |      "name": "stdout",
 690 |      "output_type": "stream",
 691 |      "text": [
 692 |       "Chain 0: 0.9031674208144796\n",
 693 |       "Chain 1: 0.9076923076923077\n",
 694 |       "Chain 2: 0.8995475113122172\n"
 695 |      ]
 696 |     }
 697 |    ],
 698 |    "source": [
 699 |     "from sklearn.naive_bayes import MultinomialNB\n",
 700 |     "from nltk.stem import PorterStemmer\n",
 701 |     "\n",
 702 |     "nlp = nlp_preprocessor(stemmer=PorterStemmer())\n",
 703 |     "nlp2 = nlp_preprocessor(vectorizer=CountVectorizer(lowercase=False))\n",
 704 |     "nlp3 = nlp_preprocessor(cleaning_function=new_clean_text, vectorizer=TfidfVectorizer(lowercase=False))\n",
 705 |     "nlp_chains = [nlp, nlp2, nlp3]\n",
 706 |     "\n",
 707 |     "for ix, chain in enumerate(nlp_chains):\n",
 708 |     "    nb = MultinomialNB()\n",
 709 |     "    chain.fit(ng_train_data)\n",
 710 |     "    train_data = chain.transform(ng_train_data)\n",
 711 |     "    test_data = chain.transform(ng_test_data)\n",
 712 |     "    nb.fit(train_data, ng_train_targets)\n",
 713 |     "    accuracy = nb.score(test_data, ng_test_targets)\n",
 714 |     "    print(\"Chain {}: {}\".format(ix, accuracy))"
 715 |    ]
 716 |   },
 717 |   {
 718 |    "cell_type": "markdown",
 719 |    "metadata": {},
 720 |    "source": [
 721 |     "## Summary"
 722 |    ]
 723 |   },
 724 |   {
 725 |    "cell_type": "markdown",
 726 |    "metadata": {},
 727 |    "source": [
 728 |     "This allows us to sweep all of the preprocessing into a class where we can control the pieces and parts that go in, and can see what comes out. If we wanted to, we could even add a model into the class as well and put the whole pipe into a single class that manages all of our challenges. In this case, we've left it outside for demo purposes. This also saves all of the pieces together, so we can just pickle a class object and that will keep the whole structure of our models together - such as the vectorizer and the stemmer we used, as well as the cleaning routine, so we don't lose any of the pieces if we want to run it on new data later."
 729 |    ]
 730 |   },
 731 |   {
 732 |    "cell_type": "markdown",
 733 |    "metadata": {},
 734 |    "source": [
 735 |     "# Adding a model to the mix\n",
 736 |     "\n",
 737 |     "Depending on the type of model we want to build, we'll need to wrap the preprocessing class a little bit differently for the specific case. For example, if we're doing supervised learning, we'll want a `predict` method. If we're doing topic modeling, we'll want a `transform` method. To make that happen, I'll show a few examples below that wrap around the preprocessing class to make the most of it. "
 738 |    ]
 739 |   },
 740 |   {
 741 |    "cell_type": "markdown",
 742 |    "metadata": {},
 743 |    "source": [
 744 |     "#### Supervised: Classification\n",
 745 |     "\n",
 746 |     "Here we'll write a class to predict a class given the text of the document. "
 747 |    ]
 748 |   },
 749 |   {
 750 |    "cell_type": "code",
 751 |    "execution_count": 17,
 752 |    "metadata": {
 753 |     "ExecuteTime": {
 754 |      "end_time": "2018-08-12T19:37:10.340551Z",
 755 |      "start_time": "2018-08-12T19:37:10.330954Z"
 756 |     }
 757 |    },
 758 |    "outputs": [],
 759 |    "source": [
 760 |     "class supervised_nlp:\n",
 761 |     "    \n",
 762 |     "    def __init__(self, model, preprocessing_pipeline=None):\n",
 763 |     "        \"\"\"\n",
 764 |     "        A pipeline for doing supervised nlp. Expects a model and creates\n",
 765 |     "        a preprocessing pipeline if one isn't provided.\n",
 766 |     "        \"\"\"\n",
 767 |     "        self.model = model\n",
 768 |     "        self._is_fit = False\n",
 769 |     "        if not preprocessing_pipeline:\n",
 770 |     "            self.preprocessor = nlp_preprocessor()\n",
 771 |     "        else:\n",
 772 |     "            self.preprocessor = preprocessing_pipeline\n",
 773 |     "        \n",
 774 |     "    def fit(self, X, y):\n",
 775 |     "        \"\"\"\n",
 776 |     "        Trains the vectorizer and model together using the \n",
 777 |     "        users input training data.\n",
 778 |     "        \"\"\"\n",
 779 |     "        self.preprocessor.fit(X)\n",
 780 |     "        train_data = self.preprocessor.transform(X)\n",
 781 |     "        self.model.fit(train_data, y)\n",
 782 |     "        self._is_fit = True\n",
 783 |     "    \n",
 784 |     "    def predict(self, X):\n",
 785 |     "        \"\"\"\n",
 786 |     "        Makes a prediction on the data provided by the users using the \n",
 787 |     "        preprocessing pipeline and provided model.\n",
 788 |     "        \"\"\"\n",
 789 |     "        if not self._is_fit:\n",
 790 |     "            raise ValueError(\"Must fit the models before transforming!\")\n",
 791 |     "        test_data = self.preprocessor.transform(X)\n",
 792 |     "        preds = self.model.predict(test_data)\n",
 793 |     "        return preds\n",
 794 |     "    \n",
 795 |     "    def score(self, X, y):\n",
 796 |     "        \"\"\"\n",
 797 |     "        Returns the accuracy for the model after using the trained\n",
 798 |     "        preprocessing pipeline to prepare the data.\n",
 799 |     "        \"\"\"\n",
 800 |     "        test_data = self.preprocessor.transform(X)\n",
 801 |     "        return self.model.score(test_data, y)\n",
 802 |     "    \n",
 803 |     "    def save_pipe(self, filename):\n",
 804 |     "        \"\"\"\n",
 805 |     "        Writes the attributes of the pipeline to a file\n",
 806 |     "        allowing a pipeline to be loaded later with the\n",
 807 |     "        pre-trained pieces in place.\n",
 808 |     "        \"\"\"\n",
 809 |     "        if type(filename) != str:\n",
 810 |     "            raise TypeError(\"filename must be a string\")\n",
 811 |     "        pickle.dump(self.__dict__, open(filename+\".mdl\",'wb'))\n",
 812 |     "        \n",
 813 |     "    def load_pipe(self, filename):\n",
 814 |     "        \"\"\"\n",
 815 |     "        Writes the attributes of the pipeline to a file\n",
 816 |     "        allowing a pipeline to be loaded later with the\n",
 817 |     "        pre-trained pieces in place.\n",
 818 |     "        \"\"\"\n",
 819 |     "        if type(filename) != str:\n",
 820 |     "            raise TypeError(\"filename must be a string\")\n",
 821 |     "        if filename[-4:] != '.mdl':\n",
 822 |     "            filename += '.mdl'\n",
 823 |     "        self.__dict__ = pickle.load(open(filename,'rb'))"
 824 |    ]
 825 |   },
 826 |   {
 827 |    "cell_type": "code",
 828 |    "execution_count": 18,
 829 |    "metadata": {
 830 |     "ExecuteTime": {
 831 |      "end_time": "2018-08-12T19:37:22.978597Z",
 832 |      "start_time": "2018-08-12T19:37:10.343254Z"
 833 |     }
 834 |    },
 835 |    "outputs": [
 836 |     {
 837 |      "data": {
 838 |       "text/plain": [
 839 |        "0.9031674208144796"
 840 |       ]
 841 |      },
 842 |      "execution_count": 18,
 843 |      "metadata": {},
 844 |      "output_type": "execute_result"
 845 |     }
 846 |    ],
 847 |    "source": [
 848 |     "nlp_pipe = supervised_nlp(MultinomialNB(), nlp)\n",
 849 |     "nlp_pipe.fit(ng_train_data, ng_train_targets)\n",
 850 |     "nlp_pipe.score(ng_test_data, ng_test_targets)"
 851 |    ]
 852 |   },
 853 |   {
 854 |    "cell_type": "markdown",
 855 |    "metadata": {},
 856 |    "source": [
 857 |     "Swap out the model for something different."
 858 |    ]
 859 |   },
 860 |   {
 861 |    "cell_type": "code",
 862 |    "execution_count": 19,
 863 |    "metadata": {
 864 |     "ExecuteTime": {
 865 |      "end_time": "2018-08-12T19:37:35.730306Z",
 866 |      "start_time": "2018-08-12T19:37:22.981131Z"
 867 |     }
 868 |    },
 869 |    "outputs": [
 870 |     {
 871 |      "data": {
 872 |       "text/plain": [
 873 |        "0.8262443438914027"
 874 |       ]
 875 |      },
 876 |      "execution_count": 19,
 877 |      "metadata": {},
 878 |      "output_type": "execute_result"
 879 |     }
 880 |    ],
 881 |    "source": [
 882 |     "from sklearn.svm import LinearSVC\n",
 883 |     "\n",
 884 |     "nlp_pipe = supervised_nlp(LinearSVC(), nlp)\n",
 885 |     "nlp_pipe.fit(ng_train_data, ng_train_targets)\n",
 886 |     "nlp_pipe.score(ng_test_data, ng_test_targets)"
 887 |    ]
 888 |   },
 889 |   {
 890 |    "cell_type": "markdown",
 891 |    "metadata": {},
 892 |    "source": [
 893 |     "#### Unsupervised: Topic Modeling\n",
 894 |     "\n",
 895 |     "We don't want to make a prediction with this example, simply to find topics and have the ability to cast our data into the \"topic space\" from the \"word space.\" With this in mind, we'll add a transform feature and also the ability to print out the topics."
 896 |    ]
 897 |   },
 898 |   {
 899 |    "cell_type": "code",
 900 |    "execution_count": 20,
 901 |    "metadata": {
 902 |     "ExecuteTime": {
 903 |      "end_time": "2018-08-12T19:37:35.740426Z",
 904 |      "start_time": "2018-08-12T19:37:35.732734Z"
 905 |     }
 906 |    },
 907 |    "outputs": [],
 908 |    "source": [
 909 |     "class topic_modeling_nlp:\n",
 910 |     "    \n",
 911 |     "    def __init__(self, model, preprocessing_pipeline=None):\n",
 912 |     "        \"\"\"\n",
 913 |     "        A pipeline for doing supervised nlp. Expects a model and creates\n",
 914 |     "        a preprocessing pipeline if one isn't provided.\n",
 915 |     "        \"\"\"\n",
 916 |     "        self.model = model\n",
 917 |     "        self._is_fit = False\n",
 918 |     "        if not preprocessing_pipeline:\n",
 919 |     "            self.preprocessor = nlp_preprocessor()\n",
 920 |     "        else:\n",
 921 |     "            self.preprocessor = preprocessing_pipeline\n",
 922 |     "        \n",
 923 |     "    def fit(self, X):\n",
 924 |     "        \"\"\"\n",
 925 |     "        Trains the vectorizer and model together using the \n",
 926 |     "        users input training data.\n",
 927 |     "        \"\"\"\n",
 928 |     "        self.preprocessor.fit(X)\n",
 929 |     "        train_data = self.preprocessor.transform(X)\n",
 930 |     "        self.model.fit(train_data)\n",
 931 |     "        self._is_fit = True\n",
 932 |     "    \n",
 933 |     "    def transform(self, X):\n",
 934 |     "        \"\"\"\n",
 935 |     "        Makes a prediction on the data provided by the users using the \n",
 936 |     "        preprocessing pipeline and provided model.\n",
 937 |     "        \"\"\"\n",
 938 |     "        if not self._is_fit:\n",
 939 |     "            raise ValueError(\"Must fit the models before transforming!\")\n",
 940 |     "        test_data = self.preprocessor.transform(X)\n",
 941 |     "        preds = self.model.transform(test_data)\n",
 942 |     "        return preds\n",
 943 |     "    \n",
 944 |     "    def print_topics(self, num_words=10):\n",
 945 |     "        \"\"\"\n",
 946 |     "        A function to print out the top words for each topic\n",
 947 |     "        \"\"\"\n",
 948 |     "        feat_names = self.preprocessor.vectorizer.get_feature_names()\n",
 949 |     "        for topic_idx, topic in enumerate(self.model.components_):\n",
 950 |     "            message = \"Topic #%d: \" % topic_idx\n",
 951 |     "            message += \" \".join([feat_names[i]\n",
 952 |     "                                 for i in topic.argsort()[:-num_words - 1:-1]])\n",
 953 |     "            print(message)\n",
 954 |     "            \n",
 955 |     "    def save_pipe(self, filename):\n",
 956 |     "        \"\"\"\n",
 957 |     "        Writes the attributes of the pipeline to a file\n",
 958 |     "        allowing a pipeline to be loaded later with the\n",
 959 |     "        pre-trained pieces in place.\n",
 960 |     "        \"\"\"\n",
 961 |     "        if type(filename) != str:\n",
 962 |     "            raise TypeError(\"filename must be a string\")\n",
 963 |     "        pickle.dump(self.__dict__, open(filename+\".mdl\",'wb'))\n",
 964 |     "        \n",
 965 |     "    def load_pipe(self, filename):\n",
 966 |     "        \"\"\"\n",
 967 |     "        Writes the attributes of the pipeline to a file\n",
 968 |     "        allowing a pipeline to be loaded later with the\n",
 969 |     "        pre-trained pieces in place.\n",
 970 |     "        \"\"\"\n",
 971 |     "        if type(filename) != str:\n",
 972 |     "            raise TypeError(\"filename must be a string\")\n",
 973 |     "        if filename[-4:] != '.mdl':\n",
 974 |     "            filename += '.mdl'\n",
 975 |     "        self.__dict__ = pickle.load(open(filename,'rb'))"
 976 |    ]
 977 |   },
 978 |   {
 979 |    "cell_type": "code",
 980 |    "execution_count": 21,
 981 |    "metadata": {
 982 |     "ExecuteTime": {
 983 |      "end_time": "2018-08-12T19:37:35.760623Z",
 984 |      "start_time": "2018-08-12T19:37:35.743417Z"
 985 |     }
 986 |    },
 987 |    "outputs": [],
 988 |    "source": [
 989 |     "from sklearn.decomposition import TruncatedSVD\n",
 990 |     "\n",
 991 |     "cv = CountVectorizer(stop_words='english', token_pattern='\\\\b[a-z][a-z]+\\\\b')\n",
 992 |     "cleaning_pipe = nlp_preprocessor(vectorizer=cv)\n",
 993 |     "topic_chain = topic_modeling_nlp(TruncatedSVD(n_components=15), preprocessing_pipeline=cleaning_pipe)"
 994 |    ]
 995 |   },
 996 |   {
 997 |    "cell_type": "code",
 998 |    "execution_count": 22,
 999 |    "metadata": {
1000 |     "ExecuteTime": {
1001 |      "end_time": "2018-08-12T19:37:36.687204Z",
1002 |      "start_time": "2018-08-12T19:37:35.762876Z"
1003 |     }
1004 |    },
1005 |    "outputs": [
1006 |     {
1007 |      "data": {
1008 |       "text/plain": [
1009 |        "(1661, 15)"
1010 |       ]
1011 |      },
1012 |      "execution_count": 22,
1013 |      "metadata": {},
1014 |      "output_type": "execute_result"
1015 |     }
1016 |    ],
1017 |    "source": [
1018 |     "topic_chain.fit(ng_train_data)\n",
1019 |     "topic_chain.transform(ng_train_data).shape"
1020 |    ]
1021 |   },
1022 |   {
1023 |    "cell_type": "code",
1024 |    "execution_count": 23,
1025 |    "metadata": {
1026 |     "ExecuteTime": {
1027 |      "end_time": "2018-08-12T19:37:36.734831Z",
1028 |      "start_time": "2018-08-12T19:37:36.689485Z"
1029 |     }
1030 |    },
1031 |    "outputs": [
1032 |     {
1033 |      "name": "stdout",
1034 |      "output_type": "stream",
1035 |      "text": [
1036 |       "Topic #0: jpeg image edu file graphics gif images format color pub\n",
1037 |       "Topic #1: edu graphics pub data mail ray ftp send com objects\n",
1038 |       "Topic #2: jesus god atheists matthew people atheism does religious said religion\n",
1039 |       "Topic #3: image data processing analysis software available display tools tool user\n",
1040 |       "Topic #4: jesus matthew prophecy messiah psalm isaiah david said lord israel\n",
1041 |       "Topic #5: argument fallacy conclusion example true argumentum ad premises false valid\n",
1042 |       "Topic #6: data available ftp sgi grass vertex pci motecc model info\n",
1043 |       "Topic #7: game year don good think hit won runs team home\n",
1044 |       "Topic #8: god posting subject response typical information universe einstein bush evidence\n",
1045 |       "Topic #9: den radius double theta sqrt pi sin rtheta pole pt\n",
1046 |       "Topic #10: program read menu bits display change file pressing want don\n",
1047 |       "Topic #11: lost program won cubs atheism game menu display bits home\n",
1048 |       "Topic #12: game runs second hit run cubs graphics home win sunday\n",
1049 |       "Topic #13: atheism alt faq send edu usenet news files otis answers\n",
1050 |       "Topic #14: col int row value char imagewidth imageheight unsigned cubs pitcher\n"
1051 |      ]
1052 |     }
1053 |    ],
1054 |    "source": [
1055 |     "topic_chain.print_topics()"
1056 |    ]
1057 |   },
1058 |   {
1059 |    "cell_type": "markdown",
1060 |    "metadata": {},
1061 |    "source": [
1062 |     "Swap out the model for something different."
1063 |    ]
1064 |   },
1065 |   {
1066 |    "cell_type": "code",
1067 |    "execution_count": 24,
1068 |    "metadata": {
1069 |     "ExecuteTime": {
1070 |      "end_time": "2018-08-12T19:37:36.741910Z",
1071 |      "start_time": "2018-08-12T19:37:36.737246Z"
1072 |     }
1073 |    },
1074 |    "outputs": [],
1075 |    "source": [
1076 |     "from sklearn.decomposition import LatentDirichletAllocation\n",
1077 |     "topic_chain = topic_modeling_nlp(LatentDirichletAllocation(n_components=15), preprocessing_pipeline=cleaning_pipe)"
1078 |    ]
1079 |   },
1080 |   {
1081 |    "cell_type": "code",
1082 |    "execution_count": 25,
1083 |    "metadata": {
1084 |     "ExecuteTime": {
1085 |      "end_time": "2018-08-12T19:37:42.475762Z",
1086 |      "start_time": "2018-08-12T19:37:36.745876Z"
1087 |     }
1088 |    },
1089 |    "outputs": [
1090 |     {
1091 |      "name": "stderr",
1092 |      "output_type": "stream",
1093 |      "text": [
1094 |       "/Users/zachariahmiller/anaconda3/lib/python3.6/site-packages/sklearn/decomposition/online_lda.py:536: DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.\n",
1095 |       "  DeprecationWarning)\n"
1096 |      ]
1097 |     },
1098 |     {
1099 |      "name": "stdout",
1100 |      "output_type": "stream",
1101 |      "text": [
1102 |       "Topic #0: cornerstone pappas nis dualpage dollar tour karlin eastwick rawley mcnally\n",
1103 |       "Topic #1: corel tune pens bone trimming dykstra reprints highway qcr packs\n",
1104 |       "Topic #2: enviroleague youth mission organizations markus true adult max pointer abekas\n",
1105 |       "Topic #3: image graphics edu file jpeg data files software use images\n",
1106 |       "Topic #4: edu cobb hou jewish polygon problem illinois schwartz sigkids pay\n",
1107 |       "Topic #5: graeme computer radiosity pp curves cubic bezier stephan generation dtax\n",
1108 |       "Topic #6: god people atheism does atheists jesus religion don believe argument\n",
1109 |       "Topic #7: sex sea tek said vice bronx com ico bob away\n",
1110 |       "Topic #8: won cubs lost york new team edu san sox reds\n",
1111 |       "Topic #9: den col int radius row war value bombing theta hussein\n",
1112 |       "Topic #10: text copy texts septuagint masoretic parody ot passages various toilet\n",
1113 |       "Topic #11: erickson cage animals temporarily syllogism spell cch cold products polygoon\n",
1114 |       "Topic #12: alomar average league players baerga double player rbi ab slg\n",
1115 |       "Topic #13: xxxx cruel anti rb germany paul mercedes woofing explaining girl\n",
1116 |       "Topic #14: don think good just year like better time know game\n"
1117 |      ]
1118 |     }
1119 |    ],
1120 |    "source": [
1121 |     "topic_chain.fit(ng_train_data)\n",
1122 |     "topic_chain.transform(ng_train_data).shape\n",
1123 |     "topic_chain.print_topics()"
1124 |    ]
1125 |   },
1126 |   {
1127 |    "cell_type": "code",
1128 |    "execution_count": null,
1129 |    "metadata": {},
1130 |    "outputs": [],
1131 |    "source": []
1132 |   }
1133 |  ],
1134 |  "metadata": {
1135 |   "kernelspec": {
1136 |    "display_name": "Python [default]",
1137 |    "language": "python",
1138 |    "name": "python3"
1139 |   },
1140 |   "language_info": {
1141 |    "codemirror_mode": {
1142 |     "name": "ipython",
1143 |     "version": 3
1144 |    },
1145 |    "file_extension": ".py",
1146 |    "mimetype": "text/x-python",
1147 |    "name": "python",
1148 |    "nbconvert_exporter": "python",
1149 |    "pygments_lexer": "ipython3",
1150 |    "version": "3.6.6"
1151 |   },
1152 |   "toc": {
1153 |    "nav_menu": {},
1154 |    "number_sections": true,
1155 |    "sideBar": true,
1156 |    "skip_h1_title": false,
1157 |    "toc_cell": false,
1158 |    "toc_position": {},
1159 |    "toc_section_display": "block",
1160 |    "toc_window_display": false
1161 |   }
1162 |  },
1163 |  "nbformat": 4,
1164 |  "nbformat_minor": 2
1165 | }
1166 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | import setuptools
 2 | 
 3 | with open("README.md") as fh:
 4 |   README_CONTENTS = fh.read()
 5 | 
 6 | config = {
 7 |   'name': 'NLPPipeManager',
 8 |   'version': '0.1',
 9 |   'author': 'ZWMiller',
10 |   'author_email': 'zach@notarealemail.com',
11 |   'long_description': README_CONTENTS,
12 |   'url': 'https://github.com/zwmiller/nlp_pipe_manager/',
13 |   'packages': setuptools.find_packages(),
14 |   'install_requires': ['scikit-learn','nltk']
15 | }
16 | 
17 | setuptools.setup(**config)
18 | 


--------------------------------------------------------------------------------