├── output_1_0.png
├── output_41_0.jpeg
└── README.md


/output_1_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aakinlalu/Crime-Classification-using-PySpark/HEAD/output_1_0.png


--------------------------------------------------------------------------------
/output_41_0.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aakinlalu/Crime-Classification-using-PySpark/HEAD/output_41_0.jpeg


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | #                                                     __Crime Classification Model using Pyspark__
  3 | 
  4 | 
  5 | ```python
  6 | from IPython.display import Image
  7 | Image('spark_ml.png')
  8 | ```
  9 | 
 10 | 
 11 | 
 12 | 
 13 | ![png](output_1_0.png)
 14 | 
 15 | 
 16 | 
 17 | 
 18 | ## 1. __Scope__
 19 | * We are interesting in a system that could classify crime discription into different categories. We want to create a system that could automatically assign a described crime to category which could  help law enforcements to assign right officers to crime or could automatically assign officers to crime based on the classification.  
 20 | * We are using dataset from Kaggle on San Francisco Crime. Our responsibilty is to train a model based on 39 pre-defined categories, test the model accuracy and  deploy it into production. Given a new crime description, the system should assign it to one of 39 categories.
 21 | 
 22 | * To solve this problem, we will use a variety of feature extraction techniques along with different supervised machine learning algorithms in Pyspark. 
 23 | 
 24 | * This is multi-class text classification problem.
 25 | 
 26 | ## __2. Setup Spark and load other libraries__
 27 | 
 28 | 
 29 | ```python
 30 | import pyspark
 31 | spark = pyspark.sql.SparkSession.builder.appName("clipper-pyspark").getOrCreate()
 32 | 
 33 | sc = spark.sparkContext
 34 | ```
 35 | 
 36 | 
 37 | ```python
 38 | import seaborn as sns
 39 | import matplotlib.pyplot as plt
 40 | import numpy as np
 41 | %matplotlib inline
 42 | np.random.seed(60)
 43 | ```
 44 | 
 45 | ## __3. Data Extraction__
 46 | 
 47 | 
 48 | ```sh
 49 | %%sh
 50 | #Let see the first 5 rows
 51 | head -5 train.csv
 52 | ```
 53 | 
 54 |     Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
 55 |     2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425891675136,37.7745985956747
 56 |     2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425891675136,37.7745985956747
 57 |     2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.42436302145,37.8004143219856
 58 |     2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.42699532676599,37.80087263276921
 59 | 
 60 | 
 61 | 
 62 | ```python
 63 | #Read the data into spark datafrome
 64 | from pyspark.sql.functions import col, lower
 65 | df = spark.read.format('csv')\
 66 |           .option('header','true')\
 67 |           .option('inferSchema', 'true')\
 68 |           .option('timestamp', 'true')\
 69 |           .load('train.csv')
 70 | 
 71 | data = df.select(lower(col('Category')),lower(col('Descript')))\
 72 |         .withColumnRenamed('lower(Category)','Category')\
 73 |         .withColumnRenamed('lower(Descript)', 'Description')
 74 | data.cache()
 75 | print('Dataframe Structure')
 76 | print('----------------------------------')
 77 | print(data.printSchema())
 78 | print(' ')
 79 | print('Dataframe preview')
 80 | print(data.show(5))
 81 | print(' ')
 82 | print('----------------------------------')
 83 | print('Total number of rows', df.count())
 84 | ```
 85 | 
 86 |     Dataframe Structure
 87 |     ----------------------------------
 88 |     root
 89 |      |-- Category: string (nullable = true)
 90 |      |-- Description: string (nullable = true)
 91 |     
 92 |     None
 93 |      
 94 |     Dataframe preview
 95 |     +--------------+--------------------+
 96 |     |      Category|         Description|
 97 |     +--------------+--------------------+
 98 |     |      warrants|      warrant arrest|
 99 |     |other offenses|traffic violation...|
100 |     |other offenses|traffic violation...|
101 |     | larceny/theft|grand theft from ...|
102 |     | larceny/theft|grand theft from ...|
103 |     +--------------+--------------------+
104 |     only showing top 5 rows
105 |     
106 |     None
107 |      
108 |     ----------------------------------
109 |     Total number of rows 878049
110 | 
111 | 
112 | **Explanation**: __To familiar ourselves with the dataset, we need to see the top list of the crime categories and descriptions__.
113 | 
114 | 
115 | ```python
116 | def top_n_list(df,var, N):
117 |     '''
118 |     This function determine the top N numbers of the list
119 |     '''
120 |     print("Total number of unique value of"+' '+var+''+':'+' '+str(df.select(var).distinct().count()))
121 |     print(' ')
122 |     print('Top'+' '+str(N)+' '+'Crime'+' '+var)
123 |     df.groupBy(var).count().withColumnRenamed('count','totalValue')\
124 |     .orderBy(col('totalValue').desc()).show(N)
125 |     
126 |     
127 | top_n_list(data, 'Category',10)
128 | print(' ')
129 | print(' ')
130 | top_n_list(data,'Description',10)
131 | ```
132 | 
133 |     Total number of unique value of Category: 39
134 |      
135 |     Top 10 Crime Category
136 |     +--------------+----------+
137 |     |      Category|totalValue|
138 |     +--------------+----------+
139 |     | larceny/theft|    174900|
140 |     |other offenses|    126182|
141 |     |  non-criminal|     92304|
142 |     |       assault|     76876|
143 |     | drug/narcotic|     53971|
144 |     | vehicle theft|     53781|
145 |     |     vandalism|     44725|
146 |     |      warrants|     42214|
147 |     |      burglary|     36755|
148 |     |suspicious occ|     31414|
149 |     +--------------+----------+
150 |     only showing top 10 rows
151 |     
152 |      
153 |      
154 |     Total number of unique value of Description: 879
155 |      
156 |     Top 10 Crime Description
157 |     +--------------------+----------+
158 |     |         Description|totalValue|
159 |     +--------------------+----------+
160 |     |grand theft from ...|     60022|
161 |     |       lost property|     31729|
162 |     |             battery|     27441|
163 |     |   stolen automobile|     26897|
164 |     |drivers license, ...|     26839|
165 |     |      warrant arrest|     23754|
166 |     |suspicious occurr...|     21891|
167 |     |aided case, menta...|     21497|
168 |     |petty theft from ...|     19771|
169 |     |malicious mischie...|     17789|
170 |     +--------------------+----------+
171 |     only showing top 10 rows
172 |     
173 | 
174 | 
175 | **Explanation**: __Category feature will be our label (multi-class). How many classes?__
176 | 
177 | 
178 | ```python
179 | data.select('Category').distinct().count()
180 | ```
181 | 
182 | 
183 | 
184 | 
185 |     39
186 | 
187 | 
188 | 
189 | ## __4. Partition the dataset into Training and Test dataset__
190 | 
191 | 
192 | ```python
193 | training, test = data.randomSplit([0.7,0.3], seed=60)
194 | #trainingSet.cache()
195 | print("Training Dataset Count:", training.count())
196 | print("Test Dataset Count:", test.count())
197 | ```
198 | 
199 |     Training Dataset Count: 615417
200 |     Test Dataset Count: 262632
201 | 
202 | 
203 | ## __5. Define Structure to build Pipeline__
204 | __The process of cleaning the dataset involves:__  
205 | * __Define tokenization function using RegexTokenizer__: RegexTokenizer allows more advanced tokenization based on regular expression (regex) matching. By default, the parameter “pattern” (regex, default: “\s+”) is used as delimiters to split the input text. Alternatively, users can set parameter “gaps” to false indicating the regex “pattern” denotes “tokens” rather than splitting gaps, and find all matching occurrences as the tokenization result.  
206 | 
207 | * __Define stop remover function using StopWordsRemover__: StopWordsRemover takes as input a sequence of strings (e.g. the output of a Tokenizer) and drops all the stop words from the input sequences. The list of stopwords is specified by the stopWords parameter.  
208 | 
209 | * __Define bag of words function for Descript variable using CountVectorizer__: CountVectorizer can be used as an estimator to extract the vocabulary, and generates a CountVectorizerModel. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA. During the fitting process, CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus. An optional parameter minDF also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary.  
210 | 
211 | * __Define function to Encode the values of category variable using StringIndexer__: StringIndexer encodes a string column of labels to a column of label indices. The indices are in (0, numLabels), ordered by label frequencies, so the most frequent label gets index 0. In our case, the label colum(Category) will be encoded to label indices, from 0 to 38; the most frequent label (LARCENY/THEFT) will be indexed as 0.
212 | 
213 | * __Define a pipeline to call these functions__: ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines.        
214 | 
215 | 
216 | ```python
217 | from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer, OneHotEncoder, StringIndexer, VectorAssembler, HashingTF, IDF, Word2Vec
218 | from pyspark.ml import Pipeline
219 | from pyspark.ml.classification import LogisticRegression, NaiveBayes 
220 | 
221 | #----------------Define tokenizer with regextokenizer()------------------
222 | regex_tokenizer = RegexTokenizer(pattern='\\W')\
223 |                   .setInputCol("Description")\
224 |                   .setOutputCol("tokens")
225 | 
226 | #----------------Define stopwords with stopwordsremover()---------------------
227 | extra_stopwords = ['http','amp','rt','t','c','the']
228 | stopwords_remover = StopWordsRemover()\
229 |                     .setInputCol('tokens')\
230 |                     .setOutputCol('filtered_words')\
231 |                     .setStopWords(extra_stopwords)
232 |                     
233 | 
234 | #----------Define bags of words using countVectorizer()---------------------------
235 | count_vectors = CountVectorizer(vocabSize=10000, minDF=5)\
236 |                .setInputCol("filtered_words")\
237 |                .setOutputCol("features")
238 | 
239 | 
240 | #-----------Using TF-IDF to vectorise features instead of countVectoriser-----------------
241 | hashingTf = HashingTF(numFeatures=10000)\
242 |             .setInputCol("filtered_words")\
243 |             .setOutputCol("raw_features")
244 |             
245 | #Use minDocFreq to remove sparse terms
246 | idf = IDF(minDocFreq=5)\
247 |         .setInputCol("raw_features")\
248 |         .setOutputCol("features")
249 | 
250 | #---------------Define bag of words using Word2Vec---------------------------
251 | word2Vec = Word2Vec(vectorSize=1000, minCount=0)\
252 |            .setInputCol("filtered_words")\
253 |            .setOutputCol("features")
254 | 
255 | #-----------Encode the Category variable into label using StringIndexer-----------
256 | label_string_idx = StringIndexer()\
257 |                   .setInputCol("Category")\
258 |                   .setOutputCol("label")
259 | 
260 | #-----------Define classifier structure for logistic Regression--------------
261 | lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)
262 | 
263 | #---------Define classifier structure for Naive Bayes----------
264 | nb = NaiveBayes(smoothing=1)
265 | 
266 | def metrics_ev(labels, metrics):
267 |     '''
268 |     List of all performance metrics
269 |     '''
270 |     # Confusion matrix
271 |     print("---------Confusion matrix-----------------")
272 |     print(metrics.confusionMatrix)
273 |     print(' ')    
274 |     # Overall statistics
275 |     print('----------Overall statistics-----------')
276 |     print("Precision = %s" %  metrics.precision())
277 |     print("Recall = %s" %  metrics.recall())
278 |     print("F1 Score = %s" % metrics.fMeasure())
279 |     print(' ')
280 |     # Statistics by class
281 |     print('----------Statistics by class----------')
282 |     for label in sorted(labels):
283 |        print("Class %s precision = %s" % (label, metrics.precision(label)))
284 |        print("Class %s recall = %s" % (label, metrics.recall(label)))
285 |        print("Class %s F1 Measure = %s" % (label, metrics.fMeasure(label, beta=1.0)))
286 |     print(' ')
287 |     # Weighted stats
288 |     print('----------Weighted stats----------------')
289 |     print("Weighted recall = %s" % metrics.weightedRecall)
290 |     print("Weighted precision = %s" % metrics.weightedPrecision)
291 |     print("Weighted F(1) Score = %s" % metrics.weightedFMeasure())
292 |     print("Weighted F(0.5) Score = %s" % metrics.weightedFMeasure(beta=0.5))
293 |     print("Weighted false positive rate = %s" % metrics.weightedFalsePositiveRate)
294 |     
295 | ```
296 | 
297 | ## __6. Build Multi-Classification__
298 | __The stages involve to perform multi-classification include:__
299 | 1. Model training and evaluation
300 |    1. Build baseling model
301 |       1. Logistic regression using CountVectorizer features
302 |    2. Build secondary models
303 |       1. Naive Bayes
304 |       2. Logistic regression and Naive Bayes using TF-IDF features
305 |       3. Logistic regression and Naive Bayes using word2Vec
306 |     
307 |  ### __(i) Baseline Model__ 
308 | Baseline model should be quick, low cost and simple to setup and produce a decent results. One of the reason to consider baselines because they iterate very quickly, while wasting minimal time. To further undertand why and how to apply baselines, please refer to Emmanuel Ameisen's article: [Always start with a stupid model, no exceptions.](https://blog.insightdatascience.com/always-start-with-a-stupid-model-no-exceptions-3a22314b9aaa)
309 | 
310 | #### __(a). Apply Logistic Regression with  Count Vector Features__
311 | We will build a model to make predictions and score on the test sets using logistics regression using the dataset we transformed using count vectors. And we will see the top 10 predictions from the highest probability from our model, accuracy and other metrics to evaluate our model.  
312 | 
313 | Note: Fit regex_tokenizer,stopwords_remover,count_vectors,label_string_idx, and lr functions into pipeline.  
314 | 
315 | 
316 | ```python
317 | pipeline_cv_lr = Pipeline().setStages([regex_tokenizer,stopwords_remover,count_vectors,label_string_idx, lr])
318 | model_cv_lr = pipeline_cv_lr.fit(training)
319 | predictions_cv_lr = model_cv_lr.transform(test)
320 | ```
321 | 
322 | 
323 | ```python
324 | print('-----------------------------Check Top 5 predictions----------------------------------')
325 | print(' ')
326 | predictions_cv_lr.select('Description','Category',"probability","label","prediction")\
327 |                                         .orderBy("probability", ascending=False)\
328 |                                         .show(n=5, truncate=30)
329 | ```
330 | 
331 |     -----------------------------Check Top 5 predictions----------------------------------
332 |      
333 |     +------------------------------+-------------+------------------------------+-----+----------+
334 |     |                   Description|     Category|                   probability|label|prediction|
335 |     +------------------------------+-------------+------------------------------+-----+----------+
336 |     |theft, bicycle, <$50, no se...|larceny/theft|[0.8726782249097988,0.02162...|  0.0|       0.0|
337 |     |theft, bicycle, <$50, no se...|larceny/theft|[0.8726782249097988,0.02162...|  0.0|       0.0|
338 |     |theft, bicycle, <$50, no se...|larceny/theft|[0.8726782249097988,0.02162...|  0.0|       0.0|
339 |     |theft, bicycle, <$50, no se...|larceny/theft|[0.8726782249097988,0.02162...|  0.0|       0.0|
340 |     |theft, bicycle, <$50, no se...|larceny/theft|[0.8726782249097988,0.02162...|  0.0|       0.0|
341 |     +------------------------------+-------------+------------------------------+-----+----------+
342 |     only showing top 5 rows
343 |     
344 | 
345 | 
346 | 
347 | ```python
348 | from pyspark.ml.evaluation import MulticlassClassificationEvaluator 
349 | evaluator_cv_lr = MulticlassClassificationEvaluator().setPredictionCol("prediction").evaluate(predictions_cv_lr)
350 | print(' ')
351 | print('------------------------------Accuracy----------------------------------')
352 | print(' ')
353 | print('                       accuracy:{}:'.format(evaluator_cv_lr))
354 | ```
355 | 
356 |      
357 |     ------------------------------Accuracy----------------------------------
358 |      
359 |                            accuracy:0.9721844116763713:
360 | 
361 | 
362 |  ### __(ii). Secondary Models__
363 |  #### __(a). Apply Naive Bayes with Count Vector Features__
364 | Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features. The spark.ml implementation currently supports both multinomial naive Bayes and Bernoulli naive Bayes.   
365 | 
366 |  Fit regex_tokenizer,stopwords_remover,count_vectors,label_string_idx, and nb functions into pipeline.
367 | 
368 | 
369 | ```python
370 | ### Secondary model using NaiveBayes
371 | pipeline_cv_nb = Pipeline().setStages([regex_tokenizer,stopwords_remover,count_vectors,label_string_idx, nb])
372 | model_cv_nb = pipeline_cv_nb.fit(training)
373 | predictions_cv_nb = model_cv_nb.transform(test)
374 | ```
375 | 
376 | 
377 | ```python
378 | evaluator_cv_nb = MulticlassClassificationEvaluator().setPredictionCol("prediction").evaluate(predictions_cv_nb)
379 | print(' ')
380 | print('--------------------------Accuracy-----------------------------')
381 | print(' ')
382 | print('                      accuracy:{}:'.format(evaluator_cv_nb))
383 | ```
384 | 
385 |      
386 |     --------------------------Accuracy-----------------------------
387 |      
388 |                           accuracy:0.9933012222188159:
389 | 
390 | 
391 | #### __(b). Apply Logistic Regression Using TF-IDF Features__ 
392 | Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. Denote a term by t, a document by _d_, and the corpus by _D_. Term frequency TF(t,d) is the number of times that term t appears in document _d_, while document frequency _DF(t,D)_ is the number of documents that contains term _t_. If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g. “a”, “the”, and “of”. If a term appears very often across the corpus, it means it doesn’t carry special information about a particular document. Inverse document frequency is a numerical measure of how much information a term provides:   $$IDF(t,D) = log {|D|+1DF \over (t,D)+1} $$, where |D| is the total number of documents in the corpus. Since logarithm is used, if a term appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF:
393 | $$TFIDF(t,d,D) = {TF(t,d)⋅IDF(t,D)} $$.  
394 | 
395 | There are several variants on the definition of term frequency and document frequency. In MLlib, we separate TF and IDF to make them flexible.  
396 | 
397 | Note: Fit regex_tokenizer, stopwords_remover,hashingTF, idf,label_string_idx, and nb functions into pipeline.
398 | 
399 | 
400 | ```python
401 | pipeline_idf_lr = Pipeline().setStages([regex_tokenizer,stopwords_remover,hashingTf, idf, label_string_idx, lr])
402 | model_idf_lr = pipeline_idf_lr.fit(training)
403 | predictions_idf_lr = model_idf_lr.transform(test)
404 | ```
405 | 
406 | 
407 | ```python
408 | print('-----------------------------Check Top 5 predictions----------------------------------')
409 | print(' ')
410 | predictions_idf_lr.select('Description','Category',"probability","label","prediction")\
411 |                                         .orderBy("probability", ascending=False)\
412 |                                         .show(n=5, truncate=30)
413 | ```
414 | 
415 |     -----------------------------Check Top 5 predictions----------------------------------
416 |      
417 |     +------------------------------+-------------+------------------------------+-----+----------+
418 |     |                   Description|     Category|                   probability|label|prediction|
419 |     +------------------------------+-------------+------------------------------+-----+----------+
420 |     |theft, bicycle, <$50, no se...|larceny/theft|[0.8745035002793186,0.02115...|  0.0|       0.0|
421 |     |theft, bicycle, <$50, no se...|larceny/theft|[0.8745035002793186,0.02115...|  0.0|       0.0|
422 |     |theft, bicycle, <$50, no se...|larceny/theft|[0.8745035002793186,0.02115...|  0.0|       0.0|
423 |     |theft, bicycle, <$50, no se...|larceny/theft|[0.8745035002793186,0.02115...|  0.0|       0.0|
424 |     |theft, bicycle, <$50, no se...|larceny/theft|[0.8745035002793186,0.02115...|  0.0|       0.0|
425 |     +------------------------------+-------------+------------------------------+-----+----------+
426 |     only showing top 5 rows
427 |     
428 | 
429 | 
430 | 
431 | ```python
432 | evaluator_idf_lr = MulticlassClassificationEvaluator().setPredictionCol("prediction").evaluate(predictions_idf_lr)
433 | print(' ')
434 | print('-------------------------------Accuracy---------------------------------')
435 | print(' ')
436 | print('                        accuracy:{}:'.format(evaluator_idf_lr))
437 | ```
438 | 
439 |      
440 |     -------------------------------Accuracy---------------------------------
441 |      
442 |                             accuracy:0.9723359770202158:
443 | 
444 | 
445 | #### __(c). Apply Naive Bayes with TF-IDF Features__
446 | 
447 | 
448 | ```python
449 | pipeline_idf_nb = Pipeline().setStages([regex_tokenizer,stopwords_remover,hashingTf, idf, label_string_idx, nb])
450 | model_idf_nb = pipeline_idf_nb.fit(training)
451 | predictions_idf_nb = model_idf_nb.transform(test)
452 | ```
453 | 
454 | 
455 | ```python
456 | evaluator_idf_nb = MulticlassClassificationEvaluator().setPredictionCol("prediction").evaluate(predictions_idf_nb)
457 | print(' ')
458 | print('-----------------------------Accuracy-----------------------------')
459 | print(' ')
460 | print('                          accuracy:{}:'.format(evaluator_idf_nb))
461 | ```
462 | 
463 |      
464 |     -----------------------------Accuracy-----------------------------
465 |      
466 |                               accuracy:0.9950758205262961:
467 | 
468 | 
469 | #### __(e). Apply Logistic Regression Using Word2Vec features__ 
470 | Word2Vec is an Estimator which takes sequences of words representing documents and trains a Word2VecModel. The model maps each word to a unique fixed-size vector. The Word2VecModel transforms each document into a vector using the average of all words in the document; this vector can then be used as features for prediction, document similarity calculations, etc. 
471 | 
472 | 
473 | ```python
474 | pipeline_wv_lr = Pipeline().setStages([regex_tokenizer,stopwords_remover, word2Vec, label_string_idx, lr])
475 | model_wv_lr = pipeline_wv_lr.fit(training)
476 | predictions_wv_lr = model_wv_lr.transform(test)
477 | ```
478 | 
479 | 
480 | ```python
481 | evaluator_wv_lr = MulticlassClassificationEvaluator().setPredictionCol("prediction").evaluate(predictions_wv_lr)
482 | print('--------------------------Accuracy------------')
483 | print(' ')
484 | print('                  accuracy:{}:'.format(evaluator_wv_lr))
485 | ```
486 | 
487 |     --------------------------Accuracy------------
488 |      
489 |                       accuracy:0.9073464410736654:
490 | 
491 | 
492 | #### __(f). Apply Naive Bayes Using Word2Vec features__
493 | 
494 | 
495 | ```python
496 | #pipeline_wv_nb = Pipeline().setStages([regex_tokenizer,stopwords_remover, word2Vec, label_string_idx, nb])
497 | #model_wv_nb = pipeline_wv_nb.fit(training)
498 | #predictions_wv_nb = model_wv_nb.transform(test)
499 | ```
500 | 
501 | 
502 | ```python
503 | #evaluator_wv_nb = MulticlassClassificationEvaluator().setPredictionCol("prediction").evaluate(predictions_wv_nb)
504 | #print('--------Accuracy------------')
505 | #print(' ')
506 | #print('accuracy:{}%:'.format(round(evaluator_wv_nb *100),2))
507 | ```
508 | 
509 | ## 7. __Results:__
510 | __The table below has accuracy of the models generated by different extraction techniques.__
511 | 
512 | |                    | Logistic Regression | Naive Bayes |
513 | | -------------------|:-------------------:|------------:|
514 | | Count Vectoriser   |  97.2%              |   99.3%     |
515 | | TF-IDF             |  97.2%              |   99.5%     |
516 | | Word2Vec           |  90.7%              |             |
517 | 
518 | **Explanation**: __As you can see, TF-IDF proves to be best vectoriser for this dataset, while Naive Bayes proves to be better algorithm for text analysis than Logistic regression.__
519 | 
520 | ## __8. Deploy the Model__
521 | We will use Flask. To know more about Flask, check [Full Stack Python.](https://www.fullstackpython.com/flask.html)
522 | 
523 | 
524 | ```python
525 | Image('flask.jpg')
526 | ```
527 | 
528 | 
529 | 
530 | 
531 | ![jpeg](output_41_0.jpeg)
532 | 
533 | 
534 | 
535 | 
536 | ```python
537 | from flask import Flask, request, jsonify
538 | from pyspark.ml import PipelineModel
539 | ```
540 | 
541 | 
542 | ```python
543 | app = Flask(__name__)
544 | ```
545 | 
546 | 
547 | ```python
548 | # Load the Model
549 | MODEL=pyspark.ml.PipelineModel("spark-naive-bayes-model")
550 | ```
551 | 
552 | 
553 | ```python
554 | HTTP_BAD_REQUEST = 400
555 | ```
556 | 
557 | 
558 | ```python
559 | @app.route('/predict')
560 | def predict():
561 |     Description = request.args.get('Description', default=None, type=str)
562 |     
563 |     # Reject request that have bad or missing values.
564 |     if Description is None:
565 |         # Provide the caller with feedback on why the record is unscorable.
566 |         message = ('Record cannot be scored because of '
567 |                    'missing or unacceptable values. '
568 |                    'All values must be present and of type string.')
569 |         response = jsonify(status='error',
570 |                            error_message=message)
571 |         # Sets the status code to 400
572 |         response.status_code = HTTP_BAD_REQUEST
573 |         return response
574 |     
575 |     features = [[Description]]
576 |     predictions = MODEL.transform(features)
577 |     label_pred = predictions.select("Description","Category","probability","prediction")
578 |     return jsonify(status='complete', label=label_pred)
579 | ```
580 | 
581 | 
582 | ```python
583 | if __name__ == '__main__':
584 |     app.run(debug=True)
585 | ```
586 | 
587 | 
588 | ```python
589 | import requests
590 | #response = requests.get('http://127.0.0.1:5000/predict?Description=arson')
591 | #response.text
592 | ```
593 | 


--------------------------------------------------------------------------------