├── output_1_0.png ├── output_41_0.jpeg └── README.md /output_1_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aakinlalu/Crime-Classification-using-PySpark/HEAD/output_1_0.png -------------------------------------------------------------------------------- /output_41_0.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aakinlalu/Crime-Classification-using-PySpark/HEAD/output_41_0.jpeg -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # __Crime Classification Model using Pyspark__ 3 | 4 | 5 | ```python 6 | from IPython.display import Image 7 | Image('spark_ml.png') 8 | ``` 9 | 10 | 11 | 12 | 13 | ![png](output_1_0.png) 14 | 15 | 16 | 17 | 18 | ## 1. __Scope__ 19 | * We are interesting in a system that could classify crime discription into different categories. We want to create a system that could automatically assign a described crime to category which could help law enforcements to assign right officers to crime or could automatically assign officers to crime based on the classification. 20 | * We are using dataset from Kaggle on San Francisco Crime. Our responsibilty is to train a model based on 39 pre-defined categories, test the model accuracy and deploy it into production. Given a new crime description, the system should assign it to one of 39 categories. 21 | 22 | * To solve this problem, we will use a variety of feature extraction techniques along with different supervised machine learning algorithms in Pyspark. 23 | 24 | * This is multi-class text classification problem. 25 | 26 | ## __2. Setup Spark and load other libraries__ 27 | 28 | 29 | ```python 30 | import pyspark 31 | spark = pyspark.sql.SparkSession.builder.appName("clipper-pyspark").getOrCreate() 32 | 33 | sc = spark.sparkContext 34 | ``` 35 | 36 | 37 | ```python 38 | import seaborn as sns 39 | import matplotlib.pyplot as plt 40 | import numpy as np 41 | %matplotlib inline 42 | np.random.seed(60) 43 | ``` 44 | 45 | ## __3. Data Extraction__ 46 | 47 | 48 | ```sh 49 | %%sh 50 | #Let see the first 5 rows 51 | head -5 train.csv 52 | ``` 53 | 54 | Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y 55 | 2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425891675136,37.7745985956747 56 | 2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425891675136,37.7745985956747 57 | 2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.42436302145,37.8004143219856 58 | 2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.42699532676599,37.80087263276921 59 | 60 | 61 | 62 | ```python 63 | #Read the data into spark datafrome 64 | from pyspark.sql.functions import col, lower 65 | df = spark.read.format('csv')\ 66 | .option('header','true')\ 67 | .option('inferSchema', 'true')\ 68 | .option('timestamp', 'true')\ 69 | .load('train.csv') 70 | 71 | data = df.select(lower(col('Category')),lower(col('Descript')))\ 72 | .withColumnRenamed('lower(Category)','Category')\ 73 | .withColumnRenamed('lower(Descript)', 'Description') 74 | data.cache() 75 | print('Dataframe Structure') 76 | print('----------------------------------') 77 | print(data.printSchema()) 78 | print(' ') 79 | print('Dataframe preview') 80 | print(data.show(5)) 81 | print(' ') 82 | print('----------------------------------') 83 | print('Total number of rows', df.count()) 84 | ``` 85 | 86 | Dataframe Structure 87 | ---------------------------------- 88 | root 89 | |-- Category: string (nullable = true) 90 | |-- Description: string (nullable = true) 91 | 92 | None 93 | 94 | Dataframe preview 95 | +--------------+--------------------+ 96 | | Category| Description| 97 | +--------------+--------------------+ 98 | | warrants| warrant arrest| 99 | |other offenses|traffic violation...| 100 | |other offenses|traffic violation...| 101 | | larceny/theft|grand theft from ...| 102 | | larceny/theft|grand theft from ...| 103 | +--------------+--------------------+ 104 | only showing top 5 rows 105 | 106 | None 107 | 108 | ---------------------------------- 109 | Total number of rows 878049 110 | 111 | 112 | **Explanation**: __To familiar ourselves with the dataset, we need to see the top list of the crime categories and descriptions__. 113 | 114 | 115 | ```python 116 | def top_n_list(df,var, N): 117 | ''' 118 | This function determine the top N numbers of the list 119 | ''' 120 | print("Total number of unique value of"+' '+var+''+':'+' '+str(df.select(var).distinct().count())) 121 | print(' ') 122 | print('Top'+' '+str(N)+' '+'Crime'+' '+var) 123 | df.groupBy(var).count().withColumnRenamed('count','totalValue')\ 124 | .orderBy(col('totalValue').desc()).show(N) 125 | 126 | 127 | top_n_list(data, 'Category',10) 128 | print(' ') 129 | print(' ') 130 | top_n_list(data,'Description',10) 131 | ``` 132 | 133 | Total number of unique value of Category: 39 134 | 135 | Top 10 Crime Category 136 | +--------------+----------+ 137 | | Category|totalValue| 138 | +--------------+----------+ 139 | | larceny/theft| 174900| 140 | |other offenses| 126182| 141 | | non-criminal| 92304| 142 | | assault| 76876| 143 | | drug/narcotic| 53971| 144 | | vehicle theft| 53781| 145 | | vandalism| 44725| 146 | | warrants| 42214| 147 | | burglary| 36755| 148 | |suspicious occ| 31414| 149 | +--------------+----------+ 150 | only showing top 10 rows 151 | 152 | 153 | 154 | Total number of unique value of Description: 879 155 | 156 | Top 10 Crime Description 157 | +--------------------+----------+ 158 | | Description|totalValue| 159 | +--------------------+----------+ 160 | |grand theft from ...| 60022| 161 | | lost property| 31729| 162 | | battery| 27441| 163 | | stolen automobile| 26897| 164 | |drivers license, ...| 26839| 165 | | warrant arrest| 23754| 166 | |suspicious occurr...| 21891| 167 | |aided case, menta...| 21497| 168 | |petty theft from ...| 19771| 169 | |malicious mischie...| 17789| 170 | +--------------------+----------+ 171 | only showing top 10 rows 172 | 173 | 174 | 175 | **Explanation**: __Category feature will be our label (multi-class). How many classes?__ 176 | 177 | 178 | ```python 179 | data.select('Category').distinct().count() 180 | ``` 181 | 182 | 183 | 184 | 185 | 39 186 | 187 | 188 | 189 | ## __4. Partition the dataset into Training and Test dataset__ 190 | 191 | 192 | ```python 193 | training, test = data.randomSplit([0.7,0.3], seed=60) 194 | #trainingSet.cache() 195 | print("Training Dataset Count:", training.count()) 196 | print("Test Dataset Count:", test.count()) 197 | ``` 198 | 199 | Training Dataset Count: 615417 200 | Test Dataset Count: 262632 201 | 202 | 203 | ## __5. Define Structure to build Pipeline__ 204 | __The process of cleaning the dataset involves:__ 205 | * __Define tokenization function using RegexTokenizer__: RegexTokenizer allows more advanced tokenization based on regular expression (regex) matching. By default, the parameter “pattern” (regex, default: “\s+”) is used as delimiters to split the input text. Alternatively, users can set parameter “gaps” to false indicating the regex “pattern” denotes “tokens” rather than splitting gaps, and find all matching occurrences as the tokenization result. 206 | 207 | * __Define stop remover function using StopWordsRemover__: StopWordsRemover takes as input a sequence of strings (e.g. the output of a Tokenizer) and drops all the stop words from the input sequences. The list of stopwords is specified by the stopWords parameter. 208 | 209 | * __Define bag of words function for Descript variable using CountVectorizer__: CountVectorizer can be used as an estimator to extract the vocabulary, and generates a CountVectorizerModel. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA. During the fitting process, CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus. An optional parameter minDF also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. 210 | 211 | * __Define function to Encode the values of category variable using StringIndexer__: StringIndexer encodes a string column of labels to a column of label indices. The indices are in (0, numLabels), ordered by label frequencies, so the most frequent label gets index 0. In our case, the label colum(Category) will be encoded to label indices, from 0 to 38; the most frequent label (LARCENY/THEFT) will be indexed as 0. 212 | 213 | * __Define a pipeline to call these functions__: ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines. 214 | 215 | 216 | ```python 217 | from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer, OneHotEncoder, StringIndexer, VectorAssembler, HashingTF, IDF, Word2Vec 218 | from pyspark.ml import Pipeline 219 | from pyspark.ml.classification import LogisticRegression, NaiveBayes 220 | 221 | #----------------Define tokenizer with regextokenizer()------------------ 222 | regex_tokenizer = RegexTokenizer(pattern='\\W')\ 223 | .setInputCol("Description")\ 224 | .setOutputCol("tokens") 225 | 226 | #----------------Define stopwords with stopwordsremover()--------------------- 227 | extra_stopwords = ['http','amp','rt','t','c','the'] 228 | stopwords_remover = StopWordsRemover()\ 229 | .setInputCol('tokens')\ 230 | .setOutputCol('filtered_words')\ 231 | .setStopWords(extra_stopwords) 232 | 233 | 234 | #----------Define bags of words using countVectorizer()--------------------------- 235 | count_vectors = CountVectorizer(vocabSize=10000, minDF=5)\ 236 | .setInputCol("filtered_words")\ 237 | .setOutputCol("features") 238 | 239 | 240 | #-----------Using TF-IDF to vectorise features instead of countVectoriser----------------- 241 | hashingTf = HashingTF(numFeatures=10000)\ 242 | .setInputCol("filtered_words")\ 243 | .setOutputCol("raw_features") 244 | 245 | #Use minDocFreq to remove sparse terms 246 | idf = IDF(minDocFreq=5)\ 247 | .setInputCol("raw_features")\ 248 | .setOutputCol("features") 249 | 250 | #---------------Define bag of words using Word2Vec--------------------------- 251 | word2Vec = Word2Vec(vectorSize=1000, minCount=0)\ 252 | .setInputCol("filtered_words")\ 253 | .setOutputCol("features") 254 | 255 | #-----------Encode the Category variable into label using StringIndexer----------- 256 | label_string_idx = StringIndexer()\ 257 | .setInputCol("Category")\ 258 | .setOutputCol("label") 259 | 260 | #-----------Define classifier structure for logistic Regression-------------- 261 | lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0) 262 | 263 | #---------Define classifier structure for Naive Bayes---------- 264 | nb = NaiveBayes(smoothing=1) 265 | 266 | def metrics_ev(labels, metrics): 267 | ''' 268 | List of all performance metrics 269 | ''' 270 | # Confusion matrix 271 | print("---------Confusion matrix-----------------") 272 | print(metrics.confusionMatrix) 273 | print(' ') 274 | # Overall statistics 275 | print('----------Overall statistics-----------') 276 | print("Precision = %s" % metrics.precision()) 277 | print("Recall = %s" % metrics.recall()) 278 | print("F1 Score = %s" % metrics.fMeasure()) 279 | print(' ') 280 | # Statistics by class 281 | print('----------Statistics by class----------') 282 | for label in sorted(labels): 283 | print("Class %s precision = %s" % (label, metrics.precision(label))) 284 | print("Class %s recall = %s" % (label, metrics.recall(label))) 285 | print("Class %s F1 Measure = %s" % (label, metrics.fMeasure(label, beta=1.0))) 286 | print(' ') 287 | # Weighted stats 288 | print('----------Weighted stats----------------') 289 | print("Weighted recall = %s" % metrics.weightedRecall) 290 | print("Weighted precision = %s" % metrics.weightedPrecision) 291 | print("Weighted F(1) Score = %s" % metrics.weightedFMeasure()) 292 | print("Weighted F(0.5) Score = %s" % metrics.weightedFMeasure(beta=0.5)) 293 | print("Weighted false positive rate = %s" % metrics.weightedFalsePositiveRate) 294 | 295 | ``` 296 | 297 | ## __6. Build Multi-Classification__ 298 | __The stages involve to perform multi-classification include:__ 299 | 1. Model training and evaluation 300 | 1. Build baseling model 301 | 1. Logistic regression using CountVectorizer features 302 | 2. Build secondary models 303 | 1. Naive Bayes 304 | 2. Logistic regression and Naive Bayes using TF-IDF features 305 | 3. Logistic regression and Naive Bayes using word2Vec 306 | 307 | ### __(i) Baseline Model__ 308 | Baseline model should be quick, low cost and simple to setup and produce a decent results. One of the reason to consider baselines because they iterate very quickly, while wasting minimal time. To further undertand why and how to apply baselines, please refer to Emmanuel Ameisen's article: [Always start with a stupid model, no exceptions.](https://blog.insightdatascience.com/always-start-with-a-stupid-model-no-exceptions-3a22314b9aaa) 309 | 310 | #### __(a). Apply Logistic Regression with Count Vector Features__ 311 | We will build a model to make predictions and score on the test sets using logistics regression using the dataset we transformed using count vectors. And we will see the top 10 predictions from the highest probability from our model, accuracy and other metrics to evaluate our model. 312 | 313 | Note: Fit regex_tokenizer,stopwords_remover,count_vectors,label_string_idx, and lr functions into pipeline. 314 | 315 | 316 | ```python 317 | pipeline_cv_lr = Pipeline().setStages([regex_tokenizer,stopwords_remover,count_vectors,label_string_idx, lr]) 318 | model_cv_lr = pipeline_cv_lr.fit(training) 319 | predictions_cv_lr = model_cv_lr.transform(test) 320 | ``` 321 | 322 | 323 | ```python 324 | print('-----------------------------Check Top 5 predictions----------------------------------') 325 | print(' ') 326 | predictions_cv_lr.select('Description','Category',"probability","label","prediction")\ 327 | .orderBy("probability", ascending=False)\ 328 | .show(n=5, truncate=30) 329 | ``` 330 | 331 | -----------------------------Check Top 5 predictions---------------------------------- 332 | 333 | +------------------------------+-------------+------------------------------+-----+----------+ 334 | | Description| Category| probability|label|prediction| 335 | +------------------------------+-------------+------------------------------+-----+----------+ 336 | |theft, bicycle, <$50, no se...|larceny/theft|[0.8726782249097988,0.02162...| 0.0| 0.0| 337 | |theft, bicycle, <$50, no se...|larceny/theft|[0.8726782249097988,0.02162...| 0.0| 0.0| 338 | |theft, bicycle, <$50, no se...|larceny/theft|[0.8726782249097988,0.02162...| 0.0| 0.0| 339 | |theft, bicycle, <$50, no se...|larceny/theft|[0.8726782249097988,0.02162...| 0.0| 0.0| 340 | |theft, bicycle, <$50, no se...|larceny/theft|[0.8726782249097988,0.02162...| 0.0| 0.0| 341 | +------------------------------+-------------+------------------------------+-----+----------+ 342 | only showing top 5 rows 343 | 344 | 345 | 346 | 347 | ```python 348 | from pyspark.ml.evaluation import MulticlassClassificationEvaluator 349 | evaluator_cv_lr = MulticlassClassificationEvaluator().setPredictionCol("prediction").evaluate(predictions_cv_lr) 350 | print(' ') 351 | print('------------------------------Accuracy----------------------------------') 352 | print(' ') 353 | print(' accuracy:{}:'.format(evaluator_cv_lr)) 354 | ``` 355 | 356 | 357 | ------------------------------Accuracy---------------------------------- 358 | 359 | accuracy:0.9721844116763713: 360 | 361 | 362 | ### __(ii). Secondary Models__ 363 | #### __(a). Apply Naive Bayes with Count Vector Features__ 364 | Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features. The spark.ml implementation currently supports both multinomial naive Bayes and Bernoulli naive Bayes. 365 | 366 | Fit regex_tokenizer,stopwords_remover,count_vectors,label_string_idx, and nb functions into pipeline. 367 | 368 | 369 | ```python 370 | ### Secondary model using NaiveBayes 371 | pipeline_cv_nb = Pipeline().setStages([regex_tokenizer,stopwords_remover,count_vectors,label_string_idx, nb]) 372 | model_cv_nb = pipeline_cv_nb.fit(training) 373 | predictions_cv_nb = model_cv_nb.transform(test) 374 | ``` 375 | 376 | 377 | ```python 378 | evaluator_cv_nb = MulticlassClassificationEvaluator().setPredictionCol("prediction").evaluate(predictions_cv_nb) 379 | print(' ') 380 | print('--------------------------Accuracy-----------------------------') 381 | print(' ') 382 | print(' accuracy:{}:'.format(evaluator_cv_nb)) 383 | ``` 384 | 385 | 386 | --------------------------Accuracy----------------------------- 387 | 388 | accuracy:0.9933012222188159: 389 | 390 | 391 | #### __(b). Apply Logistic Regression Using TF-IDF Features__ 392 | Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. Denote a term by t, a document by _d_, and the corpus by _D_. Term frequency TF(t,d) is the number of times that term t appears in document _d_, while document frequency _DF(t,D)_ is the number of documents that contains term _t_. If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g. “a”, “the”, and “of”. If a term appears very often across the corpus, it means it doesn’t carry special information about a particular document. Inverse document frequency is a numerical measure of how much information a term provides: $$IDF(t,D) = log {|D|+1DF \over (t,D)+1} $$, where |D| is the total number of documents in the corpus. Since logarithm is used, if a term appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF: 393 | $$TFIDF(t,d,D) = {TF(t,d)⋅IDF(t,D)} $$. 394 | 395 | There are several variants on the definition of term frequency and document frequency. In MLlib, we separate TF and IDF to make them flexible. 396 | 397 | Note: Fit regex_tokenizer, stopwords_remover,hashingTF, idf,label_string_idx, and nb functions into pipeline. 398 | 399 | 400 | ```python 401 | pipeline_idf_lr = Pipeline().setStages([regex_tokenizer,stopwords_remover,hashingTf, idf, label_string_idx, lr]) 402 | model_idf_lr = pipeline_idf_lr.fit(training) 403 | predictions_idf_lr = model_idf_lr.transform(test) 404 | ``` 405 | 406 | 407 | ```python 408 | print('-----------------------------Check Top 5 predictions----------------------------------') 409 | print(' ') 410 | predictions_idf_lr.select('Description','Category',"probability","label","prediction")\ 411 | .orderBy("probability", ascending=False)\ 412 | .show(n=5, truncate=30) 413 | ``` 414 | 415 | -----------------------------Check Top 5 predictions---------------------------------- 416 | 417 | +------------------------------+-------------+------------------------------+-----+----------+ 418 | | Description| Category| probability|label|prediction| 419 | +------------------------------+-------------+------------------------------+-----+----------+ 420 | |theft, bicycle, <$50, no se...|larceny/theft|[0.8745035002793186,0.02115...| 0.0| 0.0| 421 | |theft, bicycle, <$50, no se...|larceny/theft|[0.8745035002793186,0.02115...| 0.0| 0.0| 422 | |theft, bicycle, <$50, no se...|larceny/theft|[0.8745035002793186,0.02115...| 0.0| 0.0| 423 | |theft, bicycle, <$50, no se...|larceny/theft|[0.8745035002793186,0.02115...| 0.0| 0.0| 424 | |theft, bicycle, <$50, no se...|larceny/theft|[0.8745035002793186,0.02115...| 0.0| 0.0| 425 | +------------------------------+-------------+------------------------------+-----+----------+ 426 | only showing top 5 rows 427 | 428 | 429 | 430 | 431 | ```python 432 | evaluator_idf_lr = MulticlassClassificationEvaluator().setPredictionCol("prediction").evaluate(predictions_idf_lr) 433 | print(' ') 434 | print('-------------------------------Accuracy---------------------------------') 435 | print(' ') 436 | print(' accuracy:{}:'.format(evaluator_idf_lr)) 437 | ``` 438 | 439 | 440 | -------------------------------Accuracy--------------------------------- 441 | 442 | accuracy:0.9723359770202158: 443 | 444 | 445 | #### __(c). Apply Naive Bayes with TF-IDF Features__ 446 | 447 | 448 | ```python 449 | pipeline_idf_nb = Pipeline().setStages([regex_tokenizer,stopwords_remover,hashingTf, idf, label_string_idx, nb]) 450 | model_idf_nb = pipeline_idf_nb.fit(training) 451 | predictions_idf_nb = model_idf_nb.transform(test) 452 | ``` 453 | 454 | 455 | ```python 456 | evaluator_idf_nb = MulticlassClassificationEvaluator().setPredictionCol("prediction").evaluate(predictions_idf_nb) 457 | print(' ') 458 | print('-----------------------------Accuracy-----------------------------') 459 | print(' ') 460 | print(' accuracy:{}:'.format(evaluator_idf_nb)) 461 | ``` 462 | 463 | 464 | -----------------------------Accuracy----------------------------- 465 | 466 | accuracy:0.9950758205262961: 467 | 468 | 469 | #### __(e). Apply Logistic Regression Using Word2Vec features__ 470 | Word2Vec is an Estimator which takes sequences of words representing documents and trains a Word2VecModel. The model maps each word to a unique fixed-size vector. The Word2VecModel transforms each document into a vector using the average of all words in the document; this vector can then be used as features for prediction, document similarity calculations, etc. 471 | 472 | 473 | ```python 474 | pipeline_wv_lr = Pipeline().setStages([regex_tokenizer,stopwords_remover, word2Vec, label_string_idx, lr]) 475 | model_wv_lr = pipeline_wv_lr.fit(training) 476 | predictions_wv_lr = model_wv_lr.transform(test) 477 | ``` 478 | 479 | 480 | ```python 481 | evaluator_wv_lr = MulticlassClassificationEvaluator().setPredictionCol("prediction").evaluate(predictions_wv_lr) 482 | print('--------------------------Accuracy------------') 483 | print(' ') 484 | print(' accuracy:{}:'.format(evaluator_wv_lr)) 485 | ``` 486 | 487 | --------------------------Accuracy------------ 488 | 489 | accuracy:0.9073464410736654: 490 | 491 | 492 | #### __(f). Apply Naive Bayes Using Word2Vec features__ 493 | 494 | 495 | ```python 496 | #pipeline_wv_nb = Pipeline().setStages([regex_tokenizer,stopwords_remover, word2Vec, label_string_idx, nb]) 497 | #model_wv_nb = pipeline_wv_nb.fit(training) 498 | #predictions_wv_nb = model_wv_nb.transform(test) 499 | ``` 500 | 501 | 502 | ```python 503 | #evaluator_wv_nb = MulticlassClassificationEvaluator().setPredictionCol("prediction").evaluate(predictions_wv_nb) 504 | #print('--------Accuracy------------') 505 | #print(' ') 506 | #print('accuracy:{}%:'.format(round(evaluator_wv_nb *100),2)) 507 | ``` 508 | 509 | ## 7. __Results:__ 510 | __The table below has accuracy of the models generated by different extraction techniques.__ 511 | 512 | | | Logistic Regression | Naive Bayes | 513 | | -------------------|:-------------------:|------------:| 514 | | Count Vectoriser | 97.2% | 99.3% | 515 | | TF-IDF | 97.2% | 99.5% | 516 | | Word2Vec | 90.7% | | 517 | 518 | **Explanation**: __As you can see, TF-IDF proves to be best vectoriser for this dataset, while Naive Bayes proves to be better algorithm for text analysis than Logistic regression.__ 519 | 520 | ## __8. Deploy the Model__ 521 | We will use Flask. To know more about Flask, check [Full Stack Python.](https://www.fullstackpython.com/flask.html) 522 | 523 | 524 | ```python 525 | Image('flask.jpg') 526 | ``` 527 | 528 | 529 | 530 | 531 | ![jpeg](output_41_0.jpeg) 532 | 533 | 534 | 535 | 536 | ```python 537 | from flask import Flask, request, jsonify 538 | from pyspark.ml import PipelineModel 539 | ``` 540 | 541 | 542 | ```python 543 | app = Flask(__name__) 544 | ``` 545 | 546 | 547 | ```python 548 | # Load the Model 549 | MODEL=pyspark.ml.PipelineModel("spark-naive-bayes-model") 550 | ``` 551 | 552 | 553 | ```python 554 | HTTP_BAD_REQUEST = 400 555 | ``` 556 | 557 | 558 | ```python 559 | @app.route('/predict') 560 | def predict(): 561 | Description = request.args.get('Description', default=None, type=str) 562 | 563 | # Reject request that have bad or missing values. 564 | if Description is None: 565 | # Provide the caller with feedback on why the record is unscorable. 566 | message = ('Record cannot be scored because of ' 567 | 'missing or unacceptable values. ' 568 | 'All values must be present and of type string.') 569 | response = jsonify(status='error', 570 | error_message=message) 571 | # Sets the status code to 400 572 | response.status_code = HTTP_BAD_REQUEST 573 | return response 574 | 575 | features = [[Description]] 576 | predictions = MODEL.transform(features) 577 | label_pred = predictions.select("Description","Category","probability","prediction") 578 | return jsonify(status='complete', label=label_pred) 579 | ``` 580 | 581 | 582 | ```python 583 | if __name__ == '__main__': 584 | app.run(debug=True) 585 | ``` 586 | 587 | 588 | ```python 589 | import requests 590 | #response = requests.get('http://127.0.0.1:5000/predict?Description=arson') 591 | #response.text 592 | ``` 593 | --------------------------------------------------------------------------------