├── .ipynb_checkpoints ├── Binary Tabular Data Classification with PySpark-checkpoint.ipynb ├── End-to-End Machine Learning Model using PySpark and MLlib (2)-checkpoint.ipynb ├── End-to-End Machine Learning Model using PySpark and MLlib-checkpoint.ipynb ├── Multi-class Text Classification Problem with PySpark and MLlib-checkpoint.ipynb ├── Multi-class classification using Decision Tree Problem with PySpark -checkpoint.ipynb ├── Predict Customer Churn using PySpark Machine Learning-checkpoint.ipynb ├── PySpark Dataframe Complete Guide (with COVID-19 Dataset)-checkpoint.ipynb └── PySpark and SparkSQL Complete Guide-checkpoint.ipynb ├── Binary Tabular Data Classification with PySpark.ipynb ├── End-to-End Machine Learning Model using PySpark and MLlib (2).ipynb ├── End-to-End Machine Learning Model using PySpark and MLlib.ipynb ├── Multi-class Text Classification Problem with PySpark and MLlib.ipynb ├── Multi-class classification using Decision Tree Problem with PySpark .ipynb ├── Predict Customer Churn using PySpark Machine Learning.ipynb ├── PySpark Dataframe Complete Guide (with COVID-19 Dataset).ipynb ├── PySpark and SparkSQL Complete Guide.ipynb ├── README.md ├── Setting up Fast Hyperparameter Search Framework with Pyspark.ipynb ├── [Advanced] 5 Spark Tips that will get you to another level.ipynb ├── [Advanced] Spark Know-How in Pratice .ipynb ├── data ├── .ipynb_checkpoints │ └── census-checkpoint.csv ├── 2013_SFO_Customer_Survey.csv ├── Case.csv ├── Region.csv ├── TimeProvince.csv ├── adult.data ├── census.csv ├── nyt2.json ├── winequality-red.csv └── winequality-white.csv └── img ├── hyper.png ├── input.png ├── parallel-coordinates-plot.png ├── shuffle.png ├── spark.png └── sparkpartition.png /.ipynb_checkpoints/Binary Tabular Data Classification with PySpark-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Binary Tabular Data Classification with PySpark" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "This notebook covers a classification problem in Machine Learning and go through a comprehensive guide to succesfully develop an End-to-End ML class prediction model using PySpark." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "**Classification Algorithms**\n", 22 | "In order to predict the class of certain samples, there are several classification algorithms that can be used. In fact, when developing our machine learning models, we will train and evaluate a certain number of them, and we will keep those with better predicting performance. \\\n", 23 | "\n", 24 | "A non-exhaustive list of some of the most used algorithms are:\n", 25 | "\n", 26 | "- Logistic Regression\n", 27 | "- Decision Trees\n", 28 | "- Random Forests\n", 29 | "- Support Vector Machines\n", 30 | "- K-Nearest Neighbors (KNN)" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "**ROC**\n", 38 | "the metric that we will use in our project is the Reciever Operation Characteristic or ROC.\n", 39 | "The ROC curve tells us about how good the model can distinguish between two classes. It can get values from 0 to 1. The better the model is, the closer to 1 value it will be." 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "We will use a number of different supervised algorithms to precisely predict individuals’ income using data collected from the 1994 U.S. Census. \\\n", 47 | " \n", 48 | "We will then choose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data.\n", 49 | "Our goal with this implementation is to build a model that accurately predicts whether an individual makes more than $50,000. \\\n", 50 | "\n", 51 | "As from our previous research we have found out that the individuals who are most likely to donate money to a charity are the ones that make more than $50,000. \\\n", 52 | "\n", 53 | "Therefore, we are facing a binary classification problem, where we want to determine wether an individual makes more than $50K a year (class 1) or do not (class 0)." 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 1, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "#we use the findspark library to locate spark on our local machine\n", 63 | "import findspark\n", 64 | "findspark.init('C:/Users/bokhy/spark/spark-2.4.6-bin-hadoop2.7')" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 2, 70 | "metadata": {}, 71 | "outputs": [], 72 | "source": [ 73 | "import pandas as pd\n", 74 | "import numpy as np\n", 75 | "from datetime import date, timedelta, datetime\n", 76 | "import time\n", 77 | "\n", 78 | "import pyspark # only run this after findspark.init()\n", 79 | "from pyspark.sql import SparkSession, SQLContext\n", 80 | "from pyspark.context import SparkContext\n", 81 | "from pyspark.sql.functions import * \n", 82 | "from pyspark.sql.types import * " 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "### 1. Load Data" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "The census dataset consists of approximately 45222 data points, with each datapoint having 13 features.\n", 97 | "\n", 98 | "The dataset for this project can be found from the [UCI Machine Learning Repo](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/)." 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 3, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "# Initiate the Spark Session\n", 108 | "spark = SparkSession.builder.appName('imbalanced_binary_classification').getOrCreate()" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 4, 114 | "metadata": {}, 115 | "outputs": [ 116 | { 117 | "data": { 118 | "text/html": [ 119 | "\n", 120 | "
\n", 121 | "

SparkSession - in-memory

\n", 122 | " \n", 123 | "
\n", 124 | "

SparkContext

\n", 125 | "\n", 126 | "

Spark UI

\n", 127 | "\n", 128 | "
\n", 129 | "
Version
\n", 130 | "
v2.4.6
\n", 131 | "
Master
\n", 132 | "
local[*]
\n", 133 | "
AppName
\n", 134 | "
imbalanced_binary_classification
\n", 135 | "
\n", 136 | "
\n", 137 | " \n", 138 | "
\n", 139 | " " 140 | ], 141 | "text/plain": [ 142 | "" 143 | ] 144 | }, 145 | "execution_count": 4, 146 | "metadata": {}, 147 | "output_type": "execute_result" 148 | } 149 | ], 150 | "source": [ 151 | "spark" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 5, 157 | "metadata": {}, 158 | "outputs": [ 159 | { 160 | "data": { 161 | "text/plain": [ 162 | "DataFrame[age: int, workClass: string, fnlwgt: int, education: string, education-num: int, marital-status: string, occupation: string, relationship: string, race: string, sex: string, capital-gain: int, capital-loss: int, hours-per-week: int, native-country: string, income: string]" 163 | ] 164 | }, 165 | "metadata": {}, 166 | "output_type": "display_data" 167 | } 168 | ], 169 | "source": [ 170 | "# File location and type\n", 171 | "file_location = \"./data/census.csv\"\n", 172 | "file_type = \"csv\"\n", 173 | "\n", 174 | "# CSV options\n", 175 | "infer_schema = \"true\"\n", 176 | "first_row_is_header = \"False\"\n", 177 | "delimiter = \",\"\n", 178 | "\n", 179 | "# make sure to add column name as the CSV does not contain column name as default\n", 180 | "\n", 181 | "\n", 182 | "# The applied options are for CSV files. For other file types, these will be ignored.\n", 183 | "df = spark.read.format(file_type) \\\n", 184 | " .option(\"inferSchema\", infer_schema) \\\n", 185 | " .option(\"header\", first_row_is_header) \\\n", 186 | " .option(\"sep\", delimiter) \\\n", 187 | " .load(file_location) \\\n", 188 | " .toDF(\"age\", \"workClass\", \"fnlwgt\", \"education\", \"education-num\",\"marital-status\", \"occupation\", \"relationship\",\n", 189 | " \"race\", \"sex\", \"capital-gain\", \"capital-loss\", \"hours-per-week\", \"native-country\", \"income\")\n", 190 | "\n", 191 | "display(df)" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": 6, 197 | "metadata": {}, 198 | "outputs": [ 199 | { 200 | "name": "stdout", 201 | "output_type": "stream", 202 | "text": [ 203 | "+---+----------------+------+------------+-------------+--------------------+-----------------+-------------+------------------+------+------------+------------+--------------+--------------+------+\n", 204 | "|age| workClass|fnlwgt| education|education-num| marital-status| occupation| relationship| race| sex|capital-gain|capital-loss|hours-per-week|native-country|income|\n", 205 | "+---+----------------+------+------------+-------------+--------------------+-----------------+-------------+------------------+------+------------+------------+--------------+--------------+------+\n", 206 | "| 39| State-gov| 77516| Bachelors| 13| Never-married| Adm-clerical|Not-in-family| White| Male| 2174| 0| 40| United-States| <=50K|\n", 207 | "| 50|Self-emp-not-inc| 83311| Bachelors| 13| Married-civ-spouse| Exec-managerial| Husband| White| Male| 0| 0| 13| United-States| <=50K|\n", 208 | "| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family| White| Male| 0| 0| 40| United-States| <=50K|\n", 209 | "| 53| Private|234721| 11th| 7| Married-civ-spouse|Handlers-cleaners| Husband| Black| Male| 0| 0| 40| United-States| <=50K|\n", 210 | "| 28| Private|338409| Bachelors| 13| Married-civ-spouse| Prof-specialty| Wife| Black|Female| 0| 0| 40| Cuba| <=50K|\n", 211 | "| 37| Private|284582| Masters| 14| Married-civ-spouse| Exec-managerial| Wife| White|Female| 0| 0| 40| United-States| <=50K|\n", 212 | "| 49| Private|160187| 9th| 5|Married-spouse-ab...| Other-service|Not-in-family| Black|Female| 0| 0| 16| Jamaica| <=50K|\n", 213 | "| 52|Self-emp-not-inc|209642| HS-grad| 9| Married-civ-spouse| Exec-managerial| Husband| White| Male| 0| 0| 45| United-States| >50K|\n", 214 | "| 31| Private| 45781| Masters| 14| Never-married| Prof-specialty|Not-in-family| White|Female| 14084| 0| 50| United-States| >50K|\n", 215 | "| 42| Private|159449| Bachelors| 13| Married-civ-spouse| Exec-managerial| Husband| White| Male| 5178| 0| 40| United-States| >50K|\n", 216 | "| 37| Private|280464|Some-college| 10| Married-civ-spouse| Exec-managerial| Husband| Black| Male| 0| 0| 80| United-States| >50K|\n", 217 | "| 30| State-gov|141297| Bachelors| 13| Married-civ-spouse| Prof-specialty| Husband|Asian-Pac-Islander| Male| 0| 0| 40| India| >50K|\n", 218 | "| 23| Private|122272| Bachelors| 13| Never-married| Adm-clerical| Own-child| White|Female| 0| 0| 30| United-States| <=50K|\n", 219 | "| 32| Private|205019| Assoc-acdm| 12| Never-married| Sales|Not-in-family| Black| Male| 0| 0| 50| United-States| <=50K|\n", 220 | "| 40| Private|121772| Assoc-voc| 11| Married-civ-spouse| Craft-repair| Husband|Asian-Pac-Islander| Male| 0| 0| 40| ?| >50K|\n", 221 | "| 34| Private|245487| 7th-8th| 4| Married-civ-spouse| Transport-moving| Husband|Amer-Indian-Eskimo| Male| 0| 0| 45| Mexico| <=50K|\n", 222 | "| 25|Self-emp-not-inc|176756| HS-grad| 9| Never-married| Farming-fishing| Own-child| White| Male| 0| 0| 35| United-States| <=50K|\n", 223 | "| 32| Private|186824| HS-grad| 9| Never-married|Machine-op-inspct| Unmarried| White| Male| 0| 0| 40| United-States| <=50K|\n", 224 | "| 38| Private| 28887| 11th| 7| Married-civ-spouse| Sales| Husband| White| Male| 0| 0| 50| United-States| <=50K|\n", 225 | "| 43|Self-emp-not-inc|292175| Masters| 14| Divorced| Exec-managerial| Unmarried| White|Female| 0| 0| 45| United-States| >50K|\n", 226 | "+---+----------------+------+------------+-------------+--------------------+-----------------+-------------+------------------+------+------------+------------+--------------+--------------+------+\n", 227 | "only showing top 20 rows\n", 228 | "\n" 229 | ] 230 | } 231 | ], 232 | "source": [ 233 | "df.show()" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "### 2. Data Preprocessing" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": 7, 246 | "metadata": {}, 247 | "outputs": [ 248 | { 249 | "data": { 250 | "text/plain": [ 251 | "['age',\n", 252 | " 'workClass',\n", 253 | " 'fnlwgt',\n", 254 | " 'education',\n", 255 | " 'education-num',\n", 256 | " 'marital-status',\n", 257 | " 'occupation',\n", 258 | " 'relationship',\n", 259 | " 'race',\n", 260 | " 'sex',\n", 261 | " 'capital-gain',\n", 262 | " 'capital-loss',\n", 263 | " 'hours-per-week',\n", 264 | " 'native-country',\n", 265 | " '>50K']" 266 | ] 267 | }, 268 | "execution_count": 7, 269 | "metadata": {}, 270 | "output_type": "execute_result" 271 | } 272 | ], 273 | "source": [ 274 | "# Import pyspark functions\n", 275 | "from pyspark.sql import functions as F\n", 276 | "# Create add new column to the dataset\n", 277 | "df = df.withColumn('>50K', F.when(df.income == '<=50K', 0).otherwise(1))\n", 278 | "# Drop the Income label\n", 279 | "df = df.drop('income')\n", 280 | "# Show dataset's columns\n", 281 | "df.columns" 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "metadata": {}, 287 | "source": [ 288 | "#### Vectorizing Numerical Features and One-Hot Encodin Categorical Features" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": 8, 294 | "metadata": {}, 295 | "outputs": [], 296 | "source": [ 297 | "# Selecting categorical features\n", 298 | "categorical_columns = [\n", 299 | " 'workClass',\n", 300 | " 'education',\n", 301 | " 'marital-status',\n", 302 | " 'occupation',\n", 303 | " 'relationship',\n", 304 | " 'race',\n", 305 | " 'sex',\n", 306 | " 'hours-per-week',\n", 307 | " 'native-country',\n", 308 | " ]" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": 9, 314 | "metadata": {}, 315 | "outputs": [], 316 | "source": [ 317 | "from pyspark.ml import Pipeline\n", 318 | "from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler\n", 319 | "from pyspark.ml.classification import (DecisionTreeClassifier, GBTClassifier, RandomForestClassifier, LogisticRegression)\n", 320 | "from pyspark.ml.evaluation import BinaryClassificationEvaluator\n", 321 | "\n", 322 | "# The index of string values multiple columns\n", 323 | "indexers = [\n", 324 | " StringIndexer(inputCol=c, outputCol=\"{0}_indexed\".format(c))\n", 325 | " for c in categorical_columns]\n", 326 | "# The encode of indexed values multiple columns\n", 327 | "encoders = [OneHotEncoder(dropLast=False,inputCol=indexer.getOutputCol(),\n", 328 | " outputCol=\"{0}_encoded\".format(indexer.getOutputCol())) \n", 329 | " for indexer in indexers]" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "The above code basically indexes each categorical column using the StringIndexer, and then converts the indexed categories into one-hot encoded variables. The resulting output has the binary vectors appended to the end of each row." 337 | ] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": {}, 342 | "source": [ 343 | "#### Join the categorical encoded features with the numerical ones and make a vector with both of them" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": 10, 349 | "metadata": {}, 350 | "outputs": [], 351 | "source": [ 352 | "# Vectorizing encoded values\n", 353 | "categorical_encoded = [encoder.getOutputCol() for encoder in encoders]\n", 354 | "numerical_columns = ['age', 'education-num', 'capital-gain', 'capital-loss']\n", 355 | "inputcols = categorical_encoded + numerical_columns\n", 356 | "assembler = VectorAssembler(inputCols=inputcols, outputCol=\"features\")" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "#### Set up a pipeline to automatize this stages" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 11, 369 | "metadata": {}, 370 | "outputs": [ 371 | { 372 | "data": { 373 | "text/plain": [ 374 | "DataFrame[age: int, workClass: string, fnlwgt: int, education: string, education-num: int, marital-status: string, occupation: string, relationship: string, race: string, sex: string, capital-gain: int, capital-loss: int, hours-per-week: int, native-country: string, >50K: int, workClass_indexed: double, education_indexed: double, marital-status_indexed: double, occupation_indexed: double, relationship_indexed: double, race_indexed: double, sex_indexed: double, hours-per-week_indexed: double, native-country_indexed: double, workClass_indexed_encoded: vector, education_indexed_encoded: vector, marital-status_indexed_encoded: vector, occupation_indexed_encoded: vector, relationship_indexed_encoded: vector, race_indexed_encoded: vector, sex_indexed_encoded: vector, hours-per-week_indexed_encoded: vector, native-country_indexed_encoded: vector, features: vector]" 375 | ] 376 | }, 377 | "metadata": {}, 378 | "output_type": "display_data" 379 | } 380 | ], 381 | "source": [ 382 | "pipeline = Pipeline(stages=indexers + encoders+[assembler])\n", 383 | "model = pipeline.fit(df)\n", 384 | "# Transform data\n", 385 | "transformed = model.transform(df)\n", 386 | "display(transformed)" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": {}, 392 | "source": [ 393 | "#### Finally, we will select a dataset only with the relevant features." 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": 12, 399 | "metadata": {}, 400 | "outputs": [], 401 | "source": [ 402 | "# Transform data\n", 403 | "final_data = transformed.select('features', '>50K')" 404 | ] 405 | }, 406 | { 407 | "cell_type": "markdown", 408 | "metadata": {}, 409 | "source": [ 410 | "### 3. Build a Model" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 13, 416 | "metadata": {}, 417 | "outputs": [], 418 | "source": [ 419 | "# Initialize the classification models\n", 420 | "# Decision Trees\n", 421 | "# Random Forests\n", 422 | "# Gradient Boosted Trees\n", 423 | "\n", 424 | "dtc = DecisionTreeClassifier(labelCol='>50K', featuresCol='features')\n", 425 | "\n", 426 | "rfc = RandomForestClassifier(numTrees=150, labelCol='>50K', featuresCol='features')\n", 427 | "\n", 428 | "gbt = GBTClassifier(labelCol='>50K', featuresCol='features', maxIter=10)" 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": 14, 434 | "metadata": {}, 435 | "outputs": [ 436 | { 437 | "name": "stdout", 438 | "output_type": "stream", 439 | "text": [ 440 | "39010\n", 441 | "9832\n" 442 | ] 443 | } 444 | ], 445 | "source": [ 446 | "# Split data\n", 447 | "# We will perform a classic 80/20 split between training and testing data.\n", 448 | "train_data, test_data = final_data.randomSplit([0.8,0.2], seed=623)\n", 449 | "print(train_data.count())\n", 450 | "print(test_data.count())" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "### 4. Start Training" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": 15, 463 | "metadata": {}, 464 | "outputs": [], 465 | "source": [ 466 | "dtc_model = dtc.fit(train_data)\n", 467 | "rfc_model = rfc.fit(train_data)\n", 468 | "gbt_model = gbt.fit(train_data)" 469 | ] 470 | }, 471 | { 472 | "cell_type": "markdown", 473 | "metadata": {}, 474 | "source": [ 475 | "### 5. Evaludate with Test-set" 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": 16, 481 | "metadata": {}, 482 | "outputs": [], 483 | "source": [ 484 | "dtc_preds = dtc_model.transform(test_data)\n", 485 | "rfc_preds = rfc_model.transform(test_data)\n", 486 | "gbt_preds = gbt_model.transform(test_data)" 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "### 6. Evaluating Model’s Performance" 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": 17, 499 | "metadata": {}, 500 | "outputs": [], 501 | "source": [ 502 | "# our evaluator will be the ROC\n", 503 | "my_eval = BinaryClassificationEvaluator(labelCol='>50K')" 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": 18, 509 | "metadata": {}, 510 | "outputs": [ 511 | { 512 | "name": "stdout", 513 | "output_type": "stream", 514 | "text": [ 515 | "DTC\n", 516 | "0.5849312593442992\n" 517 | ] 518 | } 519 | ], 520 | "source": [ 521 | "# Display Decision Tree evaluation metric\n", 522 | "print('DTC')\n", 523 | "print(my_eval.evaluate(dtc_preds))" 524 | ] 525 | }, 526 | { 527 | "cell_type": "code", 528 | "execution_count": 19, 529 | "metadata": {}, 530 | "outputs": [ 531 | { 532 | "name": "stdout", 533 | "output_type": "stream", 534 | "text": [ 535 | "RFC\n", 536 | "0.8914577709920453\n" 537 | ] 538 | } 539 | ], 540 | "source": [ 541 | "# Display Random Forest evaluation metric\n", 542 | "print('RFC')\n", 543 | "print(my_eval.evaluate(rfc_preds))" 544 | ] 545 | }, 546 | { 547 | "cell_type": "code", 548 | "execution_count": 20, 549 | "metadata": {}, 550 | "outputs": [ 551 | { 552 | "name": "stdout", 553 | "output_type": "stream", 554 | "text": [ 555 | "GBT\n", 556 | "0.9044179860557597\n" 557 | ] 558 | } 559 | ], 560 | "source": [ 561 | "# Display Gradien Boosting Tree evaluation metric\n", 562 | "print('GBT')\n", 563 | "print(my_eval.evaluate(gbt_preds))" 564 | ] 565 | }, 566 | { 567 | "cell_type": "markdown", 568 | "metadata": {}, 569 | "source": [ 570 | "### 7. Improving Models Performance (Model Tuning)" 571 | ] 572 | }, 573 | { 574 | "cell_type": "markdown", 575 | "metadata": {}, 576 | "source": [ 577 | "We will try to do this by performing the grid search cross validation technique. With it, we will evaluate the performance of the model with different combinations of previously sets of hyperparameter’s values.\n", 578 | "\n", 579 | "The hyperparameters that we will tune are:\n", 580 | "\n", 581 | "- Max Depth\n", 582 | "- Max Bins\n", 583 | "- Max Iterations" 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": 21, 589 | "metadata": {}, 590 | "outputs": [ 591 | { 592 | "data": { 593 | "text/plain": [ 594 | "0.9143539096589867" 595 | ] 596 | }, 597 | "execution_count": 21, 598 | "metadata": {}, 599 | "output_type": "execute_result" 600 | } 601 | ], 602 | "source": [ 603 | "# Import libraries\n", 604 | "from pyspark.ml.tuning import ParamGridBuilder, CrossValidator\n", 605 | "\n", 606 | "# Set the Parameters grid\n", 607 | "paramGrid = (ParamGridBuilder()\n", 608 | " .addGrid(gbt.maxDepth, [2, 4, 6])\n", 609 | " .addGrid(gbt.maxBins, [20, 60])\n", 610 | " .addGrid(gbt.maxIter, [10, 20])\n", 611 | " .build())\n", 612 | "\n", 613 | "# Iinitializing the cross validator class\n", 614 | "cv = CrossValidator(estimator=gbt, estimatorParamMaps=paramGrid, evaluator=my_eval, numFolds=5)\n", 615 | "\n", 616 | "# Run cross validations. This can take about 6 minutes since it is training over 20 trees\n", 617 | "cvModel = cv.fit(train_data)\n", 618 | "gbt_predictions_2 = cvModel.transform(test_data)\n", 619 | "my_eval.evaluate(gbt_predictions_2)" 620 | ] 621 | }, 622 | { 623 | "cell_type": "markdown", 624 | "metadata": {}, 625 | "source": [ 626 | "#### We can also access the model's feature weights and intercepts easily" 627 | ] 628 | }, 629 | { 630 | "cell_type": "code", 631 | "execution_count": null, 632 | "metadata": {}, 633 | "outputs": [], 634 | "source": [ 635 | "print('Model Intercept: ', cvModel.bestModel.intercept)" 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": null, 641 | "metadata": {}, 642 | "outputs": [], 643 | "source": [ 644 | "weights = cvModel.bestModel.coefficients\n", 645 | "weights = [(float(w),) for w in weights] # convert numpy type to float, and to tuple\n", 646 | "weightsDF = sqlContext.createDataFrame(weights, [\"Feature Weight\"])\n", 647 | "display(weightsDF)" 648 | ] 649 | }, 650 | { 651 | "cell_type": "code", 652 | "execution_count": null, 653 | "metadata": {}, 654 | "outputs": [], 655 | "source": [ 656 | "# View best model's predictions and probabilities of each prediction class\n", 657 | "selected = predictions.select(\"label\", \"prediction\", \"probability\", \"age\", \"occupation\")\n", 658 | "display(selected)" 659 | ] 660 | }, 661 | { 662 | "cell_type": "code", 663 | "execution_count": null, 664 | "metadata": {}, 665 | "outputs": [], 666 | "source": [ 667 | "# End Spark Session\n", 668 | "spark.stop()" 669 | ] 670 | } 671 | ], 672 | "metadata": { 673 | "kernelspec": { 674 | "display_name": "Python 3", 675 | "language": "python", 676 | "name": "python3" 677 | }, 678 | "language_info": { 679 | "codemirror_mode": { 680 | "name": "ipython", 681 | "version": 3 682 | }, 683 | "file_extension": ".py", 684 | "mimetype": "text/x-python", 685 | "name": "python", 686 | "nbconvert_exporter": "python", 687 | "pygments_lexer": "ipython3", 688 | "version": "3.7.6" 689 | } 690 | }, 691 | "nbformat": 4, 692 | "nbformat_minor": 4 693 | } 694 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Multi-class Text Classification Problem with PySpark and MLlib-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Multi-class Text Classification Problem with PySpark and MLlib" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Background\n", 15 | "Apache Spark is quickly gaining steam both in the headlines and real-world adoption, mainly because of its ability to process streaming data. With so much data being processed on a daily basis, it has become essential for us to be able to stream and analyze it in real time" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "We use Spark Machine Learning Library (Spark MLlib) to solve this multi-class text classification problem" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## 1. Load Data \n", 30 | "The dataset is from Kaggle [Link](https://www.kaggle.com/c/sf-crime/data)" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "The problem is to classify 'Crime Description' into 33 categories" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 1, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "import os\n", 47 | "import pandas as pd" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 2, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "#we use the findspark library to locate spark on our local machine\n", 57 | "import findspark\n", 58 | "findspark.init('C:/Users/bokhy/spark/spark-2.4.6-bin-hadoop2.7')" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "import pyspark # only run after findspark.init()\n", 68 | "from pyspark.sql import SparkSession, SQLContext\n", 69 | "from pyspark import SparkContext\n", 70 | "\n", 71 | "# Spark offers built-in packages to load CSV files\n", 72 | "sc =SparkContext()\n", 73 | "sqlContext = SQLContext(sc)\n", 74 | "data = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('crime_train.csv')" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "## 2. Data Preprocessing" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "Remove the columns we do not need and have a look the first five rows" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "drop_list = ['Dates', 'DayOfWeek', 'PdDistrict', 'Resolution', 'Address', 'X', 'Y']\n", 98 | "data = data.select([column for column in data.columns if column not in drop_list])" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 20, 104 | "metadata": {}, 105 | "outputs": [ 106 | { 107 | "name": "stdout", 108 | "output_type": "stream", 109 | "text": [ 110 | "root\n", 111 | " |-- Category: string (nullable = true)\n", 112 | " |-- Descript: string (nullable = true)\n", 113 | "\n" 114 | ] 115 | } 116 | ], 117 | "source": [ 118 | "data.printSchema()" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 21, 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "name": "stdout", 128 | "output_type": "stream", 129 | "text": [ 130 | "+--------------+--------------------+\n", 131 | "| Category| Descript|\n", 132 | "+--------------+--------------------+\n", 133 | "| WARRANTS| WARRANT ARREST|\n", 134 | "|OTHER OFFENSES|TRAFFIC VIOLATION...|\n", 135 | "|OTHER OFFENSES|TRAFFIC VIOLATION...|\n", 136 | "| LARCENY/THEFT|GRAND THEFT FROM ...|\n", 137 | "| LARCENY/THEFT|GRAND THEFT FROM ...|\n", 138 | "+--------------+--------------------+\n", 139 | "only showing top 5 rows\n", 140 | "\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "data.show(5)" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": 22, 151 | "metadata": {}, 152 | "outputs": [ 153 | { 154 | "name": "stdout", 155 | "output_type": "stream", 156 | "text": [ 157 | "+--------------------+------+\n", 158 | "| Category| count|\n", 159 | "+--------------------+------+\n", 160 | "| LARCENY/THEFT|174900|\n", 161 | "| OTHER OFFENSES|126182|\n", 162 | "| NON-CRIMINAL| 92304|\n", 163 | "| ASSAULT| 76876|\n", 164 | "| DRUG/NARCOTIC| 53971|\n", 165 | "| VEHICLE THEFT| 53781|\n", 166 | "| VANDALISM| 44725|\n", 167 | "| WARRANTS| 42214|\n", 168 | "| BURGLARY| 36755|\n", 169 | "| SUSPICIOUS OCC| 31414|\n", 170 | "| MISSING PERSON| 25989|\n", 171 | "| ROBBERY| 23000|\n", 172 | "| FRAUD| 16679|\n", 173 | "|FORGERY/COUNTERFE...| 10609|\n", 174 | "| SECONDARY CODES| 9985|\n", 175 | "| WEAPON LAWS| 8555|\n", 176 | "| PROSTITUTION| 7484|\n", 177 | "| TRESPASS| 7326|\n", 178 | "| STOLEN PROPERTY| 4540|\n", 179 | "|SEX OFFENSES FORC...| 4388|\n", 180 | "+--------------------+------+\n", 181 | "only showing top 20 rows\n", 182 | "\n" 183 | ] 184 | } 185 | ], 186 | "source": [ 187 | "# Top 20 crimes\n", 188 | "from pyspark.sql.functions import col\n", 189 | "data.groupBy(\"Category\") \\\n", 190 | " .count() \\\n", 191 | " .orderBy(col(\"count\").desc()) \\\n", 192 | " .show()" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 23, 198 | "metadata": {}, 199 | "outputs": [ 200 | { 201 | "name": "stdout", 202 | "output_type": "stream", 203 | "text": [ 204 | "+--------------------+-----+\n", 205 | "| Descript|count|\n", 206 | "+--------------------+-----+\n", 207 | "|GRAND THEFT FROM ...|60022|\n", 208 | "| LOST PROPERTY|31729|\n", 209 | "| BATTERY|27441|\n", 210 | "| STOLEN AUTOMOBILE|26897|\n", 211 | "|DRIVERS LICENSE, ...|26839|\n", 212 | "| WARRANT ARREST|23754|\n", 213 | "|SUSPICIOUS OCCURR...|21891|\n", 214 | "|AIDED CASE, MENTA...|21497|\n", 215 | "|PETTY THEFT FROM ...|19771|\n", 216 | "|MALICIOUS MISCHIE...|17789|\n", 217 | "| TRAFFIC VIOLATION|16471|\n", 218 | "|PETTY THEFT OF PR...|16196|\n", 219 | "|MALICIOUS MISCHIE...|15957|\n", 220 | "|THREATS AGAINST LIFE|14716|\n", 221 | "| FOUND PROPERTY|12146|\n", 222 | "|ENROUTE TO OUTSID...|11470|\n", 223 | "|GRAND THEFT OF PR...|11010|\n", 224 | "|POSSESSION OF NAR...|10050|\n", 225 | "|PETTY THEFT FROM ...|10029|\n", 226 | "|PETTY THEFT SHOPL...| 9571|\n", 227 | "+--------------------+-----+\n", 228 | "only showing top 20 rows\n", 229 | "\n" 230 | ] 231 | } 232 | ], 233 | "source": [ 234 | "# Top 20 descriptions\n", 235 | "data.groupBy(\"Descript\") \\\n", 236 | " .count() \\\n", 237 | " .orderBy(col(\"count\").desc()) \\\n", 238 | " .show()" 239 | ] 240 | }, 241 | { 242 | "cell_type": "markdown", 243 | "metadata": {}, 244 | "source": [ 245 | "## 3. Create a Model\n", 246 | "### In Spark, we call create 'Model Pipeline', and we accomplish this in 5 steps\n", 247 | "1. regexTokenizer: Tokenization (with Regular Expression)\n", 248 | "2. stopwordsRemover: Remove Stop Words\n", 249 | "3. countVectors: Count vectors (“document-term vectors”)\n", 250 | "4. StringIndexer : encodes a string column of labels to a column of label indices. The indices are in (0, numLabels), ordered by label frequencies, so the most frequent label gets index 0. \\\n", 251 | "(In our case, the label column (Category) will be encoded to label indices, from 0 to 32; the most frequent label (LARCENY/THEFT) will be indexed as 0)\n", 252 | "5. Split Train/Test data for making 'training-ready'" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 24, 258 | "metadata": {}, 259 | "outputs": [], 260 | "source": [ 261 | "from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer, OneHotEncoder, StringIndexer, VectorAssembler, HashingTF, IDF\n", 262 | "from pyspark.ml.classification import LogisticRegression\n", 263 | "\n", 264 | "# Clean description using regular-expression tokenizer\n", 265 | "regexTokenizer = RegexTokenizer(inputCol=\"Descript\", outputCol=\"words\", pattern=\"\\\\W\")\n", 266 | "\n", 267 | "# exclue stop words\n", 268 | "add_stopwords = [\"http\",\"https\",\"amp\",\"rt\",\"t\",\"c\",\"the\"] \n", 269 | "stopwordsRemover = StopWordsRemover(inputCol=\"words\", outputCol=\"filtered\").setStopWords(add_stopwords)\n", 270 | "\n", 271 | "# bag-of-words count\n", 272 | "countVectors = CountVectorizer(inputCol=\"filtered\", outputCol=\"features\", vocabSize=10000, minDF=5)\n", 273 | "\n", 274 | "# Add Hashing\n", 275 | "# hashingTF = HashingTF(inputCol=\"filtered\", outputCol=\"rawFeatures\", numFeatures=10000)\n", 276 | "\n", 277 | "# TF and IDF\n", 278 | "# idf = IDF(inputCol=\"rawFeatures\", outputCol=\"features\", minDocFreq=5) #minDocFreq: remove sparse terms\n", 279 | "\n", 280 | "# StringIndexer\n", 281 | "label_stringIdx = StringIndexer(inputCol = \"Category\", outputCol = \"label\")" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 25, 287 | "metadata": {}, 288 | "outputs": [ 289 | { 290 | "name": "stdout", 291 | "output_type": "stream", 292 | "text": [ 293 | "+--------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+\n", 294 | "| Category| Descript| words| filtered| rawFeatures| features|label|\n", 295 | "+--------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+\n", 296 | "| WARRANTS| WARRANT ARREST| [warrant, arrest]| [warrant, arrest]|(10000,[2279,3942...|(10000,[2279,3942...| 7.0|\n", 297 | "|OTHER OFFENSES|TRAFFIC VIOLATION...|[traffic, violati...|[traffic, violati...|(10000,[604,3942,...|(10000,[604,3942,...| 1.0|\n", 298 | "|OTHER OFFENSES|TRAFFIC VIOLATION...|[traffic, violati...|[traffic, violati...|(10000,[604,3942,...|(10000,[604,3942,...| 1.0|\n", 299 | "| LARCENY/THEFT|GRAND THEFT FROM ...|[grand, theft, fr...|[grand, theft, fr...|(10000,[274,713,3...|(10000,[274,713,3...| 0.0|\n", 300 | "| LARCENY/THEFT|GRAND THEFT FROM ...|[grand, theft, fr...|[grand, theft, fr...|(10000,[274,713,3...|(10000,[274,713,3...| 0.0|\n", 301 | "+--------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+\n", 302 | "only showing top 5 rows\n", 303 | "\n" 304 | ] 305 | } 306 | ], 307 | "source": [ 308 | "from pyspark.ml import Pipeline\n", 309 | "\n", 310 | "# Put everything in pipeline (We use regexTokenizer, stopwordsRemover, hashingTF, idf, label_stringIdx)\n", 311 | "# you can use hasingTF and IDF alternatively than countVectors\n", 312 | "pipeline = Pipeline(stages=[regexTokenizer, stopwordsRemover, countVectors, label_stringIdx])\n", 313 | "\n", 314 | "# Fit the pipeline to training documents.\n", 315 | "pipelineFit = pipeline.fit(data)\n", 316 | "dataset = pipelineFit.transform(data)\n", 317 | "dataset.show(5)" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 26, 323 | "metadata": {}, 324 | "outputs": [ 325 | { 326 | "name": "stdout", 327 | "output_type": "stream", 328 | "text": [ 329 | "Training Dataset Count: 658302\n", 330 | "Test Dataset Count: 219747\n" 331 | ] 332 | } 333 | ], 334 | "source": [ 335 | "# Split Train/Test data\n", 336 | "(trainingData, testData) = dataset.randomSplit([0.75, 0.25], seed = 623)\n", 337 | "print(\"Training Dataset Count: \" + str(trainingData.count()))\n", 338 | "print(\"Test Dataset Count: \" + str(testData.count()))" 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": {}, 344 | "source": [ 345 | "## 4. Train a Model and Evaluation" 346 | ] 347 | }, 348 | { 349 | "cell_type": "markdown", 350 | "metadata": {}, 351 | "source": [ 352 | "Our model will make predictions and score on the test set \\\n", 353 | "And then we then look at the top 10 predictions from the highest probability." 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": 31, 359 | "metadata": {}, 360 | "outputs": [ 361 | { 362 | "name": "stdout", 363 | "output_type": "stream", 364 | "text": [ 365 | "+------------------------------+-------------+------------------------------+-----+----------+\n", 366 | "| Descript| Category| probability|label|prediction|\n", 367 | "+------------------------------+-------------+------------------------------+-----+----------+\n", 368 | "|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8741040478278337,0.02016...| 0.0| 0.0|\n", 369 | "|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8741040478278337,0.02016...| 0.0| 0.0|\n", 370 | "|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8741040478278337,0.02016...| 0.0| 0.0|\n", 371 | "|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8741040478278337,0.02016...| 0.0| 0.0|\n", 372 | "|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8741040478278337,0.02016...| 0.0| 0.0|\n", 373 | "|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8741040478278337,0.02016...| 0.0| 0.0|\n", 374 | "|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8741040478278337,0.02016...| 0.0| 0.0|\n", 375 | "|THEFT, BICYCLE, <$50, SERIA...|LARCENY/THEFT|[0.874103341787826,0.020162...| 0.0| 0.0|\n", 376 | "|THEFT, BICYCLE, <$50, SERIA...|LARCENY/THEFT|[0.874103341787826,0.020162...| 0.0| 0.0|\n", 377 | "|THEFT, BICYCLE, <$50, SERIA...|LARCENY/THEFT|[0.874103341787826,0.020162...| 0.0| 0.0|\n", 378 | "+------------------------------+-------------+------------------------------+-----+----------+\n", 379 | "only showing top 10 rows\n", 380 | "\n" 381 | ] 382 | } 383 | ], 384 | "source": [ 385 | "# We use Logistic-Regression model\n", 386 | "lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)\n", 387 | "\n", 388 | "lrModel = lr.fit(trainingData)\n", 389 | "predictions = lrModel.transform(testData)\n", 390 | "predictions.filter(predictions['prediction'] == 0) \\\n", 391 | " .select(\"Descript\",\"Category\",\"probability\",\"label\",\"prediction\") \\\n", 392 | " .orderBy(\"probability\", ascending=False) \\\n", 393 | " .show(n = 10, truncate = 30)" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": 32, 399 | "metadata": {}, 400 | "outputs": [ 401 | { 402 | "name": "stdout", 403 | "output_type": "stream", 404 | "text": [ 405 | "Test-set Accuracy is : 0.972745626745252\n" 406 | ] 407 | } 408 | ], 409 | "source": [ 410 | "from pyspark.ml.evaluation import MulticlassClassificationEvaluator\n", 411 | "evaluator = MulticlassClassificationEvaluator(predictionCol=\"prediction\")\n", 412 | "print(\"Test-set Accuracy is : \", evaluator.evaluate(predictions))" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "metadata": {}, 418 | "source": [ 419 | "## 5. Cross-Validation (hyper-parameters tuning)" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": {}, 425 | "source": [ 426 | "Let's try to improve our model by cross-validation, and we will tune the count vectors Logistic Regression" 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": 33, 432 | "metadata": {}, 433 | "outputs": [], 434 | "source": [ 435 | "# Same Pipeline step like above\n", 436 | "pipeline = Pipeline(stages=[regexTokenizer, stopwordsRemover, countVectors, label_stringIdx])\n", 437 | "pipelineFit = pipeline.fit(data)\n", 438 | "dataset = pipelineFit.transform(data)\n", 439 | "(trainingData, testData) = dataset.randomSplit([0.75, 0.25], seed = 623)\n", 440 | "\n", 441 | "# Create LR model\n", 442 | "lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)" 443 | ] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "execution_count": 34, 448 | "metadata": {}, 449 | "outputs": [], 450 | "source": [ 451 | "from pyspark.ml.tuning import ParamGridBuilder, CrossValidator\n", 452 | "\n", 453 | "# Create ParamGrid for Cross Validation\n", 454 | "paramGrid = (ParamGridBuilder()\n", 455 | " .addGrid(lr.regParam, [0.1, 0.3, 0.5]) # regularization parameter\n", 456 | " .addGrid(lr.elasticNetParam, [0.0, 0.1, 0.2]) # Elastic Net Parameter (Ridge = 0)\n", 457 | "# .addGrid(model.maxIter, [10, 20, 50]) #Number of iterations\n", 458 | "# .addGrid(idf.numFeatures, [10, 100, 1000]) # Number of features\n", 459 | " .build())\n", 460 | "\n", 461 | "# Create 5-fold CrossValidator\n", 462 | "cv = CrossValidator(estimator=lr, \\\n", 463 | " estimatorParamMaps=paramGrid, \\\n", 464 | " evaluator=evaluator, \\\n", 465 | " numFolds=5)\n", 466 | "cvModel = cv.fit(trainingData)\n", 467 | "\n", 468 | "predictions = cvModel.transform(testData)" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": 36, 474 | "metadata": {}, 475 | "outputs": [ 476 | { 477 | "name": "stdout", 478 | "output_type": "stream", 479 | "text": [ 480 | "Test-set Accuracy is : 0.9918902377560262\n" 481 | ] 482 | } 483 | ], 484 | "source": [ 485 | "# Evaluate best model\n", 486 | "evaluator = MulticlassClassificationEvaluator(predictionCol=\"prediction\")\n", 487 | "print(\"Test-set Accuracy is : \", evaluator.evaluate(predictions))" 488 | ] 489 | }, 490 | { 491 | "cell_type": "markdown", 492 | "metadata": {}, 493 | "source": [ 494 | "### We are able to acheive over 99% accuracy! " 495 | ] 496 | }, 497 | { 498 | "cell_type": "markdown", 499 | "metadata": {}, 500 | "source": [ 501 | "Reference\n", 502 | ">https://towardsdatascience.com/multi-class-text-classification-with-pyspark-7d78d022ed35" 503 | ] 504 | } 505 | ], 506 | "metadata": { 507 | "kernelspec": { 508 | "display_name": "Python 3", 509 | "language": "python", 510 | "name": "python3" 511 | }, 512 | "language_info": { 513 | "codemirror_mode": { 514 | "name": "ipython", 515 | "version": 3 516 | }, 517 | "file_extension": ".py", 518 | "mimetype": "text/x-python", 519 | "name": "python", 520 | "nbconvert_exporter": "python", 521 | "pygments_lexer": "ipython3", 522 | "version": "3.7.6" 523 | } 524 | }, 525 | "nbformat": 4, 526 | "nbformat_minor": 4 527 | } 528 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Multi-class classification using Decision Tree Problem with PySpark -checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Multi-class classification using Decision Tree Problem with PySpark " 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Each year, San Francisco Airport (SFO) conducts a customer satisfaction survey to find out what they are doing well and where they can improve. The survey gauges satisfaction with SFO facilities, services, and amenities. SFO compares results to previous surveys to discover elements of the guest experience that are not satisfactory.\n", 15 | "\n", 16 | "The 2013 SFO Survey Results consists of customer responses to survey questions and an overall satisfaction rating with the airport. We investigated whether we could use machine learning to predict a customer's overall response given their responses to the individual questions. That in and of itself is not very useful because the customer has already provided an overall rating as well as individual ratings for various aspects of the airport such as parking, food quality and restroom cleanliness. However, we didn't stop at prediction instead we asked the question:\n", 17 | "\n", 18 | "What factors drove the customer to give the overall rating?" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "Here is an outline of our data flow:\n", 26 | "\n", 27 | "- Load data: Load the data as a DataFrame\n", 28 | "- Understand the data: Compute statistics and create visualizations to get a better understanding of the data to see if we can use basic statistics to answer the question above.\n", 29 | "- Create Model On the training dataset:\n", 30 | "- Evaluate the model: Now look at the test dataset. Compare the initial model with the tuned model to see the benefit of tuning parameters.\n", 31 | "- Feature Importance: Determine the importance of each of the individual ratings in determining the overall rating by the customer" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 1, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "#we use the findspark library to locate spark on our local machine\n", 41 | "import findspark\n", 42 | "findspark.init('C:/Users/bokhy/spark/spark-2.4.6-bin-hadoop2.7')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 2, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "import os\n", 52 | "import pandas as pd\n", 53 | "import numpy as np\n", 54 | "from datetime import date, timedelta, datetime\n", 55 | "import time\n", 56 | "\n", 57 | "import pyspark # only run this after findspark.init()\n", 58 | "from pyspark.sql import SparkSession, SQLContext\n", 59 | "from pyspark.context import SparkContext\n", 60 | "from pyspark.sql.functions import * \n", 61 | "from pyspark.sql.types import * " 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "## 1. Load Data \n", 69 | "This dataset is available as a public dataset from https://catalog.data.gov/dataset/2013-sfo-customer-survey-d3541." 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 3, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "# Initiate the Spark Session\n", 79 | "spark = SparkSession.builder.appName('Decision-Tree').getOrCreate()" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 27, 85 | "metadata": {}, 86 | "outputs": [ 87 | { 88 | "data": { 89 | "text/html": [ 90 | "\n", 91 | "
\n", 92 | "

SparkSession - in-memory

\n", 93 | " \n", 94 | "
\n", 95 | "

SparkContext

\n", 96 | "\n", 97 | "

Spark UI

\n", 98 | "\n", 99 | "
\n", 100 | "
Version
\n", 101 | "
v2.4.6
\n", 102 | "
Master
\n", 103 | "
local[*]
\n", 104 | "
AppName
\n", 105 | "
Decision-Tree
\n", 106 | "
\n", 107 | "
\n", 108 | " \n", 109 | "
\n", 110 | " " 111 | ], 112 | "text/plain": [ 113 | "" 114 | ] 115 | }, 116 | "execution_count": 27, 117 | "metadata": {}, 118 | "output_type": "execute_result" 119 | } 120 | ], 121 | "source": [ 122 | "spark" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 5, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "survey = spark.read.csv(\"./data/2013_SFO_Customer_Survey.csv\", header=\"true\", inferSchema=\"true\")" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 6, 137 | "metadata": {}, 138 | "outputs": [ 139 | { 140 | "data": { 141 | "text/plain": [ 142 | "DataFrame[RESPNUM: int, CCGID: int, RUN: int, INTDATE: int, GATE: int, STRATA: int, PEAK: int, METHOD: int, AIRLINE: int, FLIGHT: int, DEST: int, DESTGEO: int, DESTMARK: int, ARRTIME: string, DEPTIME: string, Q2PURP1: int, Q2PURP2: int, Q2PURP3: int, Q2PURP4: int, Q2PURP5: int, Q2PURP6: string, Q3GETTO1: int, Q3GETTO2: int, Q3GETTO3: int, Q3GETTO4: int, Q3GETTO5: string, Q3GETTO6: string, Q3PARK: int, Q4BAGS: int, Q4BUY: int, Q4FOOD: int, Q4WIFI: int, Q5FLYPERYR: int, Q6TENURE: double, SAQ: int, Q7A_ART: int, Q7B_FOOD: int, Q7C_SHOPS: int, Q7D_SIGNS: int, Q7E_WALK: int, Q7F_SCREENS: int, Q7G_INFOARR: int, Q7H_INFODEP: int, Q7I_WIFI: int, Q7J_ROAD: int, Q7K_PARK: int, Q7L_AIRTRAIN: int, Q7M_LTPARK: int, Q7N_RENTAL: int, Q7O_WHOLE: int, Q8COM1: int, Q8COM2: int, Q8COM3: int, Q9A_CLNBOARD: int, Q9B_CLNAIRTRAIN: int, Q9C_CLNRENT: int, Q9D_CLNFOOD: int, Q9E_CLNBATH: int, Q9F_CLNWHOLE: int, Q9COM1: int, Q9COM2: int, Q9COM3: int, Q10SAFE: int, Q10COM1: int, Q10COM2: int, Q10COM3: int, Q11A_USEWEB: int, Q11B_USESFOAPP: int, Q11C_USEOTHAPP: int, Q11D_USESOCMED: int, Q11E_USEWIFI: int, Q12COM1: int, Q12COM2: int, Q12COM3: int, Q13_WHEREDEPART: int, Q13_RATEGETTO: int, Q14A_FIND: int, Q14B_SECURITY: int, Q15_PROBLEMS: int, Q15COM1: int, Q15COM2: int, Q15COM3: int, Q16_REGION: int, Q17_CITY: string, Q17_ZIP: int, Q17_COUNTRY: string, HOME: int, Q18_AGE: int, Q19_SEX: int, Q20_INCOME: int, Q21_HIFLYER: int, Q22A_USESJC: int, Q22B_USEOAK: int, LANG: int, WEIGHT: double]" 143 | ] 144 | }, 145 | "metadata": {}, 146 | "output_type": "display_data" 147 | } 148 | ], 149 | "source": [ 150 | "display(survey)" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 7, 156 | "metadata": {}, 157 | "outputs": [ 158 | { 159 | "name": "stdout", 160 | "output_type": "stream", 161 | "text": [ 162 | "root\n", 163 | " |-- RESPNUM: integer (nullable = true)\n", 164 | " |-- CCGID: integer (nullable = true)\n", 165 | " |-- RUN: integer (nullable = true)\n", 166 | " |-- INTDATE: integer (nullable = true)\n", 167 | " |-- GATE: integer (nullable = true)\n", 168 | " |-- STRATA: integer (nullable = true)\n", 169 | " |-- PEAK: integer (nullable = true)\n", 170 | " |-- METHOD: integer (nullable = true)\n", 171 | " |-- AIRLINE: integer (nullable = true)\n", 172 | " |-- FLIGHT: integer (nullable = true)\n", 173 | " |-- DEST: integer (nullable = true)\n", 174 | " |-- DESTGEO: integer (nullable = true)\n", 175 | " |-- DESTMARK: integer (nullable = true)\n", 176 | " |-- ARRTIME: string (nullable = true)\n", 177 | " |-- DEPTIME: string (nullable = true)\n", 178 | " |-- Q2PURP1: integer (nullable = true)\n", 179 | " |-- Q2PURP2: integer (nullable = true)\n", 180 | " |-- Q2PURP3: integer (nullable = true)\n", 181 | " |-- Q2PURP4: integer (nullable = true)\n", 182 | " |-- Q2PURP5: integer (nullable = true)\n", 183 | " |-- Q2PURP6: string (nullable = true)\n", 184 | " |-- Q3GETTO1: integer (nullable = true)\n", 185 | " |-- Q3GETTO2: integer (nullable = true)\n", 186 | " |-- Q3GETTO3: integer (nullable = true)\n", 187 | " |-- Q3GETTO4: integer (nullable = true)\n", 188 | " |-- Q3GETTO5: string (nullable = true)\n", 189 | " |-- Q3GETTO6: string (nullable = true)\n", 190 | " |-- Q3PARK: integer (nullable = true)\n", 191 | " |-- Q4BAGS: integer (nullable = true)\n", 192 | " |-- Q4BUY: integer (nullable = true)\n", 193 | " |-- Q4FOOD: integer (nullable = true)\n", 194 | " |-- Q4WIFI: integer (nullable = true)\n", 195 | " |-- Q5FLYPERYR: integer (nullable = true)\n", 196 | " |-- Q6TENURE: double (nullable = true)\n", 197 | " |-- SAQ: integer (nullable = true)\n", 198 | " |-- Q7A_ART: integer (nullable = true)\n", 199 | " |-- Q7B_FOOD: integer (nullable = true)\n", 200 | " |-- Q7C_SHOPS: integer (nullable = true)\n", 201 | " |-- Q7D_SIGNS: integer (nullable = true)\n", 202 | " |-- Q7E_WALK: integer (nullable = true)\n", 203 | " |-- Q7F_SCREENS: integer (nullable = true)\n", 204 | " |-- Q7G_INFOARR: integer (nullable = true)\n", 205 | " |-- Q7H_INFODEP: integer (nullable = true)\n", 206 | " |-- Q7I_WIFI: integer (nullable = true)\n", 207 | " |-- Q7J_ROAD: integer (nullable = true)\n", 208 | " |-- Q7K_PARK: integer (nullable = true)\n", 209 | " |-- Q7L_AIRTRAIN: integer (nullable = true)\n", 210 | " |-- Q7M_LTPARK: integer (nullable = true)\n", 211 | " |-- Q7N_RENTAL: integer (nullable = true)\n", 212 | " |-- Q7O_WHOLE: integer (nullable = true)\n", 213 | " |-- Q8COM1: integer (nullable = true)\n", 214 | " |-- Q8COM2: integer (nullable = true)\n", 215 | " |-- Q8COM3: integer (nullable = true)\n", 216 | " |-- Q9A_CLNBOARD: integer (nullable = true)\n", 217 | " |-- Q9B_CLNAIRTRAIN: integer (nullable = true)\n", 218 | " |-- Q9C_CLNRENT: integer (nullable = true)\n", 219 | " |-- Q9D_CLNFOOD: integer (nullable = true)\n", 220 | " |-- Q9E_CLNBATH: integer (nullable = true)\n", 221 | " |-- Q9F_CLNWHOLE: integer (nullable = true)\n", 222 | " |-- Q9COM1: integer (nullable = true)\n", 223 | " |-- Q9COM2: integer (nullable = true)\n", 224 | " |-- Q9COM3: integer (nullable = true)\n", 225 | " |-- Q10SAFE: integer (nullable = true)\n", 226 | " |-- Q10COM1: integer (nullable = true)\n", 227 | " |-- Q10COM2: integer (nullable = true)\n", 228 | " |-- Q10COM3: integer (nullable = true)\n", 229 | " |-- Q11A_USEWEB: integer (nullable = true)\n", 230 | " |-- Q11B_USESFOAPP: integer (nullable = true)\n", 231 | " |-- Q11C_USEOTHAPP: integer (nullable = true)\n", 232 | " |-- Q11D_USESOCMED: integer (nullable = true)\n", 233 | " |-- Q11E_USEWIFI: integer (nullable = true)\n", 234 | " |-- Q12COM1: integer (nullable = true)\n", 235 | " |-- Q12COM2: integer (nullable = true)\n", 236 | " |-- Q12COM3: integer (nullable = true)\n", 237 | " |-- Q13_WHEREDEPART: integer (nullable = true)\n", 238 | " |-- Q13_RATEGETTO: integer (nullable = true)\n", 239 | " |-- Q14A_FIND: integer (nullable = true)\n", 240 | " |-- Q14B_SECURITY: integer (nullable = true)\n", 241 | " |-- Q15_PROBLEMS: integer (nullable = true)\n", 242 | " |-- Q15COM1: integer (nullable = true)\n", 243 | " |-- Q15COM2: integer (nullable = true)\n", 244 | " |-- Q15COM3: integer (nullable = true)\n", 245 | " |-- Q16_REGION: integer (nullable = true)\n", 246 | " |-- Q17_CITY: string (nullable = true)\n", 247 | " |-- Q17_ZIP: integer (nullable = true)\n", 248 | " |-- Q17_COUNTRY: string (nullable = true)\n", 249 | " |-- HOME: integer (nullable = true)\n", 250 | " |-- Q18_AGE: integer (nullable = true)\n", 251 | " |-- Q19_SEX: integer (nullable = true)\n", 252 | " |-- Q20_INCOME: integer (nullable = true)\n", 253 | " |-- Q21_HIFLYER: integer (nullable = true)\n", 254 | " |-- Q22A_USESJC: integer (nullable = true)\n", 255 | " |-- Q22B_USEOAK: integer (nullable = true)\n", 256 | " |-- LANG: integer (nullable = true)\n", 257 | " |-- WEIGHT: double (nullable = true)\n", 258 | "\n" 259 | ] 260 | } 261 | ], 262 | "source": [ 263 | "survey.printSchema()" 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "As you can see above there are many questions in the survey including what airline the customer flew on, where do they live, etc. For the purposes of answering the above, focus on the Q7A, Q7B, Q7C .. Q7O questions since they directly related to customer satisfaction, which is what you want to measure. If you drill down on those variables you get the following:\n", 271 | "\n", 272 | "|Column Name|Data Type|Description|\n", 273 | "| --- | --- | --- |\n", 274 | "|Q7B_FOOD|INTEGER|Restaurants|\n", 275 | "|Q7C_SHOPS|INTEGER|Retail shops and concessions|\n", 276 | "|Q7D_SIGNS|INTEGER|Signs and Directions inside SFO|\n", 277 | "|Q7E_WALK|INTEGER|Escalators / elevators / moving walkways|\n", 278 | "|Q7F_SCREENS|INTEGER|Information on screens and monitors|\n", 279 | "|Q7G_INFOARR|INTEGER|Information booth near arrivals area|\n", 280 | "|Q7H_INFODEP|INTEGER|Information booth near departure areas|\n", 281 | "|Q7I_WIFI|INTEGER|Airport WiFi|\n", 282 | "|Q7J_ROAD|INTEGER|Signs and directions on SFO airport roadways|\n", 283 | "|Q7K_PARK|INTEGER|Airport parking facilities|\n", 284 | "|Q7L_AIRTRAIN|INTEGER|AirTrain|\n", 285 | "|Q7M_LTPARK|INTEGER|Long term parking lot shuttle|\n", 286 | "|Q7N_RENTAL|INTEGER|Airport rental car center|\n", 287 | "|Q7O_WHOLE|INTEGER|SFO Airport as a whole|\n", 288 | "\n", 289 | "Q7O_WHOLE is the target variable \n", 290 | "\n", 291 | "The possible values for the above are:\n", 292 | "\n", 293 | "**0 = no answer, 1 = Unacceptable, 2 = Below Average, 3 = Average, 4 = Good, 5 = Outstanding, 6 = Not visited or not applicable**\n", 294 | "\n", 295 | "Select only the fields we are interested in." 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 8, 301 | "metadata": {}, 302 | "outputs": [], 303 | "source": [ 304 | "dataset = survey.select(\"Q7A_ART\", \"Q7B_FOOD\", \"Q7C_SHOPS\", \"Q7D_SIGNS\", \"Q7E_WALK\", \"Q7F_SCREENS\", \"Q7G_INFOARR\", \"Q7H_INFODEP\", \"Q7I_WIFI\", \"Q7J_ROAD\", \"Q7K_PARK\", \"Q7L_AIRTRAIN\", \"Q7M_LTPARK\", \"Q7N_RENTAL\", \"Q7O_WHOLE\")" 305 | ] 306 | }, 307 | { 308 | "cell_type": "markdown", 309 | "metadata": {}, 310 | "source": [ 311 | "Let's get some basic statistics such as looking at the **average of each column**." 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": 9, 317 | "metadata": {}, 318 | "outputs": [ 319 | { 320 | "data": { 321 | "text/plain": [ 322 | "\"'missingValues(Q7A_ART) Q7A_ART', 'missingValues(Q7B_FOOD) Q7B_FOOD', 'missingValues(Q7C_SHOPS) Q7C_SHOPS', 'missingValues(Q7D_SIGNS) Q7D_SIGNS', 'missingValues(Q7E_WALK) Q7E_WALK', 'missingValues(Q7F_SCREENS) Q7F_SCREENS', 'missingValues(Q7G_INFOARR) Q7G_INFOARR', 'missingValues(Q7H_INFODEP) Q7H_INFODEP', 'missingValues(Q7I_WIFI) Q7I_WIFI', 'missingValues(Q7J_ROAD) Q7J_ROAD', 'missingValues(Q7K_PARK) Q7K_PARK', 'missingValues(Q7L_AIRTRAIN) Q7L_AIRTRAIN', 'missingValues(Q7M_LTPARK) Q7M_LTPARK', 'missingValues(Q7N_RENTAL) Q7N_RENTAL', 'missingValues(Q7O_WHOLE) Q7O_WHOLE'\"" 323 | ] 324 | }, 325 | "execution_count": 9, 326 | "metadata": {}, 327 | "output_type": "execute_result" 328 | } 329 | ], 330 | "source": [ 331 | "a = map(lambda s: \"'missingValues(\" + s +\") \" + s + \"'\",[\"Q7A_ART\", \"Q7B_FOOD\", \"Q7C_SHOPS\", \"Q7D_SIGNS\", \"Q7E_WALK\", \"Q7F_SCREENS\", \"Q7G_INFOARR\", \"Q7H_INFODEP\", \"Q7I_WIFI\", \"Q7J_ROAD\", \"Q7K_PARK\", \"Q7L_AIRTRAIN\", \"Q7M_LTPARK\", \"Q7N_RENTAL\", \"Q7O_WHOLE\"])\n", 332 | "\", \".join(a)" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "Let's start with the overall rating." 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 10, 345 | "metadata": {}, 346 | "outputs": [ 347 | { 348 | "data": { 349 | "text/plain": [ 350 | "[Row(Q7O_WHOLE=3.8743988684582744)]" 351 | ] 352 | }, 353 | "execution_count": 10, 354 | "metadata": {}, 355 | "output_type": "execute_result" 356 | } 357 | ], 358 | "source": [ 359 | "from pyspark.sql.functions import *\n", 360 | "dataset.selectExpr('avg(Q7O_WHOLE) Q7O_WHOLE').take(1)" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "The overall rating is only 3.87, so slightly above average. Let's get the averages of the constituent ratings:" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 11, 373 | "metadata": {}, 374 | "outputs": [ 375 | { 376 | "data": { 377 | "text/plain": [ 378 | "DataFrame[Q7A_ART: double, Q7B_FOOD: double, Q7C_SHOPS: double, Q7D_SIGNS: double, Q7E_WALK: double, Q7F_SCREENS: double, Q7G_INFOARR: double, Q7H_INFODEP: double, Q7I_WIFI: double, Q7J_ROAD: double, Q7K_PARK: double, Q7L_AIRTRAIN: double, Q7M_LTPARK: double, Q7N_RENTAL: double]" 379 | ] 380 | }, 381 | "metadata": {}, 382 | "output_type": "display_data" 383 | } 384 | ], 385 | "source": [ 386 | "avgs = dataset.selectExpr('avg(Q7A_ART) Q7A_ART', 'avg(Q7B_FOOD) Q7B_FOOD', 'avg(Q7C_SHOPS) Q7C_SHOPS', 'avg(Q7D_SIGNS) Q7D_SIGNS', 'avg(Q7E_WALK) Q7E_WALK', 'avg(Q7F_SCREENS) Q7F_SCREENS', 'avg(Q7G_INFOARR) Q7G_INFOARR', 'avg(Q7H_INFODEP) Q7H_INFODEP', 'avg(Q7I_WIFI) Q7I_WIFI', 'avg(Q7J_ROAD) Q7J_ROAD', 'avg(Q7K_PARK) Q7K_PARK', 'avg(Q7L_AIRTRAIN) Q7L_AIRTRAIN', 'avg(Q7M_LTPARK) Q7M_LTPARK', 'avg(Q7N_RENTAL) Q7N_RENTAL')\n", 387 | "display(avgs)" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": 12, 393 | "metadata": {}, 394 | "outputs": [ 395 | { 396 | "data": { 397 | "text/plain": [ 398 | "DataFrame[Q7A_ART: int, Q7B_FOOD: int, Q7C_SHOPS: int, Q7D_SIGNS: int, Q7E_WALK: int, Q7F_SCREENS: int, Q7G_INFOARR: int, Q7H_INFODEP: int, Q7I_WIFI: int, Q7J_ROAD: int, Q7K_PARK: int, Q7L_AIRTRAIN: int, Q7M_LTPARK: int, Q7N_RENTAL: int, Q7O_WHOLE: int]" 399 | ] 400 | }, 401 | "metadata": {}, 402 | "output_type": "display_data" 403 | } 404 | ], 405 | "source": [ 406 | "display(dataset)" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "So basic statistics can't seem to answer the question: **What factors drove the customer to give the overall rating?**\n", 414 | "\n", 415 | "So let's try to use a predictive algorithm to see if these individual ratings can be used to predict an overall rating." 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": {}, 421 | "source": [ 422 | "## 2. Create a Model" 423 | ] 424 | }, 425 | { 426 | "cell_type": "markdown", 427 | "metadata": {}, 428 | "source": [ 429 | "First need to treat responses of 0 = No Answer and 6 = Not Visited or Not Applicable as missing values. One of the ways you can do this is a technique called mean impute which is when we use the mean of the column as a replacement for the missing value. You can use a replace function to set all values of 0 or 6 to the average rating of 3. You also need a label column of type double so do that as well." 430 | ] 431 | }, 432 | { 433 | "cell_type": "code", 434 | "execution_count": 13, 435 | "metadata": {}, 436 | "outputs": [], 437 | "source": [ 438 | "training = dataset.withColumn(\"label\", dataset['Q7O_WHOLE']*1.0).na.replace(0,3).replace(6,3)" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": 14, 444 | "metadata": {}, 445 | "outputs": [ 446 | { 447 | "data": { 448 | "text/plain": [ 449 | "DataFrame[Q7A_ART: int, Q7B_FOOD: int, Q7C_SHOPS: int, Q7D_SIGNS: int, Q7E_WALK: int, Q7F_SCREENS: int, Q7G_INFOARR: int, Q7H_INFODEP: int, Q7I_WIFI: int, Q7J_ROAD: int, Q7K_PARK: int, Q7L_AIRTRAIN: int, Q7M_LTPARK: int, Q7N_RENTAL: int, Q7O_WHOLE: int, label: double]" 450 | ] 451 | }, 452 | "metadata": {}, 453 | "output_type": "display_data" 454 | } 455 | ], 456 | "source": [ 457 | "display(training)" 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": {}, 463 | "source": [ 464 | "##### Create 'Model Pipeline'" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": 15, 470 | "metadata": {}, 471 | "outputs": [], 472 | "source": [ 473 | "from pyspark.ml import Pipeline\n", 474 | "from pyspark.ml.feature import VectorAssembler\n", 475 | "from pyspark.ml.regression import DecisionTreeRegressor\n", 476 | "from pyspark.ml.tuning import ParamGridBuilder, CrossValidator\n", 477 | "from pyspark.ml.evaluation import RegressionEvaluator\n", 478 | "\n", 479 | "inputCols = ['Q7A_ART', 'Q7B_FOOD', 'Q7C_SHOPS', 'Q7D_SIGNS', 'Q7E_WALK', 'Q7F_SCREENS', 'Q7G_INFOARR', 'Q7H_INFODEP', 'Q7I_WIFI', 'Q7J_ROAD', 'Q7K_PARK', 'Q7L_AIRTRAIN', 'Q7M_LTPARK', 'Q7N_RENTAL']\n", 480 | "va = VectorAssembler(inputCols=inputCols,outputCol=\"features\")\n", 481 | "dt = DecisionTreeRegressor(labelCol=\"label\", featuresCol=\"features\", maxDepth=4)\n", 482 | "evaluator = RegressionEvaluator(metricName = \"rmse\", labelCol=\"label\")\n", 483 | "grid = ParamGridBuilder().addGrid(dt.maxDepth, [3, 5, 7, 10]).build()\n", 484 | "cv = CrossValidator(estimator=dt, estimatorParamMaps=grid, evaluator=evaluator, numFolds = 10)\n", 485 | "pipeline = Pipeline(stages=[va, dt])" 486 | ] 487 | }, 488 | { 489 | "cell_type": "markdown", 490 | "metadata": {}, 491 | "source": [ 492 | "## 3. Train a Model" 493 | ] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": 16, 498 | "metadata": {}, 499 | "outputs": [], 500 | "source": [ 501 | "model = pipeline.fit(training)" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": 17, 507 | "metadata": {}, 508 | "outputs": [ 509 | { 510 | "data": { 511 | "text/plain": [ 512 | "DecisionTreeRegressionModel (uid=DecisionTreeRegressor_73754465424e) of depth 4 with 31 nodes" 513 | ] 514 | }, 515 | "metadata": {}, 516 | "output_type": "display_data" 517 | } 518 | ], 519 | "source": [ 520 | "display(model.stages[-1])" 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": 18, 526 | "metadata": {}, 527 | "outputs": [ 528 | { 529 | "data": { 530 | "text/plain": [ 531 | "DataFrame[Q7A_ART: int, Q7B_FOOD: int, Q7C_SHOPS: int, Q7D_SIGNS: int, Q7E_WALK: int, Q7F_SCREENS: int, Q7G_INFOARR: int, Q7H_INFODEP: int, Q7I_WIFI: int, Q7J_ROAD: int, Q7K_PARK: int, Q7L_AIRTRAIN: int, Q7M_LTPARK: int, Q7N_RENTAL: int, Q7O_WHOLE: int, label: double, features: vector, prediction: double]" 532 | ] 533 | }, 534 | "metadata": {}, 535 | "output_type": "display_data" 536 | } 537 | ], 538 | "source": [ 539 | "predictions = model.transform(training)\n", 540 | "display(predictions)" 541 | ] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": {}, 546 | "source": [ 547 | "## 4. Evaluate the model" 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": 19, 553 | "metadata": {}, 554 | "outputs": [ 555 | { 556 | "data": { 557 | "text/plain": [ 558 | "0.555808023551782" 559 | ] 560 | }, 561 | "execution_count": 19, 562 | "metadata": {}, 563 | "output_type": "execute_result" 564 | } 565 | ], 566 | "source": [ 567 | "from pyspark.ml.evaluation import RegressionEvaluator\n", 568 | "\n", 569 | "evaluator = RegressionEvaluator()\n", 570 | "\n", 571 | "evaluator.evaluate(predictions, {evaluator.metricName: \"rmse\"})" 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": {}, 577 | "source": [ 578 | "## 5. Save the model" 579 | ] 580 | }, 581 | { 582 | "cell_type": "code", 583 | "execution_count": null, 584 | "metadata": {}, 585 | "outputs": [], 586 | "source": [ 587 | "import uuid\n", 588 | "model_save_path = f\"/tmp/sfo_survey_model/{str(uuid.uuid4())}\"\n", 589 | "model.write().overwrite().save(model_save_path)" 590 | ] 591 | }, 592 | { 593 | "cell_type": "markdown", 594 | "metadata": {}, 595 | "source": [ 596 | "## 6. Feature Importance\n", 597 | "Feature importance is a measure of information gain. It is scaled from 0.0 to 1.0. As an example, feature 1 in the example above is rated as 0.0826 or 8.26% of the total importance for all the features." 598 | ] 599 | }, 600 | { 601 | "cell_type": "code", 602 | "execution_count": 21, 603 | "metadata": {}, 604 | "outputs": [ 605 | { 606 | "data": { 607 | "text/plain": [ 608 | "SparseVector(14, {0: 0.0653, 1: 0.1173, 2: 0.0099, 3: 0.5219, 4: 0.0052, 5: 0.2403, 8: 0.0028, 10: 0.0059, 13: 0.0314})" 609 | ] 610 | }, 611 | "execution_count": 21, 612 | "metadata": {}, 613 | "output_type": "execute_result" 614 | } 615 | ], 616 | "source": [ 617 | "model.stages[1].featureImportances" 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": 22, 623 | "metadata": {}, 624 | "outputs": [], 625 | "source": [ 626 | "featureImportance = model.stages[1].featureImportances.toArray()\n", 627 | "featureNames = map(lambda s: s.name, dataset.schema.fields)\n", 628 | "featureImportanceMap = zip(featureImportance, featureNames)" 629 | ] 630 | }, 631 | { 632 | "cell_type": "code", 633 | "execution_count": 23, 634 | "metadata": {}, 635 | "outputs": [ 636 | { 637 | "data": { 638 | "text/plain": [ 639 | "" 640 | ] 641 | }, 642 | "execution_count": 23, 643 | "metadata": {}, 644 | "output_type": "execute_result" 645 | } 646 | ], 647 | "source": [ 648 | "featureImportanceMap" 649 | ] 650 | }, 651 | { 652 | "cell_type": "code", 653 | "execution_count": null, 654 | "metadata": {}, 655 | "outputs": [], 656 | "source": [ 657 | "importancesDf = spark.createDataFrame(spark.parallelize(featureImportanceMap).map(lambda r: [r[1], float(r[0])]))\n", 658 | "\n", 659 | "importancesDf = importancesDf.withColumnRenamed(\"_1\", \"Feature\").withColumnRenamed(\"_2\", \"Importance\")" 660 | ] 661 | }, 662 | { 663 | "cell_type": "markdown", 664 | "metadata": {}, 665 | "source": [ 666 | "Let's convert this to a DataFrame so you can view it and save it so other users can rely on this information." 667 | ] 668 | }, 669 | { 670 | "cell_type": "code", 671 | "execution_count": null, 672 | "metadata": {}, 673 | "outputs": [], 674 | "source": [ 675 | "display(importancesDf.orderBy(desc(\"Importance\")))" 676 | ] 677 | }, 678 | { 679 | "cell_type": "markdown", 680 | "metadata": {}, 681 | "source": [ 682 | "As you can see below, the 3 most important features are:\n", 683 | "\n", 684 | "- Signs\n", 685 | "- Screens\n", 686 | "- Food\n", 687 | "\n", 688 | "This is useful information for the airport management. It means that people want to first know where they are going. Second, they check the airport screens and monitors so they can find their gate and be on time for their flight. Third, they like to have good quality food.\n", 689 | "\n", 690 | "This is especially interesting considering that taking the average of these feature variables told us nothing about the importance of the variables in determining the overall rating by the survey responder.\n", 691 | "\n", 692 | "These 3 features combine to make up **65**% of the overall rating." 693 | ] 694 | }, 695 | { 696 | "cell_type": "code", 697 | "execution_count": null, 698 | "metadata": {}, 699 | "outputs": [], 700 | "source": [ 701 | "importancesDf.orderBy(desc(\"Importance\")).limit(3).agg(sum(\"Importance\")).take(1)" 702 | ] 703 | }, 704 | { 705 | "cell_type": "code", 706 | "execution_count": null, 707 | "metadata": {}, 708 | "outputs": [], 709 | "source": [ 710 | "# See it in Piechart\n", 711 | "display(importancesDf.orderBy(desc(\"Importance\")))" 712 | ] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "execution_count": null, 717 | "metadata": {}, 718 | "outputs": [], 719 | "source": [ 720 | "display(importancesDf.orderBy(desc(\"Importance\")).limit(5))" 721 | ] 722 | }, 723 | { 724 | "cell_type": "markdown", 725 | "metadata": {}, 726 | "source": [ 727 | "## 7. Conclusion\n", 728 | "So if you run SFO, artwork and shopping are nice-to-haves but signs, monitors, and food are what keep airport customers happy!" 729 | ] 730 | }, 731 | { 732 | "cell_type": "code", 733 | "execution_count": null, 734 | "metadata": {}, 735 | "outputs": [], 736 | "source": [ 737 | "# delete saved model\n", 738 | "dbutils.fs.rm(model_save_path, True)" 739 | ] 740 | } 741 | ], 742 | "metadata": { 743 | "kernelspec": { 744 | "display_name": "Python 3", 745 | "language": "python", 746 | "name": "python3" 747 | }, 748 | "language_info": { 749 | "codemirror_mode": { 750 | "name": "ipython", 751 | "version": 3 752 | }, 753 | "file_extension": ".py", 754 | "mimetype": "text/x-python", 755 | "name": "python", 756 | "nbconvert_exporter": "python", 757 | "pygments_lexer": "ipython3", 758 | "version": "3.7.6" 759 | } 760 | }, 761 | "nbformat": 4, 762 | "nbformat_minor": 4 763 | } 764 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/PySpark Dataframe Complete Guide (with COVID-19 Dataset)-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# PySpark Dataframe Complete Guide (with COVID-19 Dataset)" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Spark which is one of the most used tools when it comes to working with Big Data." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "While once upon a time Spark used to be heavily reliant on RDD manipulations, Spark has now provided a DataFrame API for us Data Scientists to work with. [Doc](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html#).\n", 22 | "\n", 23 | "**Yay~!**" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "In this notebook, We will learn standard Spark functionalities needed to work with DataFrames, and finally some tips to handle the inevitable errors you will face." 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "I'm going to skip the Spark Installation part in th for the sake of the notebook, so please go to [Apache Spark Website](http://spark.apache.org/downloads.html) to install Spark that are right to your work setting." 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "## Data\n", 45 | " We will be working with the Data Science for COVID-19 in South Korea, which is one of the most detailed datasets on the internet for COVID." 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "Data can be found in this kaggle URL [Link](https://www.kaggle.com/kimjihoo/coronavirusdataset)" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "### 1. Basic Functions" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "#### [1] Load (Read) the data" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "cases = spark.read.load(\"./data/Case.csv\",\n", 76 | " format=\"csv\", \n", 77 | " sep=\",\", \n", 78 | " inferSchema=\"true\", \n", 79 | " header=\"true\")" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "# First few rows in the file\n", 89 | "cases.show()" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "It looks ok right now, but sometimes as we the number of columns increases, the formatting becomes not too great. I have noticed that the following trick helps in displaying in pandas format in my Jupyter Notebook. \n", 97 | "\n", 98 | "The **.toPandas()** function converts a **Spark Dataframe** into a **Pandas Dataframe**, which is much easier to play with." 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "cases.limit(10).toPandas()" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "#### [2] Change Column Names" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "To change a single column," 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": null, 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "cases = cases.withColumnRenamed(\"infection_case\",\"infection_source\")" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "To change all columns," 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "cases = cases.toDF(*['case_id', 'province', 'city', 'group', 'infection_case', 'confirmed',\n", 147 | " 'latitude', 'longitude'])" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "#### [3] Change Column Names" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "We can select a subset of columns using the **select** " 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "cases = cases.select('province','city','infection_case','confirmed')\n", 171 | "cases.show()" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "#### [4] Sort by Column" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": null, 184 | "metadata": {}, 185 | "outputs": [], 186 | "source": [ 187 | "# Simple sort\n", 188 | "cases.sort(\"confirmed\").show()" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "metadata": {}, 195 | "outputs": [], 196 | "source": [ 197 | "# Descending Sort\n", 198 | "from pyspark.sql import functions as F\n", 199 | "\n", 200 | "cases.sort(F.desc(\"confirmed\")).show()" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "#### [5] Change Column Type" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "from pyspark.sql.types import DoubleType, IntegerType, StringType\n", 217 | "\n", 218 | "cases = cases.withColumn('confirmed', F.col('confirmed').cast(IntegerType()))\n", 219 | "cases = cases.withColumn('city', F.col('city').cast(StringType()))" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [ 226 | "#### [6] Filter " 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "We can filter a data frame using multiple conditions using AND(&), OR(|) and NOT(~) conditions. For example, we may want to find out all the different infection_case in Daegu with more than 10 confirmed cases." 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": {}, 240 | "outputs": [], 241 | "source": [ 242 | "cases.filter((cases.confirmed>10) & (cases.province=='Daegu')).show()" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "#### [7] GroupBy" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": null, 255 | "metadata": {}, 256 | "outputs": [], 257 | "source": [ 258 | "from pyspark.sql import functions as F\n", 259 | "\n", 260 | "cases.groupBy([\"province\",\"city\"]).agg(F.sum(\"confirmed\") ,F.max(\"confirmed\")).show()" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "Or if we don’t like the new column names, we can use the **alias** keyword to rename columns in the agg command itself." 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": null, 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [ 276 | "cases.groupBy([\"province\",\"city\"]).agg(\n", 277 | " F.sum(\"confirmed\").alias(\"TotalConfirmed\"),\\\n", 278 | " F.max(\"confirmed\").alias(\"MaxFromOneConfirmedCase\")\\\n", 279 | " ).show()" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "#### [8] Joins" 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": {}, 292 | "source": [ 293 | "Here, We will go with the region file which contains region information such as elementary_school_count, elderly_population_ratio, etc." 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": null, 299 | "metadata": {}, 300 | "outputs": [], 301 | "source": [ 302 | "regions = spark.read.load(\"./data/Region.csv\",\n", 303 | " format=\"csv\", \n", 304 | " sep=\",\", \n", 305 | " inferSchema=\"true\", \n", 306 | " header=\"true\")\n", 307 | "\n", 308 | "regions.limit(10).toPandas()" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": null, 314 | "metadata": {}, 315 | "outputs": [], 316 | "source": [ 317 | "# Left Join 'Case' with 'Region' on Province and City column\n", 318 | "cases = cases.join(regions, ['province','city'],how='left')\n", 319 | "cases.limit(10).toPandas()" 320 | ] 321 | }, 322 | { 323 | "cell_type": "markdown", 324 | "metadata": {}, 325 | "source": [ 326 | "### 2. Use SQL with DataFrames" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "We first register the cases dataframe to a temporary table cases_table on which we can run SQL operations. As you can see, the result of the SQL select statement is again a Spark Dataframe.\n", 334 | "\n", 335 | "All complex SQL queries like GROUP BY, HAVING, AND ORDER BY clauses can be applied in 'Sql' function" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": {}, 342 | "outputs": [], 343 | "source": [ 344 | "cases.registerTempTable('cases_table')\n", 345 | "newDF = sqlContext.sql('select * from cases_table where confirmed > 100')\n", 346 | "newDF.show()" 347 | ] 348 | }, 349 | { 350 | "cell_type": "markdown", 351 | "metadata": {}, 352 | "source": [ 353 | "### 3. Create New Columns" 354 | ] 355 | }, 356 | { 357 | "cell_type": "markdown", 358 | "metadata": {}, 359 | "source": [ 360 | "There are many ways that you can use to create a column in a PySpark Dataframe." 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "#### [1] Using Spark Native Functions" 368 | ] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "metadata": {}, 373 | "source": [ 374 | "We can use .withcolumn along with PySpark SQL functions to create a new column. In essence, you can find String functions, Date functions, and Math functions already implemented using Spark functions. Our first function, the F.col function gives us access to the column. So if we wanted to add 100 to a column, we could use F.col as:" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "metadata": {}, 381 | "outputs": [], 382 | "source": [ 383 | "import pyspark.sql.functions as F\n", 384 | "\n", 385 | "casesWithNewConfirmed = cases.withColumn(\"NewConfirmed\", 100 + F.col(\"confirmed\"))\n", 386 | "casesWithNewConfirmed.show()" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": {}, 392 | "source": [ 393 | "We can also use math functions like F.exp function:" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": null, 399 | "metadata": {}, 400 | "outputs": [], 401 | "source": [ 402 | "casesWithExpConfirmed = cases.withColumn(\"ExpConfirmed\", F.exp(\"confirmed\"))\n", 403 | "casesWithExpConfirmed.show()" 404 | ] 405 | }, 406 | { 407 | "cell_type": "markdown", 408 | "metadata": {}, 409 | "source": [ 410 | "#### [2] Using Spark UDFs" 411 | ] 412 | }, 413 | { 414 | "cell_type": "markdown", 415 | "metadata": {}, 416 | "source": [ 417 | "Sometimes we want to do complicated things to a column or multiple columns. This could be thought of as a map operation on a PySpark Dataframe to a single column or multiple columns. While Spark SQL functions do solve many use cases when it comes to column creation, I use Spark UDF whenever I need more matured Python functionality. \\\n", 418 | "\n", 419 | "To use Spark UDFs, we need to use the F.udf function to convert a regular python function to a Spark UDF. We also need to specify the return type of the function. In this example the return type is StringType()" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "metadata": {}, 426 | "outputs": [], 427 | "source": [ 428 | "import pyspark.sql.functions as F\n", 429 | "from pyspark.sql.types import *\n", 430 | "\n", 431 | "def casesHighLow(confirmed):\n", 432 | " if confirmed < 50: \n", 433 | " return 'low'\n", 434 | " else:\n", 435 | " return 'high'\n", 436 | " \n", 437 | "#convert to a UDF Function by passing in the function and return type of function\n", 438 | "casesHighLowUDF = F.udf(casesHighLow, StringType())\n", 439 | "CasesWithHighLow = cases.withColumn(\"HighLow\", casesHighLowUDF(\"confirmed\"))\n", 440 | "CasesWithHighLow.show()" 441 | ] 442 | }, 443 | { 444 | "cell_type": "markdown", 445 | "metadata": {}, 446 | "source": [ 447 | "#### [3] Using Pandas UDF" 448 | ] 449 | }, 450 | { 451 | "cell_type": "markdown", 452 | "metadata": {}, 453 | "source": [ 454 | "This allows you to use pandas functionality with Spark. I generally use it when I have to run a groupBy operation on a Spark dataframe or whenever I need to create rolling features\n", 455 | " \n", 456 | "The way we use it is by using the F.pandas_udf decorator. **We assume here that the input to the function will be a pandas data frame**\n", 457 | "\n", 458 | "The only complexity here is that we have to provide a schema for the output Dataframe. We can use the original schema of a dataframe to create the outSchema." 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": null, 464 | "metadata": {}, 465 | "outputs": [], 466 | "source": [ 467 | "cases.printSchema()" 468 | ] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "execution_count": null, 473 | "metadata": {}, 474 | "outputs": [], 475 | "source": [ 476 | "from pyspark.sql.types import IntegerType, StringType, DoubleType, BooleanType\n", 477 | "from pyspark.sql.types import StructType, StructField\n", 478 | "\n", 479 | "# Declare the schema for the output of our function\n", 480 | "\n", 481 | "outSchema = StructType([StructField('case_id',IntegerType(),True),\n", 482 | " StructField('province',StringType(),True),\n", 483 | " StructField('city',StringType(),True),\n", 484 | " StructField('group',BooleanType(),True),\n", 485 | " StructField('infection_case',StringType(),True),\n", 486 | " StructField('confirmed',IntegerType(),True),\n", 487 | " StructField('latitude',StringType(),True),\n", 488 | " StructField('longitude',StringType(),True),\n", 489 | " StructField('normalized_confirmed',DoubleType(),True)\n", 490 | " ])\n", 491 | "# decorate our function with pandas_udf decorator\n", 492 | "@F.pandas_udf(outSchema, F.PandasUDFType.GROUPED_MAP)\n", 493 | "def subtract_mean(pdf):\n", 494 | " # pdf is a pandas.DataFrame\n", 495 | " v = pdf.confirmed\n", 496 | " v = v - v.mean()\n", 497 | " pdf['normalized_confirmed'] = v\n", 498 | " return pdf\n", 499 | "\n", 500 | "confirmed_groupwise_normalization = cases.groupby(\"infection_case\").apply(subtract_mean)\n", 501 | "\n", 502 | "confirmed_groupwise_normalization.limit(10).toPandas()" 503 | ] 504 | }, 505 | { 506 | "cell_type": "markdown", 507 | "metadata": {}, 508 | "source": [ 509 | "### 4. Spark Window Functions" 510 | ] 511 | }, 512 | { 513 | "cell_type": "markdown", 514 | "metadata": {}, 515 | "source": [ 516 | "We will simply look at some of the most important and useful window functions available." 517 | ] 518 | }, 519 | { 520 | "cell_type": "code", 521 | "execution_count": null, 522 | "metadata": {}, 523 | "outputs": [], 524 | "source": [ 525 | "timeprovince = spark.read.load(\"./data/TimeProvince.csv\",\n", 526 | " format=\"csv\", \n", 527 | " sep=\",\", \n", 528 | " inferSchema=\"true\", \n", 529 | " header=\"true\")\n", 530 | "\n", 531 | "timeprovince.show()" 532 | ] 533 | }, 534 | { 535 | "cell_type": "markdown", 536 | "metadata": {}, 537 | "source": [ 538 | "#### [1] Ranking" 539 | ] 540 | }, 541 | { 542 | "cell_type": "markdown", 543 | "metadata": {}, 544 | "source": [ 545 | "You can get rank as well as dense_rank on a group using this function. For example, you may want to have a column in your cases table that provides the rank of infection_case based on the number of infection_case in a province. We can do this by:" 546 | ] 547 | }, 548 | { 549 | "cell_type": "code", 550 | "execution_count": null, 551 | "metadata": {}, 552 | "outputs": [], 553 | "source": [ 554 | "from pyspark.sql.window import Window\n", 555 | "windowSpec = Window().partitionBy(['province']).orderBy(F.desc('confirmed'))\n", 556 | "cases.withColumn(\"rank\",F.rank().over(windowSpec)).show()" 557 | ] 558 | }, 559 | { 560 | "cell_type": "markdown", 561 | "metadata": {}, 562 | "source": [ 563 | "#### [2] Lag Variables" 564 | ] 565 | }, 566 | { 567 | "cell_type": "markdown", 568 | "metadata": {}, 569 | "source": [ 570 | "Sometimes our data science models may need **lag based** features. For example, a model might have variables like the price last week or sales quantity the previous day. We can create such features using the lag function with window functions. \\\n", 571 | "\n", 572 | "Here I am trying to get the confirmed cases 7 days before. I am filtering to show the results as the first few days of corona cases were zeros. You can see here that the lag_7 day feature is shifted by 7 days." 573 | ] 574 | }, 575 | { 576 | "cell_type": "code", 577 | "execution_count": null, 578 | "metadata": {}, 579 | "outputs": [], 580 | "source": [ 581 | "from pyspark.sql.window import Window\n", 582 | "\n", 583 | "windowSpec = Window().partitionBy(['province']).orderBy('date')\n", 584 | "\n", 585 | "timeprovinceWithLag = timeprovince.withColumn(\"lag_7\",F.lag(\"confirmed\", 7).over(windowSpec))\n", 586 | "\n", 587 | "timeprovinceWithLag.filter(timeprovinceWithLag.date>'2020-03-10').show()" 588 | ] 589 | }, 590 | { 591 | "cell_type": "markdown", 592 | "metadata": {}, 593 | "source": [ 594 | "#### [3] Rolling Aggregations" 595 | ] 596 | }, 597 | { 598 | "cell_type": "markdown", 599 | "metadata": {}, 600 | "source": [ 601 | "For example, we might want to have a rolling 7-day sales sum/mean as a feature for our sales regression model. Let us calculate the rolling mean of confirmed cases for the last 7 days here. This is what a lot of the people are already doing with this dataset to see the real trends." 602 | ] 603 | }, 604 | { 605 | "cell_type": "code", 606 | "execution_count": null, 607 | "metadata": {}, 608 | "outputs": [], 609 | "source": [ 610 | "from pyspark.sql.window import Window\n", 611 | "\n", 612 | "# we only look at the past 7 days in a particular window including the current_day. \n", 613 | "# Here 0 specifies the current_row and -6 specifies the seventh row previous to current_row. \n", 614 | "# Remember we count starting from 0.\n", 615 | "\n", 616 | "# If we had used rowsBetween(-7,-1), we would just have looked at past 7 days of data and not the current_day\n", 617 | "windowSpec = Window().partitionBy(['province']).orderBy('date').rowsBetween(-6,0)\n", 618 | "\n", 619 | "timeprovinceWithRoll = timeprovince.withColumn(\"roll_7_confirmed\",F.mean(\"confirmed\").over(windowSpec))\n", 620 | "\n", 621 | "timeprovinceWithRoll.filter(timeprovinceWithLag.date>'2020-03-10').show()" 622 | ] 623 | }, 624 | { 625 | "cell_type": "markdown", 626 | "metadata": {}, 627 | "source": [ 628 | "One could also find a use for **rowsBetween(Window.unboundedPreceding, Window.currentRow)** function, where we take the rows between the first row in a window and the current_row to get running totals. I am calculating cumulative_confirmed here." 629 | ] 630 | }, 631 | { 632 | "cell_type": "code", 633 | "execution_count": null, 634 | "metadata": {}, 635 | "outputs": [], 636 | "source": [ 637 | "from pyspark.sql.window import Window\n", 638 | "\n", 639 | "windowSpec = Window().partitionBy(['province']).orderBy('date').rowsBetween(Window.unboundedPreceding,Window.currentRow)\n", 640 | "\n", 641 | "timeprovinceWithRoll = timeprovince.withColumn(\"cumulative_confirmed\",F.sum(\"confirmed\").over(windowSpec))\n", 642 | "\n", 643 | "timeprovinceWithRoll.filter(timeprovinceWithLag.date>'2020-03-10').show()" 644 | ] 645 | }, 646 | { 647 | "cell_type": "markdown", 648 | "metadata": {}, 649 | "source": [ 650 | "### 5. Pivot DataFrames" 651 | ] 652 | }, 653 | { 654 | "cell_type": "markdown", 655 | "metadata": {}, 656 | "source": [ 657 | "Sometimes we may need to have the dataframe in flat format. This happens frequently in movie data where we may want to show genres as columns instead of rows. We can use pivot to do this. Here I am trying to get one row for each date and getting the province names as columns." 658 | ] 659 | }, 660 | { 661 | "cell_type": "code", 662 | "execution_count": null, 663 | "metadata": {}, 664 | "outputs": [], 665 | "source": [ 666 | "pivotedTimeprovince = timeprovince.groupBy('date').pivot('province') \\\n", 667 | ".agg(F.sum('confirmed').alias('confirmed') , F.sum('released').alias('released'))\n", 668 | "\n", 669 | "pivotedTimeprovince.limit(10).toPandas()" 670 | ] 671 | }, 672 | { 673 | "cell_type": "markdown", 674 | "metadata": {}, 675 | "source": [ 676 | "### 6. Other Opertions" 677 | ] 678 | }, 679 | { 680 | "cell_type": "markdown", 681 | "metadata": {}, 682 | "source": [ 683 | "#### [1] Caching" 684 | ] 685 | }, 686 | { 687 | "cell_type": "markdown", 688 | "metadata": {}, 689 | "source": [ 690 | "Spark works on the lazy execution principle. What that means is that nothing really gets executed until you use an action function like the .count() on a dataframe. And if you do a .count function, it generally helps to cache at this step. So I have made it a point to cache() my dataframes whenever I do a .count() operation." 691 | ] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "execution_count": null, 696 | "metadata": {}, 697 | "outputs": [], 698 | "source": [ 699 | "df.cache().count()" 700 | ] 701 | }, 702 | { 703 | "cell_type": "markdown", 704 | "metadata": {}, 705 | "source": [ 706 | "#### [2] Save and Load from an intermediate step" 707 | ] 708 | }, 709 | { 710 | "cell_type": "markdown", 711 | "metadata": {}, 712 | "source": [ 713 | "When you work with Spark you will frequently run with memory and storage issues. While in some cases such issues might be resolved using techniques like broadcasting, salting or cache, sometimes just interrupting the workflow and saving and reloading the whole dataframe at a crucial step has helped me a lot. This helps spark to let go of a lot of memory that gets utilized for storing intermediate shuffle data and unused caches." 714 | ] 715 | }, 716 | { 717 | "cell_type": "code", 718 | "execution_count": null, 719 | "metadata": {}, 720 | "outputs": [], 721 | "source": [ 722 | "df.write.parquet(\"data/df.parquet\")\n", 723 | "df.unpersist()\n", 724 | "spark.read.load(\"data/df.parquet\")" 725 | ] 726 | }, 727 | { 728 | "cell_type": "markdown", 729 | "metadata": {}, 730 | "source": [ 731 | "#### [3] Repartitioning" 732 | ] 733 | }, 734 | { 735 | "cell_type": "markdown", 736 | "metadata": {}, 737 | "source": [ 738 | "You might want to repartition your data if you feel your data has been skewed while working with all the transformations and joins. The simplest way to do it is by using:" 739 | ] 740 | }, 741 | { 742 | "cell_type": "code", 743 | "execution_count": null, 744 | "metadata": {}, 745 | "outputs": [], 746 | "source": [ 747 | "df = df.repartition(1000)" 748 | ] 749 | }, 750 | { 751 | "cell_type": "markdown", 752 | "metadata": {}, 753 | "source": [ 754 | "Sometimes you might also want to repartition by a known scheme as this scheme might be used by a certain join or aggregation operation later on. You can use multiple columns to repartition using:" 755 | ] 756 | }, 757 | { 758 | "cell_type": "code", 759 | "execution_count": null, 760 | "metadata": {}, 761 | "outputs": [], 762 | "source": [ 763 | "df = df.repartition('cola', 'colb','colc','cold')" 764 | ] 765 | }, 766 | { 767 | "cell_type": "markdown", 768 | "metadata": {}, 769 | "source": [ 770 | "Then, we can get the number of partitions in a data frame using:" 771 | ] 772 | }, 773 | { 774 | "cell_type": "code", 775 | "execution_count": null, 776 | "metadata": {}, 777 | "outputs": [], 778 | "source": [ 779 | "df.rdd.getNumPartitions()" 780 | ] 781 | }, 782 | { 783 | "cell_type": "markdown", 784 | "metadata": {}, 785 | "source": [ 786 | "You can also check out the distribution of records in a partition by using the glom function. This helps in understanding the skew in the data that happens while working with various transformations." 787 | ] 788 | }, 789 | { 790 | "cell_type": "code", 791 | "execution_count": null, 792 | "metadata": {}, 793 | "outputs": [], 794 | "source": [ 795 | "df.glom().map(len).collect()" 796 | ] 797 | }, 798 | { 799 | "cell_type": "markdown", 800 | "metadata": {}, 801 | "source": [ 802 | "#### [4] Reading Parquet File in Local\n", 803 | "Sometimes you might want to read the parquet files in a system where Spark is not available. In such cases, I normally use the below code:" 804 | ] 805 | }, 806 | { 807 | "cell_type": "code", 808 | "execution_count": null, 809 | "metadata": {}, 810 | "outputs": [], 811 | "source": [ 812 | "from glob import glob\n", 813 | "def load_df_from_parquet(parquet_directory):\n", 814 | " df = pd.DataFrame()\n", 815 | " for file in glob(f\"{parquet_directory}/*\"):\n", 816 | " df = pd.concat([df,pd.read_parquet(file)])\n", 817 | " return df" 818 | ] 819 | } 820 | ], 821 | "metadata": { 822 | "kernelspec": { 823 | "display_name": "Python 3", 824 | "language": "python", 825 | "name": "python3" 826 | }, 827 | "language_info": { 828 | "codemirror_mode": { 829 | "name": "ipython", 830 | "version": 3 831 | }, 832 | "file_extension": ".py", 833 | "mimetype": "text/x-python", 834 | "name": "python", 835 | "nbconvert_exporter": "python", 836 | "pygments_lexer": "ipython3", 837 | "version": "3.7.6" 838 | } 839 | }, 840 | "nbformat": 4, 841 | "nbformat_minor": 4 842 | } 843 | -------------------------------------------------------------------------------- /Binary Tabular Data Classification with PySpark.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Binary Tabular Data Classification with PySpark" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "This notebook covers a classification problem in Machine Learning and go through a comprehensive guide to succesfully develop an End-to-End ML class prediction model using PySpark." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "**Classification Algorithms**\n", 22 | "In order to predict the class of certain samples, there are several classification algorithms that can be used. In fact, when developing our machine learning models, we will train and evaluate a certain number of them, and we will keep those with better predicting performance. \\\n", 23 | "\n", 24 | "A non-exhaustive list of some of the most used algorithms are:\n", 25 | "\n", 26 | "- Logistic Regression\n", 27 | "- Decision Trees\n", 28 | "- Random Forests\n", 29 | "- Support Vector Machines\n", 30 | "- K-Nearest Neighbors (KNN)" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "**ROC**\n", 38 | "the metric that we will use in our project is the Reciever Operation Characteristic or ROC.\n", 39 | "The ROC curve tells us about how good the model can distinguish between two classes. It can get values from 0 to 1. The better the model is, the closer to 1 value it will be." 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "We will use a number of different supervised algorithms to precisely predict individuals’ income using data collected from the 1994 U.S. Census. \\\n", 47 | " \n", 48 | "We will then choose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data.\n", 49 | "Our goal with this implementation is to build a model that accurately predicts whether an individual makes more than $50,000. \\\n", 50 | "\n", 51 | "As from our previous research we have found out that the individuals who are most likely to donate money to a charity are the ones that make more than $50,000. \\\n", 52 | "\n", 53 | "Therefore, we are facing a binary classification problem, where we want to determine wether an individual makes more than $50K a year (class 1) or do not (class 0)." 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 1, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "#we use the findspark library to locate spark on our local machine\n", 63 | "import findspark\n", 64 | "findspark.init('C:/Users/bokhy/spark/spark-2.4.6-bin-hadoop2.7')" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 2, 70 | "metadata": {}, 71 | "outputs": [], 72 | "source": [ 73 | "import pandas as pd\n", 74 | "import numpy as np\n", 75 | "from datetime import date, timedelta, datetime\n", 76 | "import time\n", 77 | "\n", 78 | "import pyspark # only run this after findspark.init()\n", 79 | "from pyspark.sql import SparkSession, SQLContext\n", 80 | "from pyspark.context import SparkContext\n", 81 | "from pyspark.sql.functions import * \n", 82 | "from pyspark.sql.types import * " 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "### 1. Load Data" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "The census dataset consists of approximately 45222 data points, with each datapoint having 13 features.\n", 97 | "\n", 98 | "The dataset for this project can be found from the [UCI Machine Learning Repo](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/)." 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 3, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "# Initiate the Spark Session\n", 108 | "spark = SparkSession.builder.appName('imbalanced_binary_classification').getOrCreate()" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 4, 114 | "metadata": {}, 115 | "outputs": [ 116 | { 117 | "data": { 118 | "text/html": [ 119 | "\n", 120 | "
\n", 121 | "

SparkSession - in-memory

\n", 122 | " \n", 123 | "
\n", 124 | "

SparkContext

\n", 125 | "\n", 126 | "

Spark UI

\n", 127 | "\n", 128 | "
\n", 129 | "
Version
\n", 130 | "
v2.4.6
\n", 131 | "
Master
\n", 132 | "
local[*]
\n", 133 | "
AppName
\n", 134 | "
imbalanced_binary_classification
\n", 135 | "
\n", 136 | "
\n", 137 | " \n", 138 | "
\n", 139 | " " 140 | ], 141 | "text/plain": [ 142 | "" 143 | ] 144 | }, 145 | "execution_count": 4, 146 | "metadata": {}, 147 | "output_type": "execute_result" 148 | } 149 | ], 150 | "source": [ 151 | "spark" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 5, 157 | "metadata": {}, 158 | "outputs": [ 159 | { 160 | "data": { 161 | "text/plain": [ 162 | "DataFrame[age: int, workClass: string, fnlwgt: int, education: string, education-num: int, marital-status: string, occupation: string, relationship: string, race: string, sex: string, capital-gain: int, capital-loss: int, hours-per-week: int, native-country: string, income: string]" 163 | ] 164 | }, 165 | "metadata": {}, 166 | "output_type": "display_data" 167 | } 168 | ], 169 | "source": [ 170 | "# File location and type\n", 171 | "file_location = \"./data/census.csv\"\n", 172 | "file_type = \"csv\"\n", 173 | "\n", 174 | "# CSV options\n", 175 | "infer_schema = \"true\"\n", 176 | "first_row_is_header = \"False\"\n", 177 | "delimiter = \",\"\n", 178 | "\n", 179 | "# make sure to add column name as the CSV does not contain column name as default\n", 180 | "\n", 181 | "\n", 182 | "# The applied options are for CSV files. For other file types, these will be ignored.\n", 183 | "df = spark.read.format(file_type) \\\n", 184 | " .option(\"inferSchema\", infer_schema) \\\n", 185 | " .option(\"header\", first_row_is_header) \\\n", 186 | " .option(\"sep\", delimiter) \\\n", 187 | " .load(file_location) \\\n", 188 | " .toDF(\"age\", \"workClass\", \"fnlwgt\", \"education\", \"education-num\",\"marital-status\", \"occupation\", \"relationship\",\n", 189 | " \"race\", \"sex\", \"capital-gain\", \"capital-loss\", \"hours-per-week\", \"native-country\", \"income\")\n", 190 | "\n", 191 | "display(df)" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": 6, 197 | "metadata": {}, 198 | "outputs": [ 199 | { 200 | "name": "stdout", 201 | "output_type": "stream", 202 | "text": [ 203 | "+---+----------------+------+------------+-------------+--------------------+-----------------+-------------+------------------+------+------------+------------+--------------+--------------+------+\n", 204 | "|age| workClass|fnlwgt| education|education-num| marital-status| occupation| relationship| race| sex|capital-gain|capital-loss|hours-per-week|native-country|income|\n", 205 | "+---+----------------+------+------------+-------------+--------------------+-----------------+-------------+------------------+------+------------+------------+--------------+--------------+------+\n", 206 | "| 39| State-gov| 77516| Bachelors| 13| Never-married| Adm-clerical|Not-in-family| White| Male| 2174| 0| 40| United-States| <=50K|\n", 207 | "| 50|Self-emp-not-inc| 83311| Bachelors| 13| Married-civ-spouse| Exec-managerial| Husband| White| Male| 0| 0| 13| United-States| <=50K|\n", 208 | "| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family| White| Male| 0| 0| 40| United-States| <=50K|\n", 209 | "| 53| Private|234721| 11th| 7| Married-civ-spouse|Handlers-cleaners| Husband| Black| Male| 0| 0| 40| United-States| <=50K|\n", 210 | "| 28| Private|338409| Bachelors| 13| Married-civ-spouse| Prof-specialty| Wife| Black|Female| 0| 0| 40| Cuba| <=50K|\n", 211 | "| 37| Private|284582| Masters| 14| Married-civ-spouse| Exec-managerial| Wife| White|Female| 0| 0| 40| United-States| <=50K|\n", 212 | "| 49| Private|160187| 9th| 5|Married-spouse-ab...| Other-service|Not-in-family| Black|Female| 0| 0| 16| Jamaica| <=50K|\n", 213 | "| 52|Self-emp-not-inc|209642| HS-grad| 9| Married-civ-spouse| Exec-managerial| Husband| White| Male| 0| 0| 45| United-States| >50K|\n", 214 | "| 31| Private| 45781| Masters| 14| Never-married| Prof-specialty|Not-in-family| White|Female| 14084| 0| 50| United-States| >50K|\n", 215 | "| 42| Private|159449| Bachelors| 13| Married-civ-spouse| Exec-managerial| Husband| White| Male| 5178| 0| 40| United-States| >50K|\n", 216 | "| 37| Private|280464|Some-college| 10| Married-civ-spouse| Exec-managerial| Husband| Black| Male| 0| 0| 80| United-States| >50K|\n", 217 | "| 30| State-gov|141297| Bachelors| 13| Married-civ-spouse| Prof-specialty| Husband|Asian-Pac-Islander| Male| 0| 0| 40| India| >50K|\n", 218 | "| 23| Private|122272| Bachelors| 13| Never-married| Adm-clerical| Own-child| White|Female| 0| 0| 30| United-States| <=50K|\n", 219 | "| 32| Private|205019| Assoc-acdm| 12| Never-married| Sales|Not-in-family| Black| Male| 0| 0| 50| United-States| <=50K|\n", 220 | "| 40| Private|121772| Assoc-voc| 11| Married-civ-spouse| Craft-repair| Husband|Asian-Pac-Islander| Male| 0| 0| 40| ?| >50K|\n", 221 | "| 34| Private|245487| 7th-8th| 4| Married-civ-spouse| Transport-moving| Husband|Amer-Indian-Eskimo| Male| 0| 0| 45| Mexico| <=50K|\n", 222 | "| 25|Self-emp-not-inc|176756| HS-grad| 9| Never-married| Farming-fishing| Own-child| White| Male| 0| 0| 35| United-States| <=50K|\n", 223 | "| 32| Private|186824| HS-grad| 9| Never-married|Machine-op-inspct| Unmarried| White| Male| 0| 0| 40| United-States| <=50K|\n", 224 | "| 38| Private| 28887| 11th| 7| Married-civ-spouse| Sales| Husband| White| Male| 0| 0| 50| United-States| <=50K|\n", 225 | "| 43|Self-emp-not-inc|292175| Masters| 14| Divorced| Exec-managerial| Unmarried| White|Female| 0| 0| 45| United-States| >50K|\n", 226 | "+---+----------------+------+------------+-------------+--------------------+-----------------+-------------+------------------+------+------------+------------+--------------+--------------+------+\n", 227 | "only showing top 20 rows\n", 228 | "\n" 229 | ] 230 | } 231 | ], 232 | "source": [ 233 | "df.show()" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "### 2. Data Preprocessing" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": 7, 246 | "metadata": {}, 247 | "outputs": [ 248 | { 249 | "data": { 250 | "text/plain": [ 251 | "['age',\n", 252 | " 'workClass',\n", 253 | " 'fnlwgt',\n", 254 | " 'education',\n", 255 | " 'education-num',\n", 256 | " 'marital-status',\n", 257 | " 'occupation',\n", 258 | " 'relationship',\n", 259 | " 'race',\n", 260 | " 'sex',\n", 261 | " 'capital-gain',\n", 262 | " 'capital-loss',\n", 263 | " 'hours-per-week',\n", 264 | " 'native-country',\n", 265 | " '>50K']" 266 | ] 267 | }, 268 | "execution_count": 7, 269 | "metadata": {}, 270 | "output_type": "execute_result" 271 | } 272 | ], 273 | "source": [ 274 | "# Import pyspark functions\n", 275 | "from pyspark.sql import functions as F\n", 276 | "# Create add new column to the dataset\n", 277 | "df = df.withColumn('>50K', F.when(df.income == '<=50K', 0).otherwise(1))\n", 278 | "# Drop the Income label\n", 279 | "df = df.drop('income')\n", 280 | "# Show dataset's columns\n", 281 | "df.columns" 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "metadata": {}, 287 | "source": [ 288 | "#### Vectorizing Numerical Features and One-Hot Encodin Categorical Features" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": 8, 294 | "metadata": {}, 295 | "outputs": [], 296 | "source": [ 297 | "# Selecting categorical features\n", 298 | "categorical_columns = [\n", 299 | " 'workClass',\n", 300 | " 'education',\n", 301 | " 'marital-status',\n", 302 | " 'occupation',\n", 303 | " 'relationship',\n", 304 | " 'race',\n", 305 | " 'sex',\n", 306 | " 'hours-per-week',\n", 307 | " 'native-country',\n", 308 | " ]" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": 9, 314 | "metadata": {}, 315 | "outputs": [], 316 | "source": [ 317 | "from pyspark.ml import Pipeline\n", 318 | "from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler\n", 319 | "from pyspark.ml.classification import (DecisionTreeClassifier, GBTClassifier, RandomForestClassifier, LogisticRegression)\n", 320 | "from pyspark.ml.evaluation import BinaryClassificationEvaluator\n", 321 | "\n", 322 | "# The index of string values multiple columns\n", 323 | "indexers = [\n", 324 | " StringIndexer(inputCol=c, outputCol=\"{0}_indexed\".format(c))\n", 325 | " for c in categorical_columns]\n", 326 | "# The encode of indexed values multiple columns\n", 327 | "encoders = [OneHotEncoder(dropLast=False,inputCol=indexer.getOutputCol(),\n", 328 | " outputCol=\"{0}_encoded\".format(indexer.getOutputCol())) \n", 329 | " for indexer in indexers]" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "The above code basically indexes each categorical column using the StringIndexer, and then converts the indexed categories into one-hot encoded variables. The resulting output has the binary vectors appended to the end of each row." 337 | ] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": {}, 342 | "source": [ 343 | "#### Join the categorical encoded features with the numerical ones and make a vector with both of them" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": 10, 349 | "metadata": {}, 350 | "outputs": [], 351 | "source": [ 352 | "# Vectorizing encoded values\n", 353 | "categorical_encoded = [encoder.getOutputCol() for encoder in encoders]\n", 354 | "numerical_columns = ['age', 'education-num', 'capital-gain', 'capital-loss']\n", 355 | "inputcols = categorical_encoded + numerical_columns\n", 356 | "assembler = VectorAssembler(inputCols=inputcols, outputCol=\"features\")" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "#### Set up a pipeline to automatize this stages" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 11, 369 | "metadata": {}, 370 | "outputs": [ 371 | { 372 | "data": { 373 | "text/plain": [ 374 | "DataFrame[age: int, workClass: string, fnlwgt: int, education: string, education-num: int, marital-status: string, occupation: string, relationship: string, race: string, sex: string, capital-gain: int, capital-loss: int, hours-per-week: int, native-country: string, >50K: int, workClass_indexed: double, education_indexed: double, marital-status_indexed: double, occupation_indexed: double, relationship_indexed: double, race_indexed: double, sex_indexed: double, hours-per-week_indexed: double, native-country_indexed: double, workClass_indexed_encoded: vector, education_indexed_encoded: vector, marital-status_indexed_encoded: vector, occupation_indexed_encoded: vector, relationship_indexed_encoded: vector, race_indexed_encoded: vector, sex_indexed_encoded: vector, hours-per-week_indexed_encoded: vector, native-country_indexed_encoded: vector, features: vector]" 375 | ] 376 | }, 377 | "metadata": {}, 378 | "output_type": "display_data" 379 | } 380 | ], 381 | "source": [ 382 | "pipeline = Pipeline(stages=indexers + encoders+[assembler])\n", 383 | "model = pipeline.fit(df)\n", 384 | "# Transform data\n", 385 | "transformed = model.transform(df)\n", 386 | "display(transformed)" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": {}, 392 | "source": [ 393 | "#### Finally, we will select a dataset only with the relevant features." 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": 12, 399 | "metadata": {}, 400 | "outputs": [], 401 | "source": [ 402 | "# Transform data\n", 403 | "final_data = transformed.select('features', '>50K')" 404 | ] 405 | }, 406 | { 407 | "cell_type": "markdown", 408 | "metadata": {}, 409 | "source": [ 410 | "### 3. Build a Model" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 13, 416 | "metadata": {}, 417 | "outputs": [], 418 | "source": [ 419 | "# Initialize the classification models\n", 420 | "# Decision Trees\n", 421 | "# Random Forests\n", 422 | "# Gradient Boosted Trees\n", 423 | "\n", 424 | "dtc = DecisionTreeClassifier(labelCol='>50K', featuresCol='features')\n", 425 | "\n", 426 | "rfc = RandomForestClassifier(numTrees=150, labelCol='>50K', featuresCol='features')\n", 427 | "\n", 428 | "gbt = GBTClassifier(labelCol='>50K', featuresCol='features', maxIter=10)" 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": 14, 434 | "metadata": {}, 435 | "outputs": [ 436 | { 437 | "name": "stdout", 438 | "output_type": "stream", 439 | "text": [ 440 | "39010\n", 441 | "9832\n" 442 | ] 443 | } 444 | ], 445 | "source": [ 446 | "# Split data\n", 447 | "# We will perform a classic 80/20 split between training and testing data.\n", 448 | "train_data, test_data = final_data.randomSplit([0.8,0.2], seed=623)\n", 449 | "print(train_data.count())\n", 450 | "print(test_data.count())" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "### 4. Start Training" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": 15, 463 | "metadata": {}, 464 | "outputs": [], 465 | "source": [ 466 | "dtc_model = dtc.fit(train_data)\n", 467 | "rfc_model = rfc.fit(train_data)\n", 468 | "gbt_model = gbt.fit(train_data)" 469 | ] 470 | }, 471 | { 472 | "cell_type": "markdown", 473 | "metadata": {}, 474 | "source": [ 475 | "### 5. Evaludate with Test-set" 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": 16, 481 | "metadata": {}, 482 | "outputs": [], 483 | "source": [ 484 | "dtc_preds = dtc_model.transform(test_data)\n", 485 | "rfc_preds = rfc_model.transform(test_data)\n", 486 | "gbt_preds = gbt_model.transform(test_data)" 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "### 6. Evaluating Model’s Performance" 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": 17, 499 | "metadata": {}, 500 | "outputs": [], 501 | "source": [ 502 | "# our evaluator will be the ROC\n", 503 | "my_eval = BinaryClassificationEvaluator(labelCol='>50K')" 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": 18, 509 | "metadata": {}, 510 | "outputs": [ 511 | { 512 | "name": "stdout", 513 | "output_type": "stream", 514 | "text": [ 515 | "DTC\n", 516 | "0.5849312593442992\n" 517 | ] 518 | } 519 | ], 520 | "source": [ 521 | "# Display Decision Tree evaluation metric\n", 522 | "print('DTC')\n", 523 | "print(my_eval.evaluate(dtc_preds))" 524 | ] 525 | }, 526 | { 527 | "cell_type": "code", 528 | "execution_count": 19, 529 | "metadata": {}, 530 | "outputs": [ 531 | { 532 | "name": "stdout", 533 | "output_type": "stream", 534 | "text": [ 535 | "RFC\n", 536 | "0.8914577709920453\n" 537 | ] 538 | } 539 | ], 540 | "source": [ 541 | "# Display Random Forest evaluation metric\n", 542 | "print('RFC')\n", 543 | "print(my_eval.evaluate(rfc_preds))" 544 | ] 545 | }, 546 | { 547 | "cell_type": "code", 548 | "execution_count": 20, 549 | "metadata": {}, 550 | "outputs": [ 551 | { 552 | "name": "stdout", 553 | "output_type": "stream", 554 | "text": [ 555 | "GBT\n", 556 | "0.9044179860557597\n" 557 | ] 558 | } 559 | ], 560 | "source": [ 561 | "# Display Gradien Boosting Tree evaluation metric\n", 562 | "print('GBT')\n", 563 | "print(my_eval.evaluate(gbt_preds))" 564 | ] 565 | }, 566 | { 567 | "cell_type": "markdown", 568 | "metadata": {}, 569 | "source": [ 570 | "### 7. Improving Models Performance (Model Tuning)" 571 | ] 572 | }, 573 | { 574 | "cell_type": "markdown", 575 | "metadata": {}, 576 | "source": [ 577 | "We will try to do this by performing the grid search cross validation technique. With it, we will evaluate the performance of the model with different combinations of previously sets of hyperparameter’s values.\n", 578 | "\n", 579 | "The hyperparameters that we will tune are:\n", 580 | "\n", 581 | "- Max Depth\n", 582 | "- Max Bins\n", 583 | "- Max Iterations" 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": 21, 589 | "metadata": {}, 590 | "outputs": [ 591 | { 592 | "data": { 593 | "text/plain": [ 594 | "0.9143539096589867" 595 | ] 596 | }, 597 | "execution_count": 21, 598 | "metadata": {}, 599 | "output_type": "execute_result" 600 | } 601 | ], 602 | "source": [ 603 | "# Import libraries\n", 604 | "from pyspark.ml.tuning import ParamGridBuilder, CrossValidator\n", 605 | "\n", 606 | "# Set the Parameters grid\n", 607 | "paramGrid = (ParamGridBuilder()\n", 608 | " .addGrid(gbt.maxDepth, [2, 4, 6])\n", 609 | " .addGrid(gbt.maxBins, [20, 60])\n", 610 | " .addGrid(gbt.maxIter, [10, 20])\n", 611 | " .build())\n", 612 | "\n", 613 | "# Iinitializing the cross validator class\n", 614 | "cv = CrossValidator(estimator=gbt, estimatorParamMaps=paramGrid, evaluator=my_eval, numFolds=5)\n", 615 | "\n", 616 | "# Run cross validations. This can take about 6 minutes since it is training over 20 trees\n", 617 | "cvModel = cv.fit(train_data)\n", 618 | "gbt_predictions_2 = cvModel.transform(test_data)\n", 619 | "my_eval.evaluate(gbt_predictions_2)" 620 | ] 621 | }, 622 | { 623 | "cell_type": "markdown", 624 | "metadata": {}, 625 | "source": [ 626 | "#### We can also access the model's feature weights and intercepts easily" 627 | ] 628 | }, 629 | { 630 | "cell_type": "code", 631 | "execution_count": null, 632 | "metadata": {}, 633 | "outputs": [], 634 | "source": [ 635 | "print('Model Intercept: ', cvModel.bestModel.intercept)" 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": null, 641 | "metadata": {}, 642 | "outputs": [], 643 | "source": [ 644 | "weights = cvModel.bestModel.coefficients\n", 645 | "weights = [(float(w),) for w in weights] # convert numpy type to float, and to tuple\n", 646 | "weightsDF = sqlContext.createDataFrame(weights, [\"Feature Weight\"])\n", 647 | "display(weightsDF)" 648 | ] 649 | }, 650 | { 651 | "cell_type": "code", 652 | "execution_count": null, 653 | "metadata": {}, 654 | "outputs": [], 655 | "source": [ 656 | "# View best model's predictions and probabilities of each prediction class\n", 657 | "selected = predictions.select(\"label\", \"prediction\", \"probability\", \"age\", \"occupation\")\n", 658 | "display(selected)" 659 | ] 660 | }, 661 | { 662 | "cell_type": "code", 663 | "execution_count": null, 664 | "metadata": {}, 665 | "outputs": [], 666 | "source": [ 667 | "# End Spark Session\n", 668 | "spark.stop()" 669 | ] 670 | } 671 | ], 672 | "metadata": { 673 | "kernelspec": { 674 | "display_name": "Python 3", 675 | "language": "python", 676 | "name": "python3" 677 | }, 678 | "language_info": { 679 | "codemirror_mode": { 680 | "name": "ipython", 681 | "version": 3 682 | }, 683 | "file_extension": ".py", 684 | "mimetype": "text/x-python", 685 | "name": "python", 686 | "nbconvert_exporter": "python", 687 | "pygments_lexer": "ipython3", 688 | "version": "3.7.6" 689 | } 690 | }, 691 | "nbformat": 4, 692 | "nbformat_minor": 4 693 | } 694 | -------------------------------------------------------------------------------- /Multi-class Text Classification Problem with PySpark and MLlib.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Multi-class Text Classification Problem with PySpark and MLlib" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Background\n", 15 | "Apache Spark is quickly gaining steam both in the headlines and real-world adoption, mainly because of its ability to process streaming data. With so much data being processed on a daily basis, it has become essential for us to be able to stream and analyze it in real time" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "We use Spark Machine Learning Library (Spark MLlib) to solve this multi-class text classification problem" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## 1. Load Data \n", 30 | "The dataset is from Kaggle [Link](https://www.kaggle.com/c/sf-crime/data)" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "The problem is to classify 'Crime Description' into 33 categories" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 1, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "import os\n", 47 | "import pandas as pd" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 2, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "#we use the findspark library to locate spark on our local machine\n", 57 | "import findspark\n", 58 | "findspark.init('C:/Users/bokhy/spark/spark-2.4.6-bin-hadoop2.7')" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "import pyspark # only run after findspark.init()\n", 68 | "from pyspark.sql import SparkSession, SQLContext\n", 69 | "from pyspark import SparkContext\n", 70 | "\n", 71 | "# Spark offers built-in packages to load CSV files\n", 72 | "sc =SparkContext()\n", 73 | "sqlContext = SQLContext(sc)\n", 74 | "data = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('crime_train.csv')" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "## 2. Data Preprocessing" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "Remove the columns we do not need and have a look the first five rows" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "drop_list = ['Dates', 'DayOfWeek', 'PdDistrict', 'Resolution', 'Address', 'X', 'Y']\n", 98 | "data = data.select([column for column in data.columns if column not in drop_list])" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 20, 104 | "metadata": {}, 105 | "outputs": [ 106 | { 107 | "name": "stdout", 108 | "output_type": "stream", 109 | "text": [ 110 | "root\n", 111 | " |-- Category: string (nullable = true)\n", 112 | " |-- Descript: string (nullable = true)\n", 113 | "\n" 114 | ] 115 | } 116 | ], 117 | "source": [ 118 | "data.printSchema()" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 21, 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "name": "stdout", 128 | "output_type": "stream", 129 | "text": [ 130 | "+--------------+--------------------+\n", 131 | "| Category| Descript|\n", 132 | "+--------------+--------------------+\n", 133 | "| WARRANTS| WARRANT ARREST|\n", 134 | "|OTHER OFFENSES|TRAFFIC VIOLATION...|\n", 135 | "|OTHER OFFENSES|TRAFFIC VIOLATION...|\n", 136 | "| LARCENY/THEFT|GRAND THEFT FROM ...|\n", 137 | "| LARCENY/THEFT|GRAND THEFT FROM ...|\n", 138 | "+--------------+--------------------+\n", 139 | "only showing top 5 rows\n", 140 | "\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "data.show(5)" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": 22, 151 | "metadata": {}, 152 | "outputs": [ 153 | { 154 | "name": "stdout", 155 | "output_type": "stream", 156 | "text": [ 157 | "+--------------------+------+\n", 158 | "| Category| count|\n", 159 | "+--------------------+------+\n", 160 | "| LARCENY/THEFT|174900|\n", 161 | "| OTHER OFFENSES|126182|\n", 162 | "| NON-CRIMINAL| 92304|\n", 163 | "| ASSAULT| 76876|\n", 164 | "| DRUG/NARCOTIC| 53971|\n", 165 | "| VEHICLE THEFT| 53781|\n", 166 | "| VANDALISM| 44725|\n", 167 | "| WARRANTS| 42214|\n", 168 | "| BURGLARY| 36755|\n", 169 | "| SUSPICIOUS OCC| 31414|\n", 170 | "| MISSING PERSON| 25989|\n", 171 | "| ROBBERY| 23000|\n", 172 | "| FRAUD| 16679|\n", 173 | "|FORGERY/COUNTERFE...| 10609|\n", 174 | "| SECONDARY CODES| 9985|\n", 175 | "| WEAPON LAWS| 8555|\n", 176 | "| PROSTITUTION| 7484|\n", 177 | "| TRESPASS| 7326|\n", 178 | "| STOLEN PROPERTY| 4540|\n", 179 | "|SEX OFFENSES FORC...| 4388|\n", 180 | "+--------------------+------+\n", 181 | "only showing top 20 rows\n", 182 | "\n" 183 | ] 184 | } 185 | ], 186 | "source": [ 187 | "# Top 20 crimes\n", 188 | "from pyspark.sql.functions import col\n", 189 | "data.groupBy(\"Category\") \\\n", 190 | " .count() \\\n", 191 | " .orderBy(col(\"count\").desc()) \\\n", 192 | " .show()" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 23, 198 | "metadata": {}, 199 | "outputs": [ 200 | { 201 | "name": "stdout", 202 | "output_type": "stream", 203 | "text": [ 204 | "+--------------------+-----+\n", 205 | "| Descript|count|\n", 206 | "+--------------------+-----+\n", 207 | "|GRAND THEFT FROM ...|60022|\n", 208 | "| LOST PROPERTY|31729|\n", 209 | "| BATTERY|27441|\n", 210 | "| STOLEN AUTOMOBILE|26897|\n", 211 | "|DRIVERS LICENSE, ...|26839|\n", 212 | "| WARRANT ARREST|23754|\n", 213 | "|SUSPICIOUS OCCURR...|21891|\n", 214 | "|AIDED CASE, MENTA...|21497|\n", 215 | "|PETTY THEFT FROM ...|19771|\n", 216 | "|MALICIOUS MISCHIE...|17789|\n", 217 | "| TRAFFIC VIOLATION|16471|\n", 218 | "|PETTY THEFT OF PR...|16196|\n", 219 | "|MALICIOUS MISCHIE...|15957|\n", 220 | "|THREATS AGAINST LIFE|14716|\n", 221 | "| FOUND PROPERTY|12146|\n", 222 | "|ENROUTE TO OUTSID...|11470|\n", 223 | "|GRAND THEFT OF PR...|11010|\n", 224 | "|POSSESSION OF NAR...|10050|\n", 225 | "|PETTY THEFT FROM ...|10029|\n", 226 | "|PETTY THEFT SHOPL...| 9571|\n", 227 | "+--------------------+-----+\n", 228 | "only showing top 20 rows\n", 229 | "\n" 230 | ] 231 | } 232 | ], 233 | "source": [ 234 | "# Top 20 descriptions\n", 235 | "data.groupBy(\"Descript\") \\\n", 236 | " .count() \\\n", 237 | " .orderBy(col(\"count\").desc()) \\\n", 238 | " .show()" 239 | ] 240 | }, 241 | { 242 | "cell_type": "markdown", 243 | "metadata": {}, 244 | "source": [ 245 | "## 3. Create a Model\n", 246 | "### In Spark, we call create 'Model Pipeline', and we accomplish this in 5 steps\n", 247 | "1. regexTokenizer: Tokenization (with Regular Expression)\n", 248 | "2. stopwordsRemover: Remove Stop Words\n", 249 | "3. countVectors: Count vectors (“document-term vectors”)\n", 250 | "4. StringIndexer : encodes a string column of labels to a column of label indices. The indices are in (0, numLabels), ordered by label frequencies, so the most frequent label gets index 0. \\\n", 251 | "(In our case, the label column (Category) will be encoded to label indices, from 0 to 32; the most frequent label (LARCENY/THEFT) will be indexed as 0)\n", 252 | "5. Split Train/Test data for making 'training-ready'" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 24, 258 | "metadata": {}, 259 | "outputs": [], 260 | "source": [ 261 | "from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer, OneHotEncoder, StringIndexer, VectorAssembler, HashingTF, IDF\n", 262 | "from pyspark.ml.classification import LogisticRegression\n", 263 | "\n", 264 | "# Clean description using regular-expression tokenizer\n", 265 | "regexTokenizer = RegexTokenizer(inputCol=\"Descript\", outputCol=\"words\", pattern=\"\\\\W\")\n", 266 | "\n", 267 | "# exclue stop words\n", 268 | "add_stopwords = [\"http\",\"https\",\"amp\",\"rt\",\"t\",\"c\",\"the\"] \n", 269 | "stopwordsRemover = StopWordsRemover(inputCol=\"words\", outputCol=\"filtered\").setStopWords(add_stopwords)\n", 270 | "\n", 271 | "# bag-of-words count\n", 272 | "countVectors = CountVectorizer(inputCol=\"filtered\", outputCol=\"features\", vocabSize=10000, minDF=5)\n", 273 | "\n", 274 | "# Add Hashing\n", 275 | "# hashingTF = HashingTF(inputCol=\"filtered\", outputCol=\"rawFeatures\", numFeatures=10000)\n", 276 | "\n", 277 | "# TF and IDF\n", 278 | "# idf = IDF(inputCol=\"rawFeatures\", outputCol=\"features\", minDocFreq=5) #minDocFreq: remove sparse terms\n", 279 | "\n", 280 | "# StringIndexer\n", 281 | "label_stringIdx = StringIndexer(inputCol = \"Category\", outputCol = \"label\")" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 25, 287 | "metadata": {}, 288 | "outputs": [ 289 | { 290 | "name": "stdout", 291 | "output_type": "stream", 292 | "text": [ 293 | "+--------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+\n", 294 | "| Category| Descript| words| filtered| rawFeatures| features|label|\n", 295 | "+--------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+\n", 296 | "| WARRANTS| WARRANT ARREST| [warrant, arrest]| [warrant, arrest]|(10000,[2279,3942...|(10000,[2279,3942...| 7.0|\n", 297 | "|OTHER OFFENSES|TRAFFIC VIOLATION...|[traffic, violati...|[traffic, violati...|(10000,[604,3942,...|(10000,[604,3942,...| 1.0|\n", 298 | "|OTHER OFFENSES|TRAFFIC VIOLATION...|[traffic, violati...|[traffic, violati...|(10000,[604,3942,...|(10000,[604,3942,...| 1.0|\n", 299 | "| LARCENY/THEFT|GRAND THEFT FROM ...|[grand, theft, fr...|[grand, theft, fr...|(10000,[274,713,3...|(10000,[274,713,3...| 0.0|\n", 300 | "| LARCENY/THEFT|GRAND THEFT FROM ...|[grand, theft, fr...|[grand, theft, fr...|(10000,[274,713,3...|(10000,[274,713,3...| 0.0|\n", 301 | "+--------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+\n", 302 | "only showing top 5 rows\n", 303 | "\n" 304 | ] 305 | } 306 | ], 307 | "source": [ 308 | "from pyspark.ml import Pipeline\n", 309 | "\n", 310 | "# Put everything in pipeline (We use regexTokenizer, stopwordsRemover, hashingTF, idf, label_stringIdx)\n", 311 | "# you can use hasingTF and IDF alternatively than countVectors\n", 312 | "pipeline = Pipeline(stages=[regexTokenizer, stopwordsRemover, countVectors, label_stringIdx])\n", 313 | "\n", 314 | "# Fit the pipeline to training documents.\n", 315 | "pipelineFit = pipeline.fit(data)\n", 316 | "dataset = pipelineFit.transform(data)\n", 317 | "dataset.show(5)" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 26, 323 | "metadata": {}, 324 | "outputs": [ 325 | { 326 | "name": "stdout", 327 | "output_type": "stream", 328 | "text": [ 329 | "Training Dataset Count: 658302\n", 330 | "Test Dataset Count: 219747\n" 331 | ] 332 | } 333 | ], 334 | "source": [ 335 | "# Split Train/Test data\n", 336 | "(trainingData, testData) = dataset.randomSplit([0.75, 0.25], seed = 623)\n", 337 | "print(\"Training Dataset Count: \" + str(trainingData.count()))\n", 338 | "print(\"Test Dataset Count: \" + str(testData.count()))" 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": {}, 344 | "source": [ 345 | "## 4. Train a Model and Evaluation" 346 | ] 347 | }, 348 | { 349 | "cell_type": "markdown", 350 | "metadata": {}, 351 | "source": [ 352 | "Our model will make predictions and score on the test set \\\n", 353 | "And then we then look at the top 10 predictions from the highest probability." 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": 31, 359 | "metadata": {}, 360 | "outputs": [ 361 | { 362 | "name": "stdout", 363 | "output_type": "stream", 364 | "text": [ 365 | "+------------------------------+-------------+------------------------------+-----+----------+\n", 366 | "| Descript| Category| probability|label|prediction|\n", 367 | "+------------------------------+-------------+------------------------------+-----+----------+\n", 368 | "|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8741040478278337,0.02016...| 0.0| 0.0|\n", 369 | "|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8741040478278337,0.02016...| 0.0| 0.0|\n", 370 | "|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8741040478278337,0.02016...| 0.0| 0.0|\n", 371 | "|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8741040478278337,0.02016...| 0.0| 0.0|\n", 372 | "|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8741040478278337,0.02016...| 0.0| 0.0|\n", 373 | "|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8741040478278337,0.02016...| 0.0| 0.0|\n", 374 | "|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8741040478278337,0.02016...| 0.0| 0.0|\n", 375 | "|THEFT, BICYCLE, <$50, SERIA...|LARCENY/THEFT|[0.874103341787826,0.020162...| 0.0| 0.0|\n", 376 | "|THEFT, BICYCLE, <$50, SERIA...|LARCENY/THEFT|[0.874103341787826,0.020162...| 0.0| 0.0|\n", 377 | "|THEFT, BICYCLE, <$50, SERIA...|LARCENY/THEFT|[0.874103341787826,0.020162...| 0.0| 0.0|\n", 378 | "+------------------------------+-------------+------------------------------+-----+----------+\n", 379 | "only showing top 10 rows\n", 380 | "\n" 381 | ] 382 | } 383 | ], 384 | "source": [ 385 | "# We use Logistic-Regression model\n", 386 | "lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)\n", 387 | "\n", 388 | "lrModel = lr.fit(trainingData)\n", 389 | "predictions = lrModel.transform(testData)\n", 390 | "predictions.filter(predictions['prediction'] == 0) \\\n", 391 | " .select(\"Descript\",\"Category\",\"probability\",\"label\",\"prediction\") \\\n", 392 | " .orderBy(\"probability\", ascending=False) \\\n", 393 | " .show(n = 10, truncate = 30)" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": 32, 399 | "metadata": {}, 400 | "outputs": [ 401 | { 402 | "name": "stdout", 403 | "output_type": "stream", 404 | "text": [ 405 | "Test-set Accuracy is : 0.972745626745252\n" 406 | ] 407 | } 408 | ], 409 | "source": [ 410 | "from pyspark.ml.evaluation import MulticlassClassificationEvaluator\n", 411 | "evaluator = MulticlassClassificationEvaluator(predictionCol=\"prediction\")\n", 412 | "print(\"Test-set Accuracy is : \", evaluator.evaluate(predictions))" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "metadata": {}, 418 | "source": [ 419 | "## 5. Cross-Validation (hyper-parameters tuning)" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": {}, 425 | "source": [ 426 | "Let's try to improve our model by cross-validation, and we will tune the count vectors Logistic Regression" 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": 33, 432 | "metadata": {}, 433 | "outputs": [], 434 | "source": [ 435 | "# Same Pipeline step like above\n", 436 | "pipeline = Pipeline(stages=[regexTokenizer, stopwordsRemover, countVectors, label_stringIdx])\n", 437 | "pipelineFit = pipeline.fit(data)\n", 438 | "dataset = pipelineFit.transform(data)\n", 439 | "(trainingData, testData) = dataset.randomSplit([0.75, 0.25], seed = 623)\n", 440 | "\n", 441 | "# Create LR model\n", 442 | "lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)" 443 | ] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "execution_count": 34, 448 | "metadata": {}, 449 | "outputs": [], 450 | "source": [ 451 | "from pyspark.ml.tuning import ParamGridBuilder, CrossValidator\n", 452 | "\n", 453 | "# Create ParamGrid for Cross Validation\n", 454 | "paramGrid = (ParamGridBuilder()\n", 455 | " .addGrid(lr.regParam, [0.1, 0.3, 0.5]) # regularization parameter\n", 456 | " .addGrid(lr.elasticNetParam, [0.0, 0.1, 0.2]) # Elastic Net Parameter (Ridge = 0)\n", 457 | "# .addGrid(model.maxIter, [10, 20, 50]) #Number of iterations\n", 458 | "# .addGrid(idf.numFeatures, [10, 100, 1000]) # Number of features\n", 459 | " .build())\n", 460 | "\n", 461 | "# Create 5-fold CrossValidator\n", 462 | "cv = CrossValidator(estimator=lr, \\\n", 463 | " estimatorParamMaps=paramGrid, \\\n", 464 | " evaluator=evaluator, \\\n", 465 | " numFolds=5)\n", 466 | "cvModel = cv.fit(trainingData)\n", 467 | "\n", 468 | "predictions = cvModel.transform(testData)" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": 36, 474 | "metadata": {}, 475 | "outputs": [ 476 | { 477 | "name": "stdout", 478 | "output_type": "stream", 479 | "text": [ 480 | "Test-set Accuracy is : 0.9918902377560262\n" 481 | ] 482 | } 483 | ], 484 | "source": [ 485 | "# Evaluate best model\n", 486 | "evaluator = MulticlassClassificationEvaluator(predictionCol=\"prediction\")\n", 487 | "print(\"Test-set Accuracy is : \", evaluator.evaluate(predictions))" 488 | ] 489 | }, 490 | { 491 | "cell_type": "markdown", 492 | "metadata": {}, 493 | "source": [ 494 | "### We are able to acheive over 99% accuracy! " 495 | ] 496 | }, 497 | { 498 | "cell_type": "markdown", 499 | "metadata": {}, 500 | "source": [ 501 | "Reference\n", 502 | ">https://towardsdatascience.com/multi-class-text-classification-with-pyspark-7d78d022ed35" 503 | ] 504 | } 505 | ], 506 | "metadata": { 507 | "kernelspec": { 508 | "display_name": "Python 3", 509 | "language": "python", 510 | "name": "python3" 511 | }, 512 | "language_info": { 513 | "codemirror_mode": { 514 | "name": "ipython", 515 | "version": 3 516 | }, 517 | "file_extension": ".py", 518 | "mimetype": "text/x-python", 519 | "name": "python", 520 | "nbconvert_exporter": "python", 521 | "pygments_lexer": "ipython3", 522 | "version": "3.7.6" 523 | } 524 | }, 525 | "nbformat": 4, 526 | "nbformat_minor": 4 527 | } 528 | -------------------------------------------------------------------------------- /Multi-class classification using Decision Tree Problem with PySpark .ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Multi-class classification using Decision Tree Problem with PySpark " 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Each year, San Francisco Airport (SFO) conducts a customer satisfaction survey to find out what they are doing well and where they can improve. The survey gauges satisfaction with SFO facilities, services, and amenities. SFO compares results to previous surveys to discover elements of the guest experience that are not satisfactory.\n", 15 | "\n", 16 | "The 2013 SFO Survey Results consists of customer responses to survey questions and an overall satisfaction rating with the airport. We investigated whether we could use machine learning to predict a customer's overall response given their responses to the individual questions. That in and of itself is not very useful because the customer has already provided an overall rating as well as individual ratings for various aspects of the airport such as parking, food quality and restroom cleanliness. However, we didn't stop at prediction instead we asked the question:\n", 17 | "\n", 18 | "What factors drove the customer to give the overall rating?" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "Here is an outline of our data flow:\n", 26 | "\n", 27 | "- Load data: Load the data as a DataFrame\n", 28 | "- Understand the data: Compute statistics and create visualizations to get a better understanding of the data to see if we can use basic statistics to answer the question above.\n", 29 | "- Create Model On the training dataset:\n", 30 | "- Evaluate the model: Now look at the test dataset. Compare the initial model with the tuned model to see the benefit of tuning parameters.\n", 31 | "- Feature Importance: Determine the importance of each of the individual ratings in determining the overall rating by the customer" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 1, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "#we use the findspark library to locate spark on our local machine\n", 41 | "import findspark\n", 42 | "findspark.init('C:/Users/bokhy/spark/spark-2.4.6-bin-hadoop2.7')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 2, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "import os\n", 52 | "import pandas as pd\n", 53 | "import numpy as np\n", 54 | "from datetime import date, timedelta, datetime\n", 55 | "import time\n", 56 | "\n", 57 | "import pyspark # only run this after findspark.init()\n", 58 | "from pyspark.sql import SparkSession, SQLContext\n", 59 | "from pyspark.context import SparkContext\n", 60 | "from pyspark.sql.functions import * \n", 61 | "from pyspark.sql.types import * " 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "## 1. Load Data \n", 69 | "This dataset is available as a public dataset from https://catalog.data.gov/dataset/2013-sfo-customer-survey-d3541." 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 3, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "# Initiate the Spark Session\n", 79 | "spark = SparkSession.builder.appName('Decision-Tree').getOrCreate()" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 27, 85 | "metadata": {}, 86 | "outputs": [ 87 | { 88 | "data": { 89 | "text/html": [ 90 | "\n", 91 | "
\n", 92 | "

SparkSession - in-memory

\n", 93 | " \n", 94 | "
\n", 95 | "

SparkContext

\n", 96 | "\n", 97 | "

Spark UI

\n", 98 | "\n", 99 | "
\n", 100 | "
Version
\n", 101 | "
v2.4.6
\n", 102 | "
Master
\n", 103 | "
local[*]
\n", 104 | "
AppName
\n", 105 | "
Decision-Tree
\n", 106 | "
\n", 107 | "
\n", 108 | " \n", 109 | "
\n", 110 | " " 111 | ], 112 | "text/plain": [ 113 | "" 114 | ] 115 | }, 116 | "execution_count": 27, 117 | "metadata": {}, 118 | "output_type": "execute_result" 119 | } 120 | ], 121 | "source": [ 122 | "spark" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 5, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "survey = spark.read.csv(\"./data/2013_SFO_Customer_Survey.csv\", header=\"true\", inferSchema=\"true\")" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 6, 137 | "metadata": {}, 138 | "outputs": [ 139 | { 140 | "data": { 141 | "text/plain": [ 142 | "DataFrame[RESPNUM: int, CCGID: int, RUN: int, INTDATE: int, GATE: int, STRATA: int, PEAK: int, METHOD: int, AIRLINE: int, FLIGHT: int, DEST: int, DESTGEO: int, DESTMARK: int, ARRTIME: string, DEPTIME: string, Q2PURP1: int, Q2PURP2: int, Q2PURP3: int, Q2PURP4: int, Q2PURP5: int, Q2PURP6: string, Q3GETTO1: int, Q3GETTO2: int, Q3GETTO3: int, Q3GETTO4: int, Q3GETTO5: string, Q3GETTO6: string, Q3PARK: int, Q4BAGS: int, Q4BUY: int, Q4FOOD: int, Q4WIFI: int, Q5FLYPERYR: int, Q6TENURE: double, SAQ: int, Q7A_ART: int, Q7B_FOOD: int, Q7C_SHOPS: int, Q7D_SIGNS: int, Q7E_WALK: int, Q7F_SCREENS: int, Q7G_INFOARR: int, Q7H_INFODEP: int, Q7I_WIFI: int, Q7J_ROAD: int, Q7K_PARK: int, Q7L_AIRTRAIN: int, Q7M_LTPARK: int, Q7N_RENTAL: int, Q7O_WHOLE: int, Q8COM1: int, Q8COM2: int, Q8COM3: int, Q9A_CLNBOARD: int, Q9B_CLNAIRTRAIN: int, Q9C_CLNRENT: int, Q9D_CLNFOOD: int, Q9E_CLNBATH: int, Q9F_CLNWHOLE: int, Q9COM1: int, Q9COM2: int, Q9COM3: int, Q10SAFE: int, Q10COM1: int, Q10COM2: int, Q10COM3: int, Q11A_USEWEB: int, Q11B_USESFOAPP: int, Q11C_USEOTHAPP: int, Q11D_USESOCMED: int, Q11E_USEWIFI: int, Q12COM1: int, Q12COM2: int, Q12COM3: int, Q13_WHEREDEPART: int, Q13_RATEGETTO: int, Q14A_FIND: int, Q14B_SECURITY: int, Q15_PROBLEMS: int, Q15COM1: int, Q15COM2: int, Q15COM3: int, Q16_REGION: int, Q17_CITY: string, Q17_ZIP: int, Q17_COUNTRY: string, HOME: int, Q18_AGE: int, Q19_SEX: int, Q20_INCOME: int, Q21_HIFLYER: int, Q22A_USESJC: int, Q22B_USEOAK: int, LANG: int, WEIGHT: double]" 143 | ] 144 | }, 145 | "metadata": {}, 146 | "output_type": "display_data" 147 | } 148 | ], 149 | "source": [ 150 | "display(survey)" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 7, 156 | "metadata": {}, 157 | "outputs": [ 158 | { 159 | "name": "stdout", 160 | "output_type": "stream", 161 | "text": [ 162 | "root\n", 163 | " |-- RESPNUM: integer (nullable = true)\n", 164 | " |-- CCGID: integer (nullable = true)\n", 165 | " |-- RUN: integer (nullable = true)\n", 166 | " |-- INTDATE: integer (nullable = true)\n", 167 | " |-- GATE: integer (nullable = true)\n", 168 | " |-- STRATA: integer (nullable = true)\n", 169 | " |-- PEAK: integer (nullable = true)\n", 170 | " |-- METHOD: integer (nullable = true)\n", 171 | " |-- AIRLINE: integer (nullable = true)\n", 172 | " |-- FLIGHT: integer (nullable = true)\n", 173 | " |-- DEST: integer (nullable = true)\n", 174 | " |-- DESTGEO: integer (nullable = true)\n", 175 | " |-- DESTMARK: integer (nullable = true)\n", 176 | " |-- ARRTIME: string (nullable = true)\n", 177 | " |-- DEPTIME: string (nullable = true)\n", 178 | " |-- Q2PURP1: integer (nullable = true)\n", 179 | " |-- Q2PURP2: integer (nullable = true)\n", 180 | " |-- Q2PURP3: integer (nullable = true)\n", 181 | " |-- Q2PURP4: integer (nullable = true)\n", 182 | " |-- Q2PURP5: integer (nullable = true)\n", 183 | " |-- Q2PURP6: string (nullable = true)\n", 184 | " |-- Q3GETTO1: integer (nullable = true)\n", 185 | " |-- Q3GETTO2: integer (nullable = true)\n", 186 | " |-- Q3GETTO3: integer (nullable = true)\n", 187 | " |-- Q3GETTO4: integer (nullable = true)\n", 188 | " |-- Q3GETTO5: string (nullable = true)\n", 189 | " |-- Q3GETTO6: string (nullable = true)\n", 190 | " |-- Q3PARK: integer (nullable = true)\n", 191 | " |-- Q4BAGS: integer (nullable = true)\n", 192 | " |-- Q4BUY: integer (nullable = true)\n", 193 | " |-- Q4FOOD: integer (nullable = true)\n", 194 | " |-- Q4WIFI: integer (nullable = true)\n", 195 | " |-- Q5FLYPERYR: integer (nullable = true)\n", 196 | " |-- Q6TENURE: double (nullable = true)\n", 197 | " |-- SAQ: integer (nullable = true)\n", 198 | " |-- Q7A_ART: integer (nullable = true)\n", 199 | " |-- Q7B_FOOD: integer (nullable = true)\n", 200 | " |-- Q7C_SHOPS: integer (nullable = true)\n", 201 | " |-- Q7D_SIGNS: integer (nullable = true)\n", 202 | " |-- Q7E_WALK: integer (nullable = true)\n", 203 | " |-- Q7F_SCREENS: integer (nullable = true)\n", 204 | " |-- Q7G_INFOARR: integer (nullable = true)\n", 205 | " |-- Q7H_INFODEP: integer (nullable = true)\n", 206 | " |-- Q7I_WIFI: integer (nullable = true)\n", 207 | " |-- Q7J_ROAD: integer (nullable = true)\n", 208 | " |-- Q7K_PARK: integer (nullable = true)\n", 209 | " |-- Q7L_AIRTRAIN: integer (nullable = true)\n", 210 | " |-- Q7M_LTPARK: integer (nullable = true)\n", 211 | " |-- Q7N_RENTAL: integer (nullable = true)\n", 212 | " |-- Q7O_WHOLE: integer (nullable = true)\n", 213 | " |-- Q8COM1: integer (nullable = true)\n", 214 | " |-- Q8COM2: integer (nullable = true)\n", 215 | " |-- Q8COM3: integer (nullable = true)\n", 216 | " |-- Q9A_CLNBOARD: integer (nullable = true)\n", 217 | " |-- Q9B_CLNAIRTRAIN: integer (nullable = true)\n", 218 | " |-- Q9C_CLNRENT: integer (nullable = true)\n", 219 | " |-- Q9D_CLNFOOD: integer (nullable = true)\n", 220 | " |-- Q9E_CLNBATH: integer (nullable = true)\n", 221 | " |-- Q9F_CLNWHOLE: integer (nullable = true)\n", 222 | " |-- Q9COM1: integer (nullable = true)\n", 223 | " |-- Q9COM2: integer (nullable = true)\n", 224 | " |-- Q9COM3: integer (nullable = true)\n", 225 | " |-- Q10SAFE: integer (nullable = true)\n", 226 | " |-- Q10COM1: integer (nullable = true)\n", 227 | " |-- Q10COM2: integer (nullable = true)\n", 228 | " |-- Q10COM3: integer (nullable = true)\n", 229 | " |-- Q11A_USEWEB: integer (nullable = true)\n", 230 | " |-- Q11B_USESFOAPP: integer (nullable = true)\n", 231 | " |-- Q11C_USEOTHAPP: integer (nullable = true)\n", 232 | " |-- Q11D_USESOCMED: integer (nullable = true)\n", 233 | " |-- Q11E_USEWIFI: integer (nullable = true)\n", 234 | " |-- Q12COM1: integer (nullable = true)\n", 235 | " |-- Q12COM2: integer (nullable = true)\n", 236 | " |-- Q12COM3: integer (nullable = true)\n", 237 | " |-- Q13_WHEREDEPART: integer (nullable = true)\n", 238 | " |-- Q13_RATEGETTO: integer (nullable = true)\n", 239 | " |-- Q14A_FIND: integer (nullable = true)\n", 240 | " |-- Q14B_SECURITY: integer (nullable = true)\n", 241 | " |-- Q15_PROBLEMS: integer (nullable = true)\n", 242 | " |-- Q15COM1: integer (nullable = true)\n", 243 | " |-- Q15COM2: integer (nullable = true)\n", 244 | " |-- Q15COM3: integer (nullable = true)\n", 245 | " |-- Q16_REGION: integer (nullable = true)\n", 246 | " |-- Q17_CITY: string (nullable = true)\n", 247 | " |-- Q17_ZIP: integer (nullable = true)\n", 248 | " |-- Q17_COUNTRY: string (nullable = true)\n", 249 | " |-- HOME: integer (nullable = true)\n", 250 | " |-- Q18_AGE: integer (nullable = true)\n", 251 | " |-- Q19_SEX: integer (nullable = true)\n", 252 | " |-- Q20_INCOME: integer (nullable = true)\n", 253 | " |-- Q21_HIFLYER: integer (nullable = true)\n", 254 | " |-- Q22A_USESJC: integer (nullable = true)\n", 255 | " |-- Q22B_USEOAK: integer (nullable = true)\n", 256 | " |-- LANG: integer (nullable = true)\n", 257 | " |-- WEIGHT: double (nullable = true)\n", 258 | "\n" 259 | ] 260 | } 261 | ], 262 | "source": [ 263 | "survey.printSchema()" 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "As you can see above there are many questions in the survey including what airline the customer flew on, where do they live, etc. For the purposes of answering the above, focus on the Q7A, Q7B, Q7C .. Q7O questions since they directly related to customer satisfaction, which is what you want to measure. If you drill down on those variables you get the following:\n", 271 | "\n", 272 | "|Column Name|Data Type|Description|\n", 273 | "| --- | --- | --- |\n", 274 | "|Q7B_FOOD|INTEGER|Restaurants|\n", 275 | "|Q7C_SHOPS|INTEGER|Retail shops and concessions|\n", 276 | "|Q7D_SIGNS|INTEGER|Signs and Directions inside SFO|\n", 277 | "|Q7E_WALK|INTEGER|Escalators / elevators / moving walkways|\n", 278 | "|Q7F_SCREENS|INTEGER|Information on screens and monitors|\n", 279 | "|Q7G_INFOARR|INTEGER|Information booth near arrivals area|\n", 280 | "|Q7H_INFODEP|INTEGER|Information booth near departure areas|\n", 281 | "|Q7I_WIFI|INTEGER|Airport WiFi|\n", 282 | "|Q7J_ROAD|INTEGER|Signs and directions on SFO airport roadways|\n", 283 | "|Q7K_PARK|INTEGER|Airport parking facilities|\n", 284 | "|Q7L_AIRTRAIN|INTEGER|AirTrain|\n", 285 | "|Q7M_LTPARK|INTEGER|Long term parking lot shuttle|\n", 286 | "|Q7N_RENTAL|INTEGER|Airport rental car center|\n", 287 | "|Q7O_WHOLE|INTEGER|SFO Airport as a whole|\n", 288 | "\n", 289 | "Q7O_WHOLE is the target variable \n", 290 | "\n", 291 | "The possible values for the above are:\n", 292 | "\n", 293 | "**0 = no answer, 1 = Unacceptable, 2 = Below Average, 3 = Average, 4 = Good, 5 = Outstanding, 6 = Not visited or not applicable**\n", 294 | "\n", 295 | "Select only the fields we are interested in." 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 8, 301 | "metadata": {}, 302 | "outputs": [], 303 | "source": [ 304 | "dataset = survey.select(\"Q7A_ART\", \"Q7B_FOOD\", \"Q7C_SHOPS\", \"Q7D_SIGNS\", \"Q7E_WALK\", \"Q7F_SCREENS\", \"Q7G_INFOARR\", \"Q7H_INFODEP\", \"Q7I_WIFI\", \"Q7J_ROAD\", \"Q7K_PARK\", \"Q7L_AIRTRAIN\", \"Q7M_LTPARK\", \"Q7N_RENTAL\", \"Q7O_WHOLE\")" 305 | ] 306 | }, 307 | { 308 | "cell_type": "markdown", 309 | "metadata": {}, 310 | "source": [ 311 | "Let's get some basic statistics such as looking at the **average of each column**." 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": 9, 317 | "metadata": {}, 318 | "outputs": [ 319 | { 320 | "data": { 321 | "text/plain": [ 322 | "\"'missingValues(Q7A_ART) Q7A_ART', 'missingValues(Q7B_FOOD) Q7B_FOOD', 'missingValues(Q7C_SHOPS) Q7C_SHOPS', 'missingValues(Q7D_SIGNS) Q7D_SIGNS', 'missingValues(Q7E_WALK) Q7E_WALK', 'missingValues(Q7F_SCREENS) Q7F_SCREENS', 'missingValues(Q7G_INFOARR) Q7G_INFOARR', 'missingValues(Q7H_INFODEP) Q7H_INFODEP', 'missingValues(Q7I_WIFI) Q7I_WIFI', 'missingValues(Q7J_ROAD) Q7J_ROAD', 'missingValues(Q7K_PARK) Q7K_PARK', 'missingValues(Q7L_AIRTRAIN) Q7L_AIRTRAIN', 'missingValues(Q7M_LTPARK) Q7M_LTPARK', 'missingValues(Q7N_RENTAL) Q7N_RENTAL', 'missingValues(Q7O_WHOLE) Q7O_WHOLE'\"" 323 | ] 324 | }, 325 | "execution_count": 9, 326 | "metadata": {}, 327 | "output_type": "execute_result" 328 | } 329 | ], 330 | "source": [ 331 | "a = map(lambda s: \"'missingValues(\" + s +\") \" + s + \"'\",[\"Q7A_ART\", \"Q7B_FOOD\", \"Q7C_SHOPS\", \"Q7D_SIGNS\", \"Q7E_WALK\", \"Q7F_SCREENS\", \"Q7G_INFOARR\", \"Q7H_INFODEP\", \"Q7I_WIFI\", \"Q7J_ROAD\", \"Q7K_PARK\", \"Q7L_AIRTRAIN\", \"Q7M_LTPARK\", \"Q7N_RENTAL\", \"Q7O_WHOLE\"])\n", 332 | "\", \".join(a)" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "Let's start with the overall rating." 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 10, 345 | "metadata": {}, 346 | "outputs": [ 347 | { 348 | "data": { 349 | "text/plain": [ 350 | "[Row(Q7O_WHOLE=3.8743988684582744)]" 351 | ] 352 | }, 353 | "execution_count": 10, 354 | "metadata": {}, 355 | "output_type": "execute_result" 356 | } 357 | ], 358 | "source": [ 359 | "from pyspark.sql.functions import *\n", 360 | "dataset.selectExpr('avg(Q7O_WHOLE) Q7O_WHOLE').take(1)" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "The overall rating is only 3.87, so slightly above average. Let's get the averages of the constituent ratings:" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 11, 373 | "metadata": {}, 374 | "outputs": [ 375 | { 376 | "data": { 377 | "text/plain": [ 378 | "DataFrame[Q7A_ART: double, Q7B_FOOD: double, Q7C_SHOPS: double, Q7D_SIGNS: double, Q7E_WALK: double, Q7F_SCREENS: double, Q7G_INFOARR: double, Q7H_INFODEP: double, Q7I_WIFI: double, Q7J_ROAD: double, Q7K_PARK: double, Q7L_AIRTRAIN: double, Q7M_LTPARK: double, Q7N_RENTAL: double]" 379 | ] 380 | }, 381 | "metadata": {}, 382 | "output_type": "display_data" 383 | } 384 | ], 385 | "source": [ 386 | "avgs = dataset.selectExpr('avg(Q7A_ART) Q7A_ART', 'avg(Q7B_FOOD) Q7B_FOOD', 'avg(Q7C_SHOPS) Q7C_SHOPS', 'avg(Q7D_SIGNS) Q7D_SIGNS', 'avg(Q7E_WALK) Q7E_WALK', 'avg(Q7F_SCREENS) Q7F_SCREENS', 'avg(Q7G_INFOARR) Q7G_INFOARR', 'avg(Q7H_INFODEP) Q7H_INFODEP', 'avg(Q7I_WIFI) Q7I_WIFI', 'avg(Q7J_ROAD) Q7J_ROAD', 'avg(Q7K_PARK) Q7K_PARK', 'avg(Q7L_AIRTRAIN) Q7L_AIRTRAIN', 'avg(Q7M_LTPARK) Q7M_LTPARK', 'avg(Q7N_RENTAL) Q7N_RENTAL')\n", 387 | "display(avgs)" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": 12, 393 | "metadata": {}, 394 | "outputs": [ 395 | { 396 | "data": { 397 | "text/plain": [ 398 | "DataFrame[Q7A_ART: int, Q7B_FOOD: int, Q7C_SHOPS: int, Q7D_SIGNS: int, Q7E_WALK: int, Q7F_SCREENS: int, Q7G_INFOARR: int, Q7H_INFODEP: int, Q7I_WIFI: int, Q7J_ROAD: int, Q7K_PARK: int, Q7L_AIRTRAIN: int, Q7M_LTPARK: int, Q7N_RENTAL: int, Q7O_WHOLE: int]" 399 | ] 400 | }, 401 | "metadata": {}, 402 | "output_type": "display_data" 403 | } 404 | ], 405 | "source": [ 406 | "display(dataset)" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "So basic statistics can't seem to answer the question: **What factors drove the customer to give the overall rating?**\n", 414 | "\n", 415 | "So let's try to use a predictive algorithm to see if these individual ratings can be used to predict an overall rating." 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": {}, 421 | "source": [ 422 | "## 2. Create a Model" 423 | ] 424 | }, 425 | { 426 | "cell_type": "markdown", 427 | "metadata": {}, 428 | "source": [ 429 | "First need to treat responses of 0 = No Answer and 6 = Not Visited or Not Applicable as missing values. One of the ways you can do this is a technique called mean impute which is when we use the mean of the column as a replacement for the missing value. You can use a replace function to set all values of 0 or 6 to the average rating of 3. You also need a label column of type double so do that as well." 430 | ] 431 | }, 432 | { 433 | "cell_type": "code", 434 | "execution_count": 13, 435 | "metadata": {}, 436 | "outputs": [], 437 | "source": [ 438 | "training = dataset.withColumn(\"label\", dataset['Q7O_WHOLE']*1.0).na.replace(0,3).replace(6,3)" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": 14, 444 | "metadata": {}, 445 | "outputs": [ 446 | { 447 | "data": { 448 | "text/plain": [ 449 | "DataFrame[Q7A_ART: int, Q7B_FOOD: int, Q7C_SHOPS: int, Q7D_SIGNS: int, Q7E_WALK: int, Q7F_SCREENS: int, Q7G_INFOARR: int, Q7H_INFODEP: int, Q7I_WIFI: int, Q7J_ROAD: int, Q7K_PARK: int, Q7L_AIRTRAIN: int, Q7M_LTPARK: int, Q7N_RENTAL: int, Q7O_WHOLE: int, label: double]" 450 | ] 451 | }, 452 | "metadata": {}, 453 | "output_type": "display_data" 454 | } 455 | ], 456 | "source": [ 457 | "display(training)" 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": {}, 463 | "source": [ 464 | "##### Create 'Model Pipeline'" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": 15, 470 | "metadata": {}, 471 | "outputs": [], 472 | "source": [ 473 | "from pyspark.ml import Pipeline\n", 474 | "from pyspark.ml.feature import VectorAssembler\n", 475 | "from pyspark.ml.regression import DecisionTreeRegressor\n", 476 | "from pyspark.ml.tuning import ParamGridBuilder, CrossValidator\n", 477 | "from pyspark.ml.evaluation import RegressionEvaluator\n", 478 | "\n", 479 | "inputCols = ['Q7A_ART', 'Q7B_FOOD', 'Q7C_SHOPS', 'Q7D_SIGNS', 'Q7E_WALK', 'Q7F_SCREENS', 'Q7G_INFOARR', 'Q7H_INFODEP', 'Q7I_WIFI', 'Q7J_ROAD', 'Q7K_PARK', 'Q7L_AIRTRAIN', 'Q7M_LTPARK', 'Q7N_RENTAL']\n", 480 | "va = VectorAssembler(inputCols=inputCols,outputCol=\"features\")\n", 481 | "dt = DecisionTreeRegressor(labelCol=\"label\", featuresCol=\"features\", maxDepth=4)\n", 482 | "evaluator = RegressionEvaluator(metricName = \"rmse\", labelCol=\"label\")\n", 483 | "grid = ParamGridBuilder().addGrid(dt.maxDepth, [3, 5, 7, 10]).build()\n", 484 | "cv = CrossValidator(estimator=dt, estimatorParamMaps=grid, evaluator=evaluator, numFolds = 10)\n", 485 | "pipeline = Pipeline(stages=[va, dt])" 486 | ] 487 | }, 488 | { 489 | "cell_type": "markdown", 490 | "metadata": {}, 491 | "source": [ 492 | "## 3. Train a Model" 493 | ] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": 16, 498 | "metadata": {}, 499 | "outputs": [], 500 | "source": [ 501 | "model = pipeline.fit(training)" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": 17, 507 | "metadata": {}, 508 | "outputs": [ 509 | { 510 | "data": { 511 | "text/plain": [ 512 | "DecisionTreeRegressionModel (uid=DecisionTreeRegressor_73754465424e) of depth 4 with 31 nodes" 513 | ] 514 | }, 515 | "metadata": {}, 516 | "output_type": "display_data" 517 | } 518 | ], 519 | "source": [ 520 | "display(model.stages[-1])" 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": 18, 526 | "metadata": {}, 527 | "outputs": [ 528 | { 529 | "data": { 530 | "text/plain": [ 531 | "DataFrame[Q7A_ART: int, Q7B_FOOD: int, Q7C_SHOPS: int, Q7D_SIGNS: int, Q7E_WALK: int, Q7F_SCREENS: int, Q7G_INFOARR: int, Q7H_INFODEP: int, Q7I_WIFI: int, Q7J_ROAD: int, Q7K_PARK: int, Q7L_AIRTRAIN: int, Q7M_LTPARK: int, Q7N_RENTAL: int, Q7O_WHOLE: int, label: double, features: vector, prediction: double]" 532 | ] 533 | }, 534 | "metadata": {}, 535 | "output_type": "display_data" 536 | } 537 | ], 538 | "source": [ 539 | "predictions = model.transform(training)\n", 540 | "display(predictions)" 541 | ] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": {}, 546 | "source": [ 547 | "## 4. Evaluate the model" 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": 19, 553 | "metadata": {}, 554 | "outputs": [ 555 | { 556 | "data": { 557 | "text/plain": [ 558 | "0.555808023551782" 559 | ] 560 | }, 561 | "execution_count": 19, 562 | "metadata": {}, 563 | "output_type": "execute_result" 564 | } 565 | ], 566 | "source": [ 567 | "from pyspark.ml.evaluation import RegressionEvaluator\n", 568 | "\n", 569 | "evaluator = RegressionEvaluator()\n", 570 | "\n", 571 | "evaluator.evaluate(predictions, {evaluator.metricName: \"rmse\"})" 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": {}, 577 | "source": [ 578 | "## 5. Save the model" 579 | ] 580 | }, 581 | { 582 | "cell_type": "code", 583 | "execution_count": null, 584 | "metadata": {}, 585 | "outputs": [], 586 | "source": [ 587 | "import uuid\n", 588 | "model_save_path = f\"/tmp/sfo_survey_model/{str(uuid.uuid4())}\"\n", 589 | "model.write().overwrite().save(model_save_path)" 590 | ] 591 | }, 592 | { 593 | "cell_type": "markdown", 594 | "metadata": {}, 595 | "source": [ 596 | "## 6. Feature Importance\n", 597 | "Feature importance is a measure of information gain. It is scaled from 0.0 to 1.0. As an example, feature 1 in the example above is rated as 0.0826 or 8.26% of the total importance for all the features." 598 | ] 599 | }, 600 | { 601 | "cell_type": "code", 602 | "execution_count": 21, 603 | "metadata": {}, 604 | "outputs": [ 605 | { 606 | "data": { 607 | "text/plain": [ 608 | "SparseVector(14, {0: 0.0653, 1: 0.1173, 2: 0.0099, 3: 0.5219, 4: 0.0052, 5: 0.2403, 8: 0.0028, 10: 0.0059, 13: 0.0314})" 609 | ] 610 | }, 611 | "execution_count": 21, 612 | "metadata": {}, 613 | "output_type": "execute_result" 614 | } 615 | ], 616 | "source": [ 617 | "model.stages[1].featureImportances" 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": 22, 623 | "metadata": {}, 624 | "outputs": [], 625 | "source": [ 626 | "featureImportance = model.stages[1].featureImportances.toArray()\n", 627 | "featureNames = map(lambda s: s.name, dataset.schema.fields)\n", 628 | "featureImportanceMap = zip(featureImportance, featureNames)" 629 | ] 630 | }, 631 | { 632 | "cell_type": "code", 633 | "execution_count": 23, 634 | "metadata": {}, 635 | "outputs": [ 636 | { 637 | "data": { 638 | "text/plain": [ 639 | "" 640 | ] 641 | }, 642 | "execution_count": 23, 643 | "metadata": {}, 644 | "output_type": "execute_result" 645 | } 646 | ], 647 | "source": [ 648 | "featureImportanceMap" 649 | ] 650 | }, 651 | { 652 | "cell_type": "code", 653 | "execution_count": null, 654 | "metadata": {}, 655 | "outputs": [], 656 | "source": [ 657 | "importancesDf = spark.createDataFrame(spark.parallelize(featureImportanceMap).map(lambda r: [r[1], float(r[0])]))\n", 658 | "\n", 659 | "importancesDf = importancesDf.withColumnRenamed(\"_1\", \"Feature\").withColumnRenamed(\"_2\", \"Importance\")" 660 | ] 661 | }, 662 | { 663 | "cell_type": "markdown", 664 | "metadata": {}, 665 | "source": [ 666 | "Let's convert this to a DataFrame so you can view it and save it so other users can rely on this information." 667 | ] 668 | }, 669 | { 670 | "cell_type": "code", 671 | "execution_count": null, 672 | "metadata": {}, 673 | "outputs": [], 674 | "source": [ 675 | "display(importancesDf.orderBy(desc(\"Importance\")))" 676 | ] 677 | }, 678 | { 679 | "cell_type": "markdown", 680 | "metadata": {}, 681 | "source": [ 682 | "As you can see below, the 3 most important features are:\n", 683 | "\n", 684 | "- Signs\n", 685 | "- Screens\n", 686 | "- Food\n", 687 | "\n", 688 | "This is useful information for the airport management. It means that people want to first know where they are going. Second, they check the airport screens and monitors so they can find their gate and be on time for their flight. Third, they like to have good quality food.\n", 689 | "\n", 690 | "This is especially interesting considering that taking the average of these feature variables told us nothing about the importance of the variables in determining the overall rating by the survey responder.\n", 691 | "\n", 692 | "These 3 features combine to make up **65**% of the overall rating." 693 | ] 694 | }, 695 | { 696 | "cell_type": "code", 697 | "execution_count": null, 698 | "metadata": {}, 699 | "outputs": [], 700 | "source": [ 701 | "importancesDf.orderBy(desc(\"Importance\")).limit(3).agg(sum(\"Importance\")).take(1)" 702 | ] 703 | }, 704 | { 705 | "cell_type": "code", 706 | "execution_count": null, 707 | "metadata": {}, 708 | "outputs": [], 709 | "source": [ 710 | "# See it in Piechart\n", 711 | "display(importancesDf.orderBy(desc(\"Importance\")))" 712 | ] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "execution_count": null, 717 | "metadata": {}, 718 | "outputs": [], 719 | "source": [ 720 | "display(importancesDf.orderBy(desc(\"Importance\")).limit(5))" 721 | ] 722 | }, 723 | { 724 | "cell_type": "markdown", 725 | "metadata": {}, 726 | "source": [ 727 | "## 7. Conclusion\n", 728 | "So if you run SFO, artwork and shopping are nice-to-haves but signs, monitors, and food are what keep airport customers happy!" 729 | ] 730 | }, 731 | { 732 | "cell_type": "code", 733 | "execution_count": null, 734 | "metadata": {}, 735 | "outputs": [], 736 | "source": [ 737 | "# delete saved model\n", 738 | "dbutils.fs.rm(model_save_path, True)" 739 | ] 740 | } 741 | ], 742 | "metadata": { 743 | "kernelspec": { 744 | "display_name": "Python 3", 745 | "language": "python", 746 | "name": "python3" 747 | }, 748 | "language_info": { 749 | "codemirror_mode": { 750 | "name": "ipython", 751 | "version": 3 752 | }, 753 | "file_extension": ".py", 754 | "mimetype": "text/x-python", 755 | "name": "python", 756 | "nbconvert_exporter": "python", 757 | "pygments_lexer": "ipython3", 758 | "version": "3.7.6" 759 | } 760 | }, 761 | "nbformat": 4, 762 | "nbformat_minor": 4 763 | } 764 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # PySpark 2 | 3 | ![logo](./img/spark.png) 4 | 5 | Spark is a great language for performing exploratory data analysis at scale, building machine learning pipelines, and creating ETLs for a data platform. Spark is a must for anyone who is dealing with Big-Data. Using PySpark (which is a Python API for Spark) to process large amounts of data in a distributed fashion is a great way to manage large-scale data-heavy tasks and gain business insights while not sacrificing on developer efficiency. In a few words, PySpark is a fast and powerful framework to perform massive distributed processing over resilient sets of data. 6 | 7 |
8 | 9 | ## Motivation 10 | 11 | I felt that any organization that deals with big data and data warehouse, some kind of distributed system needed. Being one of the most widely used distributed system, Spark is capable of handling several petabytes of data at a time, distributed across a cluster of thousands of cooperating physical or virtual servers. But most importantly, it's simple and fast. 12 | 13 | I thought data professionals can benefit by learning its logigstics and actual usage. Spark also offers Python API for easy data managing with Python (Jupyter). So, I have created this repository to show several examples of PySpark functions and utilities that can be used to build complete ETL process of your data modeling. The posts are more towards people who are already familari with Python and a bit of data analytics knowledge (where I often skip the enviornment set-up). But you can always follow the [Installtation section](Installation) if not familiar, then you should be able follow the notebook with no big issues. PySpark allows us to use Data Scientists' favoriate [Jupyter Notebook](https://jupyter.org/) with many pre-built functions to help processing your data. The contents in this repo is an attempt to help you get up and running on PySpark in no time! 14 | 15 |
16 | 17 | ## Table of contents 18 | * [Installation](#Installation) 19 | * [PySpark Notebooks](#PySpark) 20 | * [Contact](#Contact) 21 | * [Reference](#Reference) 22 | 23 |
24 | 25 | ## Installation 26 | 27 | Downloading PySpark on your local machine could be a little bit tricky at first, but 28 | 29 | First things First, make sure you have Jupyter notebook installed 30 | 31 | 1. Install Jupyter notebook 32 | 33 | ``` 34 | pip install jupyter notebook 35 | ``` 36 | 37 | 2. Install PySpark 38 | Make sure you have Java 8 or higher installed on your computer. 39 | But most likely Java 10 would through an error. The recommended solution was to install Java 8 (Spark 2.2.1 was having problems with Java 9 and beyond) 40 | 41 | Of course, you will also need Python (I recommend > Python 3.5 from Anaconda). 42 | Now visit the [Spark downloads page](http://spark.apache.org/downloads.html). Select the latest Spark release, a prebuilt package for Hadoop, and download it directly. 43 | 44 | 45 | 3. Set the environment variables: 46 | 47 | ``` 48 | SPARK_HOME = D:\Spark\spark-2.3.0-bin-hadoop2.7 49 | PATH += D:\Spark\spark-2.3.0-bin-hadoop2.7\bin 50 | ``` 51 | 52 | 4. For Windows users, 53 | - Download winutils.exe from here: https://github.com/steveloughran/winutils 54 | - Choose the same version as the package type you choose for the Spark .tgz file you chose in section 2 “Spark: Download and Install” (in my case: hadoop-2.7.1) 55 | - You need to navigate inside the hadoop-X.X.X folder, and inside the bin folder you will find winutils.exe 56 | - If you chose the same version as me (hadoop-2.7.1) here is the direct link: https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe 57 | - Move the winutils.exe file to the bin folder inside SPARK_HOME (i.e. C:\Spark\spark-2.3.0-bin-hadoop2.7\bin) 58 | - Set the folowing environment variable to be the same as SPARK_HOME: 59 | HADOOP_HOME = D:\Spark\spark-2.3.0-bin-hadoop2.7 60 | 61 | 62 | 5. Restart (our just source) and Run pyspark command your terminal and launch PySpark: 63 | ``` 64 | $ pyspark 65 | ``` 66 | 67 | For video instruction of installtion on Windows/Mac/Ubuntu Machine, please refer to each of the YouTube links below 68 | - Windows [Link](https://medium.com/@GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c) 69 | - Mac [Link](https://medium.com/@GalarnykMichael/install-spark-on-mac-pyspark-453f395f240b) 70 | - Ubuntu [Link](https://medium.com/@GalarnykMichael/install-spark-on-ubuntu-pyspark-231c45677de0) 71 | 72 | Or More Blogs I found on installations steps 73 | - https://towardsdatascience.com/how-to-get-started-with-pyspark-1adc142456ec 74 | - https://medium.com/big-data-engineering/how-to-install-apache-spark-2-x-in-your-pc-e2047246ffc3 75 | 76 | 77 |
78 | 79 | ## PySpark 80 | 81 | These examples display unique functionalities available in PySpark. They cover a broad range of topics with different method that user can utilize inside PySpark. 82 | 83 | 84 | #### [PySpark and SparkSQL Complete Guide](https://github.com/hyunjoonbok/PySpark/blob/master/PySpark%20and%20SparkSQL%20Complete%20Guide.ipynb) 85 |

86 | Apache Spark is a cluster computing system that offers comprehensive libraries and APIs for developers, and SparkSQL can be represented as the module in Apache Spark for processing unstructured data with the help of DataFrame API. In this notebook, we will cover the basics how to run Spark Jobs with PySpark (Python API) and execute useful functions insdie. If followed, you should be able to grasp a basic understadning of PySparks and its common functions. 87 |

88 | 89 | #### [PySpark Dataframe Complete Guide (with COVID-19 Dataset)](https://github.com/hyunjoonbok/PySpark/blob/master/PySpark%20Dataframe%20Complete%20Guide%20(with%20COVID-19%20Dataset).ipynb) 90 |

91 | Spark which is one of the most used tools when it comes to working with Big Data, but whereas Spark used to be heavily reliant on RDD manipulations, Spark has now provided a DataFrame API for us Data Scientists to work with. So in this notebook, We will learn standard Spark functionalities needed to work with DataFrames, and finally some tips to handle the inevitable errors you will face. 92 |

93 | 94 | #### [Binary Tabular Data Classification with PySpark (Tabular Data)](https://github.com/hyunjoonbok/PySpark/blob/master/Binary%20Tabular%20Data%20Classification%20with%20PySpark.ipynb) 95 |

96 | This notebook covers a classification problem in Machine Learning and go through a comprehensive guide to succesfully develop an End-to-End ML class prediction model using PySpark. We will use a number of different supervised algorithms to precisely predict individuals’ income using data collected from the 1994 U.S. Census. We will then choose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data. Our goal with this implementation is to build a model that accurately predicts whether an individual makes more than $50,000 97 |

98 | 99 | #### [End-to-End Binary Classification ML Model with PySpark and MLlib (1)](https://github.com/hyunjoonbok/PySpark/blob/master/End-to-End%20Machine%20Learning%20Model%20using%20PySpark%20and%20MLlib.ipynb) 100 |

101 | In-Memory computation and Parallel-Processing are some of the major reasons that Apache Spark has become very popular in the big data industry to deal with data products at large scale and perform faster analysis. we are going to use a real world dataset from Home Credit Default Risk competition on kaggle. The target variable is either 0 (applicants who were able to pay back their loans)or 1 (applicants who were NOT able to pay back their loans). it is a binary classification problem with a highly imbalanced target label. 102 |

103 | 104 | #### [End-to-End Binary Classification ML Model with PySpark and MLlib (2)](https://github.com/hyunjoonbok/PySpark/blob/master/End-to-End%20Machine%20Learning%20Model%20using%20PySpark%20and%20MLlib%20(2).ipynb) 105 |

106 | Machine learning in the real world is messy. Data sources contain missing values, include redundant rows, or may not fit in memory. Feature engineering often requires domain expertise and can be tedious. Modeling too often mixes data science and systems engineering, requiring not only knowledge of algorithms but also of machine architecture and distributed systems. In this notebook, we build a model to predict the quality of Portugese "Vinho Verde" wine based on the wine's physicochemical properties. It covers data importing, visualization, parallel hyperparameter computation, Explore the best performing model in MLflow, etc. 107 |

108 | 109 | 110 | #### [Multi-class Text Classification Problem with PySpark and MLlib](https://github.com/hyunjoonbok/PySpark/blob/master/Multi-class%20Text%20Classification%20Problem%20with%20PySpark%20and%20MLlib.ipynb) 111 |

112 | Apache Spark is quickly gaining steam both in the headlines and real-world adoption, mainly because of its ability to process streaming data. With so much data being processed on a daily basis, it has become essential for us to be able to stream and analyze it in real time. We use Spark Machine Learning Library (Spark MLlib) to solve multi-class text classification problem 113 |

114 | 115 | #### [Multi-class classification using Decision Tree Problem with PySpark](https://github.com/hyunjoonbok/PySpark/blob/master/Multi-class%20classification%20using%20Decision%20Tree%20Problem%20with%20PySpark%20.ipynb) 116 |

117 | This notebook covers a full multi class classification problem with Decision Tree method to look at the SFO airport data to predict which customer to give the overall rating. It covers a complete cycle of modeling (data loadgin, create a model, evaludate a model, feature importance). 118 |

119 | 120 | #### [Setting up Fast Hyperparameter Search Framework with Pyspark](https://github.com/hyunjoonbok/PySpark/blob/master/Setting%20up%20Fast%20Hyperparameter%20Search%20Framework%20with%20Pyspark.ipynb) 121 |

122 | In this notebook, we set up hyperparameter tuning framework in PySpark using machine learning libraries like scikit-learn/xgboost/lightgbm. Usually manual tuning has to change a lot of parameters. Hyperopt works only one model at a time. So it was taking up a lot of time to train each model and I was pretty short on time. But what if we can parallelize my model hyperparameter search process? We can choose to load your data using Spark, but here I start by creating our own classification data to set up a minimal example which we can work with.rt data to predict which customer to give the overall rating. It covers a complete cycle of modeling (data loadgin, create a model, evaludate a model, feature importance). 123 |

124 | 125 | #### [PySpark Know-How in Pratice(Advanced)](https://github.com/hyunjoonbok/PySpark/blob/master/%5BAdvanced%5D%20Spark%20Know-How%20in%20Pratice%20.ipynb) 126 |

127 | In this notebook, there would be a lot of advanced Spark Tips introduced that can be applied to boost the data processing. We go over two important method to increase performance and reduce cost: Re-partitioning and Coalesce. Parallelism allows to perform millions tasks simultaneosly on numerous number of machines in a cluster independently. Under the hood, each dataframe (RDD) is stored in partitions on different cluster nodes. 128 |

129 | 130 | #### [5 Spark Tips that will get you to another level(Advanced)](https://github.com/hyunjoonbok/PySpark/blob/master/%5BAdvanced%5D%205%20Spark%20Tips%20that%20will%20get%20you%20to%20another%20level.ipynb) 131 |

132 | The advanced (5) tips that will make you the master of Spark is here. Make sure to not slip any of it! 133 |

134 | 135 | 136 | 137 |
138 | 139 | ## Contact 140 | Created by [@hyunjoonbok](https://www.linkedin.com/in/hyunjoonbok/) - feel free to contact me! 141 | 142 | 143 | ## Resource 144 | - Ultimate PySpark Cheat Sheet [Blog](https://towardsdatascience.com/ultimate-pyspark-cheat-sheet-7d3938d13421) 145 | - Use Apache Arrow to Assist PySpark in Data Processing [Medium](https://medium.com/datadriveninvestor/use-apache-arrow-to-assist-pyspark-in-data-processing-6c1cce134306) 146 | - Real Python [Website](https://realpython.com/) 147 | - Luminousmen Blog [Blog](https://luminousmen.com/) 148 | 149 | 150 | ## Reference 151 | - Victor Romain's [Medium](https://towardsdatascience.com/@rromanss23?source=post_page-----485fb3c94e5e----------------------) 152 | - Official PySpark Documentation [Link](https://spark.apache.org/docs/latest/api/python/index.html) 153 | -------------------------------------------------------------------------------- /[Advanced] 5 Spark Tips that will get you to another level.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# [Advanced] 5 Spark Tips that will get you to another level" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "There are many different tools in the world, each of which solves a range of problems. Many of them are judged by how well and correct they solve this or that problem, but there are tools that you just like, you want to use them. They are properly designed and fit well in your hand, you do not need to dig into the documentation and understand how to do this or that simple action. About one of these tools for me I will be writing this series of posts.\n", 15 | "\n", 16 | "[Reference](https://luminousmen.com/)" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "## 1. DO NOT collect data on Local Driver" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "If your RDD/DataFrame is so large that all its elements will not fit into the driver machine memory, **DO NOT** do the following:" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "data = df.collect()" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "Collect action will try to move all data in RDD/DataFrame to the machine with the driver and where it may run out of memory and crash. Instead, you can make sure that the number of items returned is sampled by calling 'take' or 'takeSample', or perhaps by filtering your RDD/DataFrame." 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "## 2. Specify the schema" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "When reading CSV and JSON files, you get better performance by specifying the schema, instead of using the inference mechanism - specifying the schema reduces errors and is recommended for production code." 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "from pyspark.sql.types import (StructType, StructField, \n", 70 | " DoubleType, IntegerType, StringType)\n", 71 | "\n", 72 | "schema = StructType([ \n", 73 | " StructField('A', IntegerType(), nullable=False), \n", 74 | " StructField('B', DoubleType(), nullable=False), \n", 75 | " StructField('C', StringType(), nullable=False)\n", 76 | "])\n", 77 | "\n", 78 | "df = sc.read.csv('/some/input/file.csv', inferSchema=False)" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "Also, Use the right datatype. Avro has easy serialization/deserialization, which allows for efficient integration of ingestion processes. Meanwhile, Parquet allows you to work effectively when selecting specific columns and can be effective for storing intermediate files. But the parquet files are immutable, modifications require overwriting the whole data set, however, Avro files can easily cope with frequent schema changes.\n", 86 | "Reference: https://luminousmen.com/post/big-data-file-formats" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "## 3. Avoid reduceByKey when the input and output value types are different" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "If for any reason you have RDD-based jobs, use wisely reduceByKey operations.\n", 101 | "\n", 102 | "Consider the job of creating a set of strings for each key:" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "rdd.map(lambda p: (p[0], {p[1]})) \\\n", 112 | " .reduceByKey(lambda x, y: x | y) \\\n", 113 | " .collect()" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": {}, 119 | "source": [ 120 | "Note that the input values are strings and the output values are sets. The map operation creates lots of temporary small objects. A better way to handle this scenario is to use aggregateByKey:" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": {}, 127 | "outputs": [], 128 | "source": [ 129 | "def merge_vals(xs, x):\n", 130 | " xs.add(x)\n", 131 | " return xs\n", 132 | "\n", 133 | "def combine(xs, ys):\n", 134 | " return xs | ys\n", 135 | "\n", 136 | "rdd.aggregateByKey(set(), merge_vals, combine).collect()" 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "## 4. Don't use count when you don't need to return the exact number of rows" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "When you don't need to return the exact number of rows, It's efficient to use" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "df = sqlContext.read().json(...);\n", 160 | "if not len(df.take(1)):\n", 161 | " ..." 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "#### instead of" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": null, 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "if not df.count():\n", 178 | " ..." 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "## 5. Using bucketing in Pyspark" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "Bucketing is an optimization method that breaks down data into more manageable parts (buckets) to determine the data partitioning while it is written out. The motivation for this method is to make successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. In our example, we can optimize the execution of join queries by avoiding shuffles(also known as exchanges) of the tables involved in the join. Using bucketing leads to a smaller number of exchanges (and, consequently, stages), because shuffling may not be required — both DataFrames may already be located in the same partitions.\n", 193 | "\n", 194 | "Bucketing is on by default. Spark uses the configuration property spark.sql.sources.bucketing.enabled to control whether or not it should be enabled and used to optimize requests.\n", 195 | "\n", 196 | "Bucketing determines the physical layout of the data, so we shuffle the data beforehand because we want to avoid such shuffling later in the process.\n", 197 | "\n", 198 | "Okay, do I really need to do an extra step if the shuffle is to be executed anyway?\n", 199 | "\n", 200 | "If you join several times, then yes. \n", 201 | "\n", 202 | "**The more times you join, the better the performance gains.**\n", 203 | "\n", 204 | "An example of how to create a bucketed table:" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "df.write\\\n", 214 | " .bucketBy(16, 'key') \\\n", 215 | " .sortBy('value') \\\n", 216 | " .saveAsTable('bucketed', format='parquet')" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "Thus, here bucketBy distributes data to a fixed number of buckets (16 in our case) and can be used when the number of unique values is not limited. If the number of unique values is limited, it's better to use a partitioning instead of a bucketing." 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "metadata": {}, 230 | "outputs": [], 231 | "source": [ 232 | "t2 = spark.table('bucketed')\n", 233 | "t3 = spark.table('bucketed')\n", 234 | "\n", 235 | "# bucketed - bucketed join. \n", 236 | "# Both sides have the same bucketing, and no shuffles are needed.\n", 237 | "t3.join(t2, 'key').explain()" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "Apart from the single-stage sort-merge join, bucketing also supports quick data sampling. As of Spark 2.4, Spark SQL supports bucket pruning to optimize filtering on the bucketed column (by reducing the number of bucket files to scan)." 245 | ] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "metadata": {}, 250 | "source": [ 251 | "Bucketing works well when the number of unique values is unlimited. Columns that are often used in queries and provide high selectivity are a good choice for bucketing. Bucketed Spark tables store metadata about how they are bucketed and sorted, which helps optimize joins, aggregations, and queries for bucketed columns." 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "Reference\n", 259 | "https://spark.apache.org/docs/latest/tuning.html\n", 260 | "Uber Case Study: Choosing the Right HDFS File Format for Your Apache Spark Jobs\n", 261 | "https://luminousmen.com/post/the-5-minute-guide-to-using-bucketing-in-pyspark" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": null, 267 | "metadata": {}, 268 | "outputs": [], 269 | "source": [] 270 | } 271 | ], 272 | "metadata": { 273 | "kernelspec": { 274 | "display_name": "Python 3", 275 | "language": "python", 276 | "name": "python3" 277 | }, 278 | "language_info": { 279 | "codemirror_mode": { 280 | "name": "ipython", 281 | "version": 3 282 | }, 283 | "file_extension": ".py", 284 | "mimetype": "text/x-python", 285 | "name": "python", 286 | "nbconvert_exporter": "python", 287 | "pygments_lexer": "ipython3", 288 | "version": "3.7.6" 289 | } 290 | }, 291 | "nbformat": 4, 292 | "nbformat_minor": 4 293 | } 294 | -------------------------------------------------------------------------------- /data/Case.csv: -------------------------------------------------------------------------------- 1 | case_id,province,city,group,infection_case,confirmed,latitude,longitude 2 | 1000001,Seoul,Yongsan-gu,TRUE,Itaewon Clubs,139,37.538621,126.992652 3 | 1000002,Seoul,Gwanak-gu,TRUE,Richway,119,37.48208,126.901384 4 | 1000003,Seoul,Guro-gu,TRUE,Guro-gu Call Center,95,37.508163,126.884387 5 | 1000004,Seoul,Yangcheon-gu,TRUE,Yangcheon Table Tennis Club,43,37.546061,126.874209 6 | 1000005,Seoul,Dobong-gu,TRUE,Day Care Center,43,37.679422,127.044374 7 | 1000006,Seoul,Guro-gu,TRUE,Manmin Central Church,41,37.481059,126.894343 8 | 1000007,Seoul,from other city,TRUE,SMR Newly Planted Churches Group,36,-,- 9 | 1000008,Seoul,Dongdaemun-gu,TRUE,Dongan Church,17,37.592888,127.056766 10 | 1000009,Seoul,from other city,TRUE,Coupang Logistics Center,25,-,- 11 | 1000010,Seoul,Gwanak-gu,TRUE,Wangsung Church,30,37.481735,126.930121 12 | 1000011,Seoul,Eunpyeong-gu,TRUE,Eunpyeong St. Mary's Hospital,14,37.63369,126.9165 13 | 1000012,Seoul,Seongdong-gu,TRUE,Seongdong-gu APT,13,37.55713,127.0403 14 | 1000013,Seoul,Jongno-gu,TRUE,Jongno Community Center,10,37.57681,127.006 15 | 1000014,Seoul,Gangnam-gu,TRUE,Samsung Medical Center,7,37.48825,127.08559 16 | 1000015,Seoul,Jung-gu,TRUE,Jung-gu Fashion Company,7,37.562405,126.984377 17 | 1000016,Seoul,Seodaemun-gu,TRUE,Yeonana News Class,5,37.558147,126.943799 18 | 1000017,Seoul,Jongno-gu,TRUE,Korea Campus Crusade of Christ,7,37.594782,126.968022 19 | 1000018,Seoul,Gangnam-gu,TRUE,Gangnam Yeoksam-dong gathering,6,-,- 20 | 1000019,Seoul,from other city,TRUE,Daejeon door-to-door sales,1,-,- 21 | 1000020,Seoul,Geumcheon-gu,TRUE,Geumcheon-gu rice milling machine manufacture,6,-,- 22 | 1000021,Seoul,from other city,TRUE,Shincheonji Church,8,-,- 23 | 1000022,Seoul,from other city,TRUE,Guri Collective Infection,5,-,- 24 | 1000023,Seoul,Jung-gu,TRUE,KB Life Insurance,13,37.560899,126.966998 25 | 1000024,Seoul,Yeongdeungpo-gu,TRUE,Yeongdeungpo Learning Institute,3,37.520846,126.931278 26 | 1000025,Seoul,Gangnam-gu,TRUE,Gangnam Dongin Church,1,37.522331,127.057388 27 | 1000026,Seoul,Yangcheon-gu,TRUE,Biblical Language study meeting,3,37.524623,126.843118 28 | 1000027,Seoul,Seocho-gu,TRUE,Seocho Family,5,-,- 29 | 1000028,Seoul,from other city,TRUE,Anyang Gunpo Pastors Group,1,-,- 30 | 1000029,Seoul,Gangnam-gu,TRUE,Samsung Fire & Marine Insurance,4,37.498279,127.030139 31 | 1000030,Seoul,Gangseo-gu,TRUE,SJ Investment Call Center,0,37.559649,126.835102 32 | 1000031,Seoul,from other city,TRUE,Yongin Brothers,4,-,- 33 | 1000032,Seoul,Jung-gu,TRUE,Seoul City Hall Station safety worker,3,37.565699,126.977079 34 | 1000033,Seoul,from other city,TRUE,Uiwang Logistics Center,2,-,- 35 | 1000034,Seoul,-,TRUE,Orange Life,1,-,- 36 | 1000035,Seoul,Guro-gu,TRUE,Daezayeon Korea,3,37.486837,126.893163 37 | 1000036,Seoul,-,FALSE,overseas inflow,298,-,- 38 | 1000037,Seoul,-,FALSE,contact with patient,162,-,- 39 | 1000038,Seoul,-,FALSE,etc,100,-,- 40 | 1100001,Busan,Dongnae-gu,TRUE,Onchun Church,39,35.21628,129.0771 41 | 1100002,Busan,from other city,TRUE,Shincheonji Church,12,-,- 42 | 1100003,Busan,Suyeong-gu,TRUE,Suyeong-gu Kindergarten,5,35.16708,129.1124 43 | 1100004,Busan,Haeundae-gu,TRUE,Haeundae-gu Catholic Church,6,35.20599,129.1256 44 | 1100005,Busan,Jin-gu,TRUE,Jin-gu Academy,4,35.17371,129.0633 45 | 1100006,Busan,from other city,TRUE,Itaewon Clubs,4,-,- 46 | 1100007,Busan,from other city,TRUE,Cheongdo Daenam Hospital,1,-,- 47 | 1100008,Busan,-,FALSE,overseas inflow,36,-,- 48 | 1100009,Busan,-,FALSE,contact with patient,19,-,- 49 | 1100010,Busan,-,FALSE,etc,30,-,- 50 | 1200001,Daegu,Nam-gu,TRUE,Shincheonji Church,4511,35.84008,128.5667 51 | 1200002,Daegu,Dalseong-gun,TRUE,Second Mi-Ju Hospital,196,35.857375,128.466651 52 | 1200003,Daegu,Seo-gu,TRUE,Hansarang Convalescent Hospital,124,35.885592,128.556649 53 | 1200004,Daegu,Dalseong-gun,TRUE,Daesil Convalescent Hospital,101,35.857393,128.466653 54 | 1200005,Daegu,Dong-gu,TRUE,Fatima Hospital,39,35.88395,128.624059 55 | 1200006,Daegu,from other city,TRUE,Itaewon Clubs,2,-,- 56 | 1200007,Daegu,from other city,TRUE,Cheongdo Daenam Hospital,2,-,- 57 | 1200008,Daegu,-,FALSE,overseas inflow,41,-,- 58 | 1200009,Daegu,-,FALSE,contact with patient,917,-,- 59 | 1200010,Daegu,-,FALSE,etc,747,-,- 60 | 1300001,Gwangju,Dong-gu,TRUE,Gwangneuksa Temple,5,35.136035,126.956405 61 | 1300002,Gwangju,from other city,TRUE,Shincheonji Church,9,-,- 62 | 1300003,Gwangju,-,FALSE,overseas inflow,23,-,- 63 | 1300004,Gwangju,-,FALSE,contact with patient,5,-,- 64 | 1300005,Gwangju,-,FALSE,etc,1,-,- 65 | 1400001,Incheon,from other city,TRUE,Itaewon Clubs,53,-,- 66 | 1400002,Incheon,from other city,TRUE,Coupang Logistics Center,42,-,- 67 | 1400003,Incheon,from other city,TRUE,Guro-gu Call Center,20,-,- 68 | 1400004,Incheon,from other city,TRUE,Shincheonji Church,2,-,- 69 | 1400005,Incheon,-,FALSE,overseas inflow,68,-,- 70 | 1400006,Incheon,-,FALSE,contact with patient,6,-,- 71 | 1400007,Incheon,-,FALSE,etc,11,-,- 72 | 1500001,Daejeon,-,TRUE,Door-to-door sales in Daejeon,55,-,- 73 | 1500002,Daejeon,Seo-gu,TRUE,Dunsan Electronics Town,13,36.3400973,127.3927099 74 | 1500003,Daejeon,Seo-gu,TRUE,Orange Town,7,36.3398739,127.3819744 75 | 1500004,Daejeon,Seo-gu,TRUE,Dreaming Church,4,36.346869,127.368594 76 | 1500005,Daejeon,Seo-gu,TRUE,Korea Forest Engineer Institute,3,36.358123,127.388856 77 | 1500006,Daejeon,from other city,TRUE,Shincheonji Church,2,-,- 78 | 1500007,Daejeon,from other city,TRUE,Seosan-si Laboratory,2,-,- 79 | 1500008,Daejeon,-,FALSE,overseas inflow,15,-,- 80 | 1500009,Daejeon,-,FALSE,contact with patient,15,-,- 81 | 1500010,Daejeon,-,FALSE,etc,15,-,- 82 | 1600001,Ulsan,from other city,TRUE,Shincheonji Church,16,-,- 83 | 1600002,Ulsan,-,FALSE,overseas inflow,25,-,- 84 | 1600003,Ulsan,-,FALSE,contact with patient,3,-,- 85 | 1600004,Ulsan,-,FALSE,etc,7,-,- 86 | 1700001,Sejong,Sejong,TRUE,Ministry of Oceans and Fisheries,31,36.504713,127.265172 87 | 1700002,Sejong,Sejong,TRUE,gym facility in Sejong,8,36.48025,127.289 88 | 1700003,Sejong,from other city,TRUE,Shincheonji Church,1,-,- 89 | 1700004,Sejong,-,FALSE,overseas inflow,5,-,- 90 | 1700005,Sejong,-,FALSE,contact with patient,3,-,- 91 | 1700006,Sejong,-,FALSE,etc,1,-,- 92 | 2000001,Gyeonggi-do,Seongnam-si,TRUE,River of Grace Community Church,67,37.455687,127.161627 93 | 2000002,Gyeonggi-do,Bucheon-si,TRUE,Coupang Logistics Center,67,37.530579,126.775254 94 | 2000003,Gyeonggi-do,from other city,TRUE,Itaewon Clubs,59,-,- 95 | 2000004,Gyeonggi-do,from other city,TRUE,Richway,58,-,- 96 | 2000005,Gyeonggi-do,Uijeongbu-si,TRUE,Uijeongbu St. Mary’s Hospital,50,37.758635,127.077716 97 | 2000006,Gyeonggi-do,from other city,TRUE,Guro-gu Call Center,50,-,- 98 | 2000007,Gyeonggi-do,from other city,TRUE,Shincheonji Church,29,-,- 99 | 2000008,Gyeonggi-do,from other city,TRUE,Yangcheon Table Tennis Club,28,-,- 100 | 2000009,Gyeonggi-do,-,TRUE,SMR Newly Planted Churches Group,25,-,- 101 | 2000010,Gyeonggi-do,Seongnam-si,TRUE,Bundang Jesaeng Hospital,22,37.38833,127.1218 102 | 2000011,Gyeonggi-do,Anyang-si,TRUE,Anyang Gunpo Pastors Group,22,37.381784,126.93615 103 | 2000012,Gyeonggi-do,Suwon-si,TRUE,Lotte Confectionery logistics center,15,37.287356,127.013827 104 | 2000013,Gyeonggi-do,Anyang-si,TRUE,Lord Glory Church,17,37.403722,126.954939 105 | 2000014,Gyeonggi-do,Suwon-si,TRUE,Suwon Saeng Myeong Saem Church,10,37.2376,127.0517 106 | 2000015,Gyeonggi-do,from other city,TRUE,Korea Campus Crusade of Christ,7,-,- 107 | 2000016,Gyeonggi-do,from other city,TRUE,Geumcheon-gu rice milling machine manufacture,6,-,- 108 | 2000017,Gyeonggi-do,from other city,TRUE,Wangsung Church,6,-,- 109 | 2000018,Gyeonggi-do,from other city,TRUE,Seoul City Hall Station safety worker,5,-,- 110 | 2000019,Gyeonggi-do,Seongnam-si,TRUE,Seongnam neighbors gathering,5,-,- 111 | 2000020,Gyeonggi-do,-,FALSE,overseas inflow,305,-,- 112 | 2000021,Gyeonggi-do,-,FALSE,contact with patient,63,-,- 113 | 2000022,Gyeonggi-do,-,FALSE,etc,84,-,- 114 | 3000001,Gangwon-do,from other city,TRUE,Shincheonji Church,17,-,- 115 | 3000002,Gangwon-do,from other city,TRUE,Uijeongbu St. Mary’s Hospital,10,-,- 116 | 3000003,Gangwon-do,Wonju-si,TRUE,Wonju-si Apartments,4,37.342762,127.983815 117 | 3000004,Gangwon-do,from other city,TRUE,Richway,4,-,- 118 | 3000005,Gangwon-do,from other city,TRUE,Geumcheon-gu rice milling machine manufacture,4,-,- 119 | 3000006,Gangwon-do,-,FALSE,overseas inflow,16,-,- 120 | 3000007,Gangwon-do,-,FALSE,contact with patient,0,-,- 121 | 3000008,Gangwon-do,-,FALSE,etc,7,-,- 122 | 4000001,Chungcheongbuk-do,Goesan-gun,TRUE,Goesan-gun Jangyeon-myeon,11,36.82422,127.9552 123 | 4000002,Chungcheongbuk-do,from other city,TRUE,Itaewon Clubs,9,-,- 124 | 4000003,Chungcheongbuk-do,from other city,TRUE,Guro-gu Call Center,2,-,- 125 | 4000004,Chungcheongbuk-do,from other city,TRUE,Shincheonji Church,6,-,- 126 | 4000005,Chungcheongbuk-do,-,FALSE,overseas inflow,13,-,- 127 | 4000006,Chungcheongbuk-do,-,FALSE,contact with patient,8,-,- 128 | 4000007,Chungcheongbuk-do,-,FALSE,etc,11,-,- 129 | 4100001,Chungcheongnam-do,Cheonan-si,TRUE,gym facility in Cheonan,103,36.81503,127.1139 130 | 4100002,Chungcheongnam-do,from other city,TRUE,Door-to-door sales in Daejeon,10,-,- 131 | 4100003,Chungcheongnam-do,Seosan-si,TRUE,Seosan-si Laboratory,9,37.000354,126.354443 132 | 4100004,Chungcheongnam-do,from other city,TRUE,Richway,3,-,- 133 | 4100005,Chungcheongnam-do,from other city,TRUE,Eunpyeong-Boksagol culture center,3,-,- 134 | 4100006,Chungcheongnam-do,-,FALSE,overseas inflow,16,-,- 135 | 4100007,Chungcheongnam-do,-,FALSE,contact with patient,2,-,- 136 | 4100008,Chungcheongnam-do,-,FALSE,etc,12,-,- 137 | 5000001,Jeollabuk-do,from other city,TRUE,Itaewon Clubs,2,-,- 138 | 5000002,Jeollabuk-do,from other city,TRUE,Door-to-door sales in Daejeon,3,-,- 139 | 5000003,Jeollabuk-do,from other city,TRUE,Shincheonji Church,1,-,- 140 | 5000004,Jeollabuk-do,-,FALSE,overseas inflow,12,-,- 141 | 5000005,Jeollabuk-do,-,FALSE,etc,5,-,- 142 | 5100001,Jeollanam-do,Muan-gun,TRUE,Manmin Central Church,2,35.078825,126.316746 143 | 5100002,Jeollanam-do,from other city,TRUE,Shincheonji Church,1,-,- 144 | 5100003,Jeollanam-do,-,FALSE,overseas inflow,14,-,- 145 | 5100004,Jeollanam-do,-,FALSE,contact with patient,4,-,- 146 | 5100005,Jeollanam-do,-,FALSE,etc,4,-,- 147 | 6000001,Gyeongsangbuk-do,from other city,TRUE,Shincheonji Church,566,-,- 148 | 6000002,Gyeongsangbuk-do,Cheongdo-gun,TRUE,Cheongdo Daenam Hospital,119,35.64887,128.7368 149 | 6000003,Gyeongsangbuk-do,Bonghwa-gun,TRUE,Bonghwa Pureun Nursing Home,68,36.92757,128.9099 150 | 6000004,Gyeongsangbuk-do,Gyeongsan-si,TRUE,Gyeongsan Seorin Nursing Home,66,35.782149,128.801498 151 | 6000005,Gyeongsangbuk-do,from other city,TRUE,Pilgrimage to Israel,41,-,- 152 | 6000006,Gyeongsangbuk-do,Yechun-gun,TRUE,Yechun-gun,40,36.646845,128.437416 153 | 6000007,Gyeongsangbuk-do,Chilgok-gun,TRUE,Milal Shelter,36,36.0581,128.4941 154 | 6000008,Gyeongsangbuk-do,Gyeongsan-si,TRUE,Gyeongsan Jeil Silver Town,17,35.84819,128.7621 155 | 6000009,Gyeongsangbuk-do,Gyeongsan-si,TRUE,Gyeongsan Cham Joeun Community Center,16,35.82558,128.7373 156 | 6000010,Gyeongsangbuk-do,Gumi-si,TRUE,Gumi Elim Church,10,-,- 157 | 6000011,Gyeongsangbuk-do,-,FALSE,overseas inflow,22,-,- 158 | 6000012,Gyeongsangbuk-do,-,FALSE,contact with patient,190,-,- 159 | 6000013,Gyeongsangbuk-do,-,FALSE,etc,133,-,- 160 | 6100001,Gyeongsangnam-do,from other city,TRUE,Shincheonji Church,32,-,- 161 | 6100002,Gyeongsangnam-do,Geochang-gun,TRUE,Geochang Church,10,35.68556,127.9127 162 | 6100003,Gyeongsangnam-do,Jinju-si,TRUE,Wings Tower,9,35.164845,128.126969 163 | 6100004,Gyeongsangnam-do,Geochang-gun,TRUE,Geochang-gun Woongyang-myeon,8,35.805681,127.917805 164 | 6100005,Gyeongsangnam-do,Changwon-si,TRUE,Hanmaeum Changwon Hospital,7,35.22115,128.6866 165 | 6100006,Gyeongsangnam-do,Changnyeong-gun,TRUE,Changnyeong Coin Karaoke,7,35.54127,128.5008 166 | 6100007,Gyeongsangnam-do,Yangsan-si,TRUE,Soso Seowon,3,35.338811,129.017508 167 | 6100008,Gyeongsangnam-do,from other city,TRUE,Itaewon Clubs,2,-,- 168 | 6100009,Gyeongsangnam-do,from other city,TRUE,Onchun Church,2,-,- 169 | 6100010,Gyeongsangnam-do,-,FALSE,overseas inflow,26,-,- 170 | 6100011,Gyeongsangnam-do,-,FALSE,contact with patient,6,-,- 171 | 6100012,Gyeongsangnam-do,-,FALSE,etc,20,-,- 172 | 7000001,Jeju-do,-,FALSE,overseas inflow,14,-,- 173 | 7000002,Jeju-do,-,FALSE,contact with patient,0,-,- 174 | 7000003,Jeju-do,-,FALSE,etc,4,-,- 175 | 7000004,Jeju-do,from other city,TRUE,Itaewon Clubs,1,-,- -------------------------------------------------------------------------------- /data/Region.csv: -------------------------------------------------------------------------------- 1 | code,province,city,latitude,longitude,elementary_school_count,kindergarten_count,university_count,academy_ratio,elderly_population_ratio,elderly_alone_ratio,nursing_home_count 2 | 10000,Seoul,Seoul,37.566953,126.977977,607,830,48,1.44,15.38,5.8,22739 3 | 10010,Seoul,Gangnam-gu,37.518421,127.047222,33,38,0,4.18,13.17,4.3,3088 4 | 10020,Seoul,Gangdong-gu,37.530492,127.123837,27,32,0,1.54,14.55,5.4,1023 5 | 10030,Seoul,Gangbuk-gu,37.639938,127.025508,14,21,0,0.67,19.49,8.5,628 6 | 10040,Seoul,Gangseo-gu,37.551166,126.849506,36,56,1,1.17,14.39,5.7,1080 7 | 10050,Seoul,Gwanak-gu,37.47829,126.951502,22,33,1,0.89,15.12,4.9,909 8 | 10060,Seoul,Gwangjin-gu,37.538712,127.082366,22,33,3,1.16,13.75,4.8,723 9 | 10070,Seoul,Guro-gu,37.495632,126.88765,26,34,3,1,16.21,5.7,741 10 | 10080,Seoul,Geumcheon-gu,37.456852,126.895229,18,19,0,0.96,16.15,6.7,475 11 | 10090,Seoul,Nowon-gu,37.654259,127.056294,42,66,6,1.39,15.4,7.4,952 12 | 10100,Seoul,Dobong-gu,37.668952,127.047082,23,26,1,0.95,17.89,7.2,485 13 | 10110,Seoul,Dongdaemun-gu,37.574552,127.039721,21,31,4,1.06,17.26,6.7,832 14 | 10120,Seoul,Dongjak-gu,37.510571,126.963604,21,34,3,1.17,15.85,5.2,762 15 | 10130,Seoul,Mapo-gu,37.566283,126.901644,22,24,2,1.83,14.05,4.9,929 16 | 10140,Seoul,Seodaemun-gu,37.579428,126.936771,19,25,6,1.12,16.77,6.2,587 17 | 10150,Seoul,Seocho-gu,37.483804,127.032693,24,27,1,2.6,13.39,3.8,1465 18 | 10160,Seoul,Seongdong-gu,37.563277,127.036647,21,30,2,0.97,14.76,5.3,593 19 | 10170,Seoul,Seongbuk-gu,37.589562,127.0167,29,49,6,1.02,16.15,6,729 20 | 10180,Seoul,Songpa-gu,37.51462,127.106141,40,51,1,1.65,13.1,4.1,1527 21 | 10190,Seoul,Yangcheon-gu,37.517189,126.866618,30,43,0,2.26,13.55,5.5,816 22 | 10200,Seoul,Yeongdeungpo-gu,37.526505,126.89619,23,39,0,1.21,15.6,5.8,1001 23 | 10210,Seoul,Yongsan-gu,37.532768,126.990021,15,13,1,0.68,16.87,6.5,435 24 | 10220,Seoul,Eunpyeong-gu,37.603481,126.929173,31,44,1,1.09,17,6.5,874 25 | 10230,Seoul,Jongno-gu,37.572999,126.979189,13,17,3,1.71,18.27,6.8,668 26 | 10240,Seoul,Jung-gu,37.563988,126.99753,12,14,2,0.94,18.42,7.4,728 27 | 10250,Seoul,Jungnang-gu,37.606832,127.092656,23,31,1,0.7,16.65,6.9,689 28 | 11000,Busan,Busan,35.179884,129.074796,304,408,22,1.4,18.41,8.6,6752 29 | 11010,Busan,Gangseo-gu,35.212424,128.98068,17,21,0,1.43,11.84,5,147 30 | 11020,Busan,Geumjeong-gu,35.243053,129.092163,22,28,4,1.64,19.8,8.4,466 31 | 11030,Busan,Gijang-gun,35.244881,129.222253,21,35,0,1.31,15.45,7.4,229 32 | 11040,Busan,Nam-gu,35.136789,129.08414,21,27,4,1.24,19.13,7.9,475 33 | 11050,Busan,Dong-gu,35.129432,129.04554,7,9,0,0.83,25.42,13.8,239 34 | 11060,Busan,Dongnae-gu,35.20506,129.083673,22,31,0,1.98,17.53,7.7,608 35 | 11070,Busan,Busanjin-gu,35.163332,129.053058,32,43,3,1.62,19.01,8.7,986 36 | 11080,Busan,Buk-gu,35.197483,128.990224,27,37,1,1.35,16.04,8.4,465 37 | 11090,Busan,Sasang-gu,35.152777,128.991142,21,31,3,0.78,17,8.1,317 38 | 11100,Busan,Saha-gu,35.104642,128.974792,26,42,2,1.34,17.76,8.1,541 39 | 11110,Busan,Seo-gu,35.098135,129.024193,11,9,0,1.02,24.36,12.3,235 40 | 11120,Busan,Suyeong-gu,35.145805,129.113194,10,22,0,1.56,20.4,8.2,395 41 | 11130,Busan,Yeonje-gu,35.176406,129.079566,16,18,2,1.39,18.27,7.8,450 42 | 11140,Busan,Yeongdo-gu,35.091362,129.067884,14,12,2,0.69,26.13,13.3,203 43 | 11150,Busan,Jung-gu,35.106321,129.032256,4,4,0,1.36,26,13,182 44 | 11160,Busan,Haeundae-gu,35.16336,129.163594,33,39,1,1.63,16.53,7.9,814 45 | 12000,Daegu,Daegu,35.87215,128.601783,229,355,11,1.62,15.78,7.5,5083 46 | 12010,Daegu,Nam-gu,35.8463,128.597723,11,15,2,0.85,22.49,10.4,345 47 | 12020,Daegu,Dalseo-gu,35.830128,128.532635,55,78,3,1.72,13.56,6.5,1064 48 | 12030,Daegu,Dalseong-gun,35.77475,128.431314,32,47,1,1.51,12.11,5.4,361 49 | 12040,Daegu,Dong-gu,35.886836,128.635537,32,58,0,1.15,18.81,9,649 50 | 12050,Daegu,Buk-gu,35.892506,128.583779,38,73,4,1.43,13.95,6.3,748 51 | 12060,Daegu,Seo-gu,35.871993,128.559182,17,23,0,0.83,21.29,10.1,374 52 | 12070,Daegu,Suseong-gu,35.858395,128.630551,34,49,1,2.28,15,6.9,948 53 | 12080,Daegu,Jung-gu,35.869551,128.606184,10,12,0,4.03,20.29,9.6,594 54 | 13000,Gwangju,Gwangju,35.160467,126.851392,155,312,17,2.38,13.57,6.4,2852 55 | 13010,Gwangju,Gwangsan-gu,35.139941,126.7937,45,90,4,2.32,9.1,4.6,640 56 | 13020,Gwangju,Nam-gu,35.132982,126.902379,23,44,4,2.63,16.76,7.5,427 57 | 13030,Gwangju,Dong-gu,35.14626,126.923133,11,18,3,2.88,21.59,8.8,327 58 | 13040,Gwangju,Buk-gu,35.174434,126.912061,46,105,6,2.13,14.35,6.9,782 59 | 13050,Gwangju,Seo-gu,35.152516,126.889614,30,55,0,2.5,13.52,6.1,676 60 | 14000,Incheon,Incheon,37.456188,126.70592,250,403,7,1.27,13.2,5.8,4497 61 | 14010,Incheon,Ganghwa-gun,37.747065,126.487777,20,19,1,0.68,31.97,13.9,107 62 | 14020,Incheon,Gyeyang-gu,37.537537,126.737712,26,36,2,1.25,11.81,5.2,477 63 | 14030,Incheon,Michuhol-gu,37.463572,126.65027,29,43,2,0.9,16.29,6.8,643 64 | 14040,Incheon,Namdong-gu,37.44727,126.731429,38,77,0,1.39,12.68,5.5,955 65 | 14050,Incheon,Dong-gu,37.474101,126.643266,8,12,1,0.68,21.62,10.5,122 66 | 14060,Incheon,Bupyeong-gu,37.507031,126.721804,42,66,0,1.27,13.83,6.2,847 67 | 14070,Incheon,Seo-gu,37.545557,126.675994,44,80,0,1.4,10.02,3.9,670 68 | 14080,Incheon,Yeonsu-gu,37.410262,126.678309,23,43,1,1.77,9.48,4,467 69 | 14090,Incheon,Ongjin-gun,37.446777,126.636747,6,10,0,0.19,25.35,11.1,32 70 | 14100,Incheon,Jung-gu,37.473761,126.621693,14,17,0,0.8,14.14,6.7,177 71 | 15000,Daejeon,Daejeon,36.350621,127.384744,148,260,15,1.49,13.65,5.8,2984 72 | 15010,Daejeon,Daedeok-gu,36.346713,127.415597,21,29,2,1.35,14.97,6.9,310 73 | 15020,Daejeon,Dong-gu,36.31204,127.454742,23,35,4,0.93,18.16,8.1,447 74 | 15030,Daejeon,Seo-gu,36.35553,127.383755,39,78,3,1.81,12.05,4.9,1123 75 | 15040,Daejeon,Yuseong-gu,36.362226,127.356153,38,75,5,1.54,9.04,3.3,586 76 | 15050,Daejeon,Jung-gu,36.325735,127.421317,27,43,1,1.43,18.39,8.2,518 77 | 16000,Ulsan,Ulsan,35.539797,129.311538,119,200,4,2.21,11.76,5.2,1801 78 | 16010,Ulsan,Nam-gu,35.543833,129.330047,30,44,1,3.23,11.42,4.8,765 79 | 16020,Ulsan,Dong-gu,35.504806,129.416575,16,25,1,1.73,11.68,4.4,242 80 | 16030,Ulsan,Buk-gu,35.582685,129.36124,22,45,0,1.73,7.69,3.4,211 81 | 16040,Ulsan,Ulju-gun,35.522126,129.242507,30,49,2,1.74,13.94,6.7,260 82 | 16050,Ulsan,Jung-gu,35.569468,129.332773,21,37,0,2.03,14.15,6.5,323 83 | 17000,Sejong,Sejong,36.480132,127.289021,48,60,3,1.78,9.48,3.8,491 84 | 20000,Gyeonggi-do,Gyeonggi-do,37.275119,127.009466,1277,2237,61,1.6,12.63,5.2,20491 85 | 20010,Gyeonggi-do,Gapyeong-gun,37.831297,127.509555,13,14,0,0.88,25.09,12,114 86 | 20020,Gyeonggi-do,Goyang-si,37.658363,126.831961,84,171,2,1.88,12.82,5.2,1608 87 | 20030,Gyeonggi-do,Gwacheon-si,37.42915,126.987616,4,7,0,1.36,13.98,5.7,92 88 | 20040,Gyeonggi-do,Gwangmyeong-si,37.478541,126.864648,25,48,0,1.73,13.25,5.8,530 89 | 20050,Gyeonggi-do,Gwangju-si,37.429322,127.255153,27,42,2,1.09,12.65,4,406 90 | 20060,Gyeonggi-do,Guri-si,37.594267,127.129549,16,30,0,1.94,12.88,5.1,418 91 | 20070,Gyeonggi-do,Gunpo-si,37.361653,126.935206,26,49,1,1.58,12.42,5.4,429 92 | 20080,Gyeonggi-do,Gimpo-si,37.615238,126.715601,43,90,2,1.74,12.1,4.4,604 93 | 20090,Gyeonggi-do,Namyangju-si,37.635791,127.216552,64,110,1,1.38,13.44,5,912 94 | 20100,Gyeonggi-do,Dongducheon-si,37.903568,127.060347,11,19,1,1.03,19.56,9.7,140 95 | 20110,Gyeonggi-do,Bucheon-si,37.503393,126.766049,64,124,4,1.51,12.77,5.4,1432 96 | 20120,Gyeonggi-do,Seongnam-si,37.42,127.126703,72,127,3,2.08,13.52,5.6,2095 97 | 20130,Gyeonggi-do,Suwon-si,37.263376,127.028613,99,192,4,1.72,10.5,4.5,2082 98 | 20140,Gyeonggi-do,Siheung-si,37.38011,126.803009,46,74,2,1.55,8.86,3.8,622 99 | 20150,Gyeonggi-do,Ansan-si,37.321863,126.83092,54,94,4,1.49,10.35,4.6,1024 100 | 20160,Gyeonggi-do,Anseong-si,37.008008,127.279763,35,51,3,1.27,16.95,7.2,271 101 | 20170,Gyeonggi-do,Anyang-si,37.394258,126.956752,41,79,4,1.86,12.88,5.1,1099 102 | 20180,Gyeonggi-do,Yangju-si,37.785253,127.045823,33,48,1,1.16,15.23,6.3,252 103 | 20190,Gyeonggi-do,Yangpyeong-gun,37.491744,127.48757,22,25,1,0.78,24.66,11.1,178 104 | 20200,Gyeonggi-do,Yeoju-si,37.298209,127.637351,23,34,1,1,21.05,8.8,191 105 | 20210,Gyeonggi-do,Yeoncheon-gun,38.096409,127.075067,13,13,0,0.82,25.41,12.7,69 106 | 20220,Gyeonggi-do,Osan-si,37.149854,127.077461,23,50,2,1.25,9.09,3.6,314 107 | 20230,Gyeonggi-do,Yongin-si,37.240985,127.17805,103,169,7,1.82,12.77,4.1,1429 108 | 20240,Gyeonggi-do,Uiwang-si,37.344649,126.968299,14,24,1,1.09,12.99,5,189 109 | 20250,Gyeonggi-do,Uijeongbu-si,37.738058,127.033716,33,62,1,1.59,14.61,6.4,729 110 | 20260,Gyeonggi-do,Icheon-si,37.272169,127.434991,31,51,2,1.76,13.71,5.8,333 111 | 20270,Gyeonggi-do,Paju-si,37.759818,126.7799,57,98,1,1.34,13.52,5.3,573 112 | 20280,Gyeonggi-do,Pyeongtaek-si,36.992293,127.112709,58,108,3,1.39,12.13,5.6,765 113 | 20290,Gyeonggi-do,Pocheon-si,37.894881,127.200346,31,33,2,0.89,18.92,8.2,209 114 | 20300,Gyeonggi-do,Hanam-si,37.53926,127.214944,20,34,0,1.16,12.43,4.8,381 115 | 20310,Gyeonggi-do,Hwaseong-si,37.199536,126.83133,92,167,6,1.72,8.58,3.3,1001 116 | 30000,Gangwon-do,Gangwon-do,37.885369,127.729868,349,368,18,1.42,19.89,9.8,2519 117 | 30010,Gangwon-do,Gangneung-si,37.75197,128.875928,34,36,4,1.9,20.46,9.6,347 118 | 30020,Gangwon-do,Goseong-gun,38.380571,128.467827,13,11,1,0.48,28.53,14.6,41 119 | 30030,Gangwon-do,Donghae-si,37.524689,129.114239,14,17,0,1.6,19.41,10.1,140 120 | 30040,Gangwon-do,Samcheok-si,37.449899,129.165365,19,17,0,1.01,24.26,13,94 121 | 30050,Gangwon-do,Sokcho-si,38.207022,128.591861,12,11,0,1.54,18.33,9.7,165 122 | 30060,Gangwon-do,Yanggu-gun,38.110002,127.990092,10,9,0,1.01,20.38,9.9,33 123 | 30070,Gangwon-do,Yangyang-gun,38.075405,128.619125,13,15,0,0.69,28.92,14.6,37 124 | 30080,Gangwon-do,Yeongwol-gun,37.183694,128.461835,13,15,1,0.87,28.54,15.4,56 125 | 30090,Gangwon-do,Wonju-si,37.341963,127.919668,50,71,4,1.71,14.46,6.7,597 126 | 30100,Gangwon-do,Inje-gun,38.069682,128.170335,13,15,0,0.88,19.88,10.4,46 127 | 30110,Gangwon-do,Jeongseon-gun,37.38062,128.660873,16,15,0,0.67,26.79,13.9,53 128 | 30120,Gangwon-do,Cheorwon-gun,38.146693,127.3134,16,12,0,1.14,21.8,11.1,70 129 | 30130,Gangwon-do,Chuncheon-si,37.881281,127.730088,40,43,5,1.5,17.1,7.6,472 130 | 30140,Gangwon-do,Taebaek-si,37.16406,128.985736,12,12,1,1.16,23.88,12.8,64 131 | 30150,Gangwon-do,Pyeongchang-gun,37.370743,128.390335,19,16,0,0.81,27.04,13.7,84 132 | 30160,Gangwon-do,Hongcheon-gun,37.697037,127.888821,26,26,0,1.14,25.28,12.6,115 133 | 30170,Gangwon-do,Hwacheon-gun,38.106152,127.708163,13,14,0,0.72,21.75,11.2,36 134 | 30180,Gangwon-do,Hoengseong-gun,37.491702,127.985042,16,13,2,0.97,28.22,13.3,69 135 | 40000,Chungcheongbuk-do,Chungcheongbuk-do,36.63568,127.491384,259,328,17,1.39,17.28,8.5,2769 136 | 40010,Chungcheongbuk-do,Goesan-gun,36.81534,127.786651,14,15,1,0.36,33.01,16.5,64 137 | 40020,Chungcheongbuk-do,Danyang-gun,36.984657,128.365476,11,11,0,0.5,29.52,15.8,52 138 | 40030,Chungcheongbuk-do,Boeun-gun,36.489402,127.729503,15,16,0,0.67,33.46,18.4,74 139 | 40040,Chungcheongbuk-do,Yeongdong-gun,36.175017,127.783431,14,14,1,0.94,30.32,16.1,108 140 | 40050,Chungcheongbuk-do,Okcheon-gun,36.306367,127.57128,12,17,1,0.63,28.83,14.9,107 141 | 40060,Chungcheongbuk-do,Eumseong-gun,36.940237,127.690502,21,24,2,0.88,20.33,9.2,169 142 | 40070,Chungcheongbuk-do,Jecheon-si,37.132583,128.190946,24,35,2,1.31,20.92,10.5,245 143 | 40080,Chungcheongbuk-do,Jeungpyeong-gun,36.785301,127.581481,4,5,0,1.39,16.59,8.7,62 144 | 40090,Chungcheongbuk-do,Jincheon-gun,36.855374,127.435635,15,16,0,1.15,16.16,8.3,121 145 | 40100,Chungcheongbuk-do,Cheongju-si,36.641876,127.488759,92,131,8,1.67,12.82,6,1420 146 | 40110,Chungcheongbuk-do,Chungju-si,36.991009,127.925946,37,44,2,1.39,19.07,9,347 147 | 41000,Chungcheongnam-do,Chungcheongnam-do,36.658976,126.673318,409,499,21,1.38,18.4,8.9,3641 148 | 41010,Chungcheongnam-do,Gyeryong-si,36.274481,127.248557,5,8,0,1.3,11.3,5.1,61 149 | 41020,Chungcheongnam-do,Gongju-si,36.446508,127.11903,28,34,2,1.16,25.2,11.7,221 150 | 41030,Chungcheongnam-do,Geumsan-gun,36.108887,127.487966,16,16,1,0.67,30.15,14.8,119 151 | 41040,Chungcheongnam-do,Nonsan-si,36.187076,127.098959,33,32,2,1.03,25.54,13.4,262 152 | 41050,Chungcheongnam-do,Dangjin-si,36.889777,126.645814,30,35,1,1.2,18.14,7.9,265 153 | 41060,Chungcheongnam-do,Boryeong-si,36.333317,126.612679,29,33,1,1.23,25.04,12.6,186 154 | 41070,Chungcheongnam-do,Buyeo-gun,36.275673,126.90979,24,27,1,0.79,33.3,17.4,137 155 | 41080,Chungcheongnam-do,Seosan-si,36.784751,126.450289,29,39,1,1.3,18.01,8.4,261 156 | 41090,Chungcheongnam-do,Seocheon-gun,36.080266,126.691394,18,18,0,0.97,35.25,18.9,116 157 | 41100,Chungcheongnam-do,Asan-si,36.789844,127.00242,46,60,3,1.29,12.89,6.1,435 158 | 41110,Chungcheongnam-do,Yesan-gun,36.680222,126.844687,24,26,0,0.95,30.27,14.5,156 159 | 41120,Chungcheongnam-do,Cheonan-si,36.81498,127.113868,75,112,6,1.91,10.42,4.5,1069 160 | 41130,Chungcheongnam-do,Cheongyang-gun,36.459129,126.802333,12,13,1,0.6,34.51,18,62 161 | 41140,Chungcheongnam-do,Taean-gun,36.745577,126.29795,19,20,0,0.83,29.98,13.8,112 162 | 41150,Chungcheongnam-do,Hongseong-gun,36.60126,126.660772,21,26,2,1.35,23.11,11.3,179 163 | 50000,Jeollabuk-do,Jeollabuk-do,35.820308,127.108791,419,519,19,2.12,20.6,10.9,3774 164 | 50010,Jeollabuk-do,Gochang-gun,35.435767,126.702152,21,22,0,0.95,33.33,20.6,123 165 | 50020,Jeollabuk-do,Gunsan-si,35.967631,126.737011,56,67,5,2.04,18.07,8.8,494 166 | 50030,Jeollabuk-do,Gimje-si,35.803506,126.880507,36,40,0,1.11,30.89,17.9,188 167 | 50040,Jeollabuk-do,Namwon-si,35.416349,127.390265,27,28,0,1.42,27.25,16.3,186 168 | 50050,Jeollabuk-do,Muju-gun,36.006777,127.660789,10,9,0,0.74,32.71,18.7,53 169 | 50060,Jeollabuk-do,Buan-gun,35.731805,126.733347,22,25,0,0.99,32.2,19.1,118 170 | 50070,Jeollabuk-do,Sunchang-gun,35.374421,127.137597,15,15,0,0.7,32.97,20.2,72 171 | 50080,Jeollabuk-do,Wanju-gun,35.904724,127.162209,29,37,3,1.52,22.38,9.8,170 172 | 50090,Jeollabuk-do,Iksan-si,35.948229,126.957792,60,83,2,2.18,18.85,9.3,556 173 | 50100,Jeollabuk-do,Imsil-gun,35.617891,127.288868,15,14,1,0.55,34.82,20.1,79 174 | 50110,Jeollabuk-do,Jangsu-gun,35.647281,127.521204,8,8,0,0.58,32.87,19.1,44 175 | 50120,Jeollabuk-do,Jeonju-si,35.824069,127.14805,73,120,7,3.02,14.4,6.8,1381 176 | 50130,Jeollabuk-do,Jeongeup-si,35.569768,126.856148,34,39,1,1.64,26.92,15.9,256 177 | 50140,Jeollabuk-do,Jinan-gun,35.791669,127.424806,13,12,0,0.39,33.8,19.5,54 178 | 51000,Jeollanam-do,Jeollanam-do,34.816095,126.463021,429,542,19,1.45,22.81,13.5,3389 179 | 51010,Jeollanam-do,Gangjin-gun,34.642006,126.767267,14,14,0,0.79,33.87,21.7,71 180 | 51020,Jeollanam-do,Goheung-gun,34.61117,127.285055,17,21,0,0.71,40.04,24.5,141 181 | 51030,Jeollanam-do,Gokseong-gun,35.282015,127.292107,8,8,1,0.87,35.16,22.5,70 182 | 51040,Jeollanam-do,Gwangyang-si,34.940611,127.696142,28,37,2,1.63,12.65,6.8,210 183 | 51050,Jeollanam-do,Gurye-gun,35.202521,127.462789,10,12,0,0.79,33.3,19.4,57 184 | 51060,Jeollanam-do,Naju-si,35.015818,126.710826,24,30,3,1.4,22.16,13.2,214 185 | 51070,Jeollanam-do,Damyang-gun,35.321144,126.988229,14,15,1,0.28,30.27,16.3,98 186 | 51080,Jeollanam-do,Mokpo-si,34.811809,126.392198,33,55,3,2.11,15.88,8.8,406 187 | 51090,Jeollanam-do,Muan-gun,34.990368,126.481653,18,21,2,2.12,20.82,11.8,141 188 | 51100,Jeollanam-do,Boseong-gun,34.771453,127.080097,17,19,0,0.65,37.43,23.8,98 189 | 51110,Jeollanam-do,Suncheon-si,34.950592,127.487396,42,61,3,1.94,15.2,8.3,452 190 | 51120,Jeollanam-do,Sinan-gun,34.833564,126.351636,19,22,0,0.2,35.18,21,77 191 | 51130,Jeollanam-do,Yeosu-si,34.760421,127.662287,50,73,1,1.85,18.59,9.5,471 192 | 51140,Jeollanam-do,Yeonggwang-gun,35.277188,126.512087,13,15,1,1.24,28.4,17.3,116 193 | 51150,Jeollanam-do,Yeongam-gun,34.800192,126.696802,16,18,2,0.82,25.99,15.6,100 194 | 51160,Jeollanam-do,Wando-gun,34.310983,126.755076,21,24,0,0.65,31.55,17.9,96 195 | 51170,Jeollanam-do,Jangseong-gun,35.301826,126.784893,13,16,0,0.83,28.75,16.8,80 196 | 51180,Jeollanam-do,Jangheung-gun,34.681616,126.90708,15,16,0,0.67,33.21,21.6,80 197 | 51190,Jeollanam-do,Jindo-gun,34.486834,126.263554,10,13,0,0.78,32.97,22,68 198 | 51200,Jeollanam-do,Hampyeong-gun,35.065927,126.516688,11,12,0,0.49,35.45,22,76 199 | 51210,Jeollanam-do,Haenam-gun,34.573374,126.59925,20,21,0,0.91,31.45,20,135 200 | 51220,Jeollanam-do,Hwasun-gun,35.064456,126.986643,16,19,0,1.32,26.14,17.4,132 201 | 60000,Gyeongsangbuk-do,Gyeongsangbuk-do,36.576032,128.505599,471,707,33,1.33,20.85,11.1,4474 202 | 60010,Gyeongsangbuk-do,Gyeongsan-si,35.825056,128.741544,31,61,10,1.34,16.18,7,427 203 | 60020,Gyeongsangbuk-do,Gyeongju-si,35.856185,129.224796,43,61,4,1.34,21.66,11,407 204 | 60030,Gyeongsangbuk-do,Goryeong-gun,35.72615,128.26295,9,9,0,0.62,30.17,15.9,62 205 | 60040,Gyeongsangbuk-do,Gumi-si,36.119641,128.344295,50,104,3,1.96,9.08,4.3,616 206 | 60050,Gyeongsangbuk-do,Gunwi-gun,36.242926,128.572939,7,8,0,0.25,38.87,21.6,41 207 | 60060,Gyeongsangbuk-do,Gimcheon-si,36.139875,128.11365,27,37,2,0.89,22.35,12.2,212 208 | 60070,Gyeongsangbuk-do,Mungyeong-si,36.586831,128.18709,17,26,1,1.18,29.25,17.1,147 209 | 60080,Gyeongsangbuk-do,Bonghwa-gun,36.893099,128.732568,14,17,0,0.37,35.26,20,47 210 | 60090,Gyeongsangbuk-do,Sangju-si,36.410977,128.159024,28,35,0,1.03,30.09,17.3,186 211 | 60100,Gyeongsangbuk-do,Seongju-gun,35.919325,128.283237,12,15,0,0.59,30.89,15.9,84 212 | 60110,Gyeongsangbuk-do,Andong-si,36.568441,128.729551,30,42,3,1.57,23.95,12.4,290 213 | 60120,Gyeongsangbuk-do,Yeongdeok-gun,36.415059,129.366107,9,11,0,0.75,36.45,22.3,81 214 | 60130,Gyeongsangbuk-do,Yeongyang-gun,36.666664,129.112401,6,8,0,0.35,36.36,21.1,24 215 | 60140,Gyeongsangbuk-do,Yeongju-si,36.805702,128.623958,19,26,2,1.61,26.22,13.2,191 216 | 60150,Gyeongsangbuk-do,Yeongcheon-si,35.973268,128.938603,18,23,1,0.83,27.32,15.3,192 217 | 60160,Gyeongsangbuk-do,Yecheon-gun,36.646707,128.437435,12,15,1,1.09,29.79,18.4,97 218 | 60170,Gyeongsangbuk-do,Ulleung-gun,37.484438,130.905883,4,6,0,0.21,24.81,10.7,11 219 | 60180,Gyeongsangbuk-do,Uljin-gun,36.993078,129.400585,13,16,0,1.01,27.3,17.6,83 220 | 60190,Gyeongsangbuk-do,Uiseong-gun,36.352718,128.697088,16,16,0,0.32,40.26,23.7,108 221 | 60200,Gyeongsangbuk-do,Cheongdo-gun,35.647361,128.734382,11,14,0,0.63,36.55,21,85 222 | 60210,Gyeongsangbuk-do,Cheongsong-gun,36.436301,129.05708,8,8,0,0.51,36.08,19.4,55 223 | 60220,Gyeongsangbuk-do,Chilgok-gun,35.995529,128.401735,21,32,2,1.48,15.17,6.7,151 224 | 60230,Gyeongsangbuk-do,Pohang-si,36.019052,129.343645,66,117,4,1.51,16.44,8,877 225 | 61000,Gyeongsangnam-do,Gyeongsangnam-do,35.238294,128.692397,501,686,21,1.78,16.51,9.1,5364 226 | 61010,Gyeongsangnam-do,Geoje-si,34.880519,128.621216,38,58,1,1.76,10.22,4.7,317 227 | 61020,Gyeongsangnam-do,Geochang-gun,35.686526,127.910021,17,16,2,1.25,27.01,17.4,127 228 | 61030,Gyeongsangnam-do,Goseong-gun,34.972986,128.322246,19,18,0,1.3,30.17,18.2,89 229 | 61040,Gyeongsangnam-do,Gimhae-si,35.228678,128.889352,58,91,4,2.1,10.74,5.3,711 230 | 61050,Gyeongsangnam-do,Namhae-gun,34.837688,127.892595,13,10,1,1.03,36.92,23.3,88 231 | 61060,Gyeongsangnam-do,Miryang-si,35.503827,128.746604,21,25,0,1.39,27.31,16.3,178 232 | 61070,Gyeongsangnam-do,Sacheon-si,35.003668,128.064272,18,20,0,1.84,21.17,11.7,191 233 | 61080,Gyeongsangnam-do,Sancheong-gun,35.415544,127.87349,13,11,0,0.65,35.42,20.6,72 234 | 61090,Gyeongsangnam-do,Yangsan-si,35.335016,129.037057,37,68,2,1.59,12.92,6.2,507 235 | 61100,Gyeongsangnam-do,Uiryeong-gun,35.322198,128.26174,13,10,0,0.63,35.95,24.2,60 236 | 61110,Gyeongsangnam-do,Jinju-si,35.180313,128.10875,45,53,6,2.49,16.27,8.6,597 237 | 61120,Gyeongsangnam-do,Changnyeong-gun,35.544603,128.49233,17,20,0,0.8,29.8,18.4,129 238 | 61130,Gyeongsangnam-do,Changwon-si,35.227992,128.681815,110,195,5,1.84,13.64,6.5,1701 239 | 61140,Gyeongsangnam-do,Tongyeong-si,34.854426,128.43321,20,29,0,1.7,18.47,9.8,230 240 | 61150,Gyeongsangnam-do,Hadong-gun,35.067224,127.751271,16,15,0,0.84,32.89,19.1,94 241 | 61160,Gyeongsangnam-do,Haman-gun,35.272481,128.40654,16,20,0,1.19,23.74,14.7,94 242 | 61170,Gyeongsangnam-do,Hamyang-gun,35.520541,127.725177,13,12,0,1.01,32.65,20.9,83 243 | 61180,Gyeongsangnam-do,Hapcheon-gun,35.566702,128.16587,17,15,0,0.71,38.44,24.7,96 244 | 70000,Jeju-do,Jeju-do,33.488936,126.500423,113,123,4,1.53,15.1,6.4,1245 245 | 80000,Korea,Korea,37.566953,126.977977,6087,8837,340,1.56,15.67,7.2,94865 -------------------------------------------------------------------------------- /img/hyper.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hyunjoonbok/PySpark/c0eb3879e31fc1475722b340d824fecf24e17cdc/img/hyper.png -------------------------------------------------------------------------------- /img/input.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hyunjoonbok/PySpark/c0eb3879e31fc1475722b340d824fecf24e17cdc/img/input.png -------------------------------------------------------------------------------- /img/parallel-coordinates-plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hyunjoonbok/PySpark/c0eb3879e31fc1475722b340d824fecf24e17cdc/img/parallel-coordinates-plot.png -------------------------------------------------------------------------------- /img/shuffle.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hyunjoonbok/PySpark/c0eb3879e31fc1475722b340d824fecf24e17cdc/img/shuffle.png -------------------------------------------------------------------------------- /img/spark.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hyunjoonbok/PySpark/c0eb3879e31fc1475722b340d824fecf24e17cdc/img/spark.png -------------------------------------------------------------------------------- /img/sparkpartition.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hyunjoonbok/PySpark/c0eb3879e31fc1475722b340d824fecf24e17cdc/img/sparkpartition.png --------------------------------------------------------------------------------