├── Classification Evaluation.ipynb ├── Classification.ipynb ├── Clustering.ipynb ├── Cross Validation.ipynb ├── DAT202.3x-databricks.zip ├── Data Exploration.ipynb ├── LICENSE ├── Lab 1 - Exploring Data with Spark.pdf ├── Lab 2 - Building Supervised Learning Models.pdf ├── Lab 3 - Evaluating Supervised Learning Models.pdf ├── Lab 4 - Recommenders and Clustering.pdf ├── Lab 5 - Using the MML Spark Library.pdf ├── MMLSpark Classifier.ipynb ├── Parameter Tuning.ipynb ├── Pipeline.ipynb ├── README.md ├── Recommender.ipynb ├── Regression Evaluation.ipynb ├── Regression.ipynb ├── Setup.pdf └── Text Analysis.ipynb /Classification Evaluation.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["## Evaluating a Classification Model\n\nIn this exercise, you will create a pipeline for a classification model, and then apply commonly used metrics to evaluate the resulting classifier.\n\n### Prepare the Data\n\nFirst, import the libraries you will need and prepare the training and test data:"],"metadata":{}},{"cell_type":"code","source":["from pyspark.sql.types import *\nfrom pyspark.sql.functions import *\n\nfrom pyspark.ml import Pipeline\nfrom pyspark.ml.classification import LogisticRegression\nfrom pyspark.ml.feature import VectorAssembler, StringIndexer, MinMaxScaler\nfrom pyspark.ml.evaluation import BinaryClassificationEvaluator\n\n# Load the source data\nflightSchema = StructType([\n StructField(\"DayofMonth\", IntegerType(), False),\n StructField(\"DayOfWeek\", IntegerType(), False),\n StructField(\"Carrier\", StringType(), False),\n StructField(\"OriginAirportID\", StringType(), False),\n StructField(\"DestAirportID\", StringType(), False),\n StructField(\"DepDelay\", IntegerType(), False),\n StructField(\"ArrDelay\", IntegerType(), False),\n StructField(\"Late\", IntegerType(), False),\n])\n\ndata = spark.read.csv('wasb://spark@.blob.core.windows.net/data/flights.csv', schema=flightSchema, header=True)\n\n# Split the data\nsplits = data.randomSplit([0.7, 0.3])\ntrain = splits[0]\ntest = splits[1]"],"metadata":{},"outputs":[],"execution_count":2},{"cell_type":"markdown","source":["### Define the Pipeline and Train the Model\nNow define a pipeline that creates a feature vector and trains a classification model"],"metadata":{}},{"cell_type":"code","source":["monthdayIndexer = StringIndexer(inputCol=\"DayofMonth\", outputCol=\"DayofMonthIdx\")\nweekdayIndexer = StringIndexer(inputCol=\"DayOfWeek\", outputCol=\"DayOfWeekIdx\")\ncarrierIndexer = StringIndexer(inputCol=\"Carrier\", outputCol=\"CarrierIdx\")\noriginIndexer = StringIndexer(inputCol=\"OriginAirportID\", outputCol=\"OriginAirportIdx\")\ndestIndexer = StringIndexer(inputCol=\"DestAirportID\", outputCol=\"DestAirportIdx\")\nnumVect = VectorAssembler(inputCols = [\"DepDelay\"], outputCol=\"numFeatures\")\nminMax = MinMaxScaler(inputCol = numVect.getOutputCol(), outputCol=\"normNums\")\nfeatVect = VectorAssembler(inputCols=[\"DayofMonthIdx\", \"DayOfWeekIdx\", \"CarrierIdx\", \"OriginAirportIdx\", \"DestAirportIdx\", \"normNums\"], outputCol=\"features\")\nlr = LogisticRegression(labelCol=\"Late\", featuresCol=\"features\")\npipeline = Pipeline(stages=[monthdayIndexer, weekdayIndexer, carrierIndexer, originIndexer, destIndexer, numVect, minMax, featVect, lr])\nmodel = pipeline.fit(train)"],"metadata":{"scrolled":false},"outputs":[],"execution_count":4},{"cell_type":"markdown","source":["### Test the Model\nNow you're ready to apply the model to the test data."],"metadata":{}},{"cell_type":"code","source":["prediction = model.transform(test)\npredicted = prediction.select(\"features\", col(\"prediction\").cast(\"Int\"), col(\"Late\").alias(\"trueLabel\"))\npredicted.show(100, truncate=False)"],"metadata":{"scrolled":false},"outputs":[],"execution_count":6},{"cell_type":"markdown","source":["### Compute Confusion Matrix Metrics\nClassifiers are typically evaluated by creating a *confusion matrix*, which indicates the number of:\n- True Positives\n- True Negatives\n- False Positives\n- False Negatives\n\nFrom these core measures, other evaluation metrics such as *precision* and *recall* can be calculated."],"metadata":{}},{"cell_type":"code","source":["tp = float(predicted.filter(\"prediction == 1.0 AND truelabel == 1\").count())\nfp = float(predicted.filter(\"prediction == 1.0 AND truelabel == 0\").count())\ntn = float(predicted.filter(\"prediction == 0.0 AND truelabel == 0\").count())\nfn = float(predicted.filter(\"prediction == 0.0 AND truelabel == 1\").count())\nmetrics = spark.createDataFrame([\n (\"TP\", tp),\n (\"FP\", fp),\n (\"TN\", tn),\n (\"FN\", fn),\n (\"Precision\", tp / (tp + fp)),\n (\"Recall\", tp / (tp + fn))],[\"metric\", \"value\"])\nmetrics.show()"],"metadata":{},"outputs":[],"execution_count":8},{"cell_type":"markdown","source":["### View the Raw Prediction and Probability\nThe prediction is based on a raw prediction score that describes a labelled point in a logistic function. This raw prediction is then converted to a predicted label of 0 or 1 based on a probability vector that indicates the confidence for each possible label value (in this case, 0 and 1). The value with the highest confidence is selected as the prediction."],"metadata":{}},{"cell_type":"code","source":["prediction.select(\"rawPrediction\", \"probability\", \"prediction\", col(\"Late\").alias(\"trueLabel\")).show(100, truncate=False)"],"metadata":{},"outputs":[],"execution_count":10},{"cell_type":"markdown","source":["Note that the results include rows where the probability for 0 (the first value in the **probability** vector) is only slightly higher than the probability for 1 (the second value in the **probability** vector). The default *discrimination threshold* (the boundary that decides whether a probability is predicted as a 1 or a 0) is set to 0.5; so the prediction with the highest probability is always used, no matter how close to the threshold."],"metadata":{}},{"cell_type":"markdown","source":["### Review the Area Under ROC\nAnother way to assess the performance of a classification model is to measure the area under a *received operator characteristic (ROC) curve* for the model. The **spark.ml** library includes a **BinaryClassificationEvaluator** class that you can use to compute this. A ROC curve plots the True Positive and False Positive rates for varying threshold values (the probability value over which a class label is predicted). The area under this curve gives an overall indication of the models accuracy as a value between 0 and 1. A value under 0.5 means that a binary classification model (which predicts one of two possible labels) is no better at predicting the right class than a random 50/50 guess."],"metadata":{}},{"cell_type":"code","source":["evaluator = BinaryClassificationEvaluator(labelCol=\"Late\", rawPredictionCol=\"rawPrediction\", metricName=\"areaUnderROC\")\nauc = evaluator.evaluate(prediction)\nprint (\"AUC = \", auc)"],"metadata":{},"outputs":[],"execution_count":13}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"mimetype":"text/x-python","name":"python","pygments_lexer":"ipython3","codemirror_mode":{"name":"ipython","version":"3"},"version":"3.6.5","nbconvert_exporter":"python","file_extension":".py"},"name":"Python Classification Evaluation","notebookId":374219277805893},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /Classification.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["## Creating a Classification Model\n\nIn this exercise, you will implement a classification model that uses features of a flight to predict whether or not it will be late.\n\n### Import Spark SQL and Spark ML Libraries\n\nFirst, import the libraries you will need:"],"metadata":{}},{"cell_type":"code","source":["from pyspark.sql.types import *\nfrom pyspark.sql.functions import *\n\nfrom pyspark.ml.classification import LogisticRegression\nfrom pyspark.ml.feature import StringIndexer, VectorAssembler"],"metadata":{},"outputs":[],"execution_count":2},{"cell_type":"markdown","source":["### Load Source Data\nThe data for this exercise is provided as a CSV file containing details of flights that has already been cleaned up for modeling. The data includes specific characteristics (or *features*) for each flight, as well as a *label* column indicating whether or not the flight was late (a flight with an arrival delay of more than 25 minutes is considered *late*).\n\nYou will load this data into a dataframe and display it."],"metadata":{}},{"cell_type":"code","source":["flightSchema = StructType([\n StructField(\"DayofMonth\", IntegerType(), False),\n StructField(\"DayOfWeek\", IntegerType(), False),\n StructField(\"Carrier\", StringType(), False),\n StructField(\"OriginAirportID\", IntegerType(), False),\n StructField(\"DestAirportID\", IntegerType(), False),\n StructField(\"DepDelay\", IntegerType(), False),\n StructField(\"ArrDelay\", IntegerType(), False),\n StructField(\"Late\", IntegerType(), False),\n])\n\ndata = spark.read.csv('wasb://spark@.blob.core.windows.net/data/flights.csv', schema=flightSchema, header=True)\ndata.show()"],"metadata":{"scrolled":false},"outputs":[],"execution_count":4},{"cell_type":"markdown","source":["### Split the Data\nIt is common practice when building supervised machine learning models to split the source data, using some of it to train the model and reserving some to test the trained model. In this exercise, you will use 70% of the data for training, and reserve 30% for testing."],"metadata":{}},{"cell_type":"code","source":["splits = data.randomSplit([0.7, 0.3])\ntrain = splits[0]\ntest = splits[1]\ntrain_rows = train.count()\ntest_rows = test.count()\nprint (\"Training Rows:\", train_rows, \" Testing Rows:\", test_rows)"],"metadata":{},"outputs":[],"execution_count":6},{"cell_type":"markdown","source":["### Prepare the Training Data\nTo train the classification model, you need a training data set that includes a vector of numeric features, and a label column. In this exercise, you will use the **StringIndexer** class to generate a numeric category for each discrete **Carrier** string value, and then use the **VectorAssembler** class to transform the numeric features that would be available for a flight that hasn't yet arrived into a vector, and then rename the **Late** column to **label** as this is what we're going to try to predict.\n\n*Note: This is a deliberately simple example. In reality you'd likely perform mulitple data preparation steps, and later in this course we'll examine how to encapsulate these steps in to a pipeline. For now, we'll just use the numeric features as they are to define the training dataset.*"],"metadata":{}},{"cell_type":"code","source":["# Carrier is a string, and we need our features to be numeric - so we'll generate a numeric index for each distinct carrier string, and transform the dataframe to add that as a column\ncarrierIndexer = StringIndexer(inputCol=\"Carrier\", outputCol=\"CarrierIdx\")\nnumTrain = carrierIndexer.fit(train).transform(train)\n\n# Now we'll assemble a vector of all the numeric feature columns (other than ArrDelay, which we wouldn't have for enroute flights)\nassembler = VectorAssembler(inputCols = [\"DayofMonth\", \"DayOfWeek\", \"CarrierIdx\", \"OriginAirportID\", \"DestAirportID\", \"DepDelay\"], outputCol=\"features\")\ntraining = assembler.transform(numTrain).select(col(\"features\"), col(\"Late\").alias(\"label\"))\ntraining.show()"],"metadata":{"scrolled":false},"outputs":[],"execution_count":8},{"cell_type":"markdown","source":["### Train a Classification Model\nNext, you need to train a classification model using the training data. To do this, create an instance of the classification algorithm you want to use and use its **fit** method to train a model based on the training dataframe. In this exercise, you will use a *Logistic Regression* classification algorithm - but you can use the same technique for any of the classification algorithms supported in the spark.ml API."],"metadata":{}},{"cell_type":"code","source":["lr = LogisticRegression(labelCol=\"label\",featuresCol=\"features\",maxIter=10,regParam=0.3)\nmodel = lr.fit(training)\nprint (\"Model trained!\")"],"metadata":{"scrolled":false},"outputs":[],"execution_count":10},{"cell_type":"markdown","source":["### Prepare the Testing Data\nNow that you have a trained model, you can test it using the testing data you reserved previously. First, you need to prepare the testing data in the same way as you did the training data by transforming the feature columns into a vector. This time you'll rename the **Late** column to **trueLabel**."],"metadata":{}},{"cell_type":"code","source":["# Transform the test data to add the numeric carrier index\nnumTest = carrierIndexer.fit(test).transform(test)\n\n# Generate the features vector and label\ntesting = assembler.transform(numTest).select(col(\"features\"), col(\"Late\").alias(\"trueLabel\"))\ntesting.show()"],"metadata":{},"outputs":[],"execution_count":12},{"cell_type":"markdown","source":["### Test the Model\nNow you're ready to use the **transform** method of the model to generate some predictions. You can use this approach to predict delay status for flights where the label is unknown; but in this case you are using the test data which includes a known true label value, so you can compare the predicted status to the actual status."],"metadata":{}},{"cell_type":"code","source":["prediction = model.transform(testing)\npredicted = prediction.select(\"features\", \"probability\", col(\"prediction\").cast(\"Int\"), \"trueLabel\")\npredicted.show(100, truncate=False)"],"metadata":{"scrolled":false},"outputs":[],"execution_count":14},{"cell_type":"markdown","source":["Looking at the result, the **prediction** column contains the predicted value for the label, and the **trueLabel** column contains the actual known value from the testing data. The **probability** column shows the probability score for each class (0 or 1). It looks like there are a mix of correct and incorrect predictions, and the ones that are incorrect tend to have fairly close probabilities for each class. Later in this course you'll learn how to measure the accuracy of a model."],"metadata":{}}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"mimetype":"text/x-python","name":"python","pygments_lexer":"ipython3","codemirror_mode":{"name":"ipython","version":"3"},"version":"3.6.5","nbconvert_exporter":"python","file_extension":".py"},"name":"Python Classification","notebookId":342710160787076},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /Clustering.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["## Clustering\nIn this exercise, you will use K-Means clustering to segment customer data into five clusters.\n\n### Import the Libraries\nYou will use the **KMeans** class to create your model. This will require a vector of features, so you will also use the **VectorAssembler** class."],"metadata":{}},{"cell_type":"code","source":["from pyspark.ml.clustering import KMeans\nfrom pyspark.ml.feature import VectorAssembler"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":2},{"cell_type":"markdown","source":["### Load Source Data\nThe source data for your clusters is in a comma-separated values (CSV) file, and incldues the following features:\n- CustomerName: The customer's name\n- Age: The customer's age in years\n- MaritalStatus: The custtomer's marital status (1=Married, 0 = Unmarried)\n- IncomeRange: The top-level for the customer's income range (for example, a value of 25,000 means the customer earns up to 25,000)\n- Gender: A numeric value indicating gender (1 = female, 2 = male)\n- TotalChildren: The total number of children the customer has\n- ChildrenAtHome: The number of children the customer has living at home.\n- Education: A numeric value indicating the highest level of education the customer has attained (1=Started High School to 5=Post-Graduate Degree\n- Occupation: A numeric value indicating the type of occupation of the customer (0=Unskilled manual work to 5=Professional)\n- HomeOwner: A numeric code to indicate home-ownership (1 - home owner, 0 = not a home owner)\n- Cars: The number of cars owned by the customer."],"metadata":{}},{"cell_type":"code","source":["customers = spark.read.csv('wasb://spark@.blob.core.windows.net/data/customers.csv', inferSchema=True, header=True)\ncustomers.show()"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":4},{"cell_type":"markdown","source":["### Create the K-Means Model\nYou will use the feaures in the customer data to create a Kn-Means model with a k value of 5. This will be used to generate 5 clusters."],"metadata":{}},{"cell_type":"code","source":["assembler = VectorAssembler(inputCols = [\"Age\", \"MaritalStatus\", \"IncomeRange\", \"Gender\", \"TotalChildren\", \"ChildrenAtHome\", \"Education\", \"Occupation\", \"HomeOwner\", \"Cars\"], outputCol=\"features\")\ntrain = assembler.transform(customers)\n\nkmeans = KMeans(featuresCol=assembler.getOutputCol(), predictionCol=\"cluster\", k=5, seed=0)\nmodel = kmeans.fit(train)\nprint (\"Model Created!\")"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":6},{"cell_type":"markdown","source":["### Get the Cluster Centers\nThe cluster centers are indicated as vector coordinates."],"metadata":{}},{"cell_type":"code","source":["centers = model.clusterCenters()\nprint(\"Cluster Centers: \")\nfor center in centers:\n print(center)"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":8},{"cell_type":"markdown","source":["### Predict Clusters\nNow that you have trained the model, you can use it to segemnt the customer data into 5 clusters and show each customer with their allocated cluster."],"metadata":{}},{"cell_type":"code","source":["prediction = model.transform(train)\nprediction.groupBy(\"cluster\").count().orderBy(\"cluster\").show()"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":10},{"cell_type":"code","source":["prediction.select(\"CustomerName\", \"cluster\").show(50)"],"metadata":{"collapsed":false},"outputs":[],"execution_count":11}],"metadata":{"kernelspec":{"display_name":"PySpark","name":"pysparkkernel","language":""},"language_info":{"mimetype":"text/x-python","pygments_lexer":"python2","name":"pyspark","codemirror_mode":{"version":2,"name":"python"}},"name":"Python Clustering","notebookId":3378903555804643},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /Cross Validation.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["## Using Cross Validation\n\nIn this exercise, you will use cross-validation to optimize parameters for a regression model.\n\n### Prepare the Data\n\nFirst, import the libraries you will need and prepare the training and test data:"],"metadata":{}},{"cell_type":"code","source":["# Import Spark SQL and Spark ML libraries\nfrom pyspark.sql.types import *\nfrom pyspark.sql.functions import *\n\nfrom pyspark.ml import Pipeline\nfrom pyspark.ml.regression import LinearRegression\nfrom pyspark.ml.feature import VectorAssembler, StringIndexer, MinMaxScaler\nfrom pyspark.ml.tuning import ParamGridBuilder, CrossValidator\nfrom pyspark.ml.evaluation import RegressionEvaluator\n\n# Load the source data\nflightSchema = StructType([\n StructField(\"DayofMonth\", IntegerType(), False),\n StructField(\"DayOfWeek\", IntegerType(), False),\n StructField(\"Carrier\", StringType(), False),\n StructField(\"OriginAirportID\", StringType(), False),\n StructField(\"DestAirportID\", StringType(), False),\n StructField(\"DepDelay\", IntegerType(), False),\n StructField(\"ArrDelay\", IntegerType(), False),\n StructField(\"Late\", IntegerType(), False),\n])\n\ndata = spark.read.csv('wasb://spark@.blob.core.windows.net/data/flights.csv', schema=flightSchema, header=True)\ndata = data.select(\"DayofMonth\", \"DayOfWeek\", \"Carrier\", \"OriginAirportID\", \"DestAirportID\", \"DepDelay\", col(\"ArrDelay\").alias(\"label\"))\n\n# Split the data\nsplits = data.randomSplit([0.7, 0.3])\ntrain = splits[0]\ntest = splits[1]"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":2},{"cell_type":"markdown","source":["### Define the Pipeline\nNow define a pipeline that creates a feature vector and trains a regression model"],"metadata":{}},{"cell_type":"code","source":["# Define the pipeline\nmonthdayIndexer = StringIndexer(inputCol=\"DayofMonth\", outputCol=\"DayofMonthIdx\")\nweekdayIndexer = StringIndexer(inputCol=\"DayOfWeek\", outputCol=\"DayOfWeekIdx\")\ncarrierIndexer = StringIndexer(inputCol=\"Carrier\", outputCol=\"CarrierIdx\")\noriginIndexer = StringIndexer(inputCol=\"OriginAirportID\", outputCol=\"OriginAirportIdx\")\ndestIndexer = StringIndexer(inputCol=\"DestAirportID\", outputCol=\"DestAirportIdx\")\nnumVect = VectorAssembler(inputCols = [\"DepDelay\"], outputCol=\"numFeatures\")\nminMax = MinMaxScaler(inputCol = numVect.getOutputCol(), outputCol=\"normNums\")\nfeatVect = VectorAssembler(inputCols=[\"DayofMonthIdx\", \"DayOfWeekIdx\", \"CarrierIdx\", \"OriginAirportIdx\", \"DestAirportIdx\", \"normNums\"], outputCol=\"features\")\nlr = LinearRegression(labelCol=\"label\", featuresCol=\"features\")\npipeline = Pipeline(stages=[monthdayIndexer, weekdayIndexer, carrierIndexer, originIndexer, destIndexer, numVect, minMax, featVect, lr])"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":4},{"cell_type":"markdown","source":["### Tune Parameters\nYou can tune parameters to find the best model for your data. To do this you can use the **CrossValidator** class to evaluate each combination of parameters defined in a **ParameterGrid** against multiple *folds* of the data split into training and validation datasets, in order to find the best performing parameters. Note that this can take a long time to run because every parameter combination is tried multiple times."],"metadata":{}},{"cell_type":"code","source":["paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.3, 0.01]).addGrid(lr.maxIter, [10, 5]).build()\ncv = CrossValidator(estimator=pipeline, evaluator=RegressionEvaluator(), estimatorParamMaps=paramGrid, numFolds=2)\n\nmodel = cv.fit(train)"],"metadata":{"collapsed":false},"outputs":[],"execution_count":6},{"cell_type":"markdown","source":["### Test the Model\nNow you're ready to apply the model to the test data."],"metadata":{}},{"cell_type":"code","source":["prediction = model.transform(test)\npredicted = prediction.select(\"features\", \"prediction\", \"label\")\npredicted.show()"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":8},{"cell_type":"markdown","source":["### Examine the Predicted and Actual Values\nYou can plot the predicted values against the actual values to see how accurately the model has predicted. In a perfect model, the resulting scatter plot should form a perfect diagonal line with each predicted value being identical to the actual value - in practice, some variance is to be expected.\nRun the cells below to create a temporary table from the **predicted** DataFrame and then retrieve the predicted and actual label values using SQL. You can then display the results as a scatter plot, specifying **-** as the function to show the unaggregated values."],"metadata":{}},{"cell_type":"code","source":["predicted.createOrReplaceTempView(\"regressionPredictions\")"],"metadata":{"collapsed":false},"outputs":[],"execution_count":10},{"cell_type":"code","source":["%sql\nSELECT label, prediction FROM regressionPredictions"],"metadata":{"collapsed":false},"outputs":[],"execution_count":11},{"cell_type":"markdown","source":["### Retrieve the Root Mean Square Error (RMSE)\nThere are a number of metrics used to measure the variance between predicted and actual values. Of these, the root mean square error (RMSE) is a commonly used value that is measured in the same units as the prediced and actual values - so in this case, the RMSE indicates the average number of minutes between predicted and actual flight delay values. You can use the **RegressionEvaluator** class to retrieve the RMSE."],"metadata":{}},{"cell_type":"code","source":["evaluator = RegressionEvaluator(labelCol=\"label\", predictionCol=\"prediction\", metricName=\"rmse\")\nrmse = evaluator.evaluate(prediction)\nprint (\"Root Mean Square Error (RMSE):\", rmse)"],"metadata":{"collapsed":false},"outputs":[],"execution_count":13}],"metadata":{"kernelspec":{"display_name":"PySpark","name":"pysparkkernel","language":""},"language_info":{"mimetype":"text/x-python","pygments_lexer":"python2","name":"pyspark","codemirror_mode":{"version":"2","name":"python"}},"name":"Python Cross Validation","notebookId":374219277805981},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /DAT202.3x-databricks.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GraemeMalcolm/predictive-databricks/d3e7ca972a5e28e0427741f5394a7206f5d6dd5d/DAT202.3x-databricks.zip -------------------------------------------------------------------------------- /Data Exploration.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["## Exploring Data with Dataframes and Spark SQL\nIn this exercise, you will explore data using the Spark Dataframe API and Spark SQL."],"metadata":{}},{"cell_type":"markdown","source":["### Load Data Using an Explicit Schema\nNow you can load the data into a dataframe. If the structure of the data is known ahead of time, you can explicitly specify the schema for the dataframe.\n\nModify the code below to reflect your Azure blob storage account name, and then click the ► button at the top right of the cell to run it."],"metadata":{}},{"cell_type":"code","source":["from pyspark.sql.types import *\nfrom pyspark.sql.functions import *\n\nflightSchema = StructType([\n StructField(\"DayofMonth\", IntegerType(), False),\n StructField(\"DayOfWeek\", IntegerType(), False),\n StructField(\"Carrier\", StringType(), False),\n StructField(\"OriginAirportID\", IntegerType(), False),\n StructField(\"DestAirportID\", IntegerType(), False),\n StructField(\"DepDelay\", IntegerType(), False),\n StructField(\"ArrDelay\", IntegerType(), False),\n])\n\nflights = spark.read.csv('wasb://spark@.blob.core.windows.net/data/raw-flight-data.csv', schema=flightSchema, header=True)\nflights.show()"],"metadata":{},"outputs":[],"execution_count":3},{"cell_type":"markdown","source":["### Infer a Data Schema\nIf the structure of the data source is unknown, you can have Spark automatically infer the schema.\n\nIn this case, you will load data about airports without knowing the schema.\n\nModify the code below to reflect your Azure blob storage account name, and then run the cell."],"metadata":{}},{"cell_type":"code","source":["airports = spark.read.csv('wasb://spark@.blob.core.windows.net/data/airports.csv', header=True, inferSchema=True)\nairports.show()"],"metadata":{},"outputs":[],"execution_count":5},{"cell_type":"markdown","source":["### Use Dataframe Methods\nSpark DataFrames provide functions that you can use to extract and manipulate data. For example, you can use the **select** function to return a new dataframe containing columns selected from an existing dataframe."],"metadata":{}},{"cell_type":"code","source":["cities = airports.select(\"city\", \"name\")\ncities.show()"],"metadata":{},"outputs":[],"execution_count":7},{"cell_type":"markdown","source":["### Combine Operations\nYou can combine functions in a single statement to perform multiple operations on a dataframe. In this case, you will use the **join** function to combine the **flights** and **airports** dataframes, and then use the **groupBy** and **count** functions to return the number of flights from each airport."],"metadata":{}},{"cell_type":"code","source":["flightsByOrigin = flights.join(airports, flights.OriginAirportID == airports.airport_id).groupBy(\"city\").count()\nflightsByOrigin.show()"],"metadata":{},"outputs":[],"execution_count":9},{"cell_type":"markdown","source":["### Count the Rows in a Dataframe\nNow that you're familiar with working with dataframes, a key task when building predictive solutions is to explore the data, determing statistics that will help you understand the data before building predictive models. For example, how many rows of flight data do you actually have?"],"metadata":{}},{"cell_type":"code","source":["flights.count()"],"metadata":{},"outputs":[],"execution_count":11},{"cell_type":"markdown","source":["### Determine the Presence of Duplicates\nThe data you have to work with won't always be perfect - often you'll want to *clean* the data; for example to detect and remove duplicates that might affect your model. You can use the **dropDuplicates** function to create a new dataframe with the duplicates removed, enabling you to determine how many rows are duplicates of other rows."],"metadata":{}},{"cell_type":"code","source":["flights.count() - flights.dropDuplicates().count()"],"metadata":{},"outputs":[],"execution_count":13},{"cell_type":"markdown","source":["### Identify Missing Values\nAs well as determining if duplicates exist in your data, you should detect missing values, and either remove rows containing missing data or replace the missing values with a suitable relacement. The **dropna** function creates a dataframe with any rows containing missing data removed - you can specify a subset of columns, and whether the row should be removed in *any* or *all* values are missing. You can then use this new dataframe to determine how many rows contain missing values."],"metadata":{}},{"cell_type":"code","source":["flights.count() - flights.dropDuplicates().dropna(how=\"any\", subset=[\"ArrDelay\", \"DepDelay\"]).count()"],"metadata":{},"outputs":[],"execution_count":15},{"cell_type":"markdown","source":["### Clean the Data\nNow that you've identified that there are duplicates and missing values, you can clean the data by removing the duplicates and replacing the missing values. The **fillna** function replaces missing values with a specified replacement value. In this case, you'll remove all duplicate rows and replace missing **ArrDelay** and **DepDelay** values with **0**."],"metadata":{}},{"cell_type":"code","source":["data=flights.dropDuplicates().fillna(value=0, subset=[\"ArrDelay\", \"DepDelay\"])\ndata.count()"],"metadata":{},"outputs":[],"execution_count":17},{"cell_type":"markdown","source":["## Explore the Data\nNow that you've cleaned the data, you can start to explore it and perform some basic analysis. Let's start by examining the lateness of a flight. The dataset includes the **ArrDelay** field, which tells you how many minutes behind schedule a flight arrived. However, if a flight is only a few minutes behind schedule, you might not consider it *late*. Let's make our definition of lateness such that flights that arrive within 25 minutes of their scheduled arrival time are considered on-time, but any flights that are more than 25 minutes behind schedule are classified as *late*. We'll add a column to indicate this classification:"],"metadata":{}},{"cell_type":"code","source":["data = data.select(\"DayofMonth\", \"DayOfWeek\", \"Carrier\", \"OriginAirportID\",\"DestAirportID\",\n \"DepDelay\", \"ArrDelay\", ((col(\"ArrDelay\") > 25).cast(\"Int\").alias(\"Late\")))\ndata.show()"],"metadata":{},"outputs":[],"execution_count":19},{"cell_type":"markdown","source":["### Explore Summary Statistics and Data Distribution\nPredictive modeling is based on statistics and probability, so we should take a look at the summary statistics for the columns in our data. The **describe** function returns a dataframe containing the **count**, **mean**, **standard deviation**, **minimum**, and **maximum** values for each numeric column."],"metadata":{}},{"cell_type":"code","source":["data.describe().show()"],"metadata":{},"outputs":[],"execution_count":21},{"cell_type":"markdown","source":["The *DayofMonth* is a value between 1 and 31, and the mean is around halfway between these values; which seems about right. The same is true for the *DayofWeek* which is a value between 1 and 7. *Carrier* is a string, so there are no numeric statistics; and we can ignore the statistics for the airport IDs - they're just unique identifiers for the airports, not actually numeric values. The departure and arrival delays range between 63 or 94 minutes ahead of schedule, and over 1,800 minutes behind schedule. The means are much closer to zero than this, and the standard deviation is quite large; so there's quite a bit of variance in the delays. The *Late* indicator is a 1 or a 0, but the mean is very close to 0; which implies that there significantly fewer late flights than non-late flights.\n\nLet's verify that assumption by creating a table and using the **Spark SQL** API to run a SQL statement that counts the number of late and non-late flights:"],"metadata":{}},{"cell_type":"code","source":["data.createOrReplaceTempView(\"flightData\")\nspark.sql(\"SELECT Late, COUNT(*) AS Count FROM flightData GROUP BY Late\").show()"],"metadata":{},"outputs":[],"execution_count":23},{"cell_type":"markdown","source":["Yes, it looks like there are significantly more non-late flights than late ones - we can see this more clearly with a visualization, so let's use the inline **%sql** magic to query the table and bring back some results we can display as a chart:"],"metadata":{}},{"cell_type":"code","source":["%sql\nSELECT * FROM flightData"],"metadata":{},"outputs":[],"execution_count":25},{"cell_type":"markdown","source":["The query returns a table of data containing the first 1000 rows, which should be a big enough sample for us to explore. To see the distribution of *Late* classes (1 for late, 0 for on-time), in the visualization drop-down list under the table above, click **Bar**. Then click **Plot Options** and configure the visualization like this:\n- **Keys**: Late\n- **Series Groupings**: *none*\n- **Values**: <id>\n- **Aggregation**: Count\n- **Display type**: Bar chart\n- **Grouped**: Selected\n\nYou should be able to see that the sample includes significantly more on-time flights than late ones. This indicates that the dataset is *imbalanced*; which might adversely affect the accuracy of any machine learning model we train from this data.\n\nAdditionally, you observed earlier that there are some extremely high **DepDelay** and **ArrDelay** values that might be skewing the distribution of the data disproportionately because of a few *outliers*. Let's visualize the distribution of these columns to explore this. Change the **Plot Options** settings as follows:\n- **Keys**: *none*\n- **Series Groupings**: *none*\n- **Values**: DepDelay\n- **Aggregation**: Count\n- **Display Type**: Histogram plot\n- **Number of bins**: 20\n\nYou can drag the handle at the bottom right of the visualization to resize it. Note that the data is skewed such that most flights have a **DepDelay** value within 100 or so minutes of 0. However, there are a few flights with extremely high delays. Another way to view this distribution is a *box plot*. Change the **Plot Options** as follows:\n- **Keys**: *none*\n- **Series Groupings**: *none*\n- **Values**: DepDelay\n- **Aggregation**: Count\n- **Display Type**: Box plot\n\nThe box plot consists of a box with a line indicating the median departure delay, and *whiskers* extending from the box to show the first and fourth quartiles of the data, with statistical *outliers* shown as small circles. This confirms the extremely skewed distribution of **DepDelay** values seen in the histogram (and if you care to check, you'll find that the **ArrDelay** column has a similar distribution).\n\nLet's address the outliers and imbalanced classes in our data by removing rows with extreme delay values, and *undersampling* the more common on-time flights:"],"metadata":{}},{"cell_type":"code","source":["from pyspark.sql.functions import rand\n\n# Remove outliers - let's make the cut-off 150 minutes.\ndata = data.filter(\"DepDelay < 150 AND ArrDelay < 150\")\n\n# Separate the late and on-time flights\npos = data.filter(\"Late = 1\")\nneg = data.filter(\"Late = 0\")\n\n# undersample the most prevalent class to get a roughly even distribution\nposCount = pos.count()\nnegCount = neg.count()\nif posCount > negCount:\n pos = pos.sample(True, negCount/(negCount + posCount))\nelse:\n neg = neg.sample(True, posCount/(negCount + posCount))\n \n# shuffle into random order (so a sample of the first 1000 has a mix of classes)\ndata = neg.union(pos).orderBy(rand())\n\n# Replace the temporary table so we can query and visualize the balanced dataset\ndata.createOrReplaceTempView(\"flightData\")\n\n# Show the statistics\ndata.describe().show()"],"metadata":{},"outputs":[],"execution_count":27},{"cell_type":"markdown","source":["Now the maximums for the **DepDelay** and **ArrDelay** are clipped at under 150, and the mean value for the binary *Late* class is nearer 0.5; indicating a more or less even number of each class. We removed some data to accomplish this balancing act, but there are still a substantial number of rows for us to train a machine learning model with, and now the data is more balanced. Let's visualize the data again to confirm this:"],"metadata":{}},{"cell_type":"code","source":["%sql\nSELECT * FROM flightData"],"metadata":{},"outputs":[],"execution_count":29},{"cell_type":"markdown","source":["Display the data as a bar chart to compare the distribution of the **Late** classes as you did previously. There should now be a more or less even number of each class. Then visualize the **DepDelay** field as a histogram and as a box plot to verify that the distribution, while still skewed, has fewer outliers."],"metadata":{}},{"cell_type":"markdown","source":["### Explore Relationships in the Data\nPredictive modeling is largely based on statistical relationships between fields in the data. To design a good model, you need to understand how the data points relate to one another.\n\nA common way to start exploring relationships is to create visualizations that compare two or more data values. For example, modify the **Plot Options** of the chart above to compare the arrival delays for each carrier:\n- **Keys**: Carrier\n- **Series Groupings**: *none*\n- **Values**: ArrDelay\n- **Aggregation**: Count\n- **Display Type**: Box plot\n\nYou may need to resize the plot to see the data clearly, but it should show that the median delay, and the distribution of delays varies by carrier; with some carriers having a higher median delay than others. The same is true for other features, such as the day of the week and the destination airport. You may already suspect that there's likely to be a relationship between delarture delay and arrival delay, so let's examine that next. Change the **Plot Options** as follows:\n- **Keys**: None\n- **Series Groupings**: *none*\n- **Values**: ArrDelay, DepDelay\n- **Aggregation**: Count\n- **Display Type**: Scatter plot\n- **Show LOESS**: Selected\n\nThe scatter plot shows the departure delay and corresponding arrival delay for each flight as a point in a two dimensional space. Note that the points form a diagonal line, which indicates a strong linear relationship between departure delay and arrival delay. This linear relationship shows a *correlation* between these two values, which we can measure statistically. The **corr** function calculates a correlation value between -1 and 1, indicating the strength of correlation between two fields. A strong positive correlation (near 1) indicates that high values for one column are often found with high values for the other, which a strong negative correlation (near -1) indicates that *low* values for one column are often found with *high* values for the other. A correlation near 0 indicates little apparent relationship between the fields."],"metadata":{}},{"cell_type":"code","source":["data.corr(\"DepDelay\", \"ArrDelay\")"],"metadata":{},"outputs":[],"execution_count":32},{"cell_type":"markdown","source":["In this notebook we've cleaned the flight data, and explored it to identify some potential relationships between features of the flights and their lateness."],"metadata":{}}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"mimetype":"text/x-python","name":"python","pygments_lexer":"ipython3","codemirror_mode":{"name":"ipython","version":"3"},"version":"3.6.5","nbconvert_exporter":"python","file_extension":".py"},"name":"Python Data Exploration","notebookId":4101983005670620},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Graeme Malcolm 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Lab 1 - Exploring Data with Spark.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GraemeMalcolm/predictive-databricks/d3e7ca972a5e28e0427741f5394a7206f5d6dd5d/Lab 1 - Exploring Data with Spark.pdf -------------------------------------------------------------------------------- /Lab 2 - Building Supervised Learning Models.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GraemeMalcolm/predictive-databricks/d3e7ca972a5e28e0427741f5394a7206f5d6dd5d/Lab 2 - Building Supervised Learning Models.pdf -------------------------------------------------------------------------------- /Lab 3 - Evaluating Supervised Learning Models.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GraemeMalcolm/predictive-databricks/d3e7ca972a5e28e0427741f5394a7206f5d6dd5d/Lab 3 - Evaluating Supervised Learning Models.pdf -------------------------------------------------------------------------------- /Lab 4 - Recommenders and Clustering.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GraemeMalcolm/predictive-databricks/d3e7ca972a5e28e0427741f5394a7206f5d6dd5d/Lab 4 - Recommenders and Clustering.pdf -------------------------------------------------------------------------------- /Lab 5 - Using the MML Spark Library.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GraemeMalcolm/predictive-databricks/d3e7ca972a5e28e0427741f5394a7206f5d6dd5d/Lab 5 - Using the MML Spark Library.pdf -------------------------------------------------------------------------------- /MMLSpark Classifier.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["## Getting Started with MMLSpark\nIn this exercise, you will use the Microsoft Machine Learning for Spark (MMLSpark) library to create a classifier.\n\n### Load the Data\nFirst, you'll load the flight delay data from your Azure storage account and create a dataframe with a **Late** column that will be the label your classifier predicts."],"metadata":{}},{"cell_type":"code","source":["import numpy as np\nimport pandas as pd\nimport mmlspark\nfrom pyspark.sql.types import *\nfrom pyspark.sql.functions import *\n\ncsv = spark.read.csv('wasb://spark@.blob.core.windows.net/data/flights.csv', inferSchema=True, header=True)\ndata = csv.select(\"DayofMonth\", \"DayOfWeek\", \"OriginAirportID\", \"DestAirportID\", \"DepDelay\", ((col(\"ArrDelay\") > 15).cast(\"Int\").alias(\"Late\")))\ndata.show()"],"metadata":{},"outputs":[],"execution_count":2},{"cell_type":"markdown","source":["### Split the Data for Training and Testing\nNow you'll split the data into two sets; one for training a classification model, the other for testing the trained model."],"metadata":{}},{"cell_type":"code","source":["train, test = data.randomSplit([0.7, 0.3])\ntrain_rows = train.count()\ntest_rows = test.count()\nprint (\"Training Rows:\", train_rows, \" Testing Rows:\", test_rows)"],"metadata":{},"outputs":[],"execution_count":4},{"cell_type":"markdown","source":["### Train a Classification Model\nThe steps so far have been identical to those used to prepare data for training using SparkML. Now we'll use the MMLSpark **TrainClassifier** function to initialize and fit a Logistic Regression model. This function abstracts the various SparkML classes used to do this, implicitly converting the data into the correct format for the algorithm."],"metadata":{}},{"cell_type":"code","source":["from mmlspark import TrainClassifier\nfrom pyspark.ml.classification import LogisticRegression\nmodel = TrainClassifier(model=LogisticRegression(), labelCol=\"Late\", numFeatures=256).fit(train)"],"metadata":{},"outputs":[],"execution_count":6},{"cell_type":"markdown","source":["### Evaluate the Model\nThe MMLSpark library also includes classes to calculate the performance metrics of a trained model. The following code calculates metrics for a classifier, and stores them in a table."],"metadata":{}},{"cell_type":"code","source":["from mmlspark import ComputeModelStatistics, TrainedClassifierModel\nprediction = model.transform(test)\nmetrics = ComputeModelStatistics().transform(prediction)\nmetrics.createOrReplaceTempView(\"classMetrics\")\nmetrics.show()"],"metadata":{},"outputs":[],"execution_count":8},{"cell_type":"markdown","source":["If the output above is too wide to view clearly, run the following cell to display the results as a scrollable table. The metrics include:\n- predicted_class_as_0.0_actual_is_0.0 (true negatives)\n- predicted_class_as_0.0_actual_is_1.0 (false negatives)\n- predicted_class_as_1.0_actual_is_0.0 (false positives)\n- predicted_class_as_1.0_actual_is_1.0 (true positives)\n- accuracy (proportion of correct predictions)\n- precision (proportion of predicted positives that are actually positive)\n- recall (proportion of actual positves correctly predicted by the model)\n- AUC (area under the ROC curve indicating true positive rate vs false positive rate for all thresholds)"],"metadata":{}},{"cell_type":"code","source":["%sql\nSELECT * FROM classMetrics"],"metadata":{},"outputs":[],"execution_count":10},{"cell_type":"markdown","source":["### Learn More\nThis exercise has shown a simple example of using the MMLSpark library. The library really provides its greatest value when building deep learning models with the Microsoft cognitive toolkit (CNTK). To learn more about the MMLSpark library, see https://github.com/Azure/mmlspark."],"metadata":{}}],"metadata":{"name":"MMLSpark Classifier","notebookId":1882051270841879},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /Parameter Tuning.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["## Tuning Model Parameters\n\nIn this exercise, you will optimise the parameters for a classification model.\n\n### Prepare the Data\n\nFirst, import the libraries you will need and prepare the training and test data:"],"metadata":{}},{"cell_type":"code","source":["# Import Spark SQL and Spark ML libraries\nfrom pyspark.sql.types import *\nfrom pyspark.sql.functions import *\n\nfrom pyspark.ml import Pipeline\nfrom pyspark.ml.classification import LogisticRegression\nfrom pyspark.ml.feature import VectorAssembler, StringIndexer, MinMaxScaler\nfrom pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit\nfrom pyspark.ml.evaluation import BinaryClassificationEvaluator\n\n# Load the source data\nflightSchema = StructType([\n StructField(\"DayofMonth\", IntegerType(), False),\n StructField(\"DayOfWeek\", IntegerType(), False),\n StructField(\"Carrier\", StringType(), False),\n StructField(\"OriginAirportID\", StringType(), False),\n StructField(\"DestAirportID\", StringType(), False),\n StructField(\"DepDelay\", IntegerType(), False),\n StructField(\"ArrDelay\", IntegerType(), False),\n StructField(\"Late\", IntegerType(), False),\n])\n\ndata = spark.read.csv('wasb://spark@.blob.core.windows.net/data/flights.csv', schema=flightSchema, header=True)\ndata = data.select(\"DayofMonth\", \"DayOfWeek\", \"Carrier\", \"OriginAirportID\", \"DestAirportID\", \"DepDelay\", col(\"Late\").alias(\"label\"))\n\n# Split the data\nsplits = data.randomSplit([0.7, 0.3])\ntrain = splits[0]\ntest = splits[1]"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":2},{"cell_type":"markdown","source":["### Define the Pipeline\nNow define a pipeline that creates a feature vector and trains a classification model"],"metadata":{}},{"cell_type":"code","source":["# Define the pipeline\nmonthdayIndexer = StringIndexer(inputCol=\"DayofMonth\", outputCol=\"DayofMonthIdx\")\nweekdayIndexer = StringIndexer(inputCol=\"DayOfWeek\", outputCol=\"DayOfWeekIdx\")\ncarrierIndexer = StringIndexer(inputCol=\"Carrier\", outputCol=\"CarrierIdx\")\noriginIndexer = StringIndexer(inputCol=\"OriginAirportID\", outputCol=\"OriginAirportIdx\")\ndestIndexer = StringIndexer(inputCol=\"DestAirportID\", outputCol=\"DestAirportIdx\")\nnumVect = VectorAssembler(inputCols = [\"DepDelay\"], outputCol=\"numFeatures\")\nminMax = MinMaxScaler(inputCol = numVect.getOutputCol(), outputCol=\"normNums\")\nfeatVect = VectorAssembler(inputCols=[\"DayofMonthIdx\", \"DayOfWeekIdx\", \"CarrierIdx\", \"OriginAirportIdx\", \"DestAirportIdx\", \"normNums\"], outputCol=\"features\")\nlr = LogisticRegression(labelCol=\"label\", featuresCol=\"features\")\npipeline = Pipeline(stages=[monthdayIndexer, weekdayIndexer, carrierIndexer, originIndexer, destIndexer, numVect, minMax, featVect, lr])"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":4},{"cell_type":"markdown","source":["### Tune Parameters\nYou can tune parameters to find the best model for your data. A simple way to do this is to use **TrainValidationSplit** to evaluate each combination of parameters defined in a **ParameterGrid** against a subset of the training data in order to find the best performing parameters."],"metadata":{}},{"cell_type":"code","source":["paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001]).addGrid(lr.maxIter, [10, 5, 2]).build()\ntvs = TrainValidationSplit(estimator=pipeline, evaluator=BinaryClassificationEvaluator(), estimatorParamMaps=paramGrid, trainRatio=0.8)\n\nmodel = tvs.fit(train)"],"metadata":{"collapsed":false},"outputs":[],"execution_count":6},{"cell_type":"markdown","source":["### Test the Model\nNow you're ready to apply the model to the test data."],"metadata":{}},{"cell_type":"code","source":["prediction = model.transform(test)\npredicted = prediction.select(\"features\", \"prediction\", \"probability\", \"label\")\npredicted.show(100)"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":8},{"cell_type":"markdown","source":["### Compute Confusion Matrix Metrics\nClassifiers are typically evaluated by creating a *confusion matrix*, which indicates the number of:\n- True Positives\n- True Negatives\n- False Positives\n- False Negatives\n\nFrom these core measures, other evaluation metrics such as *precision* and *recall* can be calculated."],"metadata":{}},{"cell_type":"code","source":["tp = float(predicted.filter(\"prediction == 1.0 AND label == 1\").count())\nfp = float(predicted.filter(\"prediction == 1.0 AND label == 0\").count())\ntn = float(predicted.filter(\"prediction == 0.0 AND label == 0\").count())\nfn = float(predicted.filter(\"prediction == 0.0 AND label == 1\").count())\nmetrics = spark.createDataFrame([\n (\"TP\", tp),\n (\"FP\", fp),\n (\"TN\", tn),\n (\"FN\", fn),\n (\"Precision\", tp / (tp + fp)),\n (\"Recall\", tp / (tp + fn))],[\"metric\", \"value\"])\nmetrics.show()"],"metadata":{"collapsed":false},"outputs":[],"execution_count":10},{"cell_type":"markdown","source":["### Review the Area Under ROC\nAnother way to assess the performance of a classification model is to measure the area under a ROC curve for the model. the spark.ml library includes a **BinaryClassificationEvaluator** class that you can use to compute this."],"metadata":{}},{"cell_type":"code","source":["evaluator = BinaryClassificationEvaluator(labelCol=\"label\", rawPredictionCol=\"prediction\", metricName=\"areaUnderROC\")\nauc = evaluator.evaluate(prediction)\nprint (\"AUC = \", auc)"],"metadata":{"collapsed":false},"outputs":[],"execution_count":12}],"metadata":{"kernelspec":{"display_name":"PySpark","name":"pysparkkernel","language":""},"language_info":{"mimetype":"text/x-python","pygments_lexer":"python2","name":"pyspark","codemirror_mode":{"version":"2","name":"python"}},"name":"Python Parameter Tuning","notebookId":374219277805995},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /Pipeline.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["## Creating a Pipeline\n\nIn this exercise, you will implement a pipeline that includes multiple stages of *transformers* and *estimators* to prepare features and train a classification model. The resulting trained *PipelineModel* can then be used as a transformer to predict whether or not a flight will be late.\n\n### Import Spark SQL and Spark ML Libraries\n\nFirst, import the libraries you will need:"],"metadata":{}},{"cell_type":"code","source":["from pyspark.sql.types import *\nfrom pyspark.sql.functions import *\n\nfrom pyspark.ml import Pipeline\nfrom pyspark.ml.classification import LogisticRegression\nfrom pyspark.ml.feature import VectorAssembler, StringIndexer, MinMaxScaler"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":2},{"cell_type":"markdown","source":["### Load Source Data\nThe data for this exercise is provided as a CSV file containing details of flights. The data includes specific characteristics (or *features*) for each flight, as well as a column indicating whether or not the flight was late.\n\nYou will load this data into a dataframe and display it."],"metadata":{}},{"cell_type":"code","source":["flightSchema = StructType([\n StructField(\"DayofMonth\", IntegerType(), False),\n StructField(\"DayOfWeek\", IntegerType(), False),\n StructField(\"Carrier\", StringType(), False),\n StructField(\"OriginAirportID\", StringType(), False),\n StructField(\"DestAirportID\", StringType(), False),\n StructField(\"DepDelay\", IntegerType(), False),\n StructField(\"ArrDelay\", IntegerType(), False),\n StructField(\"Late\", IntegerType(), False),\n])\n\ndata = spark.read.csv('wasb://spark@.blob.core.windows.net/data/flights.csv', schema=flightSchema, header=True)\ndata.show()"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":4},{"cell_type":"markdown","source":["### Split the Data\nIt is common practice when building supervised machine learning models to split the source data, using some of it to train the model and reserving some to test the trained model. In this exercise, you will use 70% of the data for training, and reserve 30% for testing."],"metadata":{}},{"cell_type":"code","source":["splits = data.randomSplit([0.7, 0.3])\ntrain = splits[0]\ntest = splits[1]\ntrain_rows = train.count()\ntest_rows = test.count()\nprint (\"Training Rows:\", train_rows, \" Testing Rows:\", test_rows)"],"metadata":{"collapsed":false},"outputs":[],"execution_count":6},{"cell_type":"markdown","source":["### Define the Pipeline\nA predictive model often requires multiple stages of feature preparation. For example, it is common when using some algorithms to distingish between continuous features (which have a calculable numeric value) and categorical features (which are numeric representations of discrete categories). It is also common to *normalize* continuous numeric features to use a common scale - for example, by scaling all numbers to a proportional decimal value between 0 and 1 (strictly speaking, it only really makes sense to do this when you have multiple numeric columns - normalizing them all to similar scales prevents a feature with particularly large values from dominating the training of the model - in this case, we only have one non-categorical numeric feature; but I've included this so you can see how it's done!).\n\nA pipeline consists of a a series of *transformer* and *estimator* stages that typically prepare a dataframe for\nmodeling and then train a predictive model. In this case, you will create a pipeline with seven stages:\n- A **StringIndexer** estimator for each categorical variable to generate numeric indexes for categorical features\n- A **VectorAssembler** that creates a vector of continuous numeric features\n- A **MinMaxScaler** that normalizes vector of numeric features\n- A **VectorAssembler** that creates a vector of categorical and continuous features\n- A **LogisticRegression** algorithm that trains a classification model."],"metadata":{}},{"cell_type":"code","source":["monthdayIndexer = StringIndexer(inputCol=\"DayofMonth\", outputCol=\"DayofMonthIdx\")\nweekdayIndexer = StringIndexer(inputCol=\"DayOfWeek\", outputCol=\"DayOfWeekIdx\")\ncarrierIndexer = StringIndexer(inputCol=\"Carrier\", outputCol=\"CarrierIdx\")\noriginIndexer = StringIndexer(inputCol=\"OriginAirportID\", outputCol=\"OriginAirportIdx\")\ndestIndexer = StringIndexer(inputCol=\"DestAirportID\", outputCol=\"DestAirportIdx\")\nnumVect = VectorAssembler(inputCols = [\"DepDelay\"], outputCol=\"numFeatures\")\nminMax = MinMaxScaler(inputCol = numVect.getOutputCol(), outputCol=\"normNums\")\nfeatVect = VectorAssembler(inputCols=[\"DayofMonthIdx\", \"DayOfWeekIdx\", \"CarrierIdx\", \"OriginAirportIdx\", \"DestAirportIdx\", \"normNums\"], outputCol=\"features\")\nlr = LogisticRegression(labelCol=\"Late\", featuresCol=\"features\")\npipeline = Pipeline(stages=[monthdayIndexer, weekdayIndexer, carrierIndexer, originIndexer, destIndexer, numVect, minMax, featVect, lr])"],"metadata":{},"outputs":[],"execution_count":8},{"cell_type":"markdown","source":["### Run the Pipeline as an Estimator\nThe pipeline itself is an estimator, and so it has a **fit** method that you can call to run the pipeline on a specified dataframe. In this case, you will run the pipeline on the training data to train a model."],"metadata":{}},{"cell_type":"code","source":["piplineModel = pipeline.fit(train)\nprint (\"Pipeline complete!\")"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":10},{"cell_type":"markdown","source":["### Test the Pipeline Model\nThe model produced by the pipeline is a transformer that will apply all of the stages in the pipeline to a specified dataframe and apply the trained model to generate predictions. In this case, you will transform the **test** dataframe using the pipeline to generate label predictions."],"metadata":{}},{"cell_type":"code","source":["prediction = piplineModel.transform(test)\npredicted = prediction.select(\"features\", col(\"prediction\").cast(\"Int\"), col(\"Late\").alias(\"trueLabel\"))\npredicted.show(100, truncate=False)"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":12},{"cell_type":"markdown","source":["The resulting dataframe is produced by applying all of the transformations in the pipline to the test data. The **prediction** column contains the predicted value for the label, and the **trueLabel** column contains the actual known value from the testing data."],"metadata":{}}],"metadata":{"kernelspec":{"display_name":"PySpark","name":"pysparkkernel","language":""},"language_info":{"mimetype":"text/x-python","pygments_lexer":"python2","name":"pyspark","codemirror_mode":{"version":"2","name":"python"}},"name":"Python Pipeline","notebookId":374219277805845},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Implementing Predictive Analytics with Spark in Azure Databricks 2 | A few years ago, I wrote and recorded the edX course [Implementing Predictive Analytics with Spark in Azure HDInsight](https://www.edx.org/course/implementing-predictive-analytics-with-spark-in-azure-hdinsight), which teaches you how to use the Spark MLLib library to build machine learning solutions in a Spark Azure HDInsight cluster. 3 | 4 | Microsoft now also offers Spark capabilities in the **Azure Databricks** service. This repo contains versions of the lab files that have been modified to use Azure Databricks. 5 | -------------------------------------------------------------------------------- /Recommender.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["## Collaborative Filtering\nCollaborative filtering is a machine learning technique that predicts ratings awarded to items by users.\n\n### Import the ALS class\nIn this exercise, you will use the Alternating Least Squares collaborative filtering algorithm to creater a recommender."],"metadata":{}},{"cell_type":"code","source":["from pyspark.ml.recommendation import ALS"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":2},{"cell_type":"markdown","source":["### Load Source Data\nThe source data for the recommender is in two files - one containing numeric IDs for movies and users, along with user ratings; and the other containing details of the movies."],"metadata":{}},{"cell_type":"code","source":["ratings = spark.read.csv('wasb://spark@.blob.core.windows.net/data/ratings.csv', inferSchema=True, header=True)\nmovies = spark.read.csv('wasb://spark@.blob.core.windows.net/data/movies.csv', inferSchema=True, header=True)\nratings.join(movies, \"movieId\").show()"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":4},{"cell_type":"markdown","source":["### Prepare the Data\nTo prepare the data, split it into a training set and a test set."],"metadata":{}},{"cell_type":"code","source":["data = ratings.select(\"userId\", \"movieId\", \"rating\")\nsplits = data.randomSplit([0.7, 0.3])\ntrain = splits[0].withColumnRenamed(\"rating\", \"label\")\ntest = splits[1].withColumnRenamed(\"rating\", \"trueLabel\")\ntrain_rows = train.count()\ntest_rows = test.count()\nprint (\"Training Rows:\", train_rows, \" Testing Rows:\", test_rows)"],"metadata":{"collapsed":false},"outputs":[],"execution_count":6},{"cell_type":"markdown","source":["### Build the Recommender\nThe ALS class is an estimator, so you can use its **fit** method to traing a model, or you can include it in a pipeline. Rather than specifying a feature vector and as label, the ALS algorithm requries a numeric user ID, item ID, and rating."],"metadata":{}},{"cell_type":"code","source":["als = ALS(maxIter=5, regParam=0.01, userCol=\"userId\", itemCol=\"movieId\", ratingCol=\"label\")\nmodel = als.fit(train)"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":8},{"cell_type":"markdown","source":["### Test the Recommender\nNow that you've trained the recommender, you can see how accurately it predicts known ratings in the test set."],"metadata":{}},{"cell_type":"code","source":["prediction = model.transform(test)\nprediction.join(movies, \"movieId\").select(\"userId\", \"title\", \"prediction\", \"trueLabel\").show(100, truncate=False)"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":10},{"cell_type":"markdown","source":["The data used in this exercise describes 5-star rating activity from [MovieLens](http://movielens.org), a movie recommendation service. It was created by GroupLens, a research group in the Department of Computer Science and Engineering at the University of Minnesota, and is used here with permission.\n\nThis dataset and other GroupLens data sets are publicly available for download at .\n\nFor more information, see F. Maxwell Harper and Joseph A. Konstan. 2015. [The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015)](http://dx.doi.org/10.1145/2827872)"],"metadata":{}}],"metadata":{"kernelspec":{"display_name":"PySpark","name":"pysparkkernel","language":""},"language_info":{"mimetype":"text/x-python","pygments_lexer":"python2","name":"pyspark","codemirror_mode":{"version":2,"name":"python"}},"name":"Python Recommender","notebookId":3378903555804655},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /Regression Evaluation.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["## Evaluating a Regression Model\n\nIn this exercise, you will create a pipeline for a linear regression model, and then test and evaluate the model.\n\n### Prepare the Data\n\nFirst, import the libraries you will need and prepare the training and test data:"],"metadata":{}},{"cell_type":"code","source":["# Import Spark SQL and Spark ML libraries\nfrom pyspark.sql.types import *\nfrom pyspark.sql.functions import *\n\nfrom pyspark.ml import Pipeline\nfrom pyspark.ml.regression import LinearRegression\nfrom pyspark.ml.feature import VectorAssembler, StringIndexer, MinMaxScaler\nfrom pyspark.ml.tuning import ParamGridBuilder, CrossValidator\nfrom pyspark.ml.evaluation import RegressionEvaluator\n\n# Load the source data\nflightSchema = StructType([\n StructField(\"DayofMonth\", IntegerType(), False),\n StructField(\"DayOfWeek\", IntegerType(), False),\n StructField(\"Carrier\", StringType(), False),\n StructField(\"OriginAirportID\", StringType(), False),\n StructField(\"DestAirportID\", StringType(), False),\n StructField(\"DepDelay\", IntegerType(), False),\n StructField(\"ArrDelay\", IntegerType(), False),\n StructField(\"Late\", IntegerType(), False),\n])\n\ndata = spark.read.csv('wasb://spark@.blob.core.windows.net/data/flights.csv', schema=flightSchema, header=True)\ndata = data.select(\"DayofMonth\", \"DayOfWeek\", \"Carrier\", \"OriginAirportID\", \"DestAirportID\", \"DepDelay\", col(\"ArrDelay\").alias(\"label\"))\n\n# Split the data\nsplits = data.randomSplit([0.7, 0.3])\ntrain = splits[0]\ntest = splits[1]"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":2},{"cell_type":"markdown","source":["### Define the Pipeline and Train the Model\nNow define a pipeline that creates a feature vector and trains a regression model"],"metadata":{}},{"cell_type":"code","source":["# Define the pipeline\nmonthdayIndexer = StringIndexer(inputCol=\"DayofMonth\", outputCol=\"DayofMonthIdx\")\nweekdayIndexer = StringIndexer(inputCol=\"DayOfWeek\", outputCol=\"DayOfWeekIdx\")\ncarrierIndexer = StringIndexer(inputCol=\"Carrier\", outputCol=\"CarrierIdx\")\noriginIndexer = StringIndexer(inputCol=\"OriginAirportID\", outputCol=\"OriginAirportIdx\")\ndestIndexer = StringIndexer(inputCol=\"DestAirportID\", outputCol=\"DestAirportIdx\")\nnumVect = VectorAssembler(inputCols = [\"DepDelay\"], outputCol=\"numFeatures\")\nminMax = MinMaxScaler(inputCol = numVect.getOutputCol(), outputCol=\"normNums\")\nfeatVect = VectorAssembler(inputCols=[\"DayofMonthIdx\", \"DayOfWeekIdx\", \"CarrierIdx\", \"OriginAirportIdx\", \"DestAirportIdx\", \"normNums\"], outputCol=\"features\")\nlr = LinearRegression(labelCol=\"label\", featuresCol=\"features\")\npipeline = Pipeline(stages=[monthdayIndexer, weekdayIndexer, carrierIndexer, originIndexer, destIndexer, numVect, minMax, featVect, lr])\n\n# Train the model\npiplineModel = pipeline.fit(train)"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":4},{"cell_type":"markdown","source":["### Test the Model\nNow you're ready to apply the model to the test data."],"metadata":{}},{"cell_type":"code","source":["prediction = piplineModel.transform(test)\npredicted = prediction.select(\"features\", \"prediction\", \"label\")\npredicted.show()"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":6},{"cell_type":"markdown","source":["### Examine the Predicted and Actual Values\nYou can plot the predicted values against the actual values to see how accurately the model has predicted. In a perfect model, the resulting scatter plot should form a perfect diagonal line with each predicted value being identical to the actual value - in practice, some variance is to be expected.\nRun the cells below to create a temporary table from the **predicted** DataFrame and then retrieve the predicted and actual label values using SQL. You can then display the results as a scatter plot to see how well the predicted delay correlates to the actual delay."],"metadata":{}},{"cell_type":"code","source":["predicted.createOrReplaceTempView(\"regressionPredictions\")"],"metadata":{"collapsed":false},"outputs":[],"execution_count":8},{"cell_type":"code","source":["%sql\nSELECT label, prediction FROM regressionPredictions"],"metadata":{"collapsed":false},"outputs":[],"execution_count":9},{"cell_type":"markdown","source":["### Retrieve the Root Mean Square Error (RMSE)\nThere are a number of metrics used to measure the variance between predicted and actual values. Of these, the root mean square error (RMSE) is a commonly used value that is measured in the same units as the predicted and actual values - so in this case, the RMSE indicates the average number of minutes between predicted and actual flight delay values. You can use the **RegressionEvaluator** class to retrieve the RMSE."],"metadata":{}},{"cell_type":"code","source":["from pyspark.ml.evaluation import RegressionEvaluator\n\nevaluator = RegressionEvaluator(labelCol=\"label\", predictionCol=\"prediction\", metricName=\"rmse\")\nrmse = evaluator.evaluate(prediction)\nprint (\"Root Mean Square Error (RMSE):\", rmse)"],"metadata":{"collapsed":false},"outputs":[],"execution_count":11}],"metadata":{"language_info":{"mimetype":"text/x-python","pygments_lexer":"python2","name":"pyspark","codemirror_mode":{"version":"2","name":"python"}},"name":"Python Regression Evaluation","notebookId":374219277806021,"kernelspec":{"display_name":"PySpark","name":"pysparkkernel","language":""},"widgets":{"state":{"d6eefac808724429915e0b32eea703c9":{"views":[{"cell_index":"8"}]},"3cc3b0402bc44842aed785220d307db0":{"views":[{"cell_index":"8"}]},"76bc6e0b425942c3bf1e45506e6ed087":{"views":[{"cell_index":"8"}]},"9a621f8e3bc242bb8ca31f76cc02b5da":{"views":[{"cell_index":"8"}]}},"version":"1.2.0"}},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /Regression.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["## Creating a Regression Model\n\nIn this exercise, you will implement a regression model that uses features of a flight to predict how late or early it will arrive.\n\n### Import Spark SQL and Spark ML Libraries\n\nFirst, import the libraries you will need:"],"metadata":{}},{"cell_type":"code","source":["from pyspark.sql.types import *\nfrom pyspark.sql.functions import *\n\nfrom pyspark.ml.regression import LinearRegression\nfrom pyspark.ml.feature import StringIndexer, VectorAssembler"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":2},{"cell_type":"markdown","source":["### Load Source Data\nThe data for this exercise is provided as a CSV file containing details of flights. The data includes specific characteristics (or *features*) for each flight, as well as a *label* column indicating how many minutes late or early the flight arrived.\n\nYou will load this data into a DataFrame and display it."],"metadata":{}},{"cell_type":"code","source":["flightSchema = StructType([\n StructField(\"DayofMonth\", IntegerType(), False),\n StructField(\"DayOfWeek\", IntegerType(), False),\n StructField(\"Carrier\", StringType(), False),\n StructField(\"OriginAirportID\", IntegerType(), False),\n StructField(\"DestAirportID\", IntegerType(), False),\n StructField(\"DepDelay\", IntegerType(), False),\n StructField(\"ArrDelay\", IntegerType(), False),\n StructField(\"Late\", IntegerType(), False),\n])\n\ndata = spark.read.csv('wasb://spark@.blob.core.windows.net/data/flights.csv', schema=flightSchema, header=True)\ndata.show()"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":4},{"cell_type":"markdown","source":["### Split the Data\nIt is common practice when building supervised machine learning models to split the source data, using some of it to train the model and reserving some to test the trained model. In this exercise, you will use 70% of the data for training, and reserve 30% for testing."],"metadata":{}},{"cell_type":"code","source":["splits = data.randomSplit([0.7, 0.3])\ntrain = splits[0]\ntest = splits[1]\ntrain_rows = train.count()\ntest_rows = test.count()\nprint (\"Training Rows:\", train_rows, \" Testing Rows:\", test_rows)"],"metadata":{"collapsed":false},"outputs":[],"execution_count":6},{"cell_type":"markdown","source":["### Prepare the Training Data\nTo train the regression model, you need a training data set that includes a vector of numeric features, and a label column. In this exercise, you will use the **StringIndexer** class to generate a numeric category for each discrete **Carrier** string value, and then use the **VectorAssembler** class to transform the numeric features that would be available for a flight that hasn't yet arrived into a vector, and then rename the **ArrDelay** column to **label** as this is what we're going to try to predict.\n\n*Note: This is a deliberately simple example. In reality you'd likely perform mulitple data preparation steps, and later in this course we'll examine how to encapsulate these steps in to a pipeline. For now, we'll just use the numeric features as they are to dewfine the traaining dataset.*"],"metadata":{}},{"cell_type":"code","source":["# Carrier is a string, and we need our features to be numeric - so we'll generate a numeric index for each distinct carrier string, and transform the dataframe to add that as a column\ncarrierIndexer = StringIndexer(inputCol=\"Carrier\", outputCol=\"CarrierIdx\")\nnumTrain = carrierIndexer.fit(train).transform(train)\n\n# Now we'll assemble a vector of all the numeric feature columns (other than ArrDelay, which we wouldn't have for enroute flights)\nassembler = VectorAssembler(inputCols = [\"DayofMonth\", \"DayOfWeek\", \"CarrierIdx\", \"OriginAirportID\", \"DestAirportID\", \"DepDelay\"], outputCol=\"features\")\ntraining = assembler.transform(numTrain).select(col(\"features\"), (col(\"ArrDelay\").cast(\"Int\").alias(\"label\")))\ntraining.show()"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":8},{"cell_type":"markdown","source":["### Train a Regression Model\nNext, you need to train a regression model using the training data. To do this, create an instance of the regression algorithm you want to use and use its **fit** method to train a model based on the training DataFrame. In this exercise, you will use a *Linear Regression* algorithm - though you can use the same technique for any of the regression algorithms supported in the spark.ml API."],"metadata":{}},{"cell_type":"code","source":["lr = LinearRegression(labelCol=\"label\",featuresCol=\"features\", maxIter=10, regParam=0.3)\nmodel = lr.fit(training)\nprint (\"Model trained!\")"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":10},{"cell_type":"markdown","source":["### Prepare the Testing Data\nNow that you have a trained model, you can test it using the testing data you reserved previously. First, you need to prepare the testing data in the same way as you did the training data by transforming the feature columns into a vector. This time you'll rename the **ArrDelay** column to **trueLabel**."],"metadata":{}},{"cell_type":"code","source":["# Transform the test data to add the numeric carrier index\nnumTest = carrierIndexer.fit(test).transform(test)\n\n# Generate the features vector and label\ntesting = assembler.transform(numTest).select(col(\"features\"), (col(\"ArrDelay\")).cast(\"Int\").alias(\"trueLabel\"))\ntesting.show()"],"metadata":{"collapsed":false},"outputs":[],"execution_count":12},{"cell_type":"markdown","source":["### Test the Model\nNow you're ready to use the **transform** method of the model to generate some predictions. You can use this approach to predict arrival delay for flights where the label is unknown; but in this case you are using the test data which includes a known true label value, so you can compare the predicted number of minutes late or early to the actual arrival delay."],"metadata":{}},{"cell_type":"code","source":["prediction = model.transform(testing)\npredicted = prediction.select(\"features\", \"prediction\", \"trueLabel\")\npredicted.show()"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":14},{"cell_type":"markdown","source":["Looking at the result, the **prediction** column contains the predicted value for the label, and the **trueLabel** column contains the actual known value from the testing data. It looks like there is some variance between the predictions and the actual values (the individual differences are referred to as *residuals*)- later in this course you'll learn how to measure the accuracy of a model."],"metadata":{}}],"metadata":{"kernelspec":{"display_name":"PySpark","name":"pysparkkernel","language":""},"language_info":{"mimetype":"text/x-python","pygments_lexer":"python2","name":"pyspark","codemirror_mode":{"version":"2","name":"python"}},"name":"Python Regression","notebookId":374219277805877},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /Setup.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GraemeMalcolm/predictive-databricks/d3e7ca972a5e28e0427741f5394a7206f5d6dd5d/Setup.pdf -------------------------------------------------------------------------------- /Text Analysis.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["## Text Analysis\nIn this lab, you will create a classification model that performs sentiment analysis of tweets.\n### Import Spark SQL and Spark ML Libraries\n\nFirst, import the libraries you will need:"],"metadata":{}},{"cell_type":"code","source":["from pyspark.sql.types import *\nfrom pyspark.sql.functions import *\n\nfrom pyspark.ml import Pipeline\nfrom pyspark.ml.classification import LogisticRegression\nfrom pyspark.ml.feature import HashingTF, Tokenizer, StopWordsRemover"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":2},{"cell_type":"markdown","source":["### Load Source Data\nNow load the tweets data into a DataFrame. This data consists of tweets that have been previously captured and classified as positive or negative."],"metadata":{}},{"cell_type":"code","source":["tweets_csv = spark.read.csv('wasb://spark@.blob.core.windows.net/data/tweets.csv', inferSchema=True, header=True)\ntweets_csv.show(truncate = False)"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":4},{"cell_type":"markdown","source":["### Prepare the Data\nThe features for the classification model will be derived from the tweet text. The label is the sentiment (1 for positive, 0 for negative)"],"metadata":{}},{"cell_type":"code","source":["data = tweets_csv.select(\"SentimentText\", col(\"Sentiment\").cast(\"Int\").alias(\"label\"))\ndata.show(truncate = False)"],"metadata":{"collapsed":false},"outputs":[],"execution_count":6},{"cell_type":"markdown","source":["### Split the Data\nIn common with most classification modeling processes, you'll split the data into a set for training, and a set for testing the trained model."],"metadata":{}},{"cell_type":"code","source":["splits = data.randomSplit([0.7, 0.3])\ntrain = splits[0]\ntest = splits[1].withColumnRenamed(\"label\", \"trueLabel\")\ntrain_rows = train.count()\ntest_rows = test.count()\nprint (\"Training Rows:\", train_rows, \" Testing Rows:\", test_rows)"],"metadata":{"collapsed":false},"outputs":[],"execution_count":8},{"cell_type":"markdown","source":["### Define the Pipeline\nThe pipeline for the model consist of the following stages:\n- A Tokenizer to split the tweets into individual words.\n- A StopWordsRemover to remove common words such as \"a\" or \"the\" that have little predictive value.\n- A HashingTF class to generate numeric vectors from the text values.\n- A LogisticRegression algorithm to train a binary classification model."],"metadata":{}},{"cell_type":"code","source":["tokenizer = Tokenizer(inputCol=\"SentimentText\", outputCol=\"SentimentWords\")\nswr = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol=\"MeaningfulWords\")\nhashTF = HashingTF(inputCol=swr.getOutputCol(), outputCol=\"features\")\nlr = LogisticRegression(labelCol=\"label\", featuresCol=\"features\", maxIter=10, regParam=0.01)\npipeline = Pipeline(stages=[tokenizer, swr, hashTF, lr])"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":10},{"cell_type":"markdown","source":["### Run the Pipeline as an Estimator\nThe pipeline itself is an estimator, and so it has a **fit** method that you can call to run the pipeline on a specified DataFrame. In this case, you will run the pipeline on the training data to train a model."],"metadata":{}},{"cell_type":"code","source":["piplineModel = pipeline.fit(train)\nprint (\"Pipeline complete!\")"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":12},{"cell_type":"markdown","source":["### Test the Pipeline Model\nThe model produced by the pipeline is a transformer that will apply all of the stages in the pipeline to a specified DataFrame and apply the trained model to generate predictions. In this case, you will transform the **test** DataFrame using the pipeline to generate label predictions."],"metadata":{}},{"cell_type":"code","source":["prediction = piplineModel.transform(test)\npredicted = prediction.select(\"SentimentText\", \"prediction\", \"trueLabel\")\npredicted.show(100, truncate = False)"],"metadata":{"scrolled":false,"collapsed":false},"outputs":[],"execution_count":14}],"metadata":{"kernelspec":{"display_name":"PySpark","name":"pysparkkernel","language":""},"language_info":{"mimetype":"text/x-python","pygments_lexer":"python2","name":"pyspark","codemirror_mode":{"version":2,"name":"python"}},"name":"Python Text Analysis","notebookId":3378903555804445},"nbformat":4,"nbformat_minor":0} 2 | --------------------------------------------------------------------------------