├── README.md └── Capstone-Lab.py.py /README.md: -------------------------------------------------------------------------------- 1 | # Azure-Databricks-Capstone 2 | This repository is to host the capstone project for the Azure Databricks training. It contains the problem statement notebook, that attendees can work on and generate a solution towards. The notebook has all references needed for attendees to get started on the capstone project. 3 | -------------------------------------------------------------------------------- /Capstone-Lab.py.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md-sandbox 3 | # MAGIC 4 | # MAGIC
5 | # MAGIC Databricks Learning 6 | # MAGIC
7 | 8 | # COMMAND ---------- 9 | 10 | # MAGIC %md-sandbox 11 | # MAGIC # Data Science Capstone 12 | # MAGIC 13 | # MAGIC 14 | 15 | # COMMAND ---------- 16 | 17 | # MAGIC %md 18 | # MAGIC 19 | # MAGIC Airbnb has a host of public data sets for listings in cities throughout the world:
20 | # MAGIC http://insideairbnb.com/get-the-data.html 21 | # MAGIC 22 | # MAGIC This challenge will be working on the data set for London. 23 | # MAGIC 24 | # MAGIC Split into teams of 2 or 3 to work on completing this challenge. 25 | # MAGIC 26 | # MAGIC **Metric:** We will be using RMSE to assess the effectiveness of our models. 27 | 28 | # COMMAND ---------- 29 | 30 | # MAGIC %md 31 | # MAGIC 32 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Airbnb Price Prediction Challenge 33 | # MAGIC 34 | # MAGIC 1. Configure Classroom 35 | # MAGIC 1. Add the data set to Databricks. 36 | # MAGIC 2. Read the Data 37 | # MAGIC 2. Prepare the Data 38 | # MAGIC 3. Define Preprocessing Models 39 | # MAGIC 4. Split the Data for Model Development 40 | # MAGIC 4. Prepare a benchmark Model 41 | # MAGIC 5. Iterate on Benchmark Model 42 | # MAGIC 6. Iterate on Best Model 43 | # MAGIC 44 | # MAGIC A list of available regression models can be found here: https://spark.apache.org/docs/2.2.0/ml-classification-regression.html 45 | 46 | # COMMAND ---------- 47 | 48 | # MAGIC %md 49 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Configure Classroom 50 | # MAGIC 51 | # MAGIC Run the following cell to configure our "classroom." 52 | 53 | # COMMAND ---------- 54 | 55 | # MAGIC %run "../Includes/Classroom Setup" 56 | 57 | # COMMAND ---------- 58 | 59 | # MAGIC %md 60 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Read the Data 61 | # MAGIC - Load the data as a Dataframe 62 | # MAGIC - Prepare a basic description of the data set including: 63 | # MAGIC - the schema with data types 64 | # MAGIC - the number of rows 65 | 66 | # COMMAND ---------- 67 | 68 | try: 69 | sasToken="?sv=2017-11-09&ss=bf&srt=co&sp=rl&se=2099-12-31T23:59:59Z"+\ 70 | "&st=2018-01-01T00:00:00Z&spr=https&sig=di3x0sjVwmqIjO5ReQ%2Bwa54R9shTDZePtKHipkabqAg%3D" 71 | dbutils.fs.mount( 72 | source = "wasbs://class-453@airlift453.blob.core.windows.net/", 73 | mount_point = "/mnt/training-453", 74 | extra_configs = {"fs.azure.sas.class-453.airlift453.blob.core.windows.net": sasToken}) 75 | except Exception as e: 76 | if "Directory already mounted" in str(e): 77 | pass # Ignore error if already mounted. 78 | else: 79 | raise e 80 | print("Success.") 81 | 82 | # COMMAND ---------- 83 | 84 | filePath = "dbfs:/mnt/training-453/airbnb/listings/london-cleaned.csv" 85 | 86 | initDF = (spark.read 87 | .option("multiline", True) 88 | .option("header", True) 89 | .option("inferSchema", True) 90 | .csv(filePath) 91 | ) 92 | 93 | display(initDF) 94 | 95 | # COMMAND ---------- 96 | 97 | initDF.printSchema() 98 | 99 | # COMMAND ---------- 100 | 101 | initDF.count() 102 | 103 | # COMMAND ---------- 104 | 105 | # MAGIC %md 106 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Prepare the Data 107 | # MAGIC - Count rows with null values 108 | # MAGIC - Impute missing values for numerical fields 109 | # MAGIC - remove rows with null values for `zipcode` 110 | 111 | # COMMAND ---------- 112 | 113 | # TODO: Count the number of rows in the `initDF` DataFrame 114 | 115 | # COMMAND ---------- 116 | 117 | # TODO: Show the description of `initDF` DataFrame to shows the number of non-null rows 118 | 119 | # COMMAND ---------- 120 | 121 | # TODO: Impute missing values for numerical columns 122 | 123 | # COMMAND ---------- 124 | 125 | # TODO: Remove rows with null values for zip code 126 | 127 | # COMMAND ---------- 128 | 129 | # MAGIC %md 130 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Prepare for the Competition 131 | # MAGIC 132 | # MAGIC - Using the same random seed select 20% of the data to be used as a hold out set for comparison between teams. 133 | # MAGIC 134 | # MAGIC `modelingDF` will be used to prepare your model. You are free to use this data any way that you see fit in order to prepare the best possible model. **You must not expose your model to `holdOutDF`**. 135 | # MAGIC 136 | # MAGIC `holdOutDF` will be used for comparison. You will submit scores for you model's performance on `holdOutDF` to the instructor for comparisonn. 137 | 138 | # COMMAND ---------- 139 | 140 | seed = 273 141 | (holdOutDF, modelingDF) = airbnbDF.randomSplit([0.2, 0.8], seed=seed) 142 | 143 | print(holdOutDF.count(), modelingDF.count()) 144 | 145 | # COMMAND ---------- 146 | 147 | # MAGIC %md 148 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Define Preprocessing Models 149 | # MAGIC 150 | # MAGIC Prepare the following models to be used in a modeling pipeline: 151 | # MAGIC - Prepare a StringIndexer for `neighbourhood_cleansed`, `room_type`, `zipcode`, `property_type`, `bed_type` 152 | # MAGIC - Prepare a OneHotEncoder for `neighbourhood_cleansed`, `room_type`, `zipcode`, `property_type`, `bed_type` 153 | 154 | # COMMAND ---------- 155 | 156 | from pyspark.ml.feature import StringIndexer 157 | 158 | print(StringIndexer().explainParams()) 159 | 160 | # COMMAND ---------- 161 | 162 | # MAGIC %md 163 | # MAGIC 164 | # MAGIC Now *StringIndex* all categorical features (`neighbourhood_cleansed`, `room_type`, `zipcode`, `property_type`, `bed_type`) and set `handleInvalid` to `skip`. Set the output columns to `cat_neighbourhood_cleansed`, `cat_room_type`, `cat_zipcode`, `cat_property_type` and `cat_bed_type`, respectively. 165 | 166 | # COMMAND ---------- 167 | 168 | # TODO: Prepare a StringIndexer for `neighbourhood_cleansed`, `room_type`, `zipcode`, `property_type`, `bed_type` 169 | 170 | # COMMAND ---------- 171 | 172 | # TODO: Prepare a OneHotEncoder for `neighbourhood_cleansed`, `room_type`, `zipcode`, `property_type`, `bed_type` 173 | 174 | # COMMAND ---------- 175 | 176 | # MAGIC %md 177 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Split the Data for Model Development 178 | # MAGIC 179 | # MAGIC Let's keep 80% for the training set and set aside 20% of our data for the test set. 180 | # MAGIC 181 | # MAGIC **NOTE:** The data is now split into three sets: 182 | # MAGIC - `trainDF` - used for training a model 183 | # MAGIC - `testDF` - used for internal validation of hyperparamters 184 | # MAGIC - `holdOutDF` - used for final assessement of model and comparison to models prepared by other teams 185 | 186 | # COMMAND ---------- 187 | 188 | # TODO: Perform a train-test split on `modelingDF` 189 | 190 | # COMMAND ---------- 191 | 192 | # MAGIC %md 193 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Prepare a Benchmark Model 194 | # MAGIC 195 | # MAGIC - Define a `list` (Python) or `Array` (Scala) containing the features to be used. It is recommended to use the following features: 196 | # MAGIC 197 | # MAGIC `"host_total_listings_count"`, ` "accommodates"`, ` "bathrooms"`, ` "bedrooms"`, ` "beds"`, ` "minimum_nights"`, ` "number_of_reviews"`, ` "review_scores_rating"`, ` "review_scores_accuracy"`, ` "review_scores_cleanliness"`, ` "review_scores_checkin"`, ` "review_scores_communication"`, ` "review_scores_location"`, ` "review_scores_value"`, ` "vec_neighborhood"`, `"vec_room_type"`, `"vec_zipcode"`, `"vec_property_type"`, `"vec_bed_type"` 198 | # MAGIC - Build a Linear Regression pipeline that contains: 199 | # MAGIC - each of the StringIndexers 200 | # MAGIC - the OneHotEncoder 201 | # MAGIC - a VectorAssembler 202 | # MAGIC - a LinearRegression Estimator 203 | # MAGIC - Evaluate the performance of the Benchmark Model using a RegressionEvaluator 204 | 205 | # COMMAND ---------- 206 | 207 | # TODO: Define a `list` (Python) or `Array` (Scala) containing the features to be used. 208 | 209 | # COMMAND ---------- 210 | 211 | # TODO: Build a Linear Regression pipeline 212 | 213 | # COMMAND ---------- 214 | 215 | # MAGIC %md 216 | # MAGIC Evaluate the performance of the Benchmark Model using a RegressionEvaluator on internal testing set, `testDF`. 217 | 218 | # COMMAND ---------- 219 | 220 | # TODO: Evaluate the performance of the Benchmark Modelusing a RegressionEvaluator on internal testing set, `testDF` 221 | 222 | # COMMAND ---------- 223 | 224 | # MAGIC %md 225 | # MAGIC Evaluate the performance of the Benchmark Modelusing a RegressionEvaluator on the class evaluation set, `holdOutDF` 226 | 227 | # COMMAND ---------- 228 | 229 | # TODO: Evaluate the performance of the Benchmark Modelusing a RegressionEvaluator on the class evaluation set, `holdOutDF` 230 | 231 | # COMMAND ---------- 232 | 233 | # MAGIC %md 234 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Iterate on Benchmark Model 235 | # MAGIC - Prepare a model to beat your benchmark model. 236 | # MAGIC - Build a regression pipeline that contains: 237 | # MAGIC - each of the StringIndexers 238 | # MAGIC - the OneHotEncoder 239 | # MAGIC - a VectorAssembler 240 | # MAGIC - an improved Regression Estimator 241 | # MAGIC - Evaluate the performance of the new Model using a RegressionEvaluator on your internal testing set, `testDF`. 242 | # MAGIC - Use the internal testing set to adjust the hyper parameters of your model. 243 | # MAGIC - Evaluate the performance of the new Model using a RegressionEvaluator on the class evaluation set, `holdOutDF` 244 | # MAGIC - When you have beaten the benchmark, share the results with your instructor. 245 | 246 | # COMMAND ---------- 247 | 248 | # TODO: Build a better Regression pipeline 249 | 250 | # COMMAND ---------- 251 | 252 | # MAGIC %md 253 | # MAGIC Evaluate the performance of the better Model using a RegressionEvaluator on internal testing set, `testDF` 254 | 255 | # COMMAND ---------- 256 | 257 | # TODO: Evaluate the performance of the better Model using a RegressionEvaluator on internal testing set, `testDF` 258 | 259 | # COMMAND ---------- 260 | 261 | # MAGIC %md 262 | # MAGIC 263 | # MAGIC Use the internal test set to adjust the hyperparameters of your model. 264 | 265 | # COMMAND ---------- 266 | 267 | # TODO: Build a better Regression pipeline 268 | 269 | # COMMAND ---------- 270 | 271 | # MAGIC %md 272 | # MAGIC Evaluate the performance of the tuned Model using a RegressionEvaluator on internal testing set, `testDF` 273 | 274 | # COMMAND ---------- 275 | 276 | # TODO: Evaluate the performance of the better Model using a RegressionEvaluator on internal testing set, `testDF` 277 | 278 | # COMMAND ---------- 279 | 280 | # MAGIC %md 281 | # MAGIC Evaluate the performance of the tuned Model using a RegressionEvaluator on the class evaluation set, `holdOutDF` 282 | 283 | # COMMAND ---------- 284 | 285 | # TODO: Evaluate the performance of the tuned Model using a RegressionEvaluator on the class evaluation set, `holdOutDF` 286 | 287 | # COMMAND ---------- 288 | 289 | # MAGIC %md-sandbox 290 | # MAGIC © 2018 Databricks, Inc. All rights reserved.
291 | # MAGIC Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
292 | # MAGIC
293 | # MAGIC Privacy Policy | Terms of Use | Support 294 | --------------------------------------------------------------------------------