├── README.md
└── Capstone-Lab.py.py


/README.md:
--------------------------------------------------------------------------------
1 | # Azure-Databricks-Capstone
2 | This repository is to host the capstone project for the Azure Databricks training. It contains the problem statement notebook, that attendees can work on and generate a solution towards. The notebook has all references needed for attendees to get started on the capstone project.
3 | 


--------------------------------------------------------------------------------
/Capstone-Lab.py.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md-sandbox
  3 | # MAGIC 
  4 | # MAGIC <div style="text-align: center; line-height: 0; padding-top: 9px;">
  5 | # MAGIC   <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
  6 | # MAGIC </div>
  7 | 
  8 | # COMMAND ----------
  9 | 
 10 | # MAGIC %md-sandbox
 11 | # MAGIC # Data Science Capstone
 12 | # MAGIC 
 13 | # MAGIC <img src="https://files.training.databricks.com/images/airlift/AAHPtolqpaxIgZ0_m03xYBkcgBoyi6XDN2QB.png" style="width:1000px"/>
 14 | 
 15 | # COMMAND ----------
 16 | 
 17 | # MAGIC %md
 18 | # MAGIC 
 19 | # MAGIC Airbnb has a host of public data sets for listings in cities throughout the world: <br>
 20 | # MAGIC http://insideairbnb.com/get-the-data.html
 21 | # MAGIC 
 22 | # MAGIC This challenge will be working on the data set for London.
 23 | # MAGIC 
 24 | # MAGIC Split into teams of 2 or 3 to work on completing this challenge.
 25 | # MAGIC 
 26 | # MAGIC **Metric:** We will be using RMSE to assess the effectiveness of our models.
 27 | 
 28 | # COMMAND ----------
 29 | 
 30 | # MAGIC %md
 31 | # MAGIC 
 32 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Airbnb Price Prediction Challenge
 33 | # MAGIC 
 34 | # MAGIC 1. Configure Classroom
 35 | # MAGIC 1. Add the data set to Databricks.
 36 | # MAGIC 2. Read the Data
 37 | # MAGIC 2. Prepare the Data
 38 | # MAGIC 3. Define Preprocessing Models
 39 | # MAGIC 4. Split the Data for Model Development
 40 | # MAGIC 4. Prepare a benchmark Model
 41 | # MAGIC 5. Iterate on Benchmark Model
 42 | # MAGIC 6. Iterate on Best Model
 43 | # MAGIC 
 44 | # MAGIC A list of available regression models can be found here: https://spark.apache.org/docs/2.2.0/ml-classification-regression.html
 45 | 
 46 | # COMMAND ----------
 47 | 
 48 | # MAGIC %md
 49 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Configure Classroom
 50 | # MAGIC 
 51 | # MAGIC Run the following cell to configure our "classroom."
 52 | 
 53 | # COMMAND ----------
 54 | 
 55 | # MAGIC %run "../Includes/Classroom Setup"
 56 | 
 57 | # COMMAND ----------
 58 | 
 59 | # MAGIC %md
 60 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Read the Data
 61 | # MAGIC - Load the data as a Dataframe
 62 | # MAGIC - Prepare a basic description of the data set including:
 63 | # MAGIC    - the schema with data types
 64 | # MAGIC    - the number of rows
 65 | 
 66 | # COMMAND ----------
 67 | 
 68 | try:
 69 |   sasToken="?sv=2017-11-09&ss=bf&srt=co&sp=rl&se=2099-12-31T23:59:59Z"+\
 70 |     "&st=2018-01-01T00:00:00Z&spr=https&sig=di3x0sjVwmqIjO5ReQ%2Bwa54R9shTDZePtKHipkabqAg%3D"
 71 |   dbutils.fs.mount(
 72 |     source = "wasbs://class-453@airlift453.blob.core.windows.net/",
 73 |     mount_point = "/mnt/training-453",
 74 |     extra_configs = {"fs.azure.sas.class-453.airlift453.blob.core.windows.net": sasToken})
 75 | except Exception as e:
 76 |   if "Directory already mounted" in str(e):
 77 |     pass # Ignore error if already mounted.
 78 |   else:
 79 |     raise e
 80 | print("Success.")
 81 | 
 82 | # COMMAND ----------
 83 | 
 84 | filePath = "dbfs:/mnt/training-453/airbnb/listings/london-cleaned.csv"
 85 | 
 86 | initDF = (spark.read
 87 |   .option("multiline", True)
 88 |   .option("header", True)
 89 |   .option("inferSchema", True)
 90 |   .csv(filePath)
 91 | )
 92 | 
 93 | display(initDF)
 94 | 
 95 | # COMMAND ----------
 96 | 
 97 | initDF.printSchema()
 98 | 
 99 | # COMMAND ----------
100 | 
101 | initDF.count()
102 | 
103 | # COMMAND ----------
104 | 
105 | # MAGIC %md
106 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Prepare the Data
107 | # MAGIC - Count rows with null values
108 | # MAGIC - Impute missing values for numerical fields
109 | # MAGIC - remove rows with null values for `zipcode`
110 | 
111 | # COMMAND ----------
112 | 
113 | # TODO: Count the number of rows in the `initDF` DataFrame
114 | 
115 | # COMMAND ----------
116 | 
117 | # TODO: Show the description of `initDF` DataFrame to shows the number of non-null rows
118 | 
119 | # COMMAND ----------
120 | 
121 | # TODO: Impute missing values for numerical columns
122 | 
123 | # COMMAND ----------
124 | 
125 | # TODO: Remove rows with null values for zip code
126 | 
127 | # COMMAND ----------
128 | 
129 | # MAGIC %md
130 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Prepare for the Competition
131 | # MAGIC 
132 | # MAGIC - Using the same random seed select 20% of the data to be used as a hold out set for comparison between teams.
133 | # MAGIC 
134 | # MAGIC `modelingDF` will be used to prepare your model. You are free to use this data any way that you see fit in order to prepare the best possible model. **You must not expose your model to `holdOutDF`**.
135 | # MAGIC 
136 | # MAGIC `holdOutDF` will be used for comparison. You will submit scores for you model's performance on `holdOutDF` to the instructor for comparisonn.
137 | 
138 | # COMMAND ----------
139 | 
140 | seed = 273
141 | (holdOutDF, modelingDF) = airbnbDF.randomSplit([0.2, 0.8], seed=seed)
142 | 
143 | print(holdOutDF.count(), modelingDF.count())
144 | 
145 | # COMMAND ----------
146 | 
147 | # MAGIC %md
148 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Define Preprocessing Models
149 | # MAGIC 
150 | # MAGIC Prepare the following models to be used in a modeling pipeline:
151 | # MAGIC - Prepare a StringIndexer for `neighbourhood_cleansed`, `room_type`, `zipcode`, `property_type`, `bed_type`
152 | # MAGIC - Prepare a OneHotEncoder for `neighbourhood_cleansed`, `room_type`, `zipcode`, `property_type`, `bed_type`
153 | 
154 | # COMMAND ----------
155 | 
156 | from pyspark.ml.feature import StringIndexer
157 | 
158 | print(StringIndexer().explainParams())
159 | 
160 | # COMMAND ----------
161 | 
162 | # MAGIC %md
163 | # MAGIC 
164 | # MAGIC Now *StringIndex* all categorical features (`neighbourhood_cleansed`, `room_type`, `zipcode`, `property_type`, `bed_type`) and set `handleInvalid` to `skip`. Set the output columns to `cat_neighbourhood_cleansed`, `cat_room_type`, `cat_zipcode`, `cat_property_type` and `cat_bed_type`, respectively.
165 | 
166 | # COMMAND ----------
167 | 
168 | # TODO: Prepare a StringIndexer for `neighbourhood_cleansed`, `room_type`, `zipcode`, `property_type`, `bed_type`
169 | 
170 | # COMMAND ----------
171 | 
172 | # TODO: Prepare a OneHotEncoder for `neighbourhood_cleansed`, `room_type`, `zipcode`, `property_type`, `bed_type`
173 | 
174 | # COMMAND ----------
175 | 
176 | # MAGIC %md
177 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Split the Data for Model Development
178 | # MAGIC 
179 | # MAGIC Let's keep 80% for the training set and set aside 20% of our data for the test set.
180 | # MAGIC 
181 | # MAGIC **NOTE:** The data is now split into three sets:
182 | # MAGIC - `trainDF` - used for training a model
183 | # MAGIC - `testDF` - used for internal validation of hyperparamters
184 | # MAGIC - `holdOutDF` - used for final assessement of model and comparison to models prepared by other teams
185 | 
186 | # COMMAND ----------
187 | 
188 | # TODO: Perform a train-test split on `modelingDF`
189 | 
190 | # COMMAND ----------
191 | 
192 | # MAGIC %md
193 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Prepare a Benchmark Model
194 | # MAGIC 
195 | # MAGIC - Define a `list` (Python) or `Array` (Scala) containing the features to be used. It is recommended to use the following features:
196 | # MAGIC 
197 | # MAGIC   `"host_total_listings_count"`, ` "accommodates"`, ` "bathrooms"`, ` "bedrooms"`, ` "beds"`, ` "minimum_nights"`, ` "number_of_reviews"`, ` "review_scores_rating"`, ` "review_scores_accuracy"`, ` "review_scores_cleanliness"`, ` "review_scores_checkin"`, ` "review_scores_communication"`, ` "review_scores_location"`, ` "review_scores_value"`, ` "vec_neighborhood"`, `"vec_room_type"`, `"vec_zipcode"`, `"vec_property_type"`, `"vec_bed_type"`
198 | # MAGIC - Build a Linear Regression pipeline that contains:
199 | # MAGIC   - each of the StringIndexers
200 | # MAGIC   - the OneHotEncoder
201 | # MAGIC   - a VectorAssembler
202 | # MAGIC   - a LinearRegression Estimator
203 | # MAGIC - Evaluate the performance of the Benchmark Model using a RegressionEvaluator
204 | 
205 | # COMMAND ----------
206 | 
207 | # TODO: Define a `list` (Python) or `Array` (Scala) containing the features to be used.
208 | 
209 | # COMMAND ----------
210 | 
211 | # TODO: Build a Linear Regression pipeline
212 | 
213 | # COMMAND ----------
214 | 
215 | # MAGIC %md
216 | # MAGIC Evaluate the performance of the Benchmark Model using a RegressionEvaluator on internal testing set, `testDF`.
217 | 
218 | # COMMAND ----------
219 | 
220 | # TODO: Evaluate the performance of the Benchmark Modelusing a RegressionEvaluator on internal testing set, `testDF`
221 | 
222 | # COMMAND ----------
223 | 
224 | # MAGIC %md
225 | # MAGIC Evaluate the performance of the Benchmark Modelusing a RegressionEvaluator on the class evaluation set, `holdOutDF`
226 | 
227 | # COMMAND ----------
228 | 
229 | # TODO: Evaluate the performance of the Benchmark Modelusing a RegressionEvaluator on the class evaluation set, `holdOutDF`
230 | 
231 | # COMMAND ----------
232 | 
233 | # MAGIC %md
234 | # MAGIC ##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Iterate on Benchmark Model
235 | # MAGIC - Prepare a model to beat your benchmark model.
236 | # MAGIC - Build a regression pipeline that contains:
237 | # MAGIC    - each of the StringIndexers
238 | # MAGIC    - the OneHotEncoder
239 | # MAGIC    - a VectorAssembler
240 | # MAGIC    - an improved Regression Estimator
241 | # MAGIC  - Evaluate the performance of the new Model using a RegressionEvaluator on your internal testing set, `testDF`.
242 | # MAGIC  - Use the internal testing set to adjust the hyper parameters of your model.
243 | # MAGIC  - Evaluate the performance of the new Model using a RegressionEvaluator on the class evaluation set, `holdOutDF`
244 | # MAGIC  - When you have beaten the benchmark, share the results with your instructor.
245 | 
246 | # COMMAND ----------
247 | 
248 | # TODO: Build a better Regression pipeline
249 | 
250 | # COMMAND ----------
251 | 
252 | # MAGIC %md
253 | # MAGIC Evaluate the performance of the better Model using a RegressionEvaluator on internal testing set, `testDF`
254 | 
255 | # COMMAND ----------
256 | 
257 | # TODO: Evaluate the performance of the better Model using a RegressionEvaluator on internal testing set, `testDF`
258 | 
259 | # COMMAND ----------
260 | 
261 | # MAGIC %md
262 | # MAGIC 
263 | # MAGIC Use the internal test set to adjust the hyperparameters of your model.
264 | 
265 | # COMMAND ----------
266 | 
267 | # TODO: Build a better Regression pipeline
268 | 
269 | # COMMAND ----------
270 | 
271 | # MAGIC %md
272 | # MAGIC Evaluate the performance of the tuned Model using a RegressionEvaluator on internal testing set, `testDF`
273 | 
274 | # COMMAND ----------
275 | 
276 | # TODO: Evaluate the performance of the better Model using a RegressionEvaluator on internal testing set, `testDF`
277 | 
278 | # COMMAND ----------
279 | 
280 | # MAGIC %md
281 | # MAGIC Evaluate the performance of the tuned Model using a RegressionEvaluator on the class evaluation set, `holdOutDF`
282 | 
283 | # COMMAND ----------
284 | 
285 | # TODO: Evaluate the performance of the tuned Model using a RegressionEvaluator on the class evaluation set, `holdOutDF`
286 | 
287 | # COMMAND ----------
288 | 
289 | # MAGIC %md-sandbox
290 | # MAGIC &copy; 2018 Databricks, Inc. All rights reserved.<br/>
291 | # MAGIC Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
292 | # MAGIC <br/>
293 | # MAGIC <a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>
294 | 


--------------------------------------------------------------------------------