├── README.md
└── Capstone-Lab.py.py
/README.md:
--------------------------------------------------------------------------------
1 | # Azure-Databricks-Capstone
2 | This repository is to host the capstone project for the Azure Databricks training. It contains the problem statement notebook, that attendees can work on and generate a solution towards. The notebook has all references needed for attendees to get started on the capstone project.
3 |
--------------------------------------------------------------------------------
/Capstone-Lab.py.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | # MAGIC %md-sandbox
3 | # MAGIC
4 | # MAGIC
5 | # MAGIC

6 | # MAGIC
7 |
8 | # COMMAND ----------
9 |
10 | # MAGIC %md-sandbox
11 | # MAGIC # Data Science Capstone
12 | # MAGIC
13 | # MAGIC
14 |
15 | # COMMAND ----------
16 |
17 | # MAGIC %md
18 | # MAGIC
19 | # MAGIC Airbnb has a host of public data sets for listings in cities throughout the world:
20 | # MAGIC http://insideairbnb.com/get-the-data.html
21 | # MAGIC
22 | # MAGIC This challenge will be working on the data set for London.
23 | # MAGIC
24 | # MAGIC Split into teams of 2 or 3 to work on completing this challenge.
25 | # MAGIC
26 | # MAGIC **Metric:** We will be using RMSE to assess the effectiveness of our models.
27 |
28 | # COMMAND ----------
29 |
30 | # MAGIC %md
31 | # MAGIC
32 | # MAGIC ## Airbnb Price Prediction Challenge
33 | # MAGIC
34 | # MAGIC 1. Configure Classroom
35 | # MAGIC 1. Add the data set to Databricks.
36 | # MAGIC 2. Read the Data
37 | # MAGIC 2. Prepare the Data
38 | # MAGIC 3. Define Preprocessing Models
39 | # MAGIC 4. Split the Data for Model Development
40 | # MAGIC 4. Prepare a benchmark Model
41 | # MAGIC 5. Iterate on Benchmark Model
42 | # MAGIC 6. Iterate on Best Model
43 | # MAGIC
44 | # MAGIC A list of available regression models can be found here: https://spark.apache.org/docs/2.2.0/ml-classification-regression.html
45 |
46 | # COMMAND ----------
47 |
48 | # MAGIC %md
49 | # MAGIC ## Configure Classroom
50 | # MAGIC
51 | # MAGIC Run the following cell to configure our "classroom."
52 |
53 | # COMMAND ----------
54 |
55 | # MAGIC %run "../Includes/Classroom Setup"
56 |
57 | # COMMAND ----------
58 |
59 | # MAGIC %md
60 | # MAGIC ## Read the Data
61 | # MAGIC - Load the data as a Dataframe
62 | # MAGIC - Prepare a basic description of the data set including:
63 | # MAGIC - the schema with data types
64 | # MAGIC - the number of rows
65 |
66 | # COMMAND ----------
67 |
68 | try:
69 | sasToken="?sv=2017-11-09&ss=bf&srt=co&sp=rl&se=2099-12-31T23:59:59Z"+\
70 | "&st=2018-01-01T00:00:00Z&spr=https&sig=di3x0sjVwmqIjO5ReQ%2Bwa54R9shTDZePtKHipkabqAg%3D"
71 | dbutils.fs.mount(
72 | source = "wasbs://class-453@airlift453.blob.core.windows.net/",
73 | mount_point = "/mnt/training-453",
74 | extra_configs = {"fs.azure.sas.class-453.airlift453.blob.core.windows.net": sasToken})
75 | except Exception as e:
76 | if "Directory already mounted" in str(e):
77 | pass # Ignore error if already mounted.
78 | else:
79 | raise e
80 | print("Success.")
81 |
82 | # COMMAND ----------
83 |
84 | filePath = "dbfs:/mnt/training-453/airbnb/listings/london-cleaned.csv"
85 |
86 | initDF = (spark.read
87 | .option("multiline", True)
88 | .option("header", True)
89 | .option("inferSchema", True)
90 | .csv(filePath)
91 | )
92 |
93 | display(initDF)
94 |
95 | # COMMAND ----------
96 |
97 | initDF.printSchema()
98 |
99 | # COMMAND ----------
100 |
101 | initDF.count()
102 |
103 | # COMMAND ----------
104 |
105 | # MAGIC %md
106 | # MAGIC ## Prepare the Data
107 | # MAGIC - Count rows with null values
108 | # MAGIC - Impute missing values for numerical fields
109 | # MAGIC - remove rows with null values for `zipcode`
110 |
111 | # COMMAND ----------
112 |
113 | # TODO: Count the number of rows in the `initDF` DataFrame
114 |
115 | # COMMAND ----------
116 |
117 | # TODO: Show the description of `initDF` DataFrame to shows the number of non-null rows
118 |
119 | # COMMAND ----------
120 |
121 | # TODO: Impute missing values for numerical columns
122 |
123 | # COMMAND ----------
124 |
125 | # TODO: Remove rows with null values for zip code
126 |
127 | # COMMAND ----------
128 |
129 | # MAGIC %md
130 | # MAGIC ## Prepare for the Competition
131 | # MAGIC
132 | # MAGIC - Using the same random seed select 20% of the data to be used as a hold out set for comparison between teams.
133 | # MAGIC
134 | # MAGIC `modelingDF` will be used to prepare your model. You are free to use this data any way that you see fit in order to prepare the best possible model. **You must not expose your model to `holdOutDF`**.
135 | # MAGIC
136 | # MAGIC `holdOutDF` will be used for comparison. You will submit scores for you model's performance on `holdOutDF` to the instructor for comparisonn.
137 |
138 | # COMMAND ----------
139 |
140 | seed = 273
141 | (holdOutDF, modelingDF) = airbnbDF.randomSplit([0.2, 0.8], seed=seed)
142 |
143 | print(holdOutDF.count(), modelingDF.count())
144 |
145 | # COMMAND ----------
146 |
147 | # MAGIC %md
148 | # MAGIC ## Define Preprocessing Models
149 | # MAGIC
150 | # MAGIC Prepare the following models to be used in a modeling pipeline:
151 | # MAGIC - Prepare a StringIndexer for `neighbourhood_cleansed`, `room_type`, `zipcode`, `property_type`, `bed_type`
152 | # MAGIC - Prepare a OneHotEncoder for `neighbourhood_cleansed`, `room_type`, `zipcode`, `property_type`, `bed_type`
153 |
154 | # COMMAND ----------
155 |
156 | from pyspark.ml.feature import StringIndexer
157 |
158 | print(StringIndexer().explainParams())
159 |
160 | # COMMAND ----------
161 |
162 | # MAGIC %md
163 | # MAGIC
164 | # MAGIC Now *StringIndex* all categorical features (`neighbourhood_cleansed`, `room_type`, `zipcode`, `property_type`, `bed_type`) and set `handleInvalid` to `skip`. Set the output columns to `cat_neighbourhood_cleansed`, `cat_room_type`, `cat_zipcode`, `cat_property_type` and `cat_bed_type`, respectively.
165 |
166 | # COMMAND ----------
167 |
168 | # TODO: Prepare a StringIndexer for `neighbourhood_cleansed`, `room_type`, `zipcode`, `property_type`, `bed_type`
169 |
170 | # COMMAND ----------
171 |
172 | # TODO: Prepare a OneHotEncoder for `neighbourhood_cleansed`, `room_type`, `zipcode`, `property_type`, `bed_type`
173 |
174 | # COMMAND ----------
175 |
176 | # MAGIC %md
177 | # MAGIC ## Split the Data for Model Development
178 | # MAGIC
179 | # MAGIC Let's keep 80% for the training set and set aside 20% of our data for the test set.
180 | # MAGIC
181 | # MAGIC **NOTE:** The data is now split into three sets:
182 | # MAGIC - `trainDF` - used for training a model
183 | # MAGIC - `testDF` - used for internal validation of hyperparamters
184 | # MAGIC - `holdOutDF` - used for final assessement of model and comparison to models prepared by other teams
185 |
186 | # COMMAND ----------
187 |
188 | # TODO: Perform a train-test split on `modelingDF`
189 |
190 | # COMMAND ----------
191 |
192 | # MAGIC %md
193 | # MAGIC ## Prepare a Benchmark Model
194 | # MAGIC
195 | # MAGIC - Define a `list` (Python) or `Array` (Scala) containing the features to be used. It is recommended to use the following features:
196 | # MAGIC
197 | # MAGIC `"host_total_listings_count"`, ` "accommodates"`, ` "bathrooms"`, ` "bedrooms"`, ` "beds"`, ` "minimum_nights"`, ` "number_of_reviews"`, ` "review_scores_rating"`, ` "review_scores_accuracy"`, ` "review_scores_cleanliness"`, ` "review_scores_checkin"`, ` "review_scores_communication"`, ` "review_scores_location"`, ` "review_scores_value"`, ` "vec_neighborhood"`, `"vec_room_type"`, `"vec_zipcode"`, `"vec_property_type"`, `"vec_bed_type"`
198 | # MAGIC - Build a Linear Regression pipeline that contains:
199 | # MAGIC - each of the StringIndexers
200 | # MAGIC - the OneHotEncoder
201 | # MAGIC - a VectorAssembler
202 | # MAGIC - a LinearRegression Estimator
203 | # MAGIC - Evaluate the performance of the Benchmark Model using a RegressionEvaluator
204 |
205 | # COMMAND ----------
206 |
207 | # TODO: Define a `list` (Python) or `Array` (Scala) containing the features to be used.
208 |
209 | # COMMAND ----------
210 |
211 | # TODO: Build a Linear Regression pipeline
212 |
213 | # COMMAND ----------
214 |
215 | # MAGIC %md
216 | # MAGIC Evaluate the performance of the Benchmark Model using a RegressionEvaluator on internal testing set, `testDF`.
217 |
218 | # COMMAND ----------
219 |
220 | # TODO: Evaluate the performance of the Benchmark Modelusing a RegressionEvaluator on internal testing set, `testDF`
221 |
222 | # COMMAND ----------
223 |
224 | # MAGIC %md
225 | # MAGIC Evaluate the performance of the Benchmark Modelusing a RegressionEvaluator on the class evaluation set, `holdOutDF`
226 |
227 | # COMMAND ----------
228 |
229 | # TODO: Evaluate the performance of the Benchmark Modelusing a RegressionEvaluator on the class evaluation set, `holdOutDF`
230 |
231 | # COMMAND ----------
232 |
233 | # MAGIC %md
234 | # MAGIC ## Iterate on Benchmark Model
235 | # MAGIC - Prepare a model to beat your benchmark model.
236 | # MAGIC - Build a regression pipeline that contains:
237 | # MAGIC - each of the StringIndexers
238 | # MAGIC - the OneHotEncoder
239 | # MAGIC - a VectorAssembler
240 | # MAGIC - an improved Regression Estimator
241 | # MAGIC - Evaluate the performance of the new Model using a RegressionEvaluator on your internal testing set, `testDF`.
242 | # MAGIC - Use the internal testing set to adjust the hyper parameters of your model.
243 | # MAGIC - Evaluate the performance of the new Model using a RegressionEvaluator on the class evaluation set, `holdOutDF`
244 | # MAGIC - When you have beaten the benchmark, share the results with your instructor.
245 |
246 | # COMMAND ----------
247 |
248 | # TODO: Build a better Regression pipeline
249 |
250 | # COMMAND ----------
251 |
252 | # MAGIC %md
253 | # MAGIC Evaluate the performance of the better Model using a RegressionEvaluator on internal testing set, `testDF`
254 |
255 | # COMMAND ----------
256 |
257 | # TODO: Evaluate the performance of the better Model using a RegressionEvaluator on internal testing set, `testDF`
258 |
259 | # COMMAND ----------
260 |
261 | # MAGIC %md
262 | # MAGIC
263 | # MAGIC Use the internal test set to adjust the hyperparameters of your model.
264 |
265 | # COMMAND ----------
266 |
267 | # TODO: Build a better Regression pipeline
268 |
269 | # COMMAND ----------
270 |
271 | # MAGIC %md
272 | # MAGIC Evaluate the performance of the tuned Model using a RegressionEvaluator on internal testing set, `testDF`
273 |
274 | # COMMAND ----------
275 |
276 | # TODO: Evaluate the performance of the better Model using a RegressionEvaluator on internal testing set, `testDF`
277 |
278 | # COMMAND ----------
279 |
280 | # MAGIC %md
281 | # MAGIC Evaluate the performance of the tuned Model using a RegressionEvaluator on the class evaluation set, `holdOutDF`
282 |
283 | # COMMAND ----------
284 |
285 | # TODO: Evaluate the performance of the tuned Model using a RegressionEvaluator on the class evaluation set, `holdOutDF`
286 |
287 | # COMMAND ----------
288 |
289 | # MAGIC %md-sandbox
290 | # MAGIC © 2018 Databricks, Inc. All rights reserved.
291 | # MAGIC Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
292 | # MAGIC
293 | # MAGIC Privacy Policy | Terms of Use | Support
294 |
--------------------------------------------------------------------------------