├── LICENSE
├── README.md
├── blog-part-1.ipynb
├── blog-part-2.ipynb
├── blog-part-3.ipynb
├── blog-part-4.ipynb
├── crime-event-demo-R.ipynb
└── crime-event-demo-meetup.ipynb
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "{}"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright {yyyy} {name of copyright owner}
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
203 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | IPython (Jupyter) notebooks
2 | ===========================
3 |
4 | Some IPython notebooks I've created...
5 |
--------------------------------------------------------------------------------
/blog-part-2.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "",
4 | "signature": "sha256:529afb8a95a960818c26b662d1724f5a60cb8236370d942a1724bca1fd5869fb"
5 | },
6 | "nbformat": 3,
7 | "nbformat_minor": 0,
8 | "worksheets": [
9 | {
10 | "cells": [
11 | {
12 | "cell_type": "heading",
13 | "level": 1,
14 | "metadata": {},
15 | "source": [
16 | "Data Science with Hadoop - Predicting airline delays - part 2: Spark and ML-Lib"
17 | ]
18 | },
19 | {
20 | "cell_type": "heading",
21 | "level": 2,
22 | "metadata": {},
23 | "source": [
24 | "Introduction"
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "In this 2nd part of the supplement to the second blog on data science, we continue to demonstrate how to build a predictive model with Hadoop, this time we'll use [Apache Spark](https://spark.apache.org/) and [ML-Lib](http://spark.apache.org/docs/1.1.0/mllib-guide.html). \n",
32 | "\n",
33 | "In the context of our demo, we will show how to use Apache Spark via its Scala API to generate our feature matrix and also use ML-Lib (Spark's machine learning library) to build and evaluate our classification models.\n",
34 | "\n",
35 | "Recall from part 1 that we are constructing a predictive model for flight delays. Our source dataset resides [here](http://stat-computing.org/dataexpo/2009/the-data.html), and includes details about flights in the US from the years 1987-2008. We have also enriched the data with [weather information](http://www.ncdc.noaa.gov/cdo-web/datasets/), where we find daily temperatures (min/max), wind speed, snow conditions and precipitation. \n",
36 | "\n",
37 | "We will build a supervised learning model to predict flight delays for flights leaving O'Hare International airport (ORD). We will use the year 2007 data to build the model, and test its validity using data from 2008."
38 | ]
39 | },
40 | {
41 | "cell_type": "heading",
42 | "level": 1,
43 | "metadata": {},
44 | "source": [
45 | "Pre-processing with Hadoop and Spark"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "[Apache Spark](https://spark.apache.org/)'s basic data abstraction is that of an RDD (resilient distributed dataset), which is a fault-tolerant collection of elements that can be operated on in parallel across your Hadoop cluster. \n",
53 | "\n",
54 | "Spark's API (available in Scala, Python or Java) supports a variety of transformations such as map() and flatMap(), filter(), join(), and others to create and manipulate RDDs. For a full description of the API please check the [Spark API programming guide]( http://spark.apache.org/docs/1.1.0/programming-guide.html). "
55 | ]
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "metadata": {},
60 | "source": [
61 | "Recall from [part 1](http://hortonworks.com/blog/data-science-apacheh-hadoop-predicting-airline-delays/) that in our first iteration we generated the following features for each flight:\n",
62 | "* **month**: winter months should have more delays than summer months\n",
63 | "* **day of month**: this is likely not a very predictive variable, but let's keep it in anyway\n",
64 | "* **day of week**: weekend vs. weekday\n",
65 | "* **hour of the day**: later hours tend to have more delays\n",
66 | "* **Carrier**: we might expect some carriers to be more prone to delays than others\n",
67 | "* **Destination airport**: we expect some airports to be more prone to delays than others; and\n",
68 | "* **Distance**: interesting to see if this variable is a good predictor of delay\n",
69 | "\n",
70 | "We will use Spark RDDs to perform the same pre-processing, transforming the raw flight delay dataset into the two feature matrices: data_2007 (our training set) and data_2008 (our test set).\n",
71 | "\n",
72 | "The case class *DelayRec* that encapsulates a flight delay record represents the feature vector, and its methods do most of the heavy lifting: \n",
73 | "1. to_date() is a helper method to convert year/month/day to a string\n",
74 | "1. gen_features(row) takes a row of inputs and generates a key/value tuple where the key is the date string (output of *to_date*) and the value is the feature value. We don't use the key in this iteraion, but we will use it in the second iteration to join with the weather data.\n",
75 | "1. the get_hour() method extracts the 2-digit hour portion of the departure time\n",
76 | "1. The days_from_nearest_holiday() method computes the minimum distance (in days) of the provided year/month/date from any holiday in the list *holidays*.\n",
77 | "\n",
78 | "With *DelayRec* in place, our processing takes on the following steps (in the function *prepFlightDelays*):\n",
79 | "1. We read the raw input file with Spark's *SparkContext.textFile* method, resulting in an RDD\n",
80 | "1. Each row is parsed with *CSVReader* into fields, and populated into a *DelayRec* object\n",
81 | "1. We then perform a sequence of RDD transformations on the input RDD to make sure we only have rows that correspond to flights that did not get cancelled and originated from ORD.\n",
82 | "\n",
83 | "Finally, we use the *gen_features* method to generate the final feature vector per row, as a set of doubles.\n"
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "collapsed": false,
89 | "input": [
90 | "import org.apache.spark.rdd._\n",
91 | "import scala.collection.JavaConverters._\n",
92 | "import au.com.bytecode.opencsv.CSVReader\n",
93 | "\n",
94 | "import java.io._\n",
95 | "import org.joda.time._\n",
96 | "import org.joda.time.format._\n",
97 | "\n",
98 | "case class DelayRec(year: String,\n",
99 | " month: String,\n",
100 | " dayOfMonth: String,\n",
101 | " dayOfWeek: String,\n",
102 | " crsDepTime: String,\n",
103 | " depDelay: String,\n",
104 | " origin: String,\n",
105 | " distance: String,\n",
106 | " cancelled: String) {\n",
107 | "\n",
108 | " val holidays = List(\"01/01/2007\", \"01/15/2007\", \"02/19/2007\", \"05/28/2007\", \"06/07/2007\", \"07/04/2007\",\n",
109 | " \"09/03/2007\", \"10/08/2007\" ,\"11/11/2007\", \"11/22/2007\", \"12/25/2007\",\n",
110 | " \"01/01/2008\", \"01/21/2008\", \"02/18/2008\", \"05/22/2008\", \"05/26/2008\", \"07/04/2008\",\n",
111 | " \"09/01/2008\", \"10/13/2008\" ,\"11/11/2008\", \"11/27/2008\", \"12/25/2008\")\n",
112 | "\n",
113 | " def gen_features: (String, Array[Double]) = {\n",
114 | " val values = Array(\n",
115 | " depDelay.toDouble,\n",
116 | " month.toDouble,\n",
117 | " dayOfMonth.toDouble,\n",
118 | " dayOfWeek.toDouble,\n",
119 | " get_hour(crsDepTime).toDouble,\n",
120 | " distance.toDouble,\n",
121 | " days_from_nearest_holiday(year.toInt, month.toInt, dayOfMonth.toInt)\n",
122 | " )\n",
123 | " new Tuple2(to_date(year.toInt, month.toInt, dayOfMonth.toInt), values)\n",
124 | " }\n",
125 | "\n",
126 | " def get_hour(depTime: String) : String = \"%04d\".format(depTime.toInt).take(2)\n",
127 | " def to_date(year: Int, month: Int, day: Int) = \"%04d%02d%02d\".format(year, month, day)\n",
128 | "\n",
129 | " def days_from_nearest_holiday(year:Int, month:Int, day:Int): Int = {\n",
130 | " val sampleDate = new DateTime(year, month, day, 0, 0)\n",
131 | "\n",
132 | " holidays.foldLeft(3000) { (r, c) =>\n",
133 | " val holiday = DateTimeFormat.forPattern(\"MM/dd/yyyy\").parseDateTime(c)\n",
134 | " val distance = Math.abs(Days.daysBetween(holiday, sampleDate).getDays)\n",
135 | " math.min(r, distance)\n",
136 | " }\n",
137 | " }\n",
138 | " }\n",
139 | "\n",
140 | "// function to do a preprocessing step for a given file\n",
141 | "def prepFlightDelays(infile: String): RDD[DelayRec] = {\n",
142 | " val data = sc.textFile(infile)\n",
143 | "\n",
144 | " data.map { line =>\n",
145 | " val reader = new CSVReader(new StringReader(line))\n",
146 | " reader.readAll().asScala.toList.map(rec => DelayRec(rec(0),rec(1),rec(2),rec(3),rec(5),rec(15),rec(16),rec(18),rec(21)))\n",
147 | " }.map(list => list(0))\n",
148 | " .filter(rec => rec.year != \"Year\")\n",
149 | " .filter(rec => rec.cancelled == \"0\")\n",
150 | " .filter(rec => rec.origin == \"ORD\")\n",
151 | "}\n",
152 | "\n",
153 | "val data_2007 = prepFlightDelays(\"airline/delay/2007.csv\").map(rec => rec.gen_features._2)\n",
154 | "val data_2008 = prepFlightDelays(\"airline/delay/2008.csv\").map(rec => rec.gen_features._2)\n",
155 | "data_2007.take(5).map(x => x mkString \",\").foreach(println)"
156 | ],
157 | "language": "python",
158 | "metadata": {},
159 | "outputs": [
160 | {
161 | "output_type": "stream",
162 | "stream": "stderr",
163 | "text": []
164 | },
165 | {
166 | "output_type": "stream",
167 | "stream": "stdout",
168 | "text": [
169 | "-8.0,1.0,25.0,4.0,11.0,719.0,10.0\n",
170 | "41.0,1.0,28.0,7.0,15.0,925.0,13.0\n",
171 | "45.0,1.0,29.0,1.0,20.0,316.0,14.0\n",
172 | "-9.0,1.0,17.0,3.0,19.0,719.0,2.0\n",
173 | "180.0,1.0,12.0,5.0,17.0,316.0,3.0\n"
174 | ]
175 | }
176 | ],
177 | "prompt_number": 1
178 | },
179 | {
180 | "cell_type": "heading",
181 | "level": 2,
182 | "metadata": {},
183 | "source": [
184 | "Modeling with Spark and ML-Lib"
185 | ]
186 | },
187 | {
188 | "cell_type": "markdown",
189 | "metadata": {},
190 | "source": [
191 | "With the data_2007 dataset (which we'll use for training) and the data_2008 dataset (which we'll use for validation) as RDDs, we now build a predictive model using Spark's [ML-Lib](http://spark.apache.org/docs/1.1.0/mllib-guide.html) machine learning library.\n",
192 | "\n",
193 | "ML-Lib is Spark\u2019s scalable machine learning library, which includes various learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and others. \n",
194 | "\n",
195 | "If you compare ML-Lib to Scikit-learn, at the moment ML-Lib lacks a few important algorithms like Random Forest or Gradient Boosted Trees. Having said that, we see a strong pace of innovation from the ML-Lib community and expect more algorithms and other features to be added soon (for example, Random Forest is being actively [worked on](https://github.com/apache/spark/pull/2435), and will likely be available in the next release).\n",
196 | "\n",
197 | "To use ML-Lib's machine learning algorithms, first we parse our feature matrices into RDDs of *LabeledPoint* objects (for both the training and test datasets). *LabeledPoint* is ML-Lib's abstraction for a feature vector accompanied by a label. We consider flight delays of 15 minutes or more as \"delays\" and mark it with a label of 1.0, and under 15 minutes as \"non-delay\" and mark it with a label of 0.0. \n",
198 | "\n",
199 | "We also use ML-Lib's *StandardScaler* class to normalize our feature values for both training and validation sets. This is important because of ML-Lib's use of Stochastic Gradient Descent, which is known to perform best if feature vectors are normalized."
200 | ]
201 | },
202 | {
203 | "cell_type": "code",
204 | "collapsed": false,
205 | "input": [
206 | "import org.apache.spark.mllib.regression.LabeledPoint\n",
207 | "import org.apache.spark.mllib.linalg.Vectors\n",
208 | "import org.apache.spark.mllib.feature.StandardScaler\n",
209 | "\n",
210 | "def parseData(vals: Array[Double]): LabeledPoint = {\n",
211 | " LabeledPoint(if (vals(0)>=15) 1.0 else 0.0, Vectors.dense(vals.drop(1)))\n",
212 | "}\n",
213 | "\n",
214 | "// Prepare training set\n",
215 | "val parsedTrainData = data_2007.map(parseData)\n",
216 | "parsedTrainData.cache\n",
217 | "val scaler = new StandardScaler(withMean = true, withStd = true).fit(parsedTrainData.map(x => x.features))\n",
218 | "val scaledTrainData = parsedTrainData.map(x => LabeledPoint(x.label, scaler.transform(Vectors.dense(x.features.toArray))))\n",
219 | "scaledTrainData.cache\n",
220 | "\n",
221 | "// Prepare test/validation set\n",
222 | "val parsedTestData = data_2008.map(parseData)\n",
223 | "parsedTestData.cache\n",
224 | "val scaledTestData = parsedTestData.map(x => LabeledPoint(x.label, scaler.transform(Vectors.dense(x.features.toArray))))\n",
225 | "scaledTestData.cache\n",
226 | "\n",
227 | "scaledTrainData.take(3).map(x => (x.label, x.features)).foreach(println)"
228 | ],
229 | "language": "python",
230 | "metadata": {},
231 | "outputs": [
232 | {
233 | "output_type": "stream",
234 | "stream": "stdout",
235 | "text": [
236 | "(0.0,[-1.6160463330366548,1.054927299466599,0.03217026353736381,-0.5189244175441321,0.034083933424313526,-0.2801683099466359])\n",
237 | "(1.0,[-1.6160463330366548,1.3961052168540333,1.5354307758475527,0.3624320984120952,0.43165511884343954,-0.023273887437334728])\n",
238 | "(1.0,[-1.6160463330366548,1.5098311893165113,-1.4710902487728252,1.4641277433573794,-0.7436888225169864,0.06235758673243232])\n"
239 | ]
240 | },
241 | {
242 | "output_type": "stream",
243 | "stream": "stderr",
244 | "text": []
245 | }
246 | ],
247 | "prompt_number": 2
248 | },
249 | {
250 | "cell_type": "markdown",
251 | "metadata": {},
252 | "source": [
253 | "Note that we use the RDD *cache* method to ensure that these computed RDDs (parsedTrainData, scaledTrainData, parsedTestData and scaledTestData) are cached in memory by Spark and not re-computed with each iteration of stochastic gradient descent.\n",
254 | "\n",
255 | "We also define a helper function to evaluate our metrics: precision, recall, accuracy and the F1-measure"
256 | ]
257 | },
258 | {
259 | "cell_type": "code",
260 | "collapsed": false,
261 | "input": [
262 | "// Function to compute evaluation metrics\n",
263 | "def eval_metrics(labelsAndPreds: RDD[(Double, Double)]) : Tuple2[Array[Double], Array[Double]] = {\n",
264 | " val tp = labelsAndPreds.filter(r => r._1==1 && r._2==1).count.toDouble\n",
265 | " val tn = labelsAndPreds.filter(r => r._1==0 && r._2==0).count.toDouble\n",
266 | " val fp = labelsAndPreds.filter(r => r._1==1 && r._2==0).count.toDouble\n",
267 | " val fn = labelsAndPreds.filter(r => r._1==0 && r._2==1).count.toDouble\n",
268 | "\n",
269 | " val precision = tp / (tp+fp)\n",
270 | " val recall = tp / (tp+fn)\n",
271 | " val F_measure = 2*precision*recall / (precision+recall)\n",
272 | " val accuracy = (tp+tn) / (tp+tn+fp+fn)\n",
273 | " new Tuple2(Array(tp, tn, fp, fn), Array(precision, recall, F_measure, accuracy))\n",
274 | "}"
275 | ],
276 | "language": "python",
277 | "metadata": {},
278 | "outputs": [
279 | {
280 | "output_type": "stream",
281 | "stream": "stdout",
282 | "text": []
283 | },
284 | {
285 | "output_type": "stream",
286 | "stream": "stderr",
287 | "text": []
288 | }
289 | ],
290 | "prompt_number": 3
291 | },
292 | {
293 | "cell_type": "markdown",
294 | "metadata": {},
295 | "source": [
296 | "ML-Lib supports a few algorithms for supervised learning, among those are Linear Regression, Logistic Regression, Naive Bayes, Decision Tree and Linear SVM. We will use Logistic Regression and SVM, both of which are implemented using [Stochastic Gradient descent](http://en.wikipedia.org/wiki/Stochastic_gradient_descent) (SGD).\n",
297 | "\n",
298 | "Let's see how to build these models with ML-Lib:"
299 | ]
300 | },
301 | {
302 | "cell_type": "code",
303 | "collapsed": false,
304 | "input": [
305 | "import org.apache.spark.mllib.classification.LogisticRegressionWithSGD\n",
306 | "\n",
307 | "// Build the Logistic Regression model\n",
308 | "val model_lr = LogisticRegressionWithSGD.train(scaledTrainData, numIterations=100)\n",
309 | "\n",
310 | "// Predict\n",
311 | "val labelsAndPreds_lr = scaledTestData.map { point =>\n",
312 | " val pred = model_lr.predict(point.features)\n",
313 | " (pred, point.label)\n",
314 | "}\n",
315 | "val m_lr = eval_metrics(labelsAndPreds_lr)._2\n",
316 | "println(\"precision = %.2f, recall = %.2f, F1 = %.2f, accuracy = %.2f\".format(m_lr(0), m_lr(1), m_lr(2), m_lr(3)))"
317 | ],
318 | "language": "python",
319 | "metadata": {},
320 | "outputs": [
321 | {
322 | "output_type": "stream",
323 | "stream": "stdout",
324 | "text": [
325 | "precision = 0.37, recall = 0.64, F1 = 0.47, accuracy = 0.59\n"
326 | ]
327 | },
328 | {
329 | "output_type": "stream",
330 | "stream": "stderr",
331 | "text": []
332 | }
333 | ],
334 | "prompt_number": 4
335 | },
336 | {
337 | "cell_type": "markdown",
338 | "metadata": {},
339 | "source": [
340 | "We have built a model using Logistic Regression with SGD using 100 iterations, and then used it to predict flight delays over the validation set to measure performance: precision, recall, F1 and accuracy. \n",
341 | "\n",
342 | "Next, let's try a Support Vector Machine model:"
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "collapsed": false,
348 | "input": [
349 | "import org.apache.spark.mllib.classification.SVMWithSGD\n",
350 | "\n",
351 | "// Build the SVM model\n",
352 | "val svmAlg = new SVMWithSGD()\n",
353 | "svmAlg.optimizer.setNumIterations(100)\n",
354 | " .setRegParam(1.0)\n",
355 | " .setStepSize(1.0)\n",
356 | "val model_svm = svmAlg.run(scaledTrainData)\n",
357 | "\n",
358 | "// Predict\n",
359 | "val labelsAndPreds_svm = scaledTestData.map { point =>\n",
360 | " val pred = model_svm.predict(point.features)\n",
361 | " (pred, point.label)\n",
362 | "}\n",
363 | "val m_svm = eval_metrics(labelsAndPreds_svm)._2\n",
364 | "println(\"precision = %.2f, recall = %.2f, F1 = %.2f, accuracy = %.2f\".format(m_svm(0), m_svm(1), m_svm(2), m_svm(3)))"
365 | ],
366 | "language": "python",
367 | "metadata": {},
368 | "outputs": [
369 | {
370 | "output_type": "stream",
371 | "stream": "stdout",
372 | "text": [
373 | "precision = 0.37, recall = 0.64, F1 = 0.47, accuracy = 0.59\n"
374 | ]
375 | },
376 | {
377 | "output_type": "stream",
378 | "stream": "stderr",
379 | "text": []
380 | }
381 | ],
382 | "prompt_number": 5
383 | },
384 | {
385 | "cell_type": "markdown",
386 | "metadata": {},
387 | "source": [
388 | "Since ML-Lib also has a strong Decision Tree implementation, let's use it here:"
389 | ]
390 | },
391 | {
392 | "cell_type": "code",
393 | "collapsed": false,
394 | "input": [
395 | "import org.apache.spark.mllib.tree.DecisionTree\n",
396 | "\n",
397 | "// Build the Decision Tree model\n",
398 | "val numClasses = 2\n",
399 | "val categoricalFeaturesInfo = Map[Int, Int]()\n",
400 | "val impurity = \"gini\"\n",
401 | "val maxDepth = 10\n",
402 | "val maxBins = 100\n",
403 | "val model_dt = DecisionTree.trainClassifier(parsedTrainData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)\n",
404 | "\n",
405 | "// Predict\n",
406 | "val labelsAndPreds_dt = parsedTestData.map { point =>\n",
407 | " val pred = model_dt.predict(point.features)\n",
408 | " (pred, point.label)\n",
409 | "}\n",
410 | "val m_dt = eval_metrics(labelsAndPreds_dt)._2\n",
411 | "println(\"precision = %.2f, recall = %.2f, F1 = %.2f, accuracy = %.2f\".format(m_dt(0), m_dt(1), m_dt(2), m_dt(3)))"
412 | ],
413 | "language": "python",
414 | "metadata": {},
415 | "outputs": [
416 | {
417 | "output_type": "stream",
418 | "stream": "stdout",
419 | "text": [
420 | "precision = 0.41, recall = 0.24, F1 = 0.31, accuracy = 0.69\n"
421 | ]
422 | },
423 | {
424 | "output_type": "stream",
425 | "stream": "stderr",
426 | "text": []
427 | }
428 | ],
429 | "prompt_number": 6
430 | },
431 | {
432 | "cell_type": "markdown",
433 | "metadata": {},
434 | "source": [
435 | "Note that overall accuracy is higher with the Decision Tree, but the precision and recall trade-off is different with higher precision and much lower recall. "
436 | ]
437 | },
438 | {
439 | "cell_type": "heading",
440 | "level": 2,
441 | "metadata": {},
442 | "source": [
443 | "Building a richer model with flight delays, weather data using Apache Spark and ML-Lib"
444 | ]
445 | },
446 | {
447 | "cell_type": "markdown",
448 | "metadata": {},
449 | "source": [
450 | "Similar to our approach in the first part of this blog post, we now enrich the dataset by integrating weather data into our feature matrix, thus achieving better predictive performance overall for our model. \n",
451 | "\n",
452 | "To accomplish this with Apache Spark, we rewrite our previous *preprocess_spark* function to extract the same base features from the flight delay dataset, and also join those with five variables from the weather datasets: minimum and maximum temperature for the day, precipitation, snow and wind speed. Let's see how this is accomplished."
453 | ]
454 | },
455 | {
456 | "cell_type": "code",
457 | "collapsed": false,
458 | "input": [
459 | "import org.apache.spark.SparkContext._\n",
460 | "import scala.collection.JavaConverters._\n",
461 | "import au.com.bytecode.opencsv.CSVReader\n",
462 | "import java.io._\n",
463 | "\n",
464 | "// function to do a preprocessing step for a given file\n",
465 | "\n",
466 | "def preprocess_spark(delay_file: String, weather_file: String): RDD[Array[Double]] = { \n",
467 | " // Read wether data\n",
468 | " val delayRecs = prepFlightDelays(delay_file).map{ rec => \n",
469 | " val features = rec.gen_features\n",
470 | " (features._1, features._2)\n",
471 | " }\n",
472 | "\n",
473 | " // Read weather data into RDDs\n",
474 | " val station_inx = 0\n",
475 | " val date_inx = 1\n",
476 | " val metric_inx = 2\n",
477 | " val value_inx = 3\n",
478 | "\n",
479 | " def filterMap(wdata:RDD[Array[String]], metric:String):RDD[(String,Double)] = {\n",
480 | " wdata.filter(vals => vals(metric_inx) == metric).map(vals => (vals(date_inx), vals(value_inx).toDouble))\n",
481 | " }\n",
482 | "\n",
483 | " val wdata = sc.textFile(weather_file).map(line => line.split(\",\"))\n",
484 | " .filter(vals => vals(station_inx) == \"USW00094846\")\n",
485 | " val w_tmin = filterMap(wdata,\"TMIN\")\n",
486 | " val w_tmax = filterMap(wdata,\"TMAX\")\n",
487 | " val w_prcp = filterMap(wdata,\"PRCP\")\n",
488 | " val w_snow = filterMap(wdata,\"SNOW\")\n",
489 | " val w_awnd = filterMap(wdata,\"AWND\")\n",
490 | "\n",
491 | " delayRecs.join(w_tmin).map(vals => (vals._1, vals._2._1 ++ Array(vals._2._2)))\n",
492 | " .join(w_tmax).map(vals => (vals._1, vals._2._1 ++ Array(vals._2._2)))\n",
493 | " .join(w_prcp).map(vals => (vals._1, vals._2._1 ++ Array(vals._2._2)))\n",
494 | " .join(w_snow).map(vals => (vals._1, vals._2._1 ++ Array(vals._2._2)))\n",
495 | " .join(w_awnd).map(vals => vals._2._1 ++ Array(vals._2._2))\n",
496 | "}\n",
497 | "\n",
498 | "val data_2007 = preprocess_spark(\"airline/delay/2007.csv\", \"airline/weather/2007.csv\")\n",
499 | "val data_2008 = preprocess_spark(\"airline/delay/2008.csv\", \"airline/weather/2008.csv\")\n",
500 | "\n",
501 | "data_2007.take(5).map(x => x mkString \",\").foreach(println)"
502 | ],
503 | "language": "python",
504 | "metadata": {},
505 | "outputs": [
506 | {
507 | "output_type": "stream",
508 | "stream": "stdout",
509 | "text": [
510 | "63.0,2.0,14.0,3.0,15.0,316.0,5.0,-139.0,-61.0,8.0,36.0,53.0\n",
511 | "0.0,2.0,14.0,3.0,12.0,925.0,5.0,-139.0,-61.0,8.0,36.0,53.0\n",
512 | "105.0,2.0,14.0,3.0,17.0,316.0,5.0,-139.0,-61.0,8.0,36.0,53.0\n",
513 | "36.0,2.0,14.0,3.0,19.0,719.0,5.0,-139.0,-61.0,8.0,36.0,53.0\n",
514 | "35.0,2.0,14.0,3.0,18.0,719.0,5.0,-139.0,-61.0,8.0,36.0,53.0\n"
515 | ]
516 | },
517 | {
518 | "output_type": "stream",
519 | "stream": "stderr",
520 | "text": []
521 | }
522 | ],
523 | "prompt_number": 6
524 | },
525 | {
526 | "cell_type": "markdown",
527 | "metadata": {},
528 | "source": [
529 | "Note that the minimum and maximum temparature variables from the weather dataset are measured here in Celsius and multiplied by 10. So for example -139.0 would translate into -13.9 Celsius."
530 | ]
531 | },
532 | {
533 | "cell_type": "heading",
534 | "level": 2,
535 | "metadata": {},
536 | "source": [
537 | "Modeling with weather data"
538 | ]
539 | },
540 | {
541 | "cell_type": "markdown",
542 | "metadata": {},
543 | "source": [
544 | "We are going to repeat the previous models of SVM and Decision Tree with our enriched feature set. As before, we create an RDD of *LabeledPoint* objects, and normalize our dataset with ML-Lib's *StandardScaler*:"
545 | ]
546 | },
547 | {
548 | "cell_type": "code",
549 | "collapsed": false,
550 | "input": [
551 | "import org.apache.spark.mllib.regression.LabeledPoint\n",
552 | "import org.apache.spark.mllib.linalg.Vectors\n",
553 | "import org.apache.spark.mllib.feature.StandardScaler\n",
554 | "\n",
555 | "def parseData(vals: Array[Double]): LabeledPoint = {\n",
556 | " LabeledPoint(if (vals(0)>=15) 1.0 else 0.0, Vectors.dense(vals.drop(1)))\n",
557 | "}\n",
558 | "\n",
559 | "// Prepare training set\n",
560 | "val parsedTrainData = data_2007.map(parseData)\n",
561 | "val scaler = new StandardScaler(withMean = true, withStd = true).fit(parsedTrainData.map(x => x.features))\n",
562 | "val scaledTrainData = parsedTrainData.map(x => LabeledPoint(x.label, scaler.transform(Vectors.dense(x.features.toArray))))\n",
563 | "parsedTrainData.cache\n",
564 | "scaledTrainData.cache\n",
565 | "\n",
566 | "// Prepare test/validation set\n",
567 | "val parsedTestData = data_2008.map(parseData)\n",
568 | "val scaledTestData = parsedTestData.map(x => LabeledPoint(x.label, scaler.transform(Vectors.dense(x.features.toArray))))\n",
569 | "parsedTestData.cache\n",
570 | "scaledTestData.cache\n",
571 | "\n",
572 | "scaledTrainData.take(3).map(x => (x.label, x.features)).foreach(println)"
573 | ],
574 | "language": "python",
575 | "metadata": {},
576 | "outputs": [
577 | {
578 | "output_type": "stream",
579 | "stream": "stdout",
580 | "text": [
581 | "(1.0,[-1.3223160050229035,-0.19605839762065824,-0.4689165738993619,0.362432098412094,-0.7436888225169838,-0.7083256807954716,-1.8498937310721373,-1.7543558972509141,-0.24059125907894596,2.6196648266835743,0.5456137506493843])\n",
582 | "(0.0,[-1.3223160050229035,-0.19605839762065824,-0.4689165738993619,-0.2985852885550763,0.4316551188434456,-0.7083256807954716,-1.8498937310721373,-1.7543558972509141,-0.24059125907894596,2.6196648266835743,0.5456137506493843])\n",
583 | "(1.0,[-1.3223160050229035,-0.19605839762065824,-0.4689165738993619,0.8031103563902076,-0.7436888225169838,-0.7083256807954716,-1.8498937310721373,-1.7543558972509141,-0.24059125907894596,2.6196648266835743,0.5456137506493843])\n"
584 | ]
585 | },
586 | {
587 | "output_type": "stream",
588 | "stream": "stderr",
589 | "text": []
590 | }
591 | ],
592 | "prompt_number": 8
593 | },
594 | {
595 | "cell_type": "markdown",
596 | "metadata": {},
597 | "source": [
598 | "Next, let's build a Support Vector Machine model using this enriched feature matrix:"
599 | ]
600 | },
601 | {
602 | "cell_type": "code",
603 | "collapsed": false,
604 | "input": [
605 | "import org.apache.spark.mllib.classification.SVMWithSGD\n",
606 | "import org.apache.spark.mllib.optimization.L1Updater\n",
607 | "\n",
608 | "// Build the SVM model\n",
609 | "val svmAlg = new SVMWithSGD()\n",
610 | "svmAlg.optimizer.setNumIterations(100)\n",
611 | " .setRegParam(1.0)\n",
612 | " .setStepSize(1.0)\n",
613 | "val model_svm = svmAlg.run(scaledTrainData)\n",
614 | "\n",
615 | "// Predict\n",
616 | "val labelsAndPreds_svm = scaledTestData.map { point =>\n",
617 | " val pred = model_svm.predict(point.features)\n",
618 | " (pred, point.label)\n",
619 | "}\n",
620 | "val m_svm = eval_metrics(labelsAndPreds_svm)._2\n",
621 | "println(\"precision = %.2f, recall = %.2f, F1 = %.2f, accuracy = %.2f\".format(m_svm(0), m_svm(1), m_svm(2), m_svm(3)))"
622 | ],
623 | "language": "python",
624 | "metadata": {},
625 | "outputs": [
626 | {
627 | "output_type": "stream",
628 | "stream": "stdout",
629 | "text": [
630 | "precision = 0.39, recall = 0.68, F1 = 0.49, accuracy = 0.61\n"
631 | ]
632 | },
633 | {
634 | "output_type": "stream",
635 | "stream": "stderr",
636 | "text": []
637 | }
638 | ],
639 | "prompt_number": 9
640 | },
641 | {
642 | "cell_type": "markdown",
643 | "metadata": {},
644 | "source": [
645 | "And finally, let's implement a Decision Tree model:"
646 | ]
647 | },
648 | {
649 | "cell_type": "code",
650 | "collapsed": false,
651 | "input": [
652 | "import org.apache.spark.mllib.tree.DecisionTree\n",
653 | "\n",
654 | "// Build the Decision Tree model\n",
655 | "val numClasses = 2\n",
656 | "val categoricalFeaturesInfo = Map[Int, Int]()\n",
657 | "val impurity = \"gini\"\n",
658 | "val maxDepth = 10\n",
659 | "val maxBins = 100\n",
660 | "val model_dt = DecisionTree.trainClassifier(scaledTrainData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)\n",
661 | "\n",
662 | "// Predict\n",
663 | "val labelsAndPreds_dt = scaledTestData.map { point =>\n",
664 | " val pred = model_dt.predict(point.features)\n",
665 | " (pred, point.label)\n",
666 | "}\n",
667 | "val m_dt = eval_metrics(labelsAndPreds_dt)._2\n",
668 | "println(\"precision = %.2f, recall = %.2f, F1 = %.2f, accuracy = %.2f\".format(m_dt(0), m_dt(1), m_dt(2), m_dt(3)))"
669 | ],
670 | "language": "python",
671 | "metadata": {},
672 | "outputs": [
673 | {
674 | "output_type": "stream",
675 | "stream": "stdout",
676 | "text": [
677 | "precision = 0.52, recall = 0.35, F1 = 0.42, accuracy = 0.72\n"
678 | ]
679 | },
680 | {
681 | "output_type": "stream",
682 | "stream": "stderr",
683 | "text": []
684 | }
685 | ],
686 | "prompt_number": 10
687 | },
688 | {
689 | "cell_type": "markdown",
690 | "metadata": {},
691 | "source": [
692 | "As expected, the improved feature set increased the accuracy of our model for both SVM and Decision Tree models."
693 | ]
694 | },
695 | {
696 | "cell_type": "heading",
697 | "level": 2,
698 | "metadata": {},
699 | "source": [
700 | "Summary"
701 | ]
702 | },
703 | {
704 | "cell_type": "markdown",
705 | "metadata": {},
706 | "source": [
707 | "In this IPython notebook we have demonstrated how to build a predictive model in Scala with Apache Hadoop, Apache Spark and its machine learning library: ML-Lib. \n",
708 | "\n",
709 | "We have used Apache Spark on our HDP cluster to perform various types of data pre-processing and feature engineering tasks. We then applied a few ML-Lib machine learning algorithms such as Support Vector Machines and Decision Tree to the resulting datasets and showed how through iterations we continuously add new and improved features resulting in better model performance.\n",
710 | "\n",
711 | "In the next part of this demo series we will show how to perform the same learning task with R."
712 | ]
713 | }
714 | ],
715 | "metadata": {}
716 | }
717 | ]
718 | }
719 |
--------------------------------------------------------------------------------
/blog-part-3.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "",
4 | "signature": "sha256:ce8f54ded6e9f229ca2ea9615956dd4930c213fdf6de4c15e9372a2d01ad5bfd"
5 | },
6 | "nbformat": 3,
7 | "nbformat_minor": 0,
8 | "worksheets": [
9 | {
10 | "cells": [
11 | {
12 | "cell_type": "heading",
13 | "level": 1,
14 | "metadata": {},
15 | "source": [
16 | "Data Science with Hadoop - Predicting airline delays - part 3: Scalding and R"
17 | ]
18 | },
19 | {
20 | "cell_type": "heading",
21 | "level": 1,
22 | "metadata": {},
23 | "source": [
24 | "Introduction"
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "In this 3rd part of the blog post on data science, we continue to demonstrate how to build a predictive model with Hadoop, highlighting various tools that can be effectively used in this process. This time we'll use Scalding for pre-processing and R for modeling.\n",
32 | "\n",
33 | "[R](http://www.r-project.org/) is a language and environment for statistical computing and graphics. It is a GNU project which was developed at Bell Laboratories by John Chambers and his colleagues. R is an open source project with more than 6000 packages available covering various topics in data science including classification, regression, clustering, anomaly detection, market basket analysis, text processing and many others. R is an extremely powerful and mature environment for statistical analysis and data science.\n",
34 | "\n",
35 | "[Scalding](https://github.com/twitter/scalding) is a Scala library that makes it easy to specify Hadoop MapReduce jobs using higher level abstractions of a data pipeline. Scalding is built on top of [Cascading](http://www.cascading.org/), a Java library that abstracts away low-level Hadoop details. Scalding is comparable to Pig, but offers tight integration with Scala, bringing advantages of Scala to your MapReduce jobs.\n",
36 | "\n",
37 | "Recall from the first blog post that we are constructing a predictive model for flight delays. Our source dataset resides here: http://stat-computing.org/dataexpo/2009/the-data.html, and includes details about flights in the US from the years 1987-2008. We will also enrich the data with weather information from: http://www.ncdc.noaa.gov/cdo-web/datasets/, where we find daily temperatures (min/max), wind speed, snow conditions and precipitation. \n",
38 | "\n",
39 | "We will build a supervised learning model to predict flight delays for flights leaving O'Hare International airport (ORD), using the year 2007 data to build the model, and 2008 data to test its validity."
40 | ]
41 | },
42 | {
43 | "cell_type": "heading",
44 | "level": 1,
45 | "metadata": {},
46 | "source": [
47 | "Data Exploration"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "R is a fantastic environment for data exploration and is often used for this purpoase. With a ton of statistical and data manipulation functionality being part of core R, as well as powerful graphics packages such as ggplot, performing data exploration in R is easy and fun.\n",
55 | "\n",
56 | "Let's first enable R cells in IPython:"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "collapsed": false,
62 | "input": [
63 | "%%capture\n",
64 | "%load_ext rpy2.ipython"
65 | ],
66 | "language": "python",
67 | "metadata": {},
68 | "outputs": [],
69 | "prompt_number": 1
70 | },
71 | {
72 | "cell_type": "markdown",
73 | "metadata": {},
74 | "source": [
75 | "Before we start our data exploration example, we load various packages we need for later. We will use the [RHDFS](https://github.com/RevolutionAnalytics/RHadoop/wiki) package from RHadoop to read files from HDFS; however we need the ability to read a multi-part file from HDFS into R as a single data frame, so we define the *read_csv_from_hdfs* function in R:"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "collapsed": false,
81 | "input": [
82 | "%%capture\n",
83 | "%%R\n",
84 | "\n",
85 | "# load required R packages\n",
86 | "require(rhdfs)\n",
87 | "require(randomForest)\n",
88 | "require(gbm)\n",
89 | "require(plyr)\n",
90 | "require(data.table)\n",
91 | "\n",
92 | "# Initialize RHDFS package\n",
93 | "hdfs.init(hadoop='/usr/bin/hadoop')\n",
94 | "\n",
95 | "# Utility function to read a multi-part file from HDFS into an R data frame\n",
96 | "read_csv_from_hdfs <- function(filename, cols=NULL) {\n",
97 | " dir.list = hdfs.ls(filename)\n",
98 | " list.condition <- sapply(dir.list$size, function(x) x > 0)\n",
99 | " file.list <- dir.list[list.condition,]\n",
100 | " tables <- lapply(file.list$file, function(f) {\n",
101 | " content <- paste(hdfs.read.text.file(f, n = 100000L, buffer=100000000L), collapse='\\n')\n",
102 | " if (length(cols)==0) {\n",
103 | " dt = fread(content, sep=\",\", colClasses=\"character\", stringsAsFactors=F, header=T) \n",
104 | " } else {\n",
105 | " dt = fread(content, sep=\",\", colClasses=\"character\", stringsAsFactors=F, header=F) \n",
106 | " setnames(dt, names(dt), cols) \n",
107 | " }\n",
108 | " dt\n",
109 | " })\n",
110 | " rbind.fill(tables)\n",
111 | "}"
112 | ],
113 | "language": "python",
114 | "metadata": {},
115 | "outputs": [],
116 | "prompt_number": 2
117 | },
118 | {
119 | "cell_type": "markdown",
120 | "metadata": {},
121 | "source": [
122 | "We now explore the 2007 delay dataset to determine which variables are reasonable to use for this prediction task. First let's load the data into an R dataframe:"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "collapsed": false,
128 | "input": [
129 | "%%R\n",
130 | "cols = c('year', 'month', 'day', 'dow', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime','Carrier', 'FlightNum', 'TailNum', \n",
131 | " 'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', \n",
132 | " 'TaxiOut', 'Cancelled', 'CancellationCode', 'Diverted', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', \n",
133 | " 'LateAircraftDelay');\n",
134 | "flt_2007 = read_csv_from_hdfs('/user/demo/airline/delay/2007.csv', cols)\n",
135 | "\n",
136 | "print(dim(flt_2007))"
137 | ],
138 | "language": "python",
139 | "metadata": {},
140 | "outputs": [
141 | {
142 | "metadata": {},
143 | "output_type": "display_data",
144 | "text": [
145 | "\r",
146 | "Read 0.0% of 7453216 rows\r",
147 | "Read 5.0% of 7453216 rows\r",
148 | "Read 9.9% of 7453216 rows\r",
149 | "Read 14.9% of 7453216 rows\r",
150 | "Read 19.9% of 7453216 rows\r",
151 | "Read 24.8% of 7453216 rows\r",
152 | "Read 29.8% of 7453216 rows\r",
153 | "Read 34.8% of 7453216 rows\r",
154 | "Read 39.6% of 7453216 rows\r",
155 | "Read 44.4% of 7453216 rows\r",
156 | "Read 49.2% of 7453216 rows\r",
157 | "Read 54.1% of 7453216 rows\r",
158 | "Read 58.9% of 7453216 rows\r",
159 | "Read 63.9% of 7453216 rows\r",
160 | "Read 68.8% of 7453216 rows\r",
161 | "Read 73.8% of 7453216 rows\r",
162 | "Read 78.6% of 7453216 rows\r",
163 | "Read 83.6% of 7453216 rows\r",
164 | "Read 88.4% of 7453216 rows\r",
165 | "Read 93.4% of 7453216 rows\r",
166 | "Read 98.3% of 7453216 rows\r",
167 | "Read 7453216 rows and 29 (of 29) columns from 0.655 GB file in 00:00:32\n",
168 | "[1] 7453216 29\n"
169 | ]
170 | }
171 | ],
172 | "prompt_number": 3
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "We have 7.4M+ flights in 2007 and 29 variables.\n",
179 | "\n",
180 | "Our \"target\" variable will be *DepDelay* (departure delay in minutes). To build a classifier, we define a derived target variable by defining a \"delay\" as having 15 mins or more of delay, and \"non-delay\" otherwise. We thus create a new binary variable that we name *'DepDelayed'*.\n",
181 | "\n",
182 | "Let's look at some basic statistics of flights and delays (per our new definition), after limiting ourselves to flights originating from ORD:"
183 | ]
184 | },
185 | {
186 | "cell_type": "code",
187 | "collapsed": false,
188 | "input": [
189 | "%%R\n",
190 | "df1 = flt_2007[which(flt_2007$Origin == 'ORD' & !is.na(flt_2007$DepDelay)),]\n",
191 | "df1$DepDelay = sapply(df1$DepDelay, function(x) (if (as.numeric(x)>=15) 1 else 0))\n",
192 | "\n",
193 | "print(paste0(\"total flights: \", as.character(dim(df1)[1])))\n",
194 | "print(paste0(\"total delays: \", as.character(sum(df1$DepDelay))))"
195 | ],
196 | "language": "python",
197 | "metadata": {},
198 | "outputs": [
199 | {
200 | "metadata": {},
201 | "output_type": "display_data",
202 | "text": [
203 | "[1] \"total flights: 359169\"\n",
204 | "[1] \"total delays: 109346\"\n"
205 | ]
206 | }
207 | ],
208 | "prompt_number": 4
209 | },
210 | {
211 | "cell_type": "markdown",
212 | "metadata": {},
213 | "source": [
214 | "The \"month\" feature is likely a good feature for modeling -- let's look at the distribution of delays (as percentage of total flights) by month:"
215 | ]
216 | },
217 | {
218 | "cell_type": "code",
219 | "collapsed": false,
220 | "input": [
221 | "%%R\n",
222 | "df2 = df1[, c('DepDelay', 'month'), with=F]\n",
223 | "df2$month = as.numeric(df2$month)\n",
224 | "df2 <- ddply(df2, .(month), summarise, mean_delay=mean(DepDelay))\n",
225 | "barplot(df2$mean_delay, names.arg=df2$month, xlab=\"month\", ylab=\"% of delays\", col=\"blue\")"
226 | ],
227 | "language": "python",
228 | "metadata": {},
229 | "outputs": [
230 | {
231 | "metadata": {},
232 | "output_type": "display_data",
233 | "png": "iVBORw0KGgoAAAANSUhEUgAAAeAAAAHgCAIAAADytinCAAAdlElEQVR4nO3dfViUZb7A8XsYoGIQ\nhQFXsCjNQMkIM63EFzApo0LWdvdqrUZLt81stURUxLbNN+LUFmWdvbayK8Pcbc2TSG2tmdEuuttW\nRzLRECuyAgdIAkd5GeA5f8wuR3HkdOLm4Tfw/fw186i/+5aYb+Mz8zAWwzAUAEAev97eAADAOwIN\nAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEG\nAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQikAD\ngFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaAB\nQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAA\nIBSBBgChCDQACEWgAUAo/97eAABoUF5eXlJSomtabGxsfHy8rmk/mMUwjN7eAwB01+233/7yy+FK\nxekY1pCc/Oddu3bpGNUtPIMG0BeEhIQoNVepy3QMq4+M3KtjTndxDhoAhCLQACAUgQYAoQg0AAhF\noAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi\n0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEMq/tzfgk9ra2r788ktd084777zIyEhd0wD0GQT6h9i5\nc+f06bcpdYumeRsNo0nTKAB9B4H+IVpbW5VaotRyTfMOaJoDoE/hHDQACEWgAUAoAg0AQhFoABCK\nQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACBU7wS6uLi4V9YF\nAB/SO4GeNGlSr6wLAD7EpEDbbDbLKZRSHTcAAF6ZFOj3339/7NixW7ZsMQzDMAylVMcNAIBXJgV6\n9OjR77zzzssvv7xkyRK3223OogDg08w7Bz1w4MCtW7fa7faUlBTTFgUA32Xqp3r7+fllZWVdffXV\n77zzjpnrAoAvMjXQHsnJycnJyeavCwC+pRcC7VFUVJScnNzF64Tbtm1bs2ZNp4P19fV33HHHr3/9\n6x7eHQD0vl4LdFJSUtfv4khPT09PT+90cPPmzfX19T25LwCQgku9AUAokwJdV1eXlZUVGxsbEhJi\ns9liY2MzMzMbGhrMWR0AfJFJgXY4HC6Xq6CgwOl01tTUFBYWWq1Wh8NhzuoA4ItMOgddXFy8devW\nwMBAz92YmJicnJzo6GhzVgcAX2TSM+jExMSMjIyysrLGxsampqby8vKsrKyEhARzVgcAX2RSoPPz\n84OCgtLS0iIiIux2e2pqqtvt3rRpkzmrA4AvMukUR2hoaG5ubm5urjnLAUAfwNvsAEAoAg0AQhFo\nABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0\nAAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQa\nAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEMq/tzeAzq699tpdu1qUOlfHsIrVq2evXLlSxygA\nZiPQ4px33nlK/ZdSA3UMKz5x4g0dcwD0Ak5xAIBQBBoAhCLQACAU56CBnrJq1aqTJ09qGXXs2LHF\nixePHDlSyzT4ir4c6MOHD9fX1+uaNnLkSJvNpmsa+oP8/PzDh/+oadjmpKT/JtD9TV8OdEJCwokT\n92ka9vLmzf/x85//XNM09AtDhgw5fHispmFlmubAl/TlQI8ZM6a4+BFNwy40DEPTKAD4XniREACE\nItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUH35QhX0GQcPHtT1Qy0sFsvll19utVq1TAN6\nFIGGD4iLi1NqmaZh+Xv2vHrNNddomgb0IAINnzBRKV1X7Vva2to0jQJ6FuegAUAoAg0AQpkU6L/8\n5S9Dhw699NJL9+zZM27cuODg4EmTJn366afmrA4AvsikQN9///3r169/5JFHJk6cOHPmzM8+++wn\nP/nJvHnzzFkdAHyRSYGurKxMT0+fMmWKYRiLFi360Y9+tHDhwgMHDpizOgD4IpMCHRUV9e677w4Y\nMOD48eNBQUFKqffee2/48OHmrA4AvsikQD/++OM/+9nP3nzzzeDgYKVUZmbmzJkzH3/8cXNWBwBf\nZNL7oG+44YaamprW1lbP3QULFqxduzYwMNCc1QHAF5l3oYqfn19HkS+66CLT1gUAH9VrVxIWFRUl\nJyd38UmsBQUFTz/9dKeDR48enT59eg9vrS9raGjQdR2dv7//gAEDtIwC4FWvBTopKanrz8meMWPG\njBkzOh3cvHlzfX19T+6rL6uqqoqKilJqmqZ5O10ul81m0zQNQGf8LI5+xO12KzVfqf/UNO+2jhcV\nAPQEk97FUVdXl5WVFRsbGxISYrPZYmNjMzMzGxoazFkdAHyRSYF2OBwul6ugoMDpdNbU1BQWFlqt\nVofDYc7qAOCLTDrFUVxcvHXr1o53ccTExOTk5ERHR5uzOgD4IpOeQScmJmZkZJSVlTU2NjY1NZWX\nl2dlZSUkJJizOgD4IpMCnZ+fHxQUlJaWFhERYbfbU1NT3W73pk2bzFkdAHyRSac4QkNDc3Nzc3Nz\nzVkOAPoAfmA/AAhFoAFAKAINAEIRaAAQikADgFD8LA70a1u2bJk9e3ZcXJyWaVar9f3339cyClAE\nGv1cbW1tY+PzH300S8u0iRMnaZkDeHCKAwCEItAAIBSBBgChvAf6ySefTElJcbvdKSkpYWFhL7zw\ngsnbAgB4f5Fw1apVf//737dt2xYeHr579+6pU6feddddJu8MAPo578+gAwICmpubN27ceOedd1qt\nVrfbbfK2AADeA7127drJkydbrdaUlJTrrrtu9erVJm8LAOD9FMfcuXPnzp3ruV1RUWHedgAA/+b9\nGfTjjz9eWVlp8lYAAKfyHujS0tL4+PiUlJSNGzfy2dsA0Cu8B3rDhg3ffPPNwoULd+zYMWLEiFtv\nvfX1119vaWkxeXMA0J+d9UKVc845Z/z48UlJSXFxcX/+859XrVp14YUXvvbaa2ZuDgD6M++BfuKJ\nJyZPnnzppZcWFxdnZGRUV1f/85//fP311++55x6T9wcA/Zb3d3EcOHBgxYoVU6dODQwM7DgYHx//\nu9/9zqyNAUB/5/0Z9HPPPTd9+nRPnVtbW2fPnq2UCggImDlzpqm7A4B+zHugH3300XPOOcdisVgs\nloCAAKfTafK2AADeA/3YY4/94x//cDgclZWVGzZsmDZtmsnbAgB4D3Rzc3N8fPzkyZM//PDDO++8\nc+PGjSZvCwDgPdDR0dFPPPHE6NGj//CHP5SWlnKKAwDM5z3Qq1evzs/PHzduXEtLS2Ji4sMPP2zy\ntgAA3t9mN2PGjBkzZiilXn31VXP3AwD4Fz7yCgCEOu0ZtMViOdvvMwyj5zcDAPhfpwWaCgOAHJzi\nAAChvAe6srLypptuGjx48JEjR+bPn3/y5EmTtwUA8B7oZcuWjR8/vqamxm63Hzp0aNGiRSZvCwDg\nPdC7d+/2RNlms23evLmgoMDcXQEAzhJol8t1zjnneG4PGDDAarWauCUAgFJnu1BlypQp27dvV0p9\n/vnn69atu+mmm8zdFYA+qLm5+bHHHvP3956d/y/DMJYsWaJrmkze/25PPfXU3XffbbPZpkyZcsst\nt6xZs8bkbQHoe5xO58qVBUqt0zRv0axZs6KjozVNk8h7oCMjIwsLC03eCoB+4EqldP344gRNc+Ti\nSkIAEOq0FwmNf8vLy5szZ87Ro0ePHj06Z86cF198sZe2BwD911nPQe/bt89msymlnn766YSEBM/H\nEgIATOP9bXbfffdde3u753ZbW1tdXZ2JWwIAKHW2QF933XXz58/3nOKYP3/+9OnTTd4WAMB7oJ9+\n+mml1KWXXnrZZZcFBAQ89dRT5u4KAHCWc9B2u33Tpk0mbwUAcKq+fBEOgB9gzJgxJSVBSgXpGPbp\n0qWzcnNzdYzqjwg0gNMMHTq0pORlpQbqGFbs5/eGjjn91GnnoIcMGVJbW6uUSkjo+5foAIBwpz2D\nnjdv3siRI7/99lt1xlWFXEmIrrlcrurqal3TwsPDQ0JCdE0DfNRpgV6zZo3n5yKlp6dv27atl7YE\nn3Tfffdt3Fip1DAdw6pTUk7s2LFDxyjAh3k/B02d8f8VFBSk1G+VukzHsCMREVk65gC+zfv7oI8d\nOzZnzpzBgweHh4fPnj2bKwkBwHzeA71o0aLAwMBPPvnkwIEDAQEBDzzwgMnbAgB4P8Xx9ttvV1RU\nnHvuuUqp9evXDx8+3NxdAQDO8gwaANDrvAc6JSXlV7/6VXV1dXV19cKFC1NSUkzeFgDAe6CffPLJ\n5ubmuLi4uLi4pqamvLy8bi5z6NChxMTE8PDwBQsWtLa2KqVcLlcXH+ACAPAe6LCwsJdeeqm2tra2\ntjY/Pz8sLKyby8ydO3fmzJkHDx5sa2t78MEHuzkNAPoDk34WxyeffLJjx47zzjvvmWeeGTdu3J13\n3hkVFWXO0kDfU1paetNNN40YMULLtNra2r1792oZBb1MCvTQoUM//vjjq6++2mq15ubmzps3j08N\nB36wioqKiopfVlQs1zRvkqY50Mykd3GsXbs2JSXlF7/4hVIqJSVlwoQJV111lTlLA4CP6irQb731\n1rBhwyIjI994o7s/MDA9Pb20tPS2227z3M3JydmwYYPn534AALzq6hTHvHnz1q9ff9FFF/30pz+9\n8cYbu7lSdHR0dHS057bFYklMTExMTOzmTADowzoH+qGHHlq6dKnNZlNK+fn96/l1xyd8a1RUVJSc\nnNzFTzHduXPnli1bOh08fPgw50YA9BOdA33HHXc88MADt9xyy/XXX//888/ffffdTU1Nzz77rPaF\nk5KSuv4Z09dcc82Zl5hv377darVq3wwACNQ50CNGjPj973//0ksvLVq0KDs7u6Kiojd2pZRSNpvt\nzEAPHjy4vr6+V/YDACbz8iKhxWKZPXt2dnb22rVrX3zxRS2fpVJXV5eVlRUbGxsSEmKz2WJjYzMz\nMxsaGro/GQD6qs6B3rZt25AhQ4YNG/bxxx8/+eSTkZGR8+bNKy8v7+YyDofD5XIVFBQ4nc6amprC\nwkKr1epwOLo5FgD6sM6nOB544IE333yzoqJi7ty5R44cuf766ydOnPjII4+sXr26O8sUFxdv3bo1\nMDDQczcmJiYnJ6fjTR0AgDN1fgZ95g8wstls3ayzUioxMTEjI6OsrKyxsbGpqam8vDwrK4vPDgeA\nLnQOdF5e3g033LB48eINGzZoXCY/Pz8oKCgtLS0iIsJut6emprrd7k2bNmlcAgD6mM6nONLS0tLS\n0rQvExoampubm5ubq30yAPRVfKIKAAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQA\nCEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoA\nhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0A\nQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYA\noQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQpkU6JEjR1q8\nMWd1APBFJgW6tLR03LhxhYWFxunMWR0AfJFJgbZarbNmzbLZbOYsBwB9gL9pK91///2mrQUAfQAv\nEgKAUOY9g+6kqKgoOTm5i9PQe/fu3bFjR6eDJSUlo0aN6uGtAYAIvRbopKSkrl8kHDRo0PDhwzsd\nrKysPPfcc3tyXwAgRa8F+v80bNiwYcOGdTrodrvr6+t7ZT8AYDKTzkHX1dVlZWXFxsaGhITYbLbY\n2NjMzMyGhgZzVgcAX2RSoB0Oh8vlKigocDqdNTU1hYWFVqvV4XCYszoA+CKTTnEUFxdv3bo1MDDQ\nczcmJiYnJyc6Otqc1QHAF5n0DDoxMTEjI6OsrKyxsbGpqam8vDwrKyshIcGc1QHAF5kU6Pz8/KCg\noLS0tIiICLvdnpqa6na7N23aZM7qAOCLTDrFERoampubm5uba85yANAHcCUhAAhFoAFAKAINAEIR\naAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEI\nNAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFAE\nGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgC\nDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSB\nBgChCDQACEWgAUAokwJdV1eXlZUVGxsbEhJis9liY2MzMzMbGhrMWR0AfJFJgXY4HC6Xq6CgwOl0\n1tTUFBYWWq1Wh8NhzuoA4Iv8zVmmuLh469atgYGBnrsxMTE5OTnR0dHmrA4AvsikZ9CJiYkZGRll\nZWWNjY1NTU3l5eVZWVkJCQnmrA4AvsikQOfn5wcFBaWlpUVERNjt9tTUVLfbvWnTJnNWBwBfZNIp\njtDQ0Nzc3NzcXHOWA4A+wKRA/wBVVVWlpaWdDu7fv99ut3/PCRUVFUrt1LSdnUrdcvqRIqWu1DT8\n8Kl3jh49qtROpQbqmFzc1NR0+pEP9X1NSk6943K5lNqplFPH5CO1tbWnH9mvb9v/UOrGjjutra1K\n7VRqsJbRX3/99al3dX8HTjv9SLG+4Z+eeofvQKXqz/gO7B29FuiioqLk5GTDMM72G7766quPPvqo\n08H29vaRI0d+zyWWL1/ucnWe8MM0NIy8+uqrO+4mJCSsWDE2JETP8ICAzFPvLlmy5IsvDvn5aTj7\n1NRkue66n3XcHTx48IMPTrfZ9Gy7sfGnwcHBHXfnzJlzySW7AwM1DG9tbR09+t5Tj6xatVjLZKWU\nyzUxLi6u4+7UqVNXrqwODtYz3G7PPvXuihUrGhp0bXvoVVdd1XH3iiuuWLlyjK5tWyy++h0YGRnZ\ncfeuu+665JK/afk+aW9vHzVqfvfndJ+li0QCAHoRVxICgFBcSQgAQnElIQAIZdI56NDQUKfT2XEl\noVLKMIzo6OivvvrKhNUBwBdxJSEACMWVhAAgFG+zAwCheJsdAAhFoAFAKAINAEIRaAAQikADgFAE\nGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCqvwe6ra3t+38K7fdXUFAwevToQYMGTZ48+dCh\nQ3qHv/XWW3FxcYMGDYqLi9uxY4fe4Uqp/fv322w27WMTExMt/3bPPffoHd7a2nrvvfdGREQkJiZ+\n8803GidbzqBx+HvvvZeQkDBgwICEhIS//vWvusZWV1fffvvtkZGR559//t133338+HEtY898sNTV\n1d18881hYWFpaWl1dXV6h5/toJbhPfoI1cnox/Ly8saPH6/9i/Dll18GBwfv2bPn5MmTjz766IQJ\nEzQOb2trCwsL27lzZ1tb25YtW6KiojQONwzju+++Gzt2rPavSXt7e1hY2Ndff338+PHjx483Njbq\nnf/oo4/edtttJ06cWLJkydy5czVOPn6KBx98cNmyZRqHn3/++X/6059aWlpeeeWVCy64QNfYG2+8\nceXKlc3NzY2NjUuXLl28eHH3Z3p9sCxbtmzBggVNTU0LFixYvny53uG6Hp5nzunRR6he/TrQu3bt\nKiws1B6jd999d968eZ7b1dXVdrtd4/Dm5uY33nijvb29oaFh+/btcXFxGoe3t7enp6dv2bJF+9ek\nqqoqODh47NixwcHBM2bMcDqdeuePGTOmpKTEMIyGhoYPP/xQ73CPffv2XXvttW63W+PMuLi45557\n7tixY88///yoUaN0jQ0ODv7uu+88t48dO3bhhRd2f6bXB0tMTMzBgwcNwzh48GBMTIze4boenmfO\n6dFHqF79OtAePffPiNbW1nvuuefee+/VPtnzj1aLxbJ7926NY3NycjIyMowe+Jrs3bs3OTl57969\n3377rcPhuPXWW/XODwsLW7ZsWWho6NixY/ft26d3uGEYzc3N48ePLy0t1Tv2gw8+6Pi37AcffKBr\nbFJS0vLly+vq6pxO58KFCwMDA3VN7vSNYbPZTp48aRjGyZMnBwwYoHd4Fwd1De+5R6guBLqnAv32\n22+PGTNm2bJlep9zdXC5XGvXrr3yyit1Ddy1a9eUKVNaWlqMnvyflmEYlZWVoaGhemf6+/svXbq0\nsrIyOzv7qquu0jvcMIx169bdd9992sdOnTrVs+3MzMxrr71W19iKiorU1NTg4ODhw4fn5eUNGTJE\n1+RO3xhBQUGes1UnTpwICgrSO7yLg1qG9/QjVAsCrT9G7e3ty5cvnzRpUllZmd7JhmF88cUXS5Ys\n8dw+evSozWbTNTk7O7vT6xN/+9vfdA3/6KOPOp7s19bWakyGR2RkZGVlpWEYVVVVGr8mHq2trdHR\n0eXl5XrHGoZhs9mqqqoMw6itrQ0ODtY1tqamprm52XO7qKhoypQpuiZ3erCMGDHi0KFDhmEcOnTo\nkksu0Tu8i4PdHN6jj1C9+vu7OHrCnj17Xnvtte3bt0dFRblcLpfLpXF4VFTUhg0b3nvvPcMwXnnl\nlTFjxuiavGbNmo5vC6WUYRgTJ07UNfzEiRM//vGPDx482NLSsnr16vT0dF2TPa6//voXX3yxubn5\n2WefvfLKK/UO37Vr1wUXXDBixAi9Y5VS8fHxGzZscLlcL7300uWXX65r7NKlS3/5y182NDRUVVUt\nX7584cKFuiZ3cvPNN7/wwguGYbzwwgszZszooVW069FHqGa99X8GObR/EdasWdOjX+SioqIrrrgi\nNDT0mmuu8bxEo532Pbe3tz/zzDMXX3xxeHi4w+Gor6/XO7+qqmratGkDBw6cPHmy9qe6s2bNevjh\nh/XO9Dh48OCECROCg4MnTJig8T9lbW1tWlpaSEjIqFGjnn32WV1jjTO+Merq6lJTU4cOHXrzzTd3\nvCypa3gXB7s5vKcfoRrxobEAIBSnOABAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWg\nAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA95ZLJbe3gL6\nOwINnGbatGm9vQXgX/hUb+A0Fsu/HhQdN4DewjNo9BEWi2XFihWRkZGrVq36zW9+ExMTM3DgwHXr\n1imljh075nA4IiMjo6KiZs+efezYsY4/8sc//vHyyy+32+15eXlKqfT0dKVUQkKC5zesX78+Pj7e\nbrf/9re/7aW/Fvo1Ao2+47LLLtuxY8dDDz00aNCg/fv3v/rqq6tXr1ZK3X///YGBgZ9//vlnn30W\nGBiYkZHR8UeOHDlSUlKyZcuWFStWKKW2bdumlCopKfH86smTJ/ft2/f222+vXLmyN/5C6O/4Rxz6\nCIvF0tzcHBAQ4Ofn53a7/f39DcPw8/MzDCM8PPzAgQODBw9WSjmdzvj4eKfT6fkjDQ0NAwYMUN7O\nbHj9VcBM/r29AUCbwMBAzw1/f391+tswOm5bLJa2traO457+nk3Xvwr0NE5xoO+74YYbsrOzm5qa\nGhsbs7OzU1NTu/79brfbnI0BXSPQ6Pvy8vIaGxsvuuii4cOHt7S0eF4PPJvU1NSLL77YtL0BXeDM\nGgAIxTNoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAU\ngQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFD/AzroUMh9WKTgAAAAAElFTkSuQmCC\n"
234 | }
235 | ],
236 | "prompt_number": 5
237 | },
238 | {
239 | "cell_type": "markdown",
240 | "metadata": {},
241 | "source": [
242 | "Similarly, one would expect that \"hour of day\" would be a good feature to predict flight delays, as later flights in the day may present more delays due to density effects. Let's look at that:"
243 | ]
244 | },
245 | {
246 | "cell_type": "code",
247 | "collapsed": false,
248 | "input": [
249 | "%%R\n",
250 | "\n",
251 | "# Extract hour of day from 3 or 4 digit time-of-day string\n",
252 | "get_hour <- function(x) { \n",
253 | " s = sprintf(\"%04d\", as.numeric(x))\n",
254 | " return(substr(s, 1, 2))\n",
255 | "}\n",
256 | "\n",
257 | "df2 = df1[, c('DepDelay', 'CRSDepTime'), with=F]\n",
258 | "df2$hour = as.numeric(sapply(df2$CRSDepTime, get_hour))\n",
259 | "df2$CRSDepTime <- NULL\n",
260 | "df2 <- ddply(df2, .(hour), summarise, mean_delay=mean(DepDelay))\n",
261 | "barplot(df2$mean_delay, names.arg=df2$hour, xlab=\"hour of day\", ylab=\"% of delays\", col=\"green\")"
262 | ],
263 | "language": "python",
264 | "metadata": {},
265 | "outputs": [
266 | {
267 | "metadata": {},
268 | "output_type": "display_data",
269 | "png": "iVBORw0KGgoAAAANSUhEUgAAAeAAAAHgCAIAAADytinCAAAgAElEQVR4nO3df1iUZb748XskxpZB\ncAQM0MjII0pGWKklpkCiSS6aVsdTLtZirVqphaak7G4aIWspUp2uLPcqMXf74SpSWqGF+8W2Ukst\nVCTTzRIQA4VRfjPfP6ZYo5HDMzjPfAber+tc1xnGuef+4Dy9d3yYGQxWq1UBAOTp5uoBAAD2EWgA\nEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQA\nCEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoA\nhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0A\nQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYA\noQg0AAhFoAFAKAINAEJd5uoBAHQJxcXF+/bt07TE09Nz4sSJBoPBSSPJZ7Bara6eAUDnd999920I\n2KDCtaz5k/ru8++uvPJKZ80kHs+gAejB19dXJSl1nZY1O1UXfwZJoAG01+uvv15aWqppSZ8+faZN\nm+akeTo9Ag2gvVasWFGYWahpSb8H+xFohxFoAO3Vr1+/wjHaAt23b18nDdMV8DI7ABCKQAOAUAQa\nAIQi0AAgFD8kBLqWM2fOaH1xsclkMhqNTpoHbSDQQBdSUlISHBys7tay5muVHJ/87LPPOmsmXByB\nBrqQhoYGNUup/9WypkB5vufprIHQJs5BA4BQBBoAhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQ\nikADgFAEGgCEItAAIBSBBgCh+LAkANIdPXr08ccfDwwMbP+S0tLSpUuXXn/99c6bSgcEGoB0hw8f\n3tJni5qvZc0G9d+F/02gHVFQUDBy5EiXbA3ALYUoFarl9ppuLJVrzkHfeuutLtkXANyIToE2mUyG\nCyilWi4AAOzSKdCfffbZjTfe+Pbbb1utVtvvQ2u5AACwS6dADx48eMeOHW+88cb8+fMbGhr02RQA\n3Jp+PyT09fXduHFjRkZGXFycbpsCnU9jY2N1dbXWVWaz2RnDwKl0fRVHt27dUlJSbr755h07dui5\nL9CZrFmz5uE/Pqxitax5W33++edDhw511kxwDhe8zC4mJiYmJub/vFl9ff25c+d+fX2PHj0uu4yX\nb6PrMhgMKkupe7WsSVF1dXXOGghO47LS5efnx8TEtPFzwu3bt7/22mutrvzhhx9uu+22pUuXOnc4\nABDAZYGOjo5u+1Uc8fHx8fHxra7csGHD2bNnnTkXAEjBhyUBgFA6BbqysjIlJSUsLMzHx8dkMoWF\nhS1YsKCqqkqf3QHAHekU6MTERIvFkpOTU1ZWVl5enpub6+HhkZiYqM/uAOCOdDoHXVBQsHHjRqPR\naPtywIAB6enpISEh+uwOAO5Ip2fQUVFRycnJRUVFNTU1tbW1xcXFKSkpkZGR+uwOAO5Ip0BnZ2d7\neXklJCQEBAT4+fnFx8c3NDSsX79en90BwB3pdIrDbDZnZGRkZGTosx0AdAK8zA4AhCLQACAUgQYA\noQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCE4ndHAS4wcuTIXWqX+o2WNf9PWWvb+h0X6HwINOAC\nPXv2VG8o5atlza3OGgZicYoDAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQ\nACAUgQYAoQg0AAhFoAFAKAINAELxcaOAgz799FOLxaJpSWhoaGhoqJPmQedDoAEH3XLLLWq5lgUn\nVezXsTt27HDWQOh0CDTgqJFKLdRy++9UYEqgs4ZBZ0SgAXRyBoNBjdGy4Lwa4zUmLy/PWQO1G4EG\n0NmNVEpTbM+q3rN7O2sYLXgVBwAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhF\noAFAKAINAELxWRzouk6dOrVmzRpPT8/2L6mvr58+fXpISIjzpgJaEGh0Xbt37079IFXN0bJmg7rm\nmmvuvfdeZ80EXIBAo2u7Q6m7tdy+wVmDAL/GOWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhFoAFA\nKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABBKp0B/8MEHffr0ufba\naz/55JOhQ4d6e3vfeuuthw8f1md3AHBHOgV63rx5zz///PLly0eOHDl58uSjR4/eddddM2bM0Gd3\nAHBHOgX65MmTkyZNGj16tNVqnTt37hVXXDFnzpyDBw/qszsAuCOdAh0cHPzxxx/36NGjurray8tL\nKbVz587Q0FB9dgcAd6RToFeuXHnPPfds27bN29tbKbVgwYLJkyevXLlSn90BwB1dps8248ePLy8v\nb2xstH358MMPp6WlGY1GfXYHAHekU6CVUt26dWspcr9+/XTbFwDclH6BbiU/Pz8mJsZqtV7sBnl5\nee+8806rK7/55pthw4Y5eTQAEMFlgY6Ojm6jzkqpqKioa665ptWVW7Zs8fDwcOZcACCFywL9f/Ly\n8vr1yzx69+599uxZl8wDADrT6VUclZWVKSkpYWFhPj4+JpMpLCxswYIFVVVV+uwOAO5Ip0AnJiZa\nLJacnJyysrLy8vLc3FwPD4/ExER9dgcAd6TTKY6CgoKNGze2vIpjwIAB6enpISEh+uwOAO5Ip2fQ\nUVFRycnJRUVFNTU1tbW1xcXFKSkpkZGR+uwOAO5Ip0BnZ2d7eXklJCQEBAT4+fnFx8c3NDSsX79e\nn90BwB3pdIrDbDZnZGRkZGTosx0AdAJ8YD8ACEWgAUAoAg0AQhFoABBK7lu9gfZ47bXX3nrrLdvn\njLdTRUXFtm3bPD09nTcVcEkQaLi3goKCbX/Ypq7TsuZBdf78eV9fX2fNBFwiBBruzWg0qlClNP32\ntEBnDQNcWpyDBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIxVu94XoL\nFy48c+aMpiVBQUF//vOfnTMOIAWBhutt3br165yvNS2JuDOCQKPTI9BwvZ49e2r7tCOlfHx8nDML\nIAiBxqWxb9++pqYmTUtCQ0PNZrOT5gE6AQKNS+DEiRNDxgxRM7Ss2acev/bx5557zlkzAe6PQOMS\nsFqt6h6llmtZU6CM7xmdNRDQKdh/md3q1avj4uIaGhri4uJ69er117/+VeexAAD2n0EvXbr0X//6\n1+bNm/39/Xft2hUbG/v73/9e58kAoIuz/wza09Ozrq7u9ddff+CBBzw8PBoaGnQeCwBgP9BpaWmj\nRo3y8PCIi4sbO3bssmXLdB4LAGD/FEdSUlJSUpLt8vHjx/UbBwDwM/vPoFeuXHny5EmdRwEAXMh+\noAsLCyMiIuLi4l5//fWqqiqdZwIAqIsFeu3atT/88MOcOXM+/PDD/v37T5069d13362vr9d5OADo\nyi76caPdu3cfNmxYdHR0eHj41q1bly5detVVV23atEnP4QCgK7Mf6FWrVo0aNeraa68tKChITk4+\nderU559//u67786cOVPn+QCgy7L/Ko6DBw8++eSTsbGxRuN/3owbERHx0ksv6TUYAHR19p9Bv/LK\nK7fffrutzo2NjdOnT1dKeXp6Tp48WdfpAKALs/8MesWKFUuWLGn5qeC4ceN0HAmu0dzcfPbsWU1L\nDAZDz549nTQPAPuBfvbZZz/99NPMzMzly5dv27atoqJC57Ggv1deeWXmipnqBi1r3lO7d+6+6aab\nnDUT0LXZD3RdXV1ERMSoUaP27NnzwAMPREREzJ8/X+fJoLPm5ma1Wqk7tKxJUbW1tc4aCOjy7J+D\nDgkJWbVq1eDBg//2t78VFhaWlZXpPBYAwH6gly1blp2dPXTo0Pr6+qioqKeeekrnsQAA9k9xTJw4\nceLEiUqpd955R995AAA/ueg7CQEArvWLZ9AGg+Fit7Narc4fBgDwH78INBUGADk4xQEAQtkP9MmT\nJydMmNC7d+/vvvtu1qxZ58+f13ksAID9QC9cuHDYsGHl5eV+fn5HjhyZO3euzmMBAOwHeteuXbYo\nm0ymDRs25OTk6DsVAOAigbZYLN27d7dd7tGjh4eHh44jAQCUuligR48evWXLFqXUt99+O2fOnAkT\nJug7FQDgIoHOysrKzs42mUyjR4/29vZetWqVzmMBAOy/1TsoKCg3N1fnUQAAF+KdhJ3Kpk2bioqK\n2ngcf81kMj3yyCPOGwmAw+y/k3D16tX79u1bvny5UmrRokXR0dH6TwYHpKWl7U3Zq3y1rPkfRaAB\nmeyf4sjKyjpw4IDJZFJKvfDCC5GRkbZfSwjhAgMD1RilLdADnTUMgA6y/0PCM2fONDc32y43NTVV\nVlbqOBIAQKmLBXrs2LGzZs0qLS0tLS2dNWvW7bffrvNYAAD7gX7hhReUUtdee+11113n6emZlZWl\n71QAgIucg/bz81u/fr3OowAALsTHjQKAUAQaAIT6RaADAwNPnz6tlIqMjHTRPACAn/wi0DNmzBg4\ncKDBYNi/f7/hlzq4zZEjR6Kiovz9/R9++OHGxkallMVi6fjdAkAn9otAP/3006dPn7ZarRMnTrT+\nUge3SUpKmjx58qFDh5qamlJTUzt4bwDQFdg/B7158+ZLu81XX301e/bsgICAF1988YMPPjhy5Mil\nvX8A6HzsB7qiouL+++/v3bu3v7//9OnTO/5Owj59+uzfv18p5eHhkZGRMWPGjKampg7eJwB0bvYD\nPXfuXKPR+NVXXx08eNDT0/Oxxx7r4DZpaWlxcXEPPvigUiouLm7EiBHDhw/v4H0CQOdm/40qeXl5\nx48fv/zyy5VSzz//fGhoaAe3mTRpUmFh4bfffmv7Mj09/be//W1+fn4H7xYAOjH7gXaGkJCQkJAQ\n22WDwRAVFRUVFdXG7c+cOXP06NFWVx47dszb29tZIwKAJPYDHRcX9+ijj6alpSmlFi9eHBcXd8k3\nzs/Pj4mJaeP1IV988cWbb77Z6spvvvnm5ptvvuTDAIBA9gO9evXqefPmhYeHK6XGjx+/evXqS75x\ndHR026/ei42NjY2NbXXlhg0bzp49e8mHAQCB7Ae6V69e69at03kUAMCFdPosjsrKypSUlLCwMB8f\nH5PJFBYWtmDBgqqqKn12BwB3pFOgExMTLRZLTk5OWVlZeXl5bm6uh4dHYmKiPrsDgDvS6VUcBQUF\nGzduNBqNti8HDBiQnp7e8qIOAMCvtfUM+v3337/66quDgoLee++9Dm4TFRWVnJxcVFRUU1NTW1tb\nXFyckpLCZ+YBQBvaCvSMGTNWrly5devWuXPndnCb7OxsLy+vhISEgIAAPz+/+Pj4hoYGfmkLALSh\ndaD/9Kc/nTt37qc/6/bTn7b8hm+Hmc3mjIyMoqIii8Vy7ty54uLi5557ztfXt4N3CwCdWOtz0L/7\n3e8ee+yxKVOmjBs37tVXX33ooYdqa2vXrFnjkuEAoCtr/Qy6f//+L7/8cmlp6dy5cyMjI48fP15a\nWpqQkOCS4QCgK7PzKg6DwTB9+vTx48enpaUNGTJk+vTp/OoTANBf62fQmzdvDgwMvPrqq/fv3796\n9eqgoKAZM2YUFxe7ZDgA6MpaB/qxxx7btm3bypUrk5KSlFLjxo3Lysribd8AoL/Wgf712QyTybRs\n2TK95gEA/KT1OejMzMzx48f/5je/Wbt2rUsGAgDYtA50QkICr9kAAAl0+rAkAIBWBBoAhCLQACAU\ngQYAoQg0AAil0wf2o/3i4uLMZnP7b19bWxsVFbVw4ULnjQTAJQi0ONt/3K7e0rLgO3V52uXOmgaA\n6xBoeUxKaXgCrVS18vT0dNYwAFyHc9AAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi\n0AAgFIEGAKEINAAIxWdxXHpWq/XMmTNaV/n4+Hh4eDhjHgBuikBfejt27Ii7PU5N1rImV7217q27\n777bWTMBcEME+tKrq6tTTyu1SMual1RDQ4OzBgLgnjgHDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYA\noQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOA\nUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCXebqAYTKzMzcsGGDr69v+5eUlZXt2bPHaDQ6byoA\nXQqBtu/IkSO71+5W12lZc5+qqakh0AAuFU5xAIBQBBoAhCLQACAUgQYAoQg0AAhFoAFAKJ0CPXDg\nQIM9+uwOAO5Ip0AXFhYOHTo0NzfX+kv67A4A7kinQHt4eNx7770mk0mf7QCgE9DvnYTz5s3TbS8A\n6AT4ISEACOWyz+LIz8+PiYlp4zR0bm5uVlZWqytLS0vj4+OdPBoAiOCyQEdHR7f9Q8L4+PiRI0e2\nuvKdd95paGhw5lwAIIXcT7Pz8PAwm82trjSZTGfPnm3nPUydOrW5uVnTpmaz+eWXX9a0BACcRKdA\nV1ZW/uUvf/nHP/5RUlLS1NTUt2/fhISE1NRUHx8f52169OjRPR/u0bRkzD1jnDQMAGil0w8JExMT\nLRZLTk5OWVlZeXl5bm6uh4dHYmKiUze9/PLLlVlp+r/u3bs7dSQAaD+dnkEXFBRs3Lix5cPsBwwY\nkJ6eHhISos/uAOCOdHoGHRUVlZycXFRUVFNTU1tbW1xcnJKSEhkZqc/uAOCOdAp0dna2l5dXQkJC\nQECAn59ffHx8Q0PD+vXr9dkdANyRTqc4zGZzRkZGRkaGPtsBQCfAOwkBQCgCDQBCEWgAEIpAA4BQ\nBBoAhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAo\nAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAU\ngQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCK\nQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhF\noAFAKAINAEIRaAAQikADgFAEGgCEItAAIJROga6srExJSQkLC/Px8TGZTGFhYQsWLKiqqtJndwBw\nRzoFOjEx0WKx5OTklJWVlZeX5+bmenh4JCYm6rM7ALijy/TZpqCgYOPGjUaj0fblgAED0tPTQ0JC\n9NkdANyRTs+go6KikpOTi4qKampqamtri4uLU1JSIiMj9dkdANyRToHOzs728vJKSEgICAjw8/OL\nj49vaGhYv369PrsDgDvS6RSH2WzOyMjIyMjQZzsA6AR0CrQDSkpKCgsLW1359ddf+/n5tfMejh8/\nrrZr2/T777+3XbBYLGq7UmVaFu+74HK+UjdpWbtdqSk/X/5aaRv7O3X69GnbxdLSUrVdKV8tyw//\n9P8bGxvVdqW6a1n7qVJ3/Hx5j8axC1Rtba3toisfqQKNY29XaszPl137SPXWsragaz9ShzWuPfuf\nR8q1XBbo/Pz8mJgYq9V6sRucOHFi7969ra5sbm4eOHBgO7dYtGiRZa9F01RXzLvCduH+++//r13/\nZdxrbP/amrtrvL29lVI33HDDkzc+6bPXp/1rqwZW3XzzzbbLSx9fqmnfxsbGwbMH2y7Pnz//2JFj\n3bppOHPV7YmfbhwTE/PH03/02uvV/rWWkZbw8HClVO/evVNvTzXtNbV/ba2hduw9Y22XHXikAlID\nbBc6+EgtGbLEe693+9da+liGDx9uu7wseZnnXs/2r+3gI2VYYLBdiI2NXXJK29hVo6oGDRqkOvxI\nPfroo2f2nmn/WqWU3+KfnlE59kgFBQWpDj9Szzz+TLe9Gv6qm5ubB80a1P7bO4+hjUQCAFyIdxIC\ngFC8kxAAhOKdhAAglE7noM1mc1lZWcs7CZVSVqs1JCTkxIkTOuwOAO6IdxICgFC8kxAAhOJldgAg\nFC+zAwChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCda1A\nR0VFGX42c+ZMrcsbGxtnz54dEBAQFRX1ww8/aFpr+JX2r925c2dkZGSPHj0iIyP/+c9/atr31KlT\n06ZNCwoK6tu370MPPVRdXa1puWOampp+/bt97V7p7H3ff//98PDwnj17hoeHf/jhh3pu3cGDzeF9\nO3KkdWRr3Y60nJycwYMH9+zZc9SoUUeOHLnYPJ2Etctobm7u1avX999/X11dXV1dXVNTo/UeVqxY\ncd999507d27+/PlJSUma1lZfIDU1deHChe1f27dv37feequ+vv7NN9+88sorNe17xx13LFmypK6u\nrqam5oknnnj88cc1LXdAZmbmsGHDWh1adq909r5NTU29evXavn17U1PT22+/HRwcrNvWHT/YHNvX\n2rEjrSNb63Ok/fvf//b29v7kk0/Onz+/YsWKESNGXGyezqGzfT9tKCkp8fb2vvHGG729vSdOnFhW\nVqb1HoYMGbJv3z6r1VpVVbVnzx7Hxjhw4MBtt93W0NDQ/iXh4eGvvPJKRUXFq6++OmjQIE3beXt7\nnzlzxna5oqLiqquu0rTcAR999FFubm6r/1TsXunsfevq6t57773m5uaqqqotW7aEh4frtnXHDzbH\n9r2QA0daR7bW50j7+OOPZ8yYYbt86tQpPz+/i83TOXS276cNX375ZUxMzJdffvnjjz8mJiZOnTpV\n6z306tVr4cKFZrP5xhtvPHDggAMz1NXVDRs2rLCwUNOq3bt3t/yLZ/fu3ZrWRkdHL1q0qLKysqys\nbM6cOUajUdNyh9n9T0WH/35+vYXt39oGg2HXrl26bd3xg82xfVs4dqR1ZGudj7TGxsaZM2fOnj37\nYvN0Dp3t+2mnkydPms1mrasuu+yyJ5544uTJk4sXLx4+fLgD+z7zzDOPPPKI1lWxsbG2fRcsWHDb\nbbdpWnv8+PH4+Hhvb+/Q0NDMzMzAwECtuztGTqCtVqvFYklLS7vpppv039rq6MHWwX0dO9I6srWe\nR1peXt6QIUMWLlzY6t8HBNqN7d27t+U51OnTpx04gIKCgk6ePGm1WktKSkwmk9bljY2NISEhxcXF\nWheaTKaSkhKr1Xr69Glvb29Na8vLy+vq6myX8/PzR48erXV3x0gI9LFjx+bPn2+7XFpa6sBD5vDW\nHT/YHNvXxuEjrSNb63OkNTc3L1q06NZbby0qKmp7ns6hC72K49y5c3feeeehQ4fq6+uXLVs2adIk\nrfcwbty41157ra6ubs2aNTfddJPW5R999NGVV17Zv39/rQsjIiLWrl1rsVjWrVt3/fXXa1r7xBNP\n/OEPf6iqqiopKVm0aNGcOXO07u6+goOD165du3PnTqvV+uabbw4ZMkS3rTt+sHWEw0daR+hzpH3y\nySebNm3asmVLcHCwxWKxWCzO2EUQV/8vhH6am5tffPHFa665xt/fPzEx8ezZs1rvoaSkZMyYMb6+\nvqNGjXLg6cm999771FNPaV1ltVoPHTo0YsQIb2/vESNGHDp0SNPa06dPJyQk+Pj4DBo0aM2aNQ7s\n7hi7h5YOx1urLfLz82+44Qaz2XzLLbdo/avryNYdP9gc29fG4SOtI1vrc6Q9/fTTbRSs8wWNXxoL\nAEJ1oVMcAOBeCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBC\nEWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhFoAFAKAIN6QwGgz4bpaamBgYGunwMoAW/NBbSGQw6\nHaX+/v6HDx/29/d37RhAC445SKdbGdveiEBDf5zigBt4/vnnIyIi/Pz8nnvuOaVURUVFYmJiUFBQ\ncHDw9OnTKyoqbDe78CxEy2WDwZCdnX3huQu7yydNmqSUioyMvPBm9913n5+fX//+/bOyslqu37Jl\nS2RkZM+ePYOCgp599lml1IMPPrhq1SrbnyYlJa1cudI5fw3ocgg03MD58+cPHDiQl5e3ZMkSpdS8\nefOMRuO333579OhRo9GYnJzc9vLPPvtsx44dLV/aXb5582al1L59+1puNnfuXKXUsWPH9u/f/8UX\nX7Rcn5qaOm3atB9//HHr1q2LFy9WSk2ZMmXTpk1Kqbq6upycnKlTp17C7x1dGf9qg3QGg6GqqqpH\njx7q5/MM/v7+Bw8e7N27t1KqrKwsIiKirKxM/fIsRMtlg8Fw6tSpgICAljtsz3KllJ+fX2Fhoe2p\nd2lpaVBQkO1Pm5ubd+/eXVhYuHPnznXr1lmt1vr6+uDg4MLCws8//zwrKysvL0+vvxt0cjyDhhuw\n1flCF57BaGpqavWn1dXVF355YZ3bs9ymW7duv769Uuqee+5ZvXp1QEBAenq67Rqj0XjHHXds2bLl\n73//+7Rp09rzHQHtQaDhfsaPH7948eLa2tqamprFixfHx8fbru/evfvHH39stVpfeuklB5a3Eh8f\nP3/+/Orq6nPnzqWkpLRcn5eXt3jx4gkTJrz//vtKqcbGRqXUXXfd9cYbb+Tl5d15552X7PtEl0eg\n4X4yMzNramr69esXGhpaX1+fmZlpu/7pp5+eMmVKRETEFVdc4cDyVlatWtXc3NyvX7+IiIjRo0e3\nXP/MM89ER0dfd911P/7447hx45KSkpRScXFxe/fujY2N9fHxuXTfKLo6zkEDl8bw4cNTU1MnTJjg\n6kHQeVzm6gEAt9fQ0PDVV1+dOHFi7Nixrp4FnQqnOICOys3NHT9+/Isvvmg0Gl09CzoVTnEAgFA8\ngwYAoQg0AAhFoAFAKAINAEIRaAAQikADgNYmxUYAAAAuSURBVFAEGgCEItAAIBSBBgChCDQACEWg\nAUAoAg0AQhFoABCKQAOAUAQaAIT6/9UDl9sgUkleAAAAAElFTkSuQmCC\n"
270 | }
271 | ],
272 | "prompt_number": 6
273 | },
274 | {
275 | "cell_type": "markdown",
276 | "metadata": {},
277 | "source": [
278 | "In this demo we have not explored all the variables of course, just a couple to demonstrate R's capabilities in this area."
279 | ]
280 | },
281 | {
282 | "cell_type": "heading",
283 | "level": 2,
284 | "metadata": {},
285 | "source": [
286 | "Pre-processing - iteration 1"
287 | ]
288 | },
289 | {
290 | "cell_type": "markdown",
291 | "metadata": {},
292 | "source": [
293 | "After playing around with the data, and exploring some potential features -- we will now demonstrate how to use Scalding to preform some simple pre-processing on Hadoop, and create a feature matrix from the raw dataset. This is similar to the pre-processing shown in part 1 of the blog post (where we've used PIG for this purpose) and part 2 (where we've used Spark for this purpose).\n",
294 | "\n",
295 | "In our first iteration, we create the following features:\n",
296 | "\n",
297 | "* **month**: winter months should have more delays than summer months\n",
298 | "* **day of month**: this is likely not a very predictive variable, but let's keep it in anyway\n",
299 | "* **day of week**: weekend vs. weekday\n",
300 | "* **hour of the day**: later hours tend to have more delays\n",
301 | "* **Distance**: interesting to see if this variable is a good predictor of delay\n",
302 | "* **days_from_closest_holiday**: number of days from date of flight to closest US holiday\n",
303 | "\n",
304 | "Let's look at the Scalding code.\n",
305 | "Note that we write the code in the next cell using IPython's \"writefile\" magic command and then execute it later from that local file."
306 | ]
307 | },
308 | {
309 | "cell_type": "code",
310 | "collapsed": false,
311 | "input": [
312 | "%%writefile preprocess1.scala\n",
313 | "\n",
314 | "package com.hortonworks.datascience.demo1\n",
315 | "\n",
316 | "import com.twitter.scalding._\n",
317 | "import org.joda.time.format._\n",
318 | "import org.joda.time.{Days, DateTime}\n",
319 | "import com.hortonworks.datascience.demo1.ScaldingFlightDelays._\n",
320 | "\n",
321 | "/**\n",
322 | " * Pre-process flight delay data into feature matrix - iteration #1\n",
323 | " */\n",
324 | "class ScaldingFlightDelays(args: Args) extends Job(args) {\n",
325 | "\n",
326 | " val prepData = Csv(args(\"input\"), \",\", fields = inputSchema, skipHeader = true)\n",
327 | " .read\n",
328 | " .project(delaySchmea)\n",
329 | " .filter(('Origin,'Cancelled)) { x:(String,String) => x._1 == \"ORD\" && x._2 == \"0\"}\n",
330 | " .mapTo(featureSchmea -> outputSchema)(gen_features)\n",
331 | " .write(Csv(args(\"output\")))\n",
332 | "}\n",
333 | "\n",
334 | "object ScaldingFlightDelays {\n",
335 | " val inputSchema = List('Year, 'Month, 'DayofMonth, 'DayOfWeek, \n",
336 | " 'DepTime, 'CRSDepTime, 'ArrTime, 'CRSArrTime, \n",
337 | " 'UniqueCarrier, 'FlightNum, 'TailNum, \n",
338 | " 'ActualElapsedTime, 'CRSElapsedTime, 'AirTime, 'ArrDelay, \n",
339 | " 'DepDelay, 'Origin, 'Dest, 'Distance, \n",
340 | " 'TaxiIn, 'TaxiOut, 'Cancelled, 'CancellationCode, \n",
341 | " 'Diverted, 'CarrierDelay, 'WeatherDelay, \n",
342 | " 'NASDelay, 'SecurityDelay, 'LateAircraftDelay)\n",
343 | " val delaySchmea = List('Year, 'Month, 'DayofMonth, 'DayOfWeek, \n",
344 | " 'CRSDepTime, 'DepDelay, 'Origin, 'Distance, 'Cancelled)\n",
345 | " val featureSchmea = List('Year, 'Month, 'DayofMonth, 'DayOfWeek, \n",
346 | " 'CRSDepTime, 'DepDelay, 'Distance)\n",
347 | " val outputSchema = List('flightDate,'y,'m,'dm,'dw,'crs,'dep,'dist)\n",
348 | "\n",
349 | " val holidays = List(\"01/01/2007\", \"01/15/2007\", \"02/19/2007\", \"05/28/2007\", \"06/07/2007\", \"07/04/2007\",\n",
350 | " \"09/03/2007\", \"10/08/2007\" ,\"11/11/2007\", \"11/22/2007\", \"12/25/2007\",\n",
351 | " \"01/01/2008\", \"01/21/2008\", \"02/18/2008\", \"05/22/2008\", \"05/26/2008\", \"07/04/2008\",\n",
352 | " \"09/01/2008\", \"10/13/2008\" ,\"11/11/2008\", \"11/27/2008\", \"12/25/2008\")\n",
353 | "\n",
354 | " def gen_features(tuple: (String,String,String,String,String,String,String)) = {\n",
355 | " val (year, month, dayOfMonth, dayOfWeek, crsDepTime, depDelay, distance) = tuple\n",
356 | " val date = to_date(year.toInt,month.toInt,dayOfMonth.toInt)\n",
357 | " val hour = get_hour(crsDepTime)\n",
358 | " val holidayDist = days_from_nearest_holiday(year.toInt,month.toInt,dayOfMonth.toInt)\n",
359 | "\n",
360 | " (date,depDelay,month,dayOfMonth,dayOfWeek,hour,distance,holidayDist.toString)\n",
361 | " }\n",
362 | "\n",
363 | " def get_hour(depTime: String) = \"%04d\".format(depTime.toInt).take(2)\n",
364 | "\n",
365 | " def to_date(year: Int, month: Int, day: Int) = \"%04d%02d%02d\".format(year, month, day)\n",
366 | "\n",
367 | " def days_from_nearest_holiday(year:Int, month:Int, day:Int) = {\n",
368 | " val sampleDate = new DateTime(year, month, day, 0, 0)\n",
369 | "\n",
370 | " holidays.foldLeft(3000) { (r, c) =>\n",
371 | " val holiday = DateTimeFormat.forPattern(\"MM/dd/yyyy\").parseDateTime(c)\n",
372 | " val distance = Math.abs(Days.daysBetween(holiday, sampleDate).getDays)\n",
373 | " math.min(r, distance)\n",
374 | " }\n",
375 | " }\n",
376 | "}\n"
377 | ],
378 | "language": "python",
379 | "metadata": {},
380 | "outputs": [
381 | {
382 | "output_type": "stream",
383 | "stream": "stdout",
384 | "text": [
385 | "Overwriting preprocess1.scala\n"
386 | ]
387 | }
388 | ],
389 | "prompt_number": 8
390 | },
391 | {
392 | "cell_type": "markdown",
393 | "metadata": {},
394 | "source": [
395 | "We execute this Scalding code using the standard \"scald.rb\" script, generating the feature matrix for both the 2007 dataset and 2008 dataset. Note we use IPython's \"%%capture\" magic command to capture the output of the Scalding script and print only error messages, if any (stderr)"
396 | ]
397 | },
398 | {
399 | "cell_type": "code",
400 | "collapsed": false,
401 | "input": [
402 | "%%capture capt2007_1\n",
403 | "!/home/demo/scalding/scripts/scald.rb --hdfs preprocess1.scala --input \"airline/delay/2007.csv\" --output \"airline/fm/ord_2007_sc_1\""
404 | ],
405 | "language": "python",
406 | "metadata": {},
407 | "outputs": [],
408 | "prompt_number": 1
409 | },
410 | {
411 | "cell_type": "code",
412 | "collapsed": false,
413 | "input": [
414 | "capt2007_1.stderr"
415 | ],
416 | "language": "python",
417 | "metadata": {},
418 | "outputs": [
419 | {
420 | "metadata": {},
421 | "output_type": "pyout",
422 | "prompt_number": 6,
423 | "text": [
424 | "''"
425 | ]
426 | }
427 | ],
428 | "prompt_number": 6
429 | },
430 | {
431 | "cell_type": "code",
432 | "collapsed": false,
433 | "input": [
434 | "%%capture capt2008_1\n",
435 | "!/home/demo/scalding/scripts/scald.rb --hdfs preprocess1.scala --input \"airline/delay/2008.csv\" --output \"airline/fm/ord_2008_sc_1\""
436 | ],
437 | "language": "python",
438 | "metadata": {},
439 | "outputs": [],
440 | "prompt_number": 3
441 | },
442 | {
443 | "cell_type": "code",
444 | "collapsed": false,
445 | "input": [
446 | "capt2008_1.stderr"
447 | ],
448 | "language": "python",
449 | "metadata": {},
450 | "outputs": [
451 | {
452 | "metadata": {},
453 | "output_type": "pyout",
454 | "prompt_number": 4,
455 | "text": [
456 | "''"
457 | ]
458 | }
459 | ],
460 | "prompt_number": 4
461 | },
462 | {
463 | "cell_type": "heading",
464 | "level": 2,
465 | "metadata": {},
466 | "source": [
467 | "Modeling - iteration 1"
468 | ]
469 | },
470 | {
471 | "cell_type": "markdown",
472 | "metadata": {},
473 | "source": [
474 | "Now that we've generated our feature matrix using Scalding and Hadoop, let's turn to using R to build a predictive model for predicting airline delays. First we prepare our trainning set (using the 2007 data) and test set (using 2008 data):"
475 | ]
476 | },
477 | {
478 | "cell_type": "code",
479 | "collapsed": false,
480 | "input": [
481 | "%%R\n",
482 | "\n",
483 | "# Function to compute Precision, Recall and F1-Measure\n",
484 | "get_metrics <- function(predicted, actual) {\n",
485 | " tp = length(which(predicted == TRUE & actual == TRUE))\n",
486 | " tn = length(which(predicted == FALSE & actual == FALSE))\n",
487 | " fp = length(which(predicted == TRUE & actual == FALSE))\n",
488 | " fn = length(which(predicted == FALSE & actual == TRUE))\n",
489 | "\n",
490 | " precision = tp / (tp+fp)\n",
491 | " recall = tp / (tp+fn)\n",
492 | " F1 = 2*precision*recall / (precision+recall)\n",
493 | " accuracy = (tp+tn) / (tp+tn+fp+fn)\n",
494 | " \n",
495 | " v = c(precision, recall, F1, accuracy)\n",
496 | " v\n",
497 | "}\n",
498 | "\n",
499 | "# Read input files\n",
500 | "process_dataset <- function(filename) {\n",
501 | " cols = c('date', 'delay', 'month', 'day', 'dow', 'hour', 'distance', 'days_from_holiday')\n",
502 | " \n",
503 | " data = read_csv_from_hdfs(filename, cols)\n",
504 | " data$delay = as.factor(as.numeric(data$delay) >= 15)\n",
505 | " data$month = as.factor(data$month)\n",
506 | " data$day = as.factor(data$day)\n",
507 | " data$dow = as.factor(data$dow)\n",
508 | " data$hour = as.numeric(data$hour)\n",
509 | " data$distance = as.numeric(data$distance)\n",
510 | " data$days_from_holiday = as.numeric(data$days_from_holiday)\n",
511 | " data\n",
512 | "}\n",
513 | "\n",
514 | "# Prepare training set and test/validation set\n",
515 | "\n",
516 | "data_2007 = process_dataset('/user/demo/airline/fm/ord_2007_sc_1')\n",
517 | "data_2008 = process_dataset('/user/demo/airline/fm/ord_2008_sc_1')\n",
518 | "\n",
519 | "fcols = setdiff(names(data_2007), c('date', 'delay'))\n",
520 | "train_x = data_2007[,fcols, with=FALSE]\n",
521 | "train_y = data_2007$delay\n",
522 | "\n",
523 | "test_x = data_2008[,fcols, with=FALSE]\n",
524 | "test_y = data_2008$delay"
525 | ],
526 | "language": "python",
527 | "metadata": {},
528 | "outputs": [],
529 | "prompt_number": 11
530 | },
531 | {
532 | "cell_type": "markdown",
533 | "metadata": {},
534 | "source": [
535 | "The *preprocess_data* function reads the data from HDFS into an R data frame. We use it first to read the feature matrix based on 2007 into the *data_2007* R dataframe (used as a training set), and then similarly to build *data_2008* as the testing set. \n",
536 | "\n",
537 | "In this cell, we also define a helper function *get_metrics* that we will use later to measure precision, recall, F1 and accuracy.\n",
538 | "\n",
539 | "Now let's run R's random forest algorithm and evaluate the results:"
540 | ]
541 | },
542 | {
543 | "cell_type": "code",
544 | "collapsed": false,
545 | "input": [
546 | "%%R\n",
547 | "\n",
548 | "rf.model = randomForest(train_x, train_y, ntree=40)\n",
549 | "rf.pr <- predict(rf.model, newdata=test_x)\n",
550 | "m.rf = get_metrics(as.logical(rf.pr), as.logical(test_y))\n",
551 | "print(sprintf(\"Random Forest: precision=%0.2f, recall=%0.2f, F1=%0.2f, accuracy=%0.2f\", m.rf[1], m.rf[2], m.rf[3], m.rf[4]))"
552 | ],
553 | "language": "python",
554 | "metadata": {},
555 | "outputs": [
556 | {
557 | "metadata": {},
558 | "output_type": "display_data",
559 | "text": [
560 | "[1] \"Random Forest: precision=0.43, recall=0.30, F1=0.35, accuracy=0.69\"\n"
561 | ]
562 | }
563 | ],
564 | "prompt_number": 12
565 | },
566 | {
567 | "cell_type": "markdown",
568 | "metadata": {},
569 | "source": [
570 | "Let's also try R's Gradient Boosted Machines (GBM) modeling. \n",
571 | "GBM is an ensemble method that like random forest is typically robust to over-fitting."
572 | ]
573 | },
574 | {
575 | "cell_type": "code",
576 | "collapsed": false,
577 | "input": [
578 | "%%R\n",
579 | "\n",
580 | "gbm.model <- gbm.fit(train_x, as.numeric(train_y)-1, n.trees=500, verbose=F, shrinkage=0.01, distribution=\"bernoulli\", \n",
581 | " interaction.depth=3, n.minobsinnode=30)\n",
582 | "gbm.pr <- predict(gbm.model, newdata=test_x, n.trees=500, type=\"response\")\n",
583 | "m.gbm = get_metrics(gbm.pr >= 0.5, as.logical(test_y))\n",
584 | "print(sprintf(\"Gradient Boosted Machines: precision=%0.2f, recall=%0.2f, F1=%0.2f, accuracy=%0.2f\", m.gbm[1], m.gbm[2], m.gbm[3], m.gbm[4]))"
585 | ],
586 | "language": "python",
587 | "metadata": {},
588 | "outputs": [
589 | {
590 | "metadata": {},
591 | "output_type": "display_data",
592 | "text": [
593 | "[1] \"Gradient Boosted Machines: precision=0.53, recall=0.10, F1=0.17, accuracy=0.72\"\n"
594 | ]
595 | }
596 | ],
597 | "prompt_number": 13
598 | },
599 | {
600 | "cell_type": "markdown",
601 | "metadata": {},
602 | "source": [
603 | "Using Random Foresta and Gradient Boosted Machines we get pretty good results for our predictive model using a simple set of features. Following the same iterative model from the previous parts of this blog post, we now use additional data sources to enrich our core dataset and create new features that would help us achieve better predictive performance. "
604 | ]
605 | },
606 | {
607 | "cell_type": "heading",
608 | "level": 2,
609 | "metadata": {},
610 | "source": [
611 | "Pre-processing - iteration 2"
612 | ]
613 | },
614 | {
615 | "cell_type": "markdown",
616 | "metadata": {},
617 | "source": [
618 | "As we have demonstrated in part 1 and 2 of this blog post, a way to improve accuracy for this model is to layer-in weather data and with it add more informative features to our model. We can get this data from a publicly available dataset here: http://www.ncdc.noaa.gov/cdo-web/datasets//\n",
619 | "\n",
620 | "We now add these additional weather-related features to our model: daily temperatures (min/max), wind speed, snow conditions and precipitation in the flight origin airport (ORD). \n",
621 | "\n",
622 | "So let's re-write our Scalding program to add these new features to our feature matrix:"
623 | ]
624 | },
625 | {
626 | "cell_type": "code",
627 | "collapsed": false,
628 | "input": [
629 | "%%writefile preprocess2.scala\n",
630 | "\n",
631 | "package com.hortonworks.datascience.demo2\n",
632 | "\n",
633 | "import com.twitter.scalding._\n",
634 | "import org.joda.time.format._\n",
635 | "import org.joda.time.{Days, DateTime}\n",
636 | "import com.hortonworks.datascience.demo2.ScaldingFlightDelays._\n",
637 | "\n",
638 | "/**\n",
639 | " * pre-process flight and weather data into feature matrix - iteration #2\n",
640 | " */\n",
641 | "class ScaldingFlightDelays(args: Args) extends Job(args) {\n",
642 | "\n",
643 | " val delayData = Csv(args(\"delay\"), \",\", fields = delayInSchema, skipHeader = true)\n",
644 | " .read\n",
645 | " .project(delaySchema)\n",
646 | " .filter(('Origin,'Cancelled)) { x:(String,String) => x._1 == \"ORD\" && x._2 == \"0\"}\n",
647 | " .mapTo(filterSchema -> featureSchmea)(gen_features)\n",
648 | "\n",
649 | " val weatherData = Csv(args(\"weather\"),\",\", fields = weatherInSchema)\n",
650 | " .read\n",
651 | " .project(weatherSchema)\n",
652 | " .filter('Station){x:String => x == \"USW00094846\"}\n",
653 | " .filter('Metric){m:String => m == \"TMIN\" | m == \"TMAX\" | m == \"PRCP\" | m == \"SNOW\" | m == \"AWND\"}\n",
654 | " .mapTo(weatherSchema -> ('Date,'MM)){tuple:(String,String,String,String) => (tuple._2,tuple._3+\":\"+tuple._4)}\n",
655 | " .groupBy('Date){_.foldLeft('MM -> 'Measures)(Map[String,Double]()){\n",
656 | " (m,s:String) => {val kv = s.split(\":\"); m + (kv(0) -> kv(1).toDouble)}\n",
657 | " }\n",
658 | " }\n",
659 | "\n",
660 | " delayData.joinWithSmaller(('flightDate,'Date),weatherData)\n",
661 | " .project('delay,'m,'dm,'dw,'h,'dist,'holiday,'Measures)\n",
662 | " .mapTo(joinSchema -> outputSchema){x:(Double,Double,Double,Double,Double,Double,Double,Map[String,Double]) => {\n",
663 | " (x._1, x._2, x._3, x._4, x._5, x._6, x._7, x._8(\"TMIN\"), x._8(\"TMAX\"), x._8(\"PRCP\"), x._8(\"SNOW\"), x._8(\"AWND\"))\n",
664 | " }\n",
665 | " }\n",
666 | " .write(Csv(args(\"output\"),\",\"))\n",
667 | "}\n",
668 | "\n",
669 | "object ScaldingFlightDelays {\n",
670 | " val delayInSchema = List('Year, 'Month, 'DayofMonth, 'DayOfWeek,\n",
671 | " 'DepTime, 'CRSDepTime, 'ArrTime, 'CRSArrTime,\n",
672 | " 'UniqueCarrier, 'FlightNum, 'TailNum,\n",
673 | " 'ActualElapsedTime, 'CRSElapsedTime, 'AirTime,\n",
674 | " 'ArrDelay, 'DepDelay, 'Origin, 'Dest,\n",
675 | " 'Distance, 'TaxiIn, 'TaxiOut,\n",
676 | " 'Cancelled, 'CancellationCode, 'Diverted,\n",
677 | " 'CarrierDelay, 'WeatherDelay, 'NASDelay,\n",
678 | " 'SecurityDelay, 'LateAircraftDelay)\n",
679 | " val weatherInSchema = List('Station, 'Date, 'Metric, 'Measure, 'v1, 'v2, 'v3, 'v4)\n",
680 | " val delaySchema = List('Year, 'Month, 'DayofMonth, 'DayOfWeek,\n",
681 | " 'CRSDepTime, 'DepDelay, 'Origin, 'Distance, 'Cancelled);\n",
682 | " val weatherSchema = List('Station, 'Date, 'Metric, 'Measure)\n",
683 | " val filterSchema = List('Year, 'Month, 'DayofMonth, 'DayOfWeek, 'CRSDepTime, 'DepDelay, 'Distance)\n",
684 | " val featureSchmea = List('flightDate,'delay,'m,'dm,'dw,'h,'dist,'holiday);\n",
685 | " val joinSchema = List('delay,'m,'dm,'dw,'h,'dist,'holiday,'Measures)\n",
686 | " val outputSchema = List('delay,'m,'dm,'dw,'h,'dist,'holiday,'tmin,'tmax,'prcp,'snow,'awnd)\n",
687 | "\n",
688 | " val holidays = List(\"01/01/2007\", \"01/15/2007\", \"02/19/2007\", \"05/28/2007\", \"06/07/2007\", \"07/04/2007\",\n",
689 | " \"09/03/2007\", \"10/08/2007\" ,\"11/11/2007\", \"11/22/2007\", \"12/25/2007\",\n",
690 | " \"01/01/2008\", \"01/21/2008\", \"02/18/2008\", \"05/22/2008\", \"05/26/2008\", \"07/04/2008\",\n",
691 | " \"09/01/2008\", \"10/13/2008\" ,\"11/11/2008\", \"11/27/2008\", \"12/25/2008\")\n",
692 | "\n",
693 | " def gen_features(tuple: (String,String,String,String,String,String,String)) = {\n",
694 | " val (year, month,dayOfMonth,dayOfWeek:String, crsDepTime:String,depDelay:String,distance:String) = tuple\n",
695 | " val date = to_date(year.toInt,month.toInt,dayOfMonth.toInt)\n",
696 | " val hour = get_hour(crsDepTime)\n",
697 | " val holidayDist = days_from_nearest_holiday(year.toInt,month.toInt,dayOfMonth.toInt)\n",
698 | "\n",
699 | " (date,depDelay,month,dayOfMonth,dayOfWeek,hour,distance,holidayDist.toString)\n",
700 | " }\n",
701 | "\n",
702 | " def get_hour(depTime: String) = \"%04d\".format(depTime.toInt).take(2)\n",
703 | "\n",
704 | " def to_date(year: Int, month: Int, day: Int) = \"%04d%02d%02d\".format(year, month, day)\n",
705 | "\n",
706 | " def days_from_nearest_holiday(year:Int, month:Int, day:Int) = {\n",
707 | " val sampleDate = new DateTime(year, month, day, 0, 0)\n",
708 | "\n",
709 | " holidays.foldLeft(3000) { (r, c) =>\n",
710 | " val holiday = DateTimeFormat.forPattern(\"MM/dd/yyyy\").parseDateTime(c)\n",
711 | " val distance = Math.abs(Days.daysBetween(holiday, sampleDate).getDays)\n",
712 | " math.min(r, distance)\n",
713 | " }\n",
714 | " }\n",
715 | "}\n"
716 | ],
717 | "language": "python",
718 | "metadata": {},
719 | "outputs": [
720 | {
721 | "output_type": "stream",
722 | "stream": "stdout",
723 | "text": [
724 | "Overwriting preprocess2.scala\n"
725 | ]
726 | }
727 | ],
728 | "prompt_number": 14
729 | },
730 | {
731 | "cell_type": "code",
732 | "collapsed": false,
733 | "input": [
734 | "%%capture capt2007_2\n",
735 | "!/home/demo/scalding/scripts/scald.rb --hdfs preprocess2.scala --delay \"airline/delay/2007.csv\" --weather \"airline/weather/2007.csv\" --output \"airline/fm/ord_2007_sc_2\""
736 | ],
737 | "language": "python",
738 | "metadata": {},
739 | "outputs": [],
740 | "prompt_number": 9
741 | },
742 | {
743 | "cell_type": "code",
744 | "collapsed": false,
745 | "input": [
746 | "capt2007_2.stderr"
747 | ],
748 | "language": "python",
749 | "metadata": {},
750 | "outputs": [
751 | {
752 | "metadata": {},
753 | "output_type": "pyout",
754 | "prompt_number": 10,
755 | "text": [
756 | "''"
757 | ]
758 | }
759 | ],
760 | "prompt_number": 10
761 | },
762 | {
763 | "cell_type": "code",
764 | "collapsed": false,
765 | "input": [
766 | "%%capture capt2008_2\n",
767 | "!/home/demo/scalding/scripts/scald.rb --hdfs preprocess2.scala --delay \"airline/delay/2008.csv\" --weather \"airline/weather/2008.csv\" --output \"airline/fm/ord_2008_sc_2\""
768 | ],
769 | "language": "python",
770 | "metadata": {},
771 | "outputs": [],
772 | "prompt_number": 11
773 | },
774 | {
775 | "cell_type": "code",
776 | "collapsed": false,
777 | "input": [
778 | "capt2008_2.stderr"
779 | ],
780 | "language": "python",
781 | "metadata": {},
782 | "outputs": [
783 | {
784 | "metadata": {},
785 | "output_type": "pyout",
786 | "prompt_number": 12,
787 | "text": [
788 | "''"
789 | ]
790 | }
791 | ],
792 | "prompt_number": 12
793 | },
794 | {
795 | "cell_type": "heading",
796 | "level": 2,
797 | "metadata": {},
798 | "source": [
799 | "Modeling - iteration 2"
800 | ]
801 | },
802 | {
803 | "cell_type": "markdown",
804 | "metadata": {},
805 | "source": [
806 | "Now let's re-build the Random Forest and Gradient Boosted Tree models with the enahanced feature matrices.\n",
807 | "As before, we first prepare our training set and test set:"
808 | ]
809 | },
810 | {
811 | "cell_type": "code",
812 | "collapsed": false,
813 | "input": [
814 | "%%R\n",
815 | "\n",
816 | "# Read input files\n",
817 | "process_dataset <- function(filename) {\n",
818 | " cols = c('delay', 'month', 'day', 'dow', 'hour', 'distance', 'days_from_holiday', \n",
819 | " 'tmin', 'tmax', 'prcp', 'snow', 'awnd')\n",
820 | " \n",
821 | " data = read_csv_from_hdfs(filename, cols)\n",
822 | " data$delay = as.factor(as.numeric(data$delay) >= 15)\n",
823 | " data$month = as.factor(data$month)\n",
824 | " data$day = as.factor(data$day)\n",
825 | " data$dow = as.factor(data$dow)\n",
826 | " data$hour = as.numeric(data$hour)\n",
827 | " data$distance = as.numeric(data$distance)\n",
828 | " data$days_from_holiday = as.numeric(data$days_from_holiday)\n",
829 | " data$tmin = as.numeric(data$tmin)\n",
830 | " data$tmax = as.numeric(data$tmax)\n",
831 | " data$prcp = as.numeric(data$prcp)\n",
832 | " data$snow = as.numeric(data$snow)\n",
833 | " data$awnd = as.numeric(data$awnd)\n",
834 | " data\n",
835 | "}\n",
836 | "\n",
837 | "# Prepare training set and test/validation set\n",
838 | "\n",
839 | "data_2007 = process_dataset('/user/demo/airline/fm/ord_2007_sc_2')\n",
840 | "data_2008 = process_dataset('/user/demo/airline/fm/ord_2008_sc_2')\n",
841 | "\n",
842 | "fcols = setdiff(names(data_2007), c('delay'))\n",
843 | "train_x = data_2007[,fcols]\n",
844 | "train_y = data_2007$delay\n",
845 | "\n",
846 | "test_x = data_2008[,fcols]\n",
847 | "test_y = data_2008$delay"
848 | ],
849 | "language": "python",
850 | "metadata": {},
851 | "outputs": [],
852 | "prompt_number": 7
853 | },
854 | {
855 | "cell_type": "markdown",
856 | "metadata": {},
857 | "source": [
858 | "And now to build a Random Forest model:"
859 | ]
860 | },
861 | {
862 | "cell_type": "code",
863 | "collapsed": false,
864 | "input": [
865 | "%%R\n",
866 | "\n",
867 | "rf.model = randomForest(train_x, train_y, ntree=40)\n",
868 | "rf.pr <- predict(rf.model, newdata=test_x)\n",
869 | "m.rf = get_metrics(as.logical(rf.pr), as.logical(test_y))\n",
870 | "print(sprintf(\"Random Forest: precision=%0.2f, recall=%0.2f, F1=%0.2f, accuracy=%0.2f\", \n",
871 | " m.rf[1], m.rf[2], m.rf[3], m.rf[4]))"
872 | ],
873 | "language": "python",
874 | "metadata": {},
875 | "outputs": [
876 | {
877 | "metadata": {},
878 | "output_type": "display_data",
879 | "text": [
880 | "[1] \"Random Forest: precision=0.58, recall=0.38, F1=0.45, accuracy=0.74\"\n"
881 | ]
882 | }
883 | ],
884 | "prompt_number": 8
885 | },
886 | {
887 | "cell_type": "markdown",
888 | "metadata": {},
889 | "source": [
890 | "And the gradient boosted tree model:"
891 | ]
892 | },
893 | {
894 | "cell_type": "code",
895 | "collapsed": false,
896 | "input": [
897 | "%%R\n",
898 | "\n",
899 | "gbm.model <- gbm.fit(train_x, as.numeric(train_y)-1, n.trees=500, verbose=F, shrinkage=0.01, distribution=\"bernoulli\", \n",
900 | " interaction.depth=3, n.minobsinnode=30)\n",
901 | "gbm.pr <- predict(gbm.model, newdata=test_x, n.trees=500, type=\"response\")\n",
902 | "m.gbm = get_metrics(gbm.pr >= 0.5, as.logical(test_y))\n",
903 | "print(sprintf(\"Gradient Boosted Machines: precision=%0.2f, recall=%0.2f, F1=%0.2f, accuracy=%0.2f\", \n",
904 | " m.gbm[1], m.gbm[2], m.gbm[3], m.gbm[4]))"
905 | ],
906 | "language": "python",
907 | "metadata": {},
908 | "outputs": [
909 | {
910 | "metadata": {},
911 | "output_type": "display_data",
912 | "text": [
913 | "[1] \"Gradient Boosted Machines: precision=0.63, recall=0.27, F1=0.38, accuracy=0.75\"\n"
914 | ]
915 | }
916 | ],
917 | "prompt_number": 9
918 | },
919 | {
920 | "cell_type": "heading",
921 | "level": 2,
922 | "metadata": {},
923 | "source": [
924 | "Summary"
925 | ]
926 | },
927 | {
928 | "cell_type": "markdown",
929 | "metadata": {},
930 | "source": [
931 | "In this 3rd part of our blog post we used an IPython notebook to demonstrate how to build a predictive model for airline delays. We have used Scalding on our HDP cluster to perform various types of data pre-processing and feature engineering tasks. We then applied a few machine learning algorithms such as random forest and gradient boosted machines to the resulting datasets and showed how through iterations we add new features resulting in better model performance."
932 | ]
933 | }
934 | ],
935 | "metadata": {}
936 | }
937 | ]
938 | }
--------------------------------------------------------------------------------
/crime-event-demo-R.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "collapsed": false
7 | },
8 | "source": [
9 | "# Data Science Demo: predicting Crime resolution in San Francisco "
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "The City of San Francisco publishes historical crime events on their http://sfdata.gov website. \n",
17 | "\n",
18 | "I have loaded this dataset into HIVE. Let's use Spark to do some fun stuff with it!"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "## Setting Up SparkR"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "First we setup SparkR, create a SparkContext as well as a HiveContext to access data from Hive:"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 1,
38 | "metadata": {
39 | "collapsed": false
40 | },
41 | "outputs": [
42 | {
43 | "name": "stderr",
44 | "output_type": "stream",
45 | "text": [
46 | "\n",
47 | "Attaching package: ‘SparkR’\n",
48 | "\n",
49 | "The following objects are masked from ‘package:stats’:\n",
50 | "\n",
51 | " cov, filter, lag, na.omit, predict, sd, var\n",
52 | "\n",
53 | "The following objects are masked from ‘package:base’:\n",
54 | "\n",
55 | " colnames, colnames<-, intersect, rank, rbind, sample, subset,\n",
56 | " summary, table, transform\n",
57 | "\n",
58 | "Loading required package: ggplot2\n"
59 | ]
60 | }
61 | ],
62 | "source": [
63 | ".libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))\n",
64 | "\n",
65 | "library(SparkR)\n",
66 | "require(ggplot2)"
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": 2,
72 | "metadata": {
73 | "collapsed": false
74 | },
75 | "outputs": [
76 | {
77 | "name": "stdout",
78 | "output_type": "stream",
79 | "text": [
80 | "Launching java with spark-submit command /usr/hdp/2.3.4.1-10/spark///bin/spark-submit sparkr-shell /tmp/RtmpgS9r78/backend_port71f962c08568 \n"
81 | ]
82 | },
83 | {
84 | "data": {
85 | "text/plain": [
86 | "DataFrame[result:string]"
87 | ]
88 | },
89 | "execution_count": 2,
90 | "metadata": {},
91 | "output_type": "execute_result"
92 | }
93 | ],
94 | "source": [
95 | "# Create Spark Context\n",
96 | "sc = sparkR.init(master=\"yarn-client\", appName=\"SparkR Demo\",\n",
97 | " sparkEnvir = list(spark.executor.memory=\"4g\", spark.executor.instances=\"15\"))\n",
98 | "\n",
99 | "# Create HiveContext\n",
100 | "hc = sparkRHive.init(sc)\n",
101 | "sql(hc, \"use demo\")"
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {
107 | "collapsed": false
108 | },
109 | "source": [
110 | "Let's take a look at the dataset - first 5 rows:"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": 3,
116 | "metadata": {
117 | "collapsed": false
118 | },
119 | "outputs": [
120 | {
121 | "data": {
122 | "text/plain": [
123 | " incidentid category description dayofweek date_str\n",
124 | "1 150331521 LARCENY/THEFT GRAND THEFT FROM A BUILDING Wednesday 04/22/2015\n",
125 | "2 150341605 ASSAULT ATTEMPTED SIMPLE ASSAULT Sunday 04/19/2015\n",
126 | "3 150341605 ASSAULT THREATS AGAINST LIFE Sunday 04/19/2015\n",
127 | "4 150341605 OTHER OFFENSES CRUELTY TO ANIMALS Sunday 04/19/2015\n",
128 | "5 150341702 NON-CRIMINAL AIDED CASE, MENTAL DISTURBED Sunday 04/19/2015\n",
129 | " time district resolution address\n",
130 | "1 18:00 BAYVIEW NONE 2000 Block of EVANS AV\n",
131 | "2 12:15 CENTRAL NONE 800 Block of WASHINGTON ST\n",
132 | "3 12:15 CENTRAL NONE 800 Block of WASHINGTON ST\n",
133 | "4 12:15 CENTRAL NONE 800 Block of WASHINGTON ST\n",
134 | "5 12:03 MISSION EXCEPTIONAL CLEARANCE 1100 Block of SOUTH VAN NESS AV\n",
135 | " longitude latitude location\n",
136 | "1 -122.396315619126 37.7478113603031 (37.7478113603031, -122.396315619126)\n",
137 | "2 -122.40672716771 37.7950566945259 (37.7950566945259, -122.40672716771)\n",
138 | "3 -122.40672716771 37.7950566945259 (37.7950566945259, -122.40672716771)\n",
139 | "4 -122.40672716771 37.7950566945259 (37.7950566945259, -122.40672716771)\n",
140 | "5 -122.416557578218 37.7547485110398 (37.7547485110398, -122.416557578218)\n",
141 | " pdid\n",
142 | "1 15033152106304\n",
143 | "2 15034160504114\n",
144 | "3 15034160519057\n",
145 | "4 15034160528010\n",
146 | "5 15034170264020"
147 | ]
148 | },
149 | "execution_count": 3,
150 | "metadata": {},
151 | "output_type": "execute_result"
152 | }
153 | ],
154 | "source": [
155 | "crimes = table(hc, \"crimes\")\n",
156 | "df = head(crimes, 5)\n",
157 | "df"
158 | ]
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {},
163 | "source": [
164 | "## Exploring the Dataset"
165 | ]
166 | },
167 | {
168 | "cell_type": "markdown",
169 | "metadata": {},
170 | "source": [
171 | "What are the different types of crime resolutions?"
172 | ]
173 | },
174 | {
175 | "cell_type": "code",
176 | "execution_count": 4,
177 | "metadata": {
178 | "collapsed": false
179 | },
180 | "outputs": [
181 | {
182 | "name": "stdout",
183 | "output_type": "stream",
184 | "text": [
185 | "+--------------------+\n",
186 | "| resolution|\n",
187 | "+--------------------+\n",
188 | "| LOCATED|\n",
189 | "| JUVENILE DIVERTED|\n",
190 | "| ARREST, CITED|\n",
191 | "| NOT PROSECUTED|\n",
192 | "|COMPLAINANT REFUS...|\n",
193 | "|CLEARED-CONTACT J...|\n",
194 | "| JUVENILE CITED|\n",
195 | "|PROSECUTED FOR LE...|\n",
196 | "|EXCEPTIONAL CLEAR...|\n",
197 | "| JUVENILE BOOKED|\n",
198 | "| UNFOUNDED|\n",
199 | "| PSYCHOPATHIC CASE|\n",
200 | "| JUVENILE ADMONISHED|\n",
201 | "|DISTRICT ATTORNEY...|\n",
202 | "|PROSECUTED BY OUT...|\n",
203 | "| ARREST, BOOKED|\n",
204 | "| NONE|\n",
205 | "+--------------------+\n"
206 | ]
207 | }
208 | ],
209 | "source": [
210 | "showDF(distinct(select(crimes, \"resolution\"))) "
211 | ]
212 | },
213 | {
214 | "cell_type": "markdown",
215 | "metadata": {},
216 | "source": [
217 | "Let's define a crime as 'resolved' if it has any string except \"NONE\" in the resolution column.\n",
218 | "\n",
219 | "**Question**: How many crimes total in the dataset? How many resolved?"
220 | ]
221 | },
222 | {
223 | "cell_type": "code",
224 | "execution_count": 5,
225 | "metadata": {
226 | "collapsed": false
227 | },
228 | "outputs": [
229 | {
230 | "name": "stdout",
231 | "output_type": "stream",
232 | "text": [
233 | "[1] \"1750133 crimes total, out of which 700088 were resolved\"\n"
234 | ]
235 | }
236 | ],
237 | "source": [
238 | "total = count(crimes)\n",
239 | "num_resolved = count(filter(crimes, crimes$resolution != 'NONE'))\n",
240 | "print(paste0(total, \" crimes total, out of which \", num_resolved, \" were resolved\"))"
241 | ]
242 | },
243 | {
244 | "cell_type": "markdown",
245 | "metadata": {},
246 | "source": [
247 | "Let's look at the longitude/latitude values in more detail. Spark provides the describe() function to see this some basic statistics of these columns:"
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": 6,
253 | "metadata": {
254 | "collapsed": false
255 | },
256 | "outputs": [
257 | {
258 | "name": "stdout",
259 | "output_type": "stream",
260 | "text": [
261 | "+-------+-------------------+-------------------+\n",
262 | "|summary| long| lat|\n",
263 | "+-------+-------------------+-------------------+\n",
264 | "| count| 1750133| 1750133|\n",
265 | "| mean|-122.42263853403858| 37.771270995453996|\n",
266 | "| stddev|0.03069515078034428|0.47274630631731845|\n",
267 | "| min| -122.51364| 37.70788|\n",
268 | "| max| -120.5| 90.0|\n",
269 | "+-------+-------------------+-------------------+\n"
270 | ]
271 | }
272 | ],
273 | "source": [
274 | "c1 = select(crimes, alias(cast(crimes$longitude, \"float\"), \"long\"), \n",
275 | " alias(cast(crimes$latitude, \"float\"), \"lat\")) \n",
276 | "showDF(describe(c1))"
277 | ]
278 | },
279 | {
280 | "cell_type": "markdown",
281 | "metadata": {},
282 | "source": [
283 | "Notice that the max values for longitude (-120.5) and latitude (90.0) seem strange. Those are not inside the SF area. Let's see how many bad values like this exist in the dataset: "
284 | ]
285 | },
286 | {
287 | "cell_type": "code",
288 | "execution_count": 7,
289 | "metadata": {
290 | "collapsed": false
291 | },
292 | "outputs": [
293 | {
294 | "name": "stdout",
295 | "output_type": "stream",
296 | "text": [
297 | "[1] 143\n",
298 | "+------+----+\n",
299 | "| long| lat|\n",
300 | "+------+----+\n",
301 | "|-120.5|90.0|\n",
302 | "|-120.5|90.0|\n",
303 | "|-120.5|90.0|\n",
304 | "+------+----+\n"
305 | ]
306 | }
307 | ],
308 | "source": [
309 | "c2 = filter(c1, \"lat < 37 or lat > 38\")\n",
310 | "print(count(c2))\n",
311 | "showDF(limit(c2,3))\n"
312 | ]
313 | },
314 | {
315 | "cell_type": "markdown",
316 | "metadata": {},
317 | "source": [
318 | "Seems like this is a data quality issue where some data points just have a fixed (bad) value of -120.5, 90."
319 | ]
320 | },
321 | {
322 | "cell_type": "markdown",
323 | "metadata": {},
324 | "source": [
325 | "## Computing Neighborhoods\n",
326 | "\n",
327 | "Now I create a new dataset called crimes2:\n",
328 | "1. Without the points that have invalid longitude/latitude\n",
329 | "2. I calculate the neighborhood associated with each long/lat (for each crime), using ESRI geo-spatial UDFs\n",
330 | "3. Translate \"resolution\" to \"resolved\" (1.0 = resolved, 0.0 = unresolved)"
331 | ]
332 | },
333 | {
334 | "cell_type": "code",
335 | "execution_count": 8,
336 | "metadata": {
337 | "collapsed": false
338 | },
339 | "outputs": [
340 | {
341 | "data": {
342 | "text/plain": [
343 | "DataFrame[result:int]"
344 | ]
345 | },
346 | "execution_count": 8,
347 | "metadata": {},
348 | "output_type": "execute_result"
349 | },
350 | {
351 | "data": {
352 | "text/plain": [
353 | "DataFrame[result:int]"
354 | ]
355 | },
356 | "execution_count": 8,
357 | "metadata": {},
358 | "output_type": "execute_result"
359 | },
360 | {
361 | "data": {
362 | "text/plain": [
363 | "DataFrame[result:int]"
364 | ]
365 | },
366 | "execution_count": 8,
367 | "metadata": {},
368 | "output_type": "execute_result"
369 | },
370 | {
371 | "data": {
372 | "text/plain": [
373 | "DataFrame[result:int]"
374 | ]
375 | },
376 | "execution_count": 8,
377 | "metadata": {},
378 | "output_type": "execute_result"
379 | },
380 | {
381 | "data": {
382 | "text/plain": [
383 | "DataFrame[result:string]"
384 | ]
385 | },
386 | "execution_count": 8,
387 | "metadata": {},
388 | "output_type": "execute_result"
389 | },
390 | {
391 | "data": {
392 | "text/plain": [
393 | "DataFrame[result:string]"
394 | ]
395 | },
396 | "execution_count": 8,
397 | "metadata": {},
398 | "output_type": "execute_result"
399 | }
400 | ],
401 | "source": [
402 | "sql(hc, \"add jar /home/jupyter/notebooks/jars/guava-11.0.2.jar\")\n",
403 | "sql(hc, \"add jar /home/jupyter/notebooks/jars/spatial-sdk-json.jar\")\n",
404 | "sql(hc, \"add jar /home/jupyter/notebooks/jars/esri-geometry-api.jar\")\n",
405 | "sql(hc, \"add jar /home/jupyter/notebooks/jars/spatial-sdk-hive.jar\")\n",
406 | "\n",
407 | "sql(hc, \"create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains'\")\n",
408 | "sql(hc, \"create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point'\")\n",
409 | "\n",
410 | "cf1 = sql(hc, \n",
411 | "\"SELECT date_str, time, longitude, latitude, resolution, category, district, dayofweek, description\n",
412 | " FROM crimes\n",
413 | " WHERE longitude < -121.0 and latitude < 38.0\")\n",
414 | "cf2 = repartition(cf1, 50)\n",
415 | "registerTempTable(cf2, \"cf\")\n",
416 | "\n",
417 | "crimes2 = sql(hc, \n",
418 | "\"SELECT date_str, time, dayofweek, category, district, description, longitude, latitude,\n",
419 | " if (resolution == 'NONE',0.0,1.0) as resolved,\n",
420 | " neighborho as neighborhood \n",
421 | " FROM sf_neighborhoods JOIN cf\n",
422 | " WHERE ST_Contains(sf_neighborhoods.shape, ST_Point(cf.longitude, cf.latitude))\")\n",
423 | "\n",
424 | "# cache(crimes2)\n",
425 | "registerTempTable(crimes2, \"crimes2\")"
426 | ]
427 | },
428 | {
429 | "cell_type": "code",
430 | "execution_count": 9,
431 | "metadata": {
432 | "collapsed": false
433 | },
434 | "outputs": [
435 | {
436 | "name": "stdout",
437 | "output_type": "stream",
438 | "text": [
439 | "+----------+-----+---------+-------------+--------+--------------------+-----------------+----------------+--------+------------+\n",
440 | "| date_str| time|dayofweek| category|district| description| longitude| latitude|resolved|neighborhood|\n",
441 | "+----------+-----+---------+-------------+--------+--------------------+-----------------+----------------+--------+------------+\n",
442 | "|02/26/2015|16:30| Thursday| NON-CRIMINAL|RICHMOND| FOUND PROPERTY|-122.507892223803|37.7807186482924| 0.0| Seacliff|\n",
443 | "|02/24/2015|10:00| Tuesday|LARCENY/THEFT|RICHMOND|LICENSE PLATE OR ...| -122.50706596541|37.7807037558799| 0.0| Seacliff|\n",
444 | "|01/02/2015|18:40| Friday|LARCENY/THEFT|RICHMOND|GRAND THEFT FROM ...|-122.511298876781|37.7759977595559| 0.0| Seacliff|\n",
445 | "|08/18/2014|08:30| Monday|LARCENY/THEFT|RICHMOND|PETTY THEFT OF PR...|-122.508907927108|37.7807278478539| 0.0| Seacliff|\n",
446 | "|04/13/2014|15:00| Sunday|LARCENY/THEFT|RICHMOND|GRAND THEFT FROM ...|-122.513642064265|37.7784692199467| 0.0| Seacliff|\n",
447 | "+----------+-----+---------+-------------+--------+--------------------+-----------------+----------------+--------+------------+\n"
448 | ]
449 | }
450 | ],
451 | "source": [
452 | "showDF(limit(crimes2, 5))"
453 | ]
454 | },
455 | {
456 | "cell_type": "markdown",
457 | "metadata": {},
458 | "source": [
459 | "**Question:** what is the percentage of crimes resolved in each neighborhood?"
460 | ]
461 | },
462 | {
463 | "cell_type": "code",
464 | "execution_count": 10,
465 | "metadata": {
466 | "collapsed": false
467 | },
468 | "outputs": [
469 | {
470 | "data": {
471 | "text/plain": [
472 | " neighborhood pres count\n",
473 | "1 Downtown/Civic Center 0.5718149 285122\n",
474 | "2 Mission 0.5021143 190606\n",
475 | "3 Bayview 0.4725739 119412\n",
476 | "4 Haight Ashbury 0.4458685 41898\n",
477 | "5 Golden Gate Park 0.4154171 15606\n",
478 | "6 South of Market 0.3928699 241230\n",
479 | "7 Ocean View 0.3848783 30724\n",
480 | "8 Bernal Heights 0.3728814 38881\n",
481 | "9 Excelsior 0.3717961 40810\n",
482 | "10 Outer Mission 0.3685316 37123"
483 | ]
484 | },
485 | "execution_count": 10,
486 | "metadata": {},
487 | "output_type": "execute_result"
488 | }
489 | ],
490 | "source": [
491 | "ngrp = groupBy(crimes2, \"neighborhood\")\n",
492 | "df = summarize(ngrp, pres = avg(crimes2$resolved), count = n(crimes2$resolved))\n",
493 | "df_sorted = arrange(df, desc(df$pres))\n",
494 | "head(df_sorted, 10)"
495 | ]
496 | },
497 | {
498 | "cell_type": "code",
499 | "execution_count": 11,
500 | "metadata": {
501 | "collapsed": false
502 | },
503 | "outputs": [
504 | {
505 | "data": {
506 | "text/plain": []
507 | },
508 | "execution_count": 11,
509 | "metadata": {},
510 | "output_type": "execute_result"
511 | },
512 | {
513 | "data": {
514 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAeAAAAHgCAIAAADytinCAAAgAElEQVR4nO3deXRV5b3/8eeck3k2EyEQmjDEEGICdSAgEqESZoI41TpApRWnRW+7enXVqam363q97br21noBwcrF6vVX0Qw2yKBgilBQBkVEhmAGkgCZyMl8cqbfH3v1rBSpcpKT53k85/36g3WyydnfZ+/s/Tn7PHvvZ5vcbrcAAOjHrLoBAIBLI6ABQFMENABoioAGAE0R0ACgKQIaADRFQAOApghoANAUAQ0AmiKgAUBTQT6ZS3t7e0lJicvlMpvNy5Yti42NNabb7faysjKr1SqEuOWWW+Li4nxSDgACgW+OoLdv356Xl7dy5cq8vLxt27Z5pu/evTshIWHlypXf/e53d+7c6ZNaABAgfBPQ1dXVWVlZQoisrKzq6mrP9GPHjuXl5QkhcnJybrjhBp/UAoAA4Zsujr6+vvDwcCFEWFhYX1+fZ/qFCxf2799/8ODB2NjYhQsXJiUlGdNfe+21+vp64/X1119/7bXXDr0NJpPJZDK5XK6hz2oQzGaz2+1WMjSgyWQSQqgaldBisTidTiWl1a5ztRubqtJqNzaFC+7Djc3hcERGRl7mL/smoI1cDg8P9yS1weVyjRkzprCw8OTJk6WlpT/72c+M6UuWLHE4HMZrp9PZ3t4+9DYEBwfHxMT4ZFaDkJCQ0NnZabfb5ZeOjIy0WCwdHR3yS5tMpuTk5NbWVvmlhRBxcXE2m623t1d+6dDQ0MjISFUbW3Jycltbm2cPkikqKkoI0dXVJb+0xWJJSEhoamqSX1oIER8f39PTY7PZhj4rt9stO6AzMjJOnTqVm5tbVVWVkZHhmZ6cnDxixAiLxTJixIiBx1nR0dGe1x0dHT75e1ssFrfbrepozu12u1wuJdVdLpfJZFJS2mw2CyEUrnNVf3GXyxWYG5txCKlwwQNtnfsmoOfOnVtaWnrkyBGn07l06VIhRHFxcXFx8eLFi0tKSoxvwcuWLfNJLQAIEL4J6NjY2OXLlw+cUlxcLIQYPXr0j3/8Y5+UAIBAw40qAKApAhoANEVAA4CmCGgA0BQBDQCaIqABQFMENABoioAGAE0R0ACgKQIaADTlm1u9MXR2u338+PGjRo3y9o2DHvqytrZ23759aWlp3r4RgBwEtC76+vocDsfvlyyRVvGOtWvPnj1LQAPaIqD1kuP9EfSgfSchQVotAINAHzQAaIqABgBNEdAAoCkCGgA0RUADgKYIaADQFAENAJoioAFAUwQ0AGiKgAYATRHQAKApAhoANEVAA4CmCGgA0BQBDQCaIqABQFMENABoioAGAE0R0ACgKQIaADRFQAOApghoANAUAQ0AmiKgAUBTBDQAaIqABgBNEdAAoCkCGgA0RUADgKYIaADQFAENAJoioAFAUwQ0AGiKgAYATQWpbgAQoPbv3//666+HhYV5+8bw8HCbzeZyubx6l91uv/rqq++66y5vy0EhLQLabPbBgbzZbDaZTD6Z1SAYpYdS3WQy+bA9l8knbVa4zlX9xX2y4Lt27Xrrrbf+dd48HzXqm8odP75379577rlnKDNR+BdXu4OLv+/jQ5+P2+2+/F/WIqAtFsvQZ2KsO5/MatANGEp1JS0fYpuN3VXVOjd2GFXrzWQyDbG0yWRaVVCwqqDAV636epkjRqz/4ouht1mo21ZVlRY+3dicTufl/7IWAW2324c+E5PJ5Ha7fTKrQXC73Q6HYyjVHQ6HD9tz+UWH0mZjn1G1zl0ul9PpVFLdYrEMfWPzto9i6HzV5m/vOh80t9utZGPjJCEAaIqABgBNEdAAoCkCGgA0RUADgKYIaADQFAENAJoioAFAUwQ0AGiKgAYATRHQAKApAhoANKXFYEkXWbVq1WeffRYbG+vVu4zRxQYx5FB1dfULL7wwZ84cb98IAMNKx4D+6KOPvp+be016upxyz5w7d+LECQIagG50DOj4+Pjrx4+XFtBTx46VUwgAvEIfNABoioAGAE0R0ACgKQIaADRFQAOApghoANAUAQ0AmiKgAUBTBDQAaIqABgBNEdAAoCkdx+IApPnf//3fhoYGb98VFBQUHBzc29s7iIp33XXXd77znUG8EQGIgEZAe/zxxxfn5aXGxXn1LocQfYMqt66ycvTo0ffee++g3o2AQ0AjoEVHRz86f763AT1oZ61WOYXgH+iDBgBNEdAAoCkCGgA0RUADgKYIaADQFAENAJoioAFAUwQ0AGiKgAYATRHQAKApAhoANEVAA4CmCGgA0BSj2UExh8Oxb98+t9vt7Rujo6Ptdntfn9cDf8bExOTl5Xn7LkA+AhqKffjhh3fddVfOqFFyyvX2959ubh7EIP2AfAQ0FHM6ndekp/+/Bx6QU66zr29ycbGcWsAQ0QcNAJoioAFAUwQ0AGjKN33Q7e3tJSUlLpfLbDYvW7YsNjZ24P+eP39+w4YNTzzxhE9qAUCA8M0R9Pbt2/Py8lauXJmXl7dt27aB/9Xb21tZWWm3231SCAACh2+OoKurqxctWiSEyMrK2rFjh2e6y+XasmXLnDlzjh07NvD3bTaby+UyXjscDrP5Hz4nTCaTT1p1+Uwm00VtGNwchjIT+UsthPBJm4e+6oby9sHxtFnhxqZ2wQfHJ3/xwTGbzUPfT4fCV9W9uuTfNwHd19cXHh4uhAgLCxt440BlZWVOTs4VV1xx0e+/9tprdXV1xuuCgoJZs2b9Q5uCZF/8Fx0dnZKSMsSZxMfHD+XtxgqULD4+fugLPsQ5fHXzkMDTZvkpGRMTY1SPjIxsl1s6JCRk6H9uIURkZOTQZzI4Pmn/4ISEhPhkPt3d3Zf/y76JQiOXw8PDPUltOHHiRGVlpfH6mWeeefrpp43X9913n+d3Ojo6GhsbB85Nfn/IV9vgraSkJKvV2t/fP+g5dHZ2DqUBg9PS0jKUBTebzSkpKUNcdW1tbUN5++B42uz5JieN1Wo1qnd1dUku3d/fP8Q/VkxMjBCio6PDRy3ygsViSUpKOnfunPzSQojExMSurq5B3LZ6SZf/CeebgM7IyDh16lRubm5VVVVGRoZn+gN/v/uguLjYk84AgMvhm4CeO3duaWnpkSNHnE7n0qVLhRDFxcXF3K8FQCc2m+3AgQOD+NoUGxvb29s7iK/IycnJV155pbfv8vBNQMfGxi5fvnzglIvSmbAGoFxpaenPfvYzaQO/tHZ1nbVahzLwC2NxAAgUNpttVlbWhhUr5JRrbG9f8OKLQ5kDdxICgKYIaADQFF0cQMBpamqaMmXKRUMyXA7jsvFBPF3BarUePnw4OTnZ2zcGOAIaCDhtbW2hQUFlq1ZJqzj3v/6rra2NgPYWAQ0EooSoqLSh3fvqbTlptfwJfdAAoCkCGgA0RUADgKbog/4HJ0+e3Lx58yDeGBERYbPZnE6nt2+cMGHCbbfdNoiKAPweAf0Ptm/f/uKLL64qKPD2jYMb3etATc2LtbU6BPT9999fUVEhs+Ly5cv//d//XWZF4FuHgL7YXfn5j86fL6fWgZqan73zjpxaX6+np6e4qGjJ5Mlyyq394IPz7ZIHQ4Yuzp0719TU5O27LBZLXFxca2vrICpmZGRER0cP4o3KEdAQQgiz2RwbHh4r66EBseHhzeoejQG1li1bVltbK+0ivzNtbT/96U9//vOfyynnWwQ0AKlGjBjx1KxZs7Ky5JT7zdat395nonIUAwCaIqABQFMENABoioAGAE0R0ACgKQIaADRFQAOApghoANAUAQ0AmiKgAUBTBDQAaIqABgBNEdAAoCkCGgA0RUADgKYIaADQFAENAJoioAFAUwQ0AGiKgAYATRHQAKApAhoANEVAA4CmCGgA0BQBDQCaIqABQFMENABoioAGAE0R0ACgKQIaADRFQAOAprwI6N/85jezZ8/u6uqaMmVKdHT02rVrh69ZAAAvAvo//uM//vSnP73zzju5ubnHjh0rLi4etlYBALwJ6ODgYKfTuXnz5ttuuy00NNTpdA5fswAAQZf/q08//XRWVlZubm5hYeGVV175xBNP+KoRwcHBA380mUy+mvNlslgsRhssFovk0iaTySgdFOTF38JXgoKCjOpms+yzEWazWdU6FwM2OYUbm/x1rsnGJn+dK9zYPOvcw6tDWy/+Tg899NBDDz1kvK6urr78N34j5QfjLpfLaIPL5ZJc2u12G6WVrATPgrvdbsmlPQsuf50LpZucDutcyeI7nU7lC67DxubV4iv4IP2qi9aakr+f0Qb5pcXfF19VaVXVdVjnSqrrsOBKSnsWXGFphX/uwfHiS1Z9ff3MmTNDQkJqa2tXrlzZ0dEx6KoAgG/kRUCvXr16wYIFdrs9MTHxwoULq1evHr5mAQC8COjKysqf//znQojIyMhNmzZVVFQMW6sAAN4EtNls9pwDDQkJUdKNBQCBw4uAzs/P37JlixCis7Pz0UcfnTt37rC1CgDgTUCvXbv2d7/7XXBwcEZGRm9v75o1a4avWQAALy6zy87OPnny5IgRI4avNQAAD+/G4nj++ec7OzuHrzUAAA/v7iQUQjz33HOeKZwnBIDh40VAE8cAIBMD9gOAprwI6FOnTs2fPz8yMjIiImLevHlVVVXD1ywAgBcBfccdd0ybNq2hoeHs2bPTpk37/ve/P3zNAgB4EdCnT59+8skn4+LiYmNjn3rqqVOnTg1fswAAXgT0ww8//Nvf/razs7Ozs/M3v/nNPffcM3zNAgB4cRXHs88+K4R47LHHPFNefPFFIcT//d//0d0BAD7HZXYAoCkuswMATRHQAKApAhoANEVAA4CmCGgA0BQBDQCaIqABQFMENABoioAGAE0R0ACgKQIaADRFQAOApghoANAUAQ0AmiKgAUBTBDQAaIqABgBNEdAAoCkCGgA0RUADgKYIaADQFAENAJoioAFAUwQ0AGiKgAYATRHQAKApAhoANEVAA4CmCGgA0BQBDQCaIqABQFNBPplLe3t7SUmJy+Uym83Lli2LjY01pjscjpKSks7OTpvN9r3vfS8zM9Mn5QAgEPjmCHr79u15eXkrV67My8vbtm2bZ/qxY8diYmLuu+++W265pby83Ce1ACBA+OYIurq6etGiRUKIrKysHTt2eKYnJiampqYKIUJDQ51Op2d6S0tLf3+/8dpsNgcHBw+cm8lk8kmrLp/FYjHaYLFYJJc2mUxG6aAg3/wtvBIUFGRUN5tld3Z5/u7y17kQwrPJKdzY5K9zTTY2+etc4cbmWeceLpfr8t/um79TX19feHi4ECIsLKyvr88z3UjnxsbG8vLywsJCz/Rdu3Y1NjYar6+55pr8/PyBc5O/EsPDw+Pj440XkkubzWajtJKciomJMapftA1JEBISYpSOjo6WXFoIYZQWKsIiIiLCs7H1yy0dFBRklPZ0QsoUGxtrVJf/8eDZwSMiIiSXNplMno3N0NPTc/lv982aMnI5PDzck9QGt9u9a9euurq6oqKikSNHeqbfdtttntcdHR3nz58fODeHw+GTVl2+rq4uow1dXV2SSzudTqN0Z2en5NJCiLa2NqO6zWaTXLqvr88o3d7eLrm0EMKzyXl1OOMTnZ2dRvXu7m7Jpe12u1G6tbVVcmmjqFHdbrdLLt3d3a1qL3O5XBflm/DmoMQ3X7IyMjJOnTolhKiqqsrIyPBMP378eEdHx/LlywemMwDgcvjmCHru3LmlpaVHjhxxOp1Lly4VQhQXFxcXF58+fbq6unrNmjXGrz300EM+KQcAgcA3AR0bG7t8+fKBU4qLi4UQxplDAMAgcKMKAGiKgAYATRHQAKApAhoANEVAA4CmCGgA0BQBDQCaIqABQFMENABoioAGAE0R0ACgKQIaADRFQAOApghoANAUAQ0AmiKgAUBTBDQAaIqABgBNEdAAoCkCGgA0RUADgKYIaADQFAENAJoioAFAUwQ0AGiKgAYATRHQAKApAhoANEVAA4CmCGgA0BQBDQCaIqABQFMENABoioAGAE0R0ACgKQIaADRFQAOApghoANAUAQ0AmiKgAUBTBDQAaIqABgBNEdAAoCkCGgA0RUADgKYIaADQFAENAJoioAFAU0GqGyCEECEhIQN/NJlMkhtgsViMNlgsFsmlTSaTUTo4OFhyaaOoUd1slv1RbTabjdJBQQo2Qs8mp3Zjc8gtrcnGFpg7uIfT6bz8t2sR0Ha7XW0DXC6X0QaXyyW5tNvtNkorWQkOh8Oo63a7JZf2rHOHQ3JMCaF0k3M6nUZ1r3ZUn1C7sdntdlUbm8J1Lr6ytr1afC0C+qIWy//7ud1uo6j80qqKekoH7IIrbIMOCx5o1VWV9uxig0MfNABoioAGAE0R0ACgKQIaADRFQAOApghoANAUAQ0AmiKgAUBTBDQAaIqABgBNEdAAoCkCGgA0RUADgKYIaADQFAENAJoioAFAUwQ0AGiKgAYATRHQAKApAhoANEVAA4CmCGgA0BQBDQCaIqABQFMENABoioAGAE0R0ACgKQIaADRFQAOApghoANAUAQ0AmiKgAUBTBDQAaIqABgBNEdAAoCkCGgA0RUADgKYIaADQFAENAJoioAFAUwQ0AGiKgAYATRHQAKApAhoANEVAA4CmCGgA0BQBDQCaIqABQFNBPplLe3t7SUmJy+Uym83Lli2LjY39+ukAgG/kmyPo7du35+XlrVy5Mi8vb9u2bd84HQDwjXxzBF1dXb1o0SIhRFZW1o4dO75xemNjY19fn/E6NDQ0LCxs4NwuXLiwp6rK5nD4pG3faP+XX46dMyc0NFQIERQUtP/LL/dUVckpfaCmprm52Shts9mEENJKCyFOnDsXHBxsVO/t7d1TVZUQFSWn9J6qqvisLKN0cHDwZ/X10ha8s69PCGGUFkJ0dnbuqapKjYuTU/1ATc0NQUFGdYvFUllVNV3Wgu+pqurs7PSs88b2dpkbW2N7u2djM9Z5SJBvwucb7amqmp2X59nBD9fVSVvwxvZ2zzr3cDqdlz8H36yjvr6+8PBwIURYWJgneb9m+pEjR5qamozXkyZNysnJGTi3efPm7Tl69KjL5VUbTCaTyWRyefkuIYQ9PDw/Pz8qKkoIMX369E2bNm388ktvZ2I2m91ut9vt9updVqu1qKjIKB0aGpqdnT2I0iaTSQjhbWkhRHp6+qRJk4zqhYWFb775ZtugFnwQ6/y8233b975nlL7qqqtSRo2StuAOh+Paa6+N+vtHUWFhYXljY1hbm4TSQojg2NipU6ca1a+//vrS0tLBbWyDWOetnZ1z5swxSmdmZo4bN07mxjZu3LjMzEyj+ty5c3ft2lUra8E7LJZp06YZpadOnRqblCRtwfv6+goLC6P+8bhnYBJ+c91BrOuveu6551avXh0eHt7T0/OHP/zh0Ucf/frpA3V0dHR1dQ29DSEhIbGxsc3NzUOf1SAkJSVZrdb+/n75paOioiwWi9VqlV/abDanpKQ0NjbKLy2EiI+Pt9ls3d3d8kuHhYVFRUW1tLTILy2ESElJaW1ttdvt8kvHxMQIITo6OuSXtlgsSUlJ586dk19aCJGYmNjV1eVVtn6N1NTUy/xN3/RBZ2RknDp1SghRVVWVkZHxjdMBAN/INwE9d+7cw4cP/+lPfzp8+HBhYaEQori4+JLTAQCXyTd90LGxscuXLx84xQjor04HAFwmblQBAE0R0ACgKQIaADRFQAOApghoANAUAQ0AmiKgAUBTBDQAaIqABgBNEdAAoCnfjGY3FL4aza6vr+/cuXPp6elDn9Ug1NTUpKSkXDSwtRwXLlyw2+3JycnyS7tcrpMnT2ZlZckvLYRoaGiIiopS8pie7u7u1tbWMWPGyC8thDh16tSYMWMuGmVYDmMAv8TERPmlHQ5HdXX1hAkT5JcWQtTV1SUkJERGRvpkbpc/mp2kMbO/RkxMjDGG4RDV1dV9+umn06dPH/qsBqGsrGz06NGXv9596Msvv7RarZMnT5Zfuq+vb8OGDbNnz5ZfWgjxwQcfTJgwYeLEifJLnzx58tixY/n5+fJLCyE2bdqUlZU1YsQI+aWPHTsmhMjNzZVf2mq1/u1vfysoKJBfWgixdevWGTNmyN/B6eIAAE0R0ACgKfVdHL4SFhY2evRoVdVHjx6tpANaCBEXFxccHKyktNlsHjt2rJLSQoiUlBSfdI4NQkREhJLuLEN6enpISIiS0vHx8UrqCiGCgoJUnWESQqSmpkZERMivq/4kIQDgkujiAABNEdD+o7KyMgBLA37Mf/qgPXbu3JmZmZmammo2B9bHT09PT21t7ZgxY4xHxAdIaSGE2+321G1vb4+Li5NW+osvvsjMzLRYLNIqDvTOO+8cPHjQ86PxnDk5FO5lCpdaCNHR0eE582G1Wof7Mnw/DOjExMRDhw5t3bo1Pj5+woQJV111lczqO3bsiIiICA8P37Jlyw033CDzss39+/fv37/f86PMDVdhaSFERUXFwoULXS7X3/72t88+++zBBx+UVvrMmTO7du1KT0/Py8tLTU2V/Pl09uzZxx9/XMkJQ4V7maql/uSTT0pLSwdOiY+PX7169bAW9cOThC6X69y5czU1NUePHnU6nTJ3VyHESy+9dN9997366qu33nrrG2+88eMf/1hm9cB04MCBmpqatra2zMzMGTNmBAVJPexwuVynT5/esWOHy+XKy8ubMmVKVFSUnNLvv/9+bm5uYmKi/C8uCvcyhUsthNi8efOtt94qrZwfHkE/99xzCQkJ06dPv/vuu+VfGWMymbq7uy0WS2RkpP99+P0zDQ0N5eXlzc3NSUlJRUVFki9Bu+aaa0wmU1BQUEFBgeSdtrGx8fPPP6+urh4zZsykSZMaGhpefvnln/zkJ3Kq7969e/fu3Z4fZX5xUbiXKVxqIYTMdBZ+eQRdU1NTU1Nz9uxZIcTIkSNvvPFGmdUrKyt3795dVFRUX18fGhoq8zboEydOHD58+MSJE7fccovD4ZB58/f69etvuummMWPG1NbW7ty580c/+pGcupfcOWXuscXFxXfdddfYsWONbmiXy7Vnz54bbrhBWgNUUbuXKSR5L/PDI+iUlBSbzdbX19fQ0NDc3Cy5+pQpU4x+56uuuspqtcos/eGHH/7whz985plnJk2a9Morr8gM6ODg4IyMDCHE2LFj//rXv0qra2Txu+++m52dreQU5dSpU0NCQjwnysxms8x0VvjFReFepvbrmuS9zA8D+tVXXx03blx2dnZhYaHMU8xKziFcpLOz0/jXbrfLrGu322tqatLS0urq6hwOh8zSIoDPjm7ZsmXevHnGF5ctW7ZI++Ii1O1lQulSG2TuZX7YxaGW5HMIA505c2bLli3nz5+Pjo5euHBhZmamtNINDQ1lZWUtLS2JiYlFRUWjRo2SVvoiH3/88bXXXququmQbN25csWLFV1/7N7VLLXkv88MjaIVdsUKIkJCQgYdRMg+pYmNjV61aZbyW1rtSXFz85JNPrl+/3vixqalp/fr1kg8kt27dum/fPuN1cnKyzIC+aEklL7jCLy4K9zK1X9dGjx7t2cva29uHu5wfBrTCrlghxPnz55944gnJoxcp7F0xIklyMF2ktrb2scce++CDD6ZNmzaww0ECY8Hdbndzc/Phw4dllhZCLFiwYOAXF5mlFe5lCpdaSL/o3g8DWqjrihVCpKamtre3S75Ic/LkyZMnT1bYu6JWWFhYeHi40+mMi4u7cOGC/AaYTKYrrriitrZWWkUdvrjI38t0WOqUlJS33nrLuOhewl0OftgHraorVvklXwpvgVV4/6QQYu3atTfffPPu3bvnz5//+uuvy7w5aOBKvvrqqxcvXiyttFoKT3god/DgwTNnzhQVFUk4CPOrgDY+YH/9619fNFFRc2R76aWXVqxYoeTGX7X3T+7bt6+7uzs1NfWtt96aNm3a9773PZnVA03A7mVKjsD8KqB1MPAPZjabExISFi5cKGegcYW3wK5fv/72228vKyu7++67N2zYcP/998upe/Dgwe3bt1sslgULFuTk5MgpOtCZM2fKy8tbW1tTUlKWLFmSkpIis7qSLy4Dh6ZSQu3XNckX3fthH/Qnn3ySnZ2t6pET+fn5WVlZaWlpZ86cOX78eGpqakVFxcMPPyyhtMJbYDMzM1944YWioqJt27aNHz9eWt29e/c+9NBDLS0tmzdvVhLQf/nLXwoLCzMyMqqrq8vLy6V9Mhmqq6uNLy4/+clP3njjDTlR9cILL1x55ZWTJk0aNWqUkqRWstQekq9898OA7ujo2LRpU1JS0uTJk+XfXdbU1DRv3jwhRHp6emVl5Zw5c0pKSuSUVvI1c2DRt956y3gh7Qb3sLCw2NjYmJiY3t5eORUvEhoaOmHCBCHEhAkTBn46yqFk4JdVq1adOnVq7969LS0tEyZMyMnJSUlJkbmXqR3uZuAG//HHHw93OT8M6JkzZ86cObOpqWnPnj1lZWWS7+Xr7+83hs6pq6vr7+8/ffp0QkKCzAZIprbz0cgFhd+4ExISvvzyy+985zu1tbXJycmSqyv54hIaGpqTk5OTk2O3241h/Nrb22XuZaq+rhkkX3Tvh33Qra2tX3zxxcmTJ8PDw3NyciSPB33+/PnS0tLz588nJiYuXbr0xIkT48ePT0tLk1Ba7R06Sii8ckb5RTtqdXd3Hzt27PPPP3c6nTk5OVOnTlXdIknWrVt37733ei66nzt37rCW88Mj6IqKipycnDvvvDM8PFxmXePs9po1a4wfm5qaXnrpJZl7rMJ7B1R9NigMRKN0eXn5oUOHVLVByemyQ4cOHT16tKurKycnZ8mSJfKf8632JKHki+79MKCvvfZaJU8h0uGeOlV36Ki9e1MhJTeOeig5XdbS0jJnzhzJ/c4DqT1J2Nvbe/78+b6+vu7ubmN3G1Z+GNBqn0KkUGFh4RtvvGE2m19++eWFCxdKrq7w7k2FlNw46qHkdFlhYaGcQv+M2pOEkydPPnr06KRJk55//vlp0zB/TuQAABKCSURBVKYNdzk/7IMWSp9CpLwjWMllqgF4X5kOfdAKnw6hUEAttR8GtOcpRKmpqcZTiA4ePCjtKUQvv/yy8WX/l7/85SuvvHLfffdJKGq1Wt99993Zs2cnJye/8847Fy5cWLJkicyHWzudzubm5pSUlNra2tGjR6t6ynXgCMzb+dQutZKPZD/s4ti3b99VV101e/Zsi8Xy8ccfT58+XfKHkPwv+yUlJfn5+UlJSUKIRYsWnThxory8/N5775VTXQjx9ttvp6WlpaSk1NXVffTRR7fddpu00kKDby3yKT/hoWTgF7VLfdNNNzU1NQUHB6enp48dOzYyMlJCUT8M6IiIiNdee814bVyoKPMpREo6gm02W1ZWlvHaZDJlZWXJfO6UEKK7uzs/P18IccMNN2zcuFFmaaH0FKXaKwoUOnv27OOPP67qfl0lZsyYIYSw2+21tbV79uzp6emJioq66aabhrWoHwa0wtGBhRBpaWme8bylcbvdLpfL8+Qhp9Mp+Uyd0+msq6sbPXp0Y2OjkpOEqk5RqrqiQHkP+Lhx46xWq+Szo8qXWvx9X3M6nQP3uOHjhwGtanRghVvPuHHjysrKCgoK4uLi2tvbd+7cKfkOq/nz55eXlzc1NSUkJMgfQ13h5SuqrihQ8nV7ICUDv6hd6g8//NBTvaCgICIiQkJRPzxJqHB0YENxcbHkT3WXy7V79+5Dhw51dHRER0dPmTJl5syZMs/Utbe3e85JGo+6kFZaCNHR0RETE2O8tlqtsbGx0koPvKLAYrFIvgTN+Lr95Zdfyvm6/c9Ifg6kqqXmJKFvSL5QUQdms7mgoEBhB2hZWVlubm5eXt7evXurqqqkPcdT+ZPUCwoKZsyYUV1dbbPZqqqqJAe05K/bAyl8DqSqpVZyctKvAvrgwYPbtm1LSEiYMWNGRUWFxWKRdvlzgLvnnnsqKiree++97373uzKvHvnqs74kDDBmcLlcdXV1R48era6ubm1tXbp06fz58+WUFoq+bg+k5EyP8qWWz68Ceu/evY888khbW9vGjRtXrlwZHx//xz/+UdowLpd8mLffX5pq2Lt3b2dn57Jly/bu3XvkyBHJF7pFRUV51rO0o7nnn38+LS0tLy9v/vz5//Zv/yZ5kd977z3jhaqHnCk506N8qeXzq4AOCwuLiYmJjo4WQhgDyMkcL0n5hqLwbpHQ0NA777zTZDKlp6dLvsJPKDqay87Orq6uPn36tPwTdEKDjU3ykBQG5Ustn9R+q+GmfHRgtd5+++2amhohRF1d3dtvvy2n6IYNG4QQ11577a9+9SshhMViOX36tJzSHkqO5ubPn//AAw9ceeWVBw4cCA0NLSkpkb/gCg080zN27FjVzZHBarUKITo6OmQW9asj6Pr6+kDrWxhI7d0iCik5mhNCmM3mcePGjRs3btGiRadOnTp48OC4ceOkVVfL2NKEEE8++aTalkjz5ptv1tfXXzSRqzi8EIChPJDyu0VUUX7dTlBQ0MSJEydOnCi/tCoKb6+vrKxUcsHSj370IyHEwDPSEvhVQKul/DYntXeLKGGz2d57772FCxf+/ve///zzz2NiYhobG1U3KiAovL2+p6entrZW/uNGDdOmTVuzZk1zc3NSUlJRUVFqauqwliOgfearWSztki9jlK+XXnrJ+LG5uXnDhg1yPhvUditVVFQYT/Roa2v7xS9+8eGHHxojRkECVbfXS36u9kW2bNkyb968MWPG1NbWbtmyxTisHj4EtI8puYBf4ShfaruVmpqabr75ZuO1MTTwK6+8kpubK60BCm9iVEvh7fVqN7ng4OCMjAwhxNixYyVcsERA+5jCoZo++eST7OzsgBpgzGw2G99zjZ3WZDI5HA45pZXfxKhWampqUVGR55pO1c2Rx26319TUpKWl1dXVSdjYCGgfUzVUkxCio6Nj06ZNSUlJkydPVtVDJ193d7fnSmTjQig5vnoTY0BROAK42uG/FyxYUFZWZgw4I+FMDwHtY6ou+RJCzJw5c+bMmU1NTXv27CkrKwuEo7lrrrnmzTffXLx4cVxcXH19/ZYtWySf3w8JCbnkHaR+T+E1nWqfUDxq1KiHHnpIWjkC2scUXvLV2tr6xRdfnDx5Mjw8fNasWTJLqzJlyhS73b5p06bOzs6kpKSZM2dmZ2fLbIDap3orpPaazsB5QrEfDjeqlpJHARk2bdqUk5MzceJEmTe4B7iKiorrrrtO1VO9FWpsbBx4TafMbuiAekIxAe1jL7300ooVKySfqQvMR4iqpfyyd1XcbvfHH3+8b98+YxDwqVOnXnfddTI/n9ReOVNeXn7o0CHPj8P9Fyegfez999/Pzc0NwEMqBIj9+/efPHly7ty5CQkJLS0tW7duzczMlNObp8OVM+vXr1+xYoW0Ti0C2scu+kSVf0jldrsD7bNBYbfSwFpmszkhIWHhwoXp6enSGiDf2rVrf/jDH4aGhho/9vX1vfLKKw8++KC0Bqi9ckZypxYnCX1s1qxZ8gcKsFqt77777uzZs5OTk//yl79cuHBhyZIlnmdQ+T2FT5jOz8/PyspKS0s7c+bM8ePHU1NTKyoqHn74YfktkcZisXjSWQgRFhYmc2BbIcTRo0ePHj3q+VHa57Gn0MA7hBks6VtGyUABJSUl+fn5xl3OixYtOnHiRHl5ucwnm6il5AnThqampnnz5gkh0tPTKysr58yZU1JSIrkNkjmdTpvN5slom83mcrlkNsDIRLfb3dzcfPjwYcl1JSOgfUzJQAE2my0rK8t4bTKZsrKy5I+ar5CSJ0wb+vv7q6urx4wZU1dX19/ff/r06YSEBGnVlZg8efKf//znefPmxcfHt7a2bt26dcqUKfKbYTKZrrjiitraWsl1JZ+iJKB9TMnHrPEYTc8DNJ1Op99fHzqQwmsnFi9eXFpaev78+cTExKVLl544ccLvxxE0niH3+uuvG/GUn59/3XXXyWzAwD/31VdfLa2uklOUnCT0Bzt27Ojq6iooKIiLi2tvb9+5c2d0dPTcuXNVt8ufcWljYJJ8ipKA9jElAwW4XK7du3cfOnSoo6MjOjp6ypQpM2fOlHzqRiG1gzN4fPzxx3IGLwxYyi/BFtIv06KLw8eUDBSwZ8+ezMzMmTNnBtoFdgaFgzMoGV02YH300UcnT578/ve/77kE2+VySR5QQfIpSr96aKwm5A8UMHXq1AsXLuzatauysrKurk7yWXUdqBqcwRhddurUqf/yL/8SOA8kVOXw4cO33357cnKyxWIZMWLEHXfc8cknnyhpibRTlBxB+5iSgcxDQkKys7Ozs7MdDkd1dfUHH3xgNptvvPFGOdWVUzh4vMLRZQOQ8kuwhfRTlAS0zxw8eHDbtm0JCQkzZsx49913+/r6JO+xF91QFwgBfVEPoNVqff3112WeqVM4umwAUn4JtpB+HpiA9pm9e/c+8sgjbW1tGzduXLlyZXx8/B//+EfjmiQ5FN5Qp4qxt5SUlOTm5qanp9fU1Bw/flxmA5Q/UDyg6HAJdkNDQ3l5OQ+N/fYJCwuLiYmJjo4WQqSlpQkhJA/7qfCGOrUuXLhg9P+OGzeusrJSZmlj0HohxJNPPimzbmBSfgm24KGx315GLCoMR4U31KnV399fVVWVkZFRXV0t7SRhwA43qpDJZMrPz/d8KCoh+aGxXAftM+yxqjQ0NBiPiZPzrfMixcXF/JUDx/r16+fMmWM8NPb9998f7iNoAtp/7Ny5MzMzMzU11XPPt9/T4UORgA4onqMB46Gxo0aNGtZydHH4j8TExEOHDm3dujU+Pn7ChAlXXXWV6hYNO5IR0lRVVZWVla1atWrBggV//vOfe3t7e3t7h7soR9D+w+VynTt3rqam5ujRo06nU+YY6mpJPrFu0OHgHTKtW7fu5ptvTkpK2rhx44wZM+Lj4zdv3rxq1aphLcoRtP947rnnEhISpk+ffvfdd0dERKhujjyST6wbyOJAExISkpycbLPZOjs7x48fbzKZJPQlBkpnZSC48847MzMzP/vss7Kysg8++EB1c+QxTqxbLJaxY8cGBXHMgWFht9utVuunn35qXMVRW1sr4TYZtmb/kZKSYrPZ+vr6GhoampubVTdHHrvdXlNTY5xYdzgcqpsD/1RQULBu3Tqz2XzPPfd8/vnn77333s033zzcRemD9h/r168fN27c+PHjR48eHTgXcgjpJ9YBaQJoN/Z7CxYsOHHixMaNG9etW9fY2Ki6OTLYbLaKiopRo0Y5HI64uDiHw7Fz507VjQJ8hi4O/6HkXJlaFRUV8fHxQoi2trZf/OIXH374ofHkXMA/cATtPwLwXFlTU1NBQYHxOjQ0dPbs2QcOHFDbJMCHCGj/YZwrczqd1dXVAXKuzGw2G4OfGBe9mUymAFlwBIiAOM4KEAsWLBh4rkx1cyTp7u6OjIw0XlutVrWNAXyLqzj8R0dHR0xMjPHaGI9RbXskOHTo0JEjRxYvXhwXF1dfX79ly5aCgoLs7GzV7QJ8g4D2B5988klpaenAKfHx8atXr1bVHmncbvdHH320d+/ezs7OpKSkmTNnTpo0SXWjAJ8hoP3H5s2bb731VtWtAOAzBLT/uOiZhAwWAXzbcZLQfwTgMwkB/8Zldv7DeCYhX4kAv8ERtP8I2GcSAv6KPmgA0BRdHP5j586d9fX1EsaoBSAHR9D+48iRIzU1NU1NTYHzTELAv9EH7T9ycnISExONZxKeP3+egAa+7TiC9h/PPvus8UzCsWPHBtQzCQF/RUD7j5qampqamrNnzwohRo4ceeONN6puEYAhoYvDfwTsMwkBf8URtP8I2GcSAv6KgPYfDQ0N5eXlzc3NSUlJRUVFqampqlsEYEjo4vAfAfhMQsC/8UXYfwTgMwkB/0ZA+48AfCYh4N/og/YfDQ0NA59JOGrUKNUtAjAkBDQAaIqeSn/Q09Pz3nvvnTx5sqenJyIiIjMz86abbuJmQuDbjiNof/Dqq6+OHTs2Nzc3MjKyu7v7008/rampufvuu1W3C8CQcJLQH/T09Fx//fXR0dFmszk6OnrGjBnd3d2qGwVgqAhof2CxWL5xCoBvHfqg/UF9fT0PuAL8D33QAKApujgAQFMENABoioAGAE0R0Pj2MZlM3v7v179lcIVUzQqBg4DGt89jjz2mugmADFzFAX9jMl1iq77kxMHNyoetAr4eR9DQhclkWrdu3ZVXXpmYmPif//mfQoiGhobFixcnJSUlJSXdeuutDQ0Nnt8UQlit1h/84Afx8fFpaWl/+MMfBvYhPPvssxMnTkxMTHz22Wc9E3/605/GxMRMmTLl+PHjQoiWlpbbb789MTExMTHxjjvuaGlp8cx87dq1sbGxl5zVJd/1zyYuW7YsOjo6NTX117/+9fCvP/ghAhoa6enpOX78+M6dO5988kkhxIoVK+6///6mpqbGxsarr776Bz/4wcBfXr16dVxcXGNjY1VV1YkTJwb+V1hY2BdffPHXv/71l7/8pWdiSkpKe3t7UVHRI488IoR48MEHk5KS6uvrGxoakpKSHn74Yc9vVldXe2Z40awu+a5LTnzggQfi4uLOnTt3+vTpM2fODNMag3/jaxd0YTKZuru7jUH4jA6BkJAQu93u+YXg4OD+/n7P/8bExNTW1l5xxRVCiLa2toSEBGNjNplMPT094eHhYkDHgslk6ujoiI6Obm9vHzlyZG9vb1RU1JkzZzxvT09P7+joGPibl5zVJd91yYmRkZF1dXUJCQlCiJaWlqSkJPY1eIsjaGjkoiFSY2JiOjs73W632+3u6uo6derUwP91Op2ebo2LnmJuROolmUwm43lgbrfb83aTyeR0Oj2/Y6TzJWd1yXddcuLALhcu4cDgENDQ17Jly5566qn+/v7u7u577733V7/61cD/Xbhw4VNPPWWz2fr7+y/6r0tas2aNy+X67//+71mzZgkh5s+f//TTT9tsNpvN9tRTTy1cuPBymnTJd11y4oIFCx599NHu7u6+vj6jxwbwFgENff32t789d+7cyJEj09PTo6Kifve73w383zVr1pw9ezYpKSkzM3PSpEnf+JzcpqamuLi4ioqK3//+90KItWvXNjY2pqamjho1qqmpac2aNZfTpEu+65IT/+d//qetrW3kyJEZGRkTJ04c5CpAYKMPGt9WpaWlU6dOHTlypBCipqamoKCgtrZWdaMAX+IIGt9Wu3bt+td//df29vbW1tbHH398wYIFqlsE+BgBjW+rZ555xm63p6amjhkzxu12D7zkGfAPdHEAgKY4ggYATRHQAKApAhoANEVAA4CmCGgA0BQBDQCaIqABQFP/H2xFiPeojrphAAAAAElFTkSuQmCC"
515 | },
516 | "metadata": {},
517 | "output_type": "display_data"
518 | }
519 | ],
520 | "source": [
521 | "ggplot(data=take(df_sorted, 10), aes(x=neighborhood, y=pres)) + \n",
522 | " geom_bar(colour=\"black\", stat=\"identity\", fill=\"#DD8888\") + guides(fill=FALSE) +\n",
523 | " theme(axis.text.x = element_text(angle = 90, hjust = 1))"
524 | ]
525 | },
526 | {
527 | "cell_type": "code",
528 | "execution_count": 12,
529 | "metadata": {
530 | "collapsed": true
531 | },
532 | "outputs": [],
533 | "source": [
534 | "dfl = collect(df)"
535 | ]
536 | },
537 | {
538 | "cell_type": "markdown",
539 | "metadata": {},
540 | "source": [
541 | "Using the R leaflet package I draw an interactive map of San Francisco, and color-code each neighborhood with the percent of resolved crimes:"
542 | ]
543 | },
544 | {
545 | "cell_type": "code",
546 | "execution_count": 13,
547 | "metadata": {
548 | "collapsed": false
549 | },
550 | "outputs": [
551 | {
552 | "name": "stderr",
553 | "output_type": "stream",
554 | "text": [
555 | "Loading required package: leaflet\n",
556 | "Loading required package: htmlwidgets\n"
557 | ]
558 | },
559 | {
560 | "data": {
561 | "text/html": [
562 | ""
563 | ]
564 | },
565 | "metadata": {},
566 | "output_type": "display_data"
567 | }
568 | ],
569 | "source": [
570 | "require(leaflet)\n",
571 | "require(htmlwidgets)\n",
572 | "library(IRdisplay)\n",
573 | "library(jsonlite)\n",
574 | "\n",
575 | "sf_lat = 37.77\n",
576 | "sf_long = -122.4\n",
577 | "\n",
578 | "geojson <- readLines(\"data/sfn.geojson\") %>% paste(collapse = \"\\n\") %>% fromJSON(simplifyVector = FALSE)\n",
579 | "\n",
580 | "pal <- colorQuantile(\"OrRd\", dfl$pres, n=7)\n",
581 | "geojson$features <- lapply(geojson$features, function(feat) {\n",
582 | " feat$properties$style <- list(fillColor = pal(dfl[dfl$neighborhood==feat$properties$neighborho, \"pres\"]))\n",
583 | " feat\n",
584 | "})\n",
585 | "\n",
586 | "m <- leaflet() %>%\n",
587 | " setView(lng = sf_long, lat = sf_lat, zoom = 12) %>%\n",
588 | " addTiles() %>%\n",
589 | " addGeoJSON(geojson, weight = 1.5, color = \"#444444\", opacity = 1, fillOpacity = 0.3) %>%\n",
590 | " addLegend(\"topright\", pal = pal, values = dfl$pres, title = \"P(resolved)\", opacity = 1)\n",
591 | "\n",
592 | "tf = 'map1.html'\n",
593 | "saveWidget(m, file = tf, selfcontained = T)\n",
594 | "display_html(paste0(\"\"))"
595 | ]
596 | },
597 | {
598 | "cell_type": "code",
599 | "execution_count": 14,
600 | "metadata": {
601 | "collapsed": false,
602 | "scrolled": false
603 | },
604 | "outputs": [
605 | {
606 | "data": {
607 | "text/plain": [
608 | "DataFrame[year:int, month:int, hour:int, resolved:double, category:string, district:string, dayofweek:string, description:string, neighborhood:string]"
609 | ]
610 | },
611 | "execution_count": 14,
612 | "metadata": {},
613 | "output_type": "execute_result"
614 | },
615 | {
616 | "data": {
617 | "text/plain": [
618 | "DataFrame[year:int, month:int, hour:int, resolved:double, category:string, district:string, dayofweek:string, description:string, neighborhood:string]"
619 | ]
620 | },
621 | "execution_count": 14,
622 | "metadata": {},
623 | "output_type": "execute_result"
624 | },
625 | {
626 | "name": "stdout",
627 | "output_type": "stream",
628 | "text": [
629 | "[1] \"training set has 426306 instances\"\n",
630 | "[1] \"test set has 150155 instances\"\n"
631 | ]
632 | }
633 | ],
634 | "source": [
635 | "# initial data frame\n",
636 | "crimes = sql(hc, \"SELECT cast(SUBSTR(date_str,7,4) as int) as year, \n",
637 | " cast(SUBSTR(date_str,1,2) as int) as month, \n",
638 | " cast(SUBSTR(time,1,2) as int) as hour,\n",
639 | " resolved, category, district, dayofweek, description, neighborhood\n",
640 | "FROM crimes2\n",
641 | "WHERE latitude > 37 and latitude < 38\")\n",
642 | "\n",
643 | "trainData = filter(crimes, \"year>=2011 and year<=2013\")\n",
644 | "cache(trainData)\n",
645 | "testData = filter(crimes, \"year=2014\")\n",
646 | "cache(testData)\n",
647 | "print(paste0(\"training set has \", count(trainData), \" instances\"))\n",
648 | "print(paste0(\"test set has \", count(testData), \" instances\"))"
649 | ]
650 | },
651 | {
652 | "cell_type": "code",
653 | "execution_count": 15,
654 | "metadata": {
655 | "collapsed": false
656 | },
657 | "outputs": [
658 | {
659 | "data": {
660 | "text/plain": [
661 | "$coefficients\n",
662 | " Estimate\n",
663 | "(Intercept) -0.500995597\n",
664 | "month -0.013698270\n",
665 | "hour 0.007715841\n",
666 | "category_LARCENY/THEFT -1.966637160\n",
667 | "category_OTHER OFFENSES 1.270989783\n",
668 | "category_NON-CRIMINAL -0.246223825\n",
669 | "category_ASSAULT 0.230423056\n",
670 | "category_VANDALISM -1.396336159\n",
671 | "category_DRUG/NARCOTIC 2.355341637\n",
672 | "category_WARRANTS 2.898549300\n",
673 | "category_SUSPICIOUS OCC -1.348106274\n",
674 | "category_BURGLARY -0.875788725\n",
675 | "category_VEHICLE THEFT -1.987641681\n",
676 | "category_MISSING PERSON 1.665890169\n",
677 | "category_ROBBERY -0.822972194\n",
678 | "category_FRAUD -1.114991424\n",
679 | "category_SECONDARY CODES 0.622131629\n",
680 | "category_WEAPON LAWS 1.869005867\n",
681 | "category_TRESPASS 1.186409444\n",
682 | "category_STOLEN PROPERTY 2.607626808\n",
683 | "category_FORGERY/COUNTERFEITING -0.068146486\n",
684 | "category_PROSTITUTION 2.884462812\n",
685 | "category_SEX OFFENSES, FORCIBLE 0.192205213\n",
686 | "category_DRUNKENNESS 1.658611141\n",
687 | "category_RECOVERED VEHICLE -1.708570543\n",
688 | "category_DISORDERLY CONDUCT 1.109973872\n",
689 | "category_DRIVING UNDER THE INFLUENCE 3.161483733\n",
690 | "category_KIDNAPPING 0.715620360\n",
691 | "category_RUNAWAY 1.038663448\n",
692 | "category_LIQUOR LAWS 1.997762291\n",
693 | "category_ARSON -0.780926255\n",
694 | "category_EMBEZZLEMENT -0.716695368\n",
695 | "category_LOITERING 1.968992235\n",
696 | "category_SUICIDE 0.051681181\n",
697 | "category_FAMILY OFFENSES -0.253458450\n",
698 | "category_BRIBERY 1.121022119\n",
699 | "category_BAD CHECKS -1.076153246\n",
700 | "category_EXTORTION -0.147007892\n",
701 | "category_SEX OFFENSES, NON FORCIBLE 0.460873957\n",
702 | "category_GAMBLING 0.986811111\n",
703 | "category_PORNOGRAPHY/OBSCENE MAT 0.386166318\n",
704 | "district_SOUTHERN 0.033583808\n",
705 | "district_MISSION 0.475875779\n",
706 | "district_NORTHERN -0.021969016\n",
707 | "district_BAYVIEW 0.055176562\n",
708 | "district_CENTRAL -0.287558824\n",
709 | "district_INGLESIDE 0.193898515\n",
710 | "district_TENDERLOIN 0.640502351\n",
711 | "district_TARAVAL 0.240363593\n",
712 | "district_PARK 0.059166475\n",
713 | "dayofweek_Friday -0.073344790\n",
714 | "dayofweek_Saturday -0.015888935\n",
715 | "dayofweek_Wednesday 0.025025748\n",
716 | "dayofweek_Thursday -0.026655658\n",
717 | "dayofweek_Tuesday -0.015867908\n",
718 | "dayofweek_Sunday 0.013801078\n",
719 | "neighborhood_Downtown/Civic Center 0.153010689\n",
720 | "neighborhood_South of Market 0.160210287\n",
721 | "neighborhood_Mission -0.287916023\n",
722 | "neighborhood_Bayview 0.177175038\n",
723 | "neighborhood_Western Addition -0.175962726\n",
724 | "neighborhood_Financial District 0.237691313\n",
725 | "neighborhood_Castro/Upper Market -0.254762573\n",
726 | "neighborhood_Haight Ashbury 0.178012051\n",
727 | "neighborhood_North Beach 0.109416668\n",
728 | "neighborhood_Excelsior -0.110366558\n",
729 | "neighborhood_Outer Mission -0.171500135\n",
730 | "neighborhood_Bernal Heights -0.171167999\n",
731 | "neighborhood_Potrero Hill -0.177001258\n",
732 | "neighborhood_Inner Richmond -0.231279187\n",
733 | "neighborhood_Visitacion Valley -0.132683393\n",
734 | "neighborhood_Marina -0.355850651\n",
735 | "neighborhood_Outer Sunset -0.479070275\n",
736 | "neighborhood_Nob Hill 0.078041247\n",
737 | "neighborhood_Ocean View -0.290216260\n",
738 | "neighborhood_Lakeshore -0.046727214\n",
739 | "neighborhood_Russian Hill -0.206509945\n",
740 | "neighborhood_Outer Richmond -0.115498642\n",
741 | "neighborhood_Parkside -0.519061412\n",
742 | "neighborhood_Inner Sunset -0.478406620\n",
743 | "neighborhood_Pacific Heights -0.565766022\n",
744 | "neighborhood_West of Twin Peaks -0.474278575\n",
745 | "neighborhood_Chinatown 0.029137163\n",
746 | "neighborhood_Golden Gate Park -0.081706940\n",
747 | "neighborhood_Noe Valley -0.755653895\n",
748 | "neighborhood_Presidio Heights -0.238432798\n",
749 | "neighborhood_Crocker Amazon -0.206720503\n",
750 | "neighborhood_Glen Park -0.559210545\n",
751 | "neighborhood_Twin Peaks -1.017804121\n",
752 | "neighborhood_Seacliff -0.590237769\n",
753 | "neighborhood_Diamond Heights -0.460373358\n",
754 | "neighborhood_Treasure Island/YBI -0.536487896\n"
755 | ]
756 | },
757 | "execution_count": 15,
758 | "metadata": {},
759 | "output_type": "execute_result"
760 | }
761 | ],
762 | "source": [
763 | "model <- glm(resolved ~ month + hour + category + district + dayofweek + neighborhood, \n",
764 | " family = \"binomial\", data = trainData)\n",
765 | "summary(model)"
766 | ]
767 | },
768 | {
769 | "cell_type": "code",
770 | "execution_count": 16,
771 | "metadata": {
772 | "collapsed": false
773 | },
774 | "outputs": [],
775 | "source": [
776 | "# Predict results for test data\n",
777 | "pmat <- predict(model, testData)"
778 | ]
779 | },
780 | {
781 | "cell_type": "code",
782 | "execution_count": 62,
783 | "metadata": {
784 | "collapsed": false
785 | },
786 | "outputs": [
787 | {
788 | "name": "stdout",
789 | "output_type": "stream",
790 | "text": [
791 | "[1] \"precision = 0.74, recall = 0.67, accuracy = 0.79\"\n"
792 | ]
793 | }
794 | ],
795 | "source": [
796 | "# Compute precision/recall curve\n",
797 | "\n",
798 | "registerTempTable(pmat, \"pmat\")\n",
799 | "cm = sql(hc, \"SELECT prediction, label, COUNT(*) as cnt from pmat GROUP BY prediction, label\")\n",
800 | "\n",
801 | "cml = collect(cm)\n",
802 | "tp = cml[cml['prediction']==1.0 & cml['label']==1.0, \"cnt\"]\n",
803 | "tn = cml[cml['prediction']==0.0 & cml['label']==0.0, \"cnt\"]\n",
804 | "fp = cml[cml['prediction']==1.0 & cml['label']==0.0, \"cnt\"]\n",
805 | "fn = cml[cml['prediction']==0.0 & cml['label']==1.0, \"cnt\"]\n",
806 | "precision = tp / (tp+fp)\n",
807 | "recall = tp / (tp+fn)\n",
808 | "accuracy = (tp+tn) / (tp+tn+fp+fn)\n",
809 | "\n",
810 | "print(sprintf(\"precision = %0.2f, recall = %0.2f, accuracy = %0.2f\", precision, recall, accuracy)) "
811 | ]
812 | },
813 | {
814 | "cell_type": "code",
815 | "execution_count": null,
816 | "metadata": {
817 | "collapsed": true
818 | },
819 | "outputs": [],
820 | "source": []
821 | }
822 | ],
823 | "metadata": {
824 | "kernelspec": {
825 | "display_name": "R",
826 | "language": "",
827 | "name": "r"
828 | },
829 | "language_info": {
830 | "codemirror_mode": "r",
831 | "file_extension": ".r",
832 | "mimetype": "text/x-r-source",
833 | "name": "R",
834 | "pygments_lexer": "r",
835 | "version": "3.2.3"
836 | }
837 | },
838 | "nbformat": 4,
839 | "nbformat_minor": 0
840 | }
841 |
--------------------------------------------------------------------------------