├── 2015-08-31
    └── AdvancedDataScienceSpark.pdf
├── 2015-09-30
    ├── Building_enterprise_data_lake_solution_using_spark_and_sequoiadb.pdf
    └── Spark_Scala_vs_The_Rest.pdf
├── 2015-11-25
    ├── 5Bullets_Nov25.pdf
    ├── Continuous_Integration_for_Apache_Spark.pdf
    ├── Oliver-SparkMeetup-2015-11-23.pdf
    ├── Remote_QA_With_Denny_Lee.pptx
    └── a-year-of-spark-at-flipp_toronto-spark-meetup_20151125.pdf
├── 2015-12-14
    ├── 5Bullets_Dec14.pdf
    ├── Organizational_Updates.gslides
    └── Toronto_SparkMeetup_Dec142015.pdf
├── 2016-01-27
    ├── Databricks_Spark_Summit_East_2016_Promo.pptx
    ├── IntroToSpark_by_Adastra.pdf
    └── SparkAsService_by_Sansom_Lee.pptx
├── 2016-02-24
    └── ScalaJVMBigData-SparkLessons.pdf
├── 2016-03-30
    ├── sustainable_spark_development.htm
    └── tas_spark_realtime_risk_management_2016.pdf
├── 2016-04-27
    ├── 5BP.md
    └── SparkKafkaMeetup2016-04-27.pdf
├── 2016-05-25
    ├── .ignore
    ├── Hackathon-2016.pdf
    └── Spark Tools and Methodologies at Shopify.pdf
├── 2016-06-29
    ├── .ignore
    └── Collaborative_Recommendations_by_Mo_kijiji.pdf
├── 2016-07-27
    ├── ExperiencesinDeliveringSparkasaServiceIBM.pdf
    ├── Readme.md
    └── SparkStoriesWattpad.pdf
├── 2016-09-28
    └── Continuous_Applications_with_Apache_Spark.pdf
├── 2016-10-26
    ├── .ignore
    ├── Shoehorning_Spark.pdf
    └── Spark_in_production_pipelines.pdf
├── 2016-11-30
    ├── Analyzing+Flight+Data+with+GraphFrames+2.ipynb
    ├── GraphFrame basics.ipynb
    └── Spark GraphFrames.pdf
├── 2017-01-25
    ├── .gitignore
    └── README.md
├── 2017-02-22
    ├── README.md
    └── TAS-2017.pdf
└── README.md


/2015-08-31/AdvancedDataScienceSpark.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-08-31/AdvancedDataScienceSpark.pdf


--------------------------------------------------------------------------------
/2015-09-30/Building_enterprise_data_lake_solution_using_spark_and_sequoiadb.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-09-30/Building_enterprise_data_lake_solution_using_spark_and_sequoiadb.pdf


--------------------------------------------------------------------------------
/2015-09-30/Spark_Scala_vs_The_Rest.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-09-30/Spark_Scala_vs_The_Rest.pdf


--------------------------------------------------------------------------------
/2015-11-25/5Bullets_Nov25.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-11-25/5Bullets_Nov25.pdf


--------------------------------------------------------------------------------
/2015-11-25/Continuous_Integration_for_Apache_Spark.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-11-25/Continuous_Integration_for_Apache_Spark.pdf


--------------------------------------------------------------------------------
/2015-11-25/Oliver-SparkMeetup-2015-11-23.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-11-25/Oliver-SparkMeetup-2015-11-23.pdf


--------------------------------------------------------------------------------
/2015-11-25/Remote_QA_With_Denny_Lee.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-11-25/Remote_QA_With_Denny_Lee.pptx


--------------------------------------------------------------------------------
/2015-11-25/a-year-of-spark-at-flipp_toronto-spark-meetup_20151125.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-11-25/a-year-of-spark-at-flipp_toronto-spark-meetup_20151125.pdf


--------------------------------------------------------------------------------
/2015-12-14/5Bullets_Dec14.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-12-14/5Bullets_Dec14.pdf


--------------------------------------------------------------------------------
/2015-12-14/Organizational_Updates.gslides:
--------------------------------------------------------------------------------
1 | {"url": "https://docs.google.com/open?id=179XryZKXI965WCmaIKjA9zz7SjUTJXOuxM8kEvl4Ee8", "doc_id": "179XryZKXI965WCmaIKjA9zz7SjUTJXOuxM8kEvl4Ee8", "email": "pazookime@gmail.com", "resource_id": "presentation:179XryZKXI965WCmaIKjA9zz7SjUTJXOuxM8kEvl4Ee8"}


--------------------------------------------------------------------------------
/2015-12-14/Toronto_SparkMeetup_Dec142015.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-12-14/Toronto_SparkMeetup_Dec142015.pdf


--------------------------------------------------------------------------------
/2016-01-27/Databricks_Spark_Summit_East_2016_Promo.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-01-27/Databricks_Spark_Summit_East_2016_Promo.pptx


--------------------------------------------------------------------------------
/2016-01-27/IntroToSpark_by_Adastra.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-01-27/IntroToSpark_by_Adastra.pdf


--------------------------------------------------------------------------------
/2016-01-27/SparkAsService_by_Sansom_Lee.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-01-27/SparkAsService_by_Sansom_Lee.pptx


--------------------------------------------------------------------------------
/2016-02-24/ScalaJVMBigData-SparkLessons.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-02-24/ScalaJVMBigData-SparkLessons.pdf


--------------------------------------------------------------------------------
/2016-03-30/sustainable_spark_development.htm:
--------------------------------------------------------------------------------
  1 | <!DOCTYPE html>
  2 | <!-- saved from url=(0060)http://ghnuberath.github.io/sustainable-spark-development/#/ -->
  3 | <html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  4 |     
  5 | 
  6 |     <title>Introducing Apache Spark</title>
  7 | 
  8 |     <meta name="description" content="Introducing fundamental Apache Spark concepts through diagrams, examples and use cases.">
  9 |     <meta name="author" content="Sean McIntyre">
 10 | 
 11 |     <meta name="apple-mobile-web-app-capable" content="yes">
 12 |     <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
 13 | 
 14 |     <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
 15 | 
 16 |     <link rel="stylesheet" href="./sustainable_spark_development_files/reveal.css">
 17 |     <link rel="stylesheet" href="./sustainable_spark_development_files/black.css" id="theme">
 18 | 
 19 |     <!-- Code syntax highlighting -->
 20 |     <link rel="stylesheet" href="./sustainable_spark_development_files/zenburn.css">
 21 | 
 22 |     <!-- Custom styles -->
 23 |     <link rel="stylesheet" href="./sustainable_spark_development_files/custom.css">
 24 | 
 25 |     <!-- Printing and PDF exports -->
 26 |     <script>
 27 |       var link = document.createElement( 'link' );
 28 |       link.rel = 'stylesheet';
 29 |       link.type = 'text/css';
 30 |       link.href = window.location.search.match( /print-pdf/gi ) ? 'css/print/pdf.css' : 'css/print/paper.css';
 31 |       document.getElementsByTagName( 'head' )[0].appendChild( link );
 32 |     </script><link rel="stylesheet" type="text/css" href="./sustainable_spark_development_files/paper.css">
 33 | 
 34 |     <!--[if lt IE 9]>
 35 |     <script src="lib/js/html5shiv.js"></script>
 36 |     <![endif]-->
 37 |   </head>
 38 | 
 39 |   <body style="transition: -webkit-transform 0.8s ease;">
 40 | 
 41 |     <div class="reveal slide center" role="application" data-transition-speed="default" data-background-transition="fade">
 42 | 
 43 |       <!-- Any section element inside of this container is displayed as a slide -->
 44 |       <div class="slides" style="width: 960px; height: 700px; left: 50%; top: 50%; bottom: auto; right: auto; transform: translate(-50%, -50%) scale(0.973286);">
 45 |         <section data-transition="convex" class="present" style="top: 45.5px; display: block;">
 46 |           <h1>Sustainable Spark Development</h1>
 47 |           <p>
 48 |             Sean McIntyre<br>
 49 |             <small>Software Architect</small>
 50 |           </p>
 51 |           <img src="./sustainable_spark_development_files/uncharted-logo.png" width="350">
 52 |         </section>
 53 |         <section data-transition="convex" hidden="" aria-hidden="true" class="stack future" style="top: 0px; display: block;">
 54 |           <section style="top: 264px; display: block;">
 55 |             <h2>The Problem</h2>
 56 |           </section>
 57 |           <section class="future" aria-hidden="true" style="top: 98px; display: block;">
 58 |             <p>Uncharted in a nutshell:</p>
 59 |             <ul>
 60 |               <li>Data Science</li>
 61 |               <li>Machine Learning</li>
 62 |               <li>Data cleaning and enrichment</li>
 63 |               <li><strong>Custom visual analytics software</strong></li>
 64 |             </ul>
 65 |             <p class="fragment" data-fragment-index="0">
 66 |               <em>Visualization</em> is a broad term these days, and it often involves a lot of data processing to create something useful.
 67 |             </p>
 68 |           </section>
 69 |           <section class="future" aria-hidden="true" style="top: 330px; display: none;">
 70 |             <p>From a different perspective:</p>
 71 |             <ul>
 72 |               <li>Data Scientists</li>
 73 |               <li>Computer Scientists</li>
 74 |               <li>Software Engineers</li>
 75 |               <li>Dev Ops</li>
 76 |               <li>...and many more</li>
 77 |             </ul>
 78 |             <p class="fragment" data-fragment-index="0">Developing big, data-driven applications involves all of these folks!</p>
 79 |           </section>
 80 |           <section class="future" aria-hidden="true" style="top: 330px; display: none;">
 81 |             <aside class="notes">
 82 |               Despite your best intentions...
 83 |             </aside>
 84 |             <h3>
 85 |               The informal workflow<br>
 86 |               <small class="fragment" data-fragment-index="0">i.e. "what always happens"</small>
 87 |             </h3>
 88 |             <ol>
 89 |               <li class="fragment" data-fragment-index="1">Data collection <em>(Dev Ops/Engineers)</em></li>
 90 |               <li class="fragment" data-fragment-index="2">Exploratory data analysis <em>(Data scientist)</em></li>
 91 |               <li class="fragment" data-fragment-index="3">Application development <em>(Engineers)</em></li>
 92 |               <li class="fragment" data-fragment-index="4">Algorithm optimization <em>(Data/computer scientists)</em></li>
 93 |               <li class="fragment" data-fragment-index="5">Deployment/Scaling <em>(Engineers/Dev Ops)</em></li>
 94 |             </ol>
 95 |           </section>
 96 |           <section class="future" aria-hidden="true" style="top: 330px; display: none;">
 97 |             <h3>Issues:</h3>
 98 |             <ul>
 99 |               <li class="fragment" data-fragment-index="0">Everyone ends up touching Spark</li>
100 |               <li class="fragment" data-fragment-index="1">Spark is hard to learn, and hard to use correctly</li>
101 |               <li class="fragment" data-fragment-index="2">Scripts written for data analysis are productionized due to time constraints</li>
102 |               <li class="fragment" data-fragment-index="3">Scripts are not sufficiently modular to re-use. Wheel reinventing ensues.</li>
103 |             </ul>
104 |             <p class="fragment" data-fragment-index="4">It became clear very quickly that this process is not <em>sustainable</em>.</p>
105 |           </section>
106 |         </section>
107 |         <section data-transition="convex" hidden="" aria-hidden="true" class="stack future" style="top: 0px; display: block;">
108 |           <section style="top: 264px; display: block;">
109 |             <h2>What is Sustainability?</h2>
110 |           </section>
111 |           <section class="future" aria-hidden="true" style="top: 330px; display: none;">
112 |             <h3>Priorities:</h3>
113 |             <ul>
114 |               <li class="fragment" data-fragment-index="0">Retain the productivity/flexibility afforded by scripting</li>
115 |               <li class="fragment" data-fragment-index="1">Balance with the need for modularity</li>
116 |               <li class="fragment" data-fragment-index="2">Promote code reuse and composability<br><small>"Hey, how did you parse that twitter data?"</small></li>
117 |               <li class="fragment" data-fragment-index="3">Reduce barriers to entry for new team members<br><small>people shouldn't have to be Spark experts to write efficient code</small></li>
118 |             </ul>
119 |           </section>
120 |         </section>
121 |         <section data-transition="convex" hidden="" aria-hidden="true" class="stack future" style="top: 330px; display: none;">
122 |           <section style="top: 330px; display: none;">
123 |             <h2>The Solution</h2>
124 |           </section>
125 |           <section class="future" aria-hidden="true" style="top: 330px; display: none;">
126 |             <p>
127 |               Sparkpipe!
128 |             </p>
129 |             <pre class="presentation-code fragment" data-fragment-index="0">              <code class="sh hljs bash" data-trim="">$ ./spark-shell \
130 |   --packages software.uncharted.sparkpipe:sparkpipe-core:<span class="hljs-number">0.9</span>.<span class="hljs-number">7</span></code>
131 |             </pre>
132 |           </section>
133 |           <section class="future" aria-hidden="true" style="top: 330px; display: none;">
134 |             <pre class="presentation-code">              <code class="scala hljs" data-trim=""><span class="hljs-comment">// import data</span>
135 | <span class="hljs-keyword">val</span> df = sqlContext.read
136 |   .format(<span class="hljs-string">"json"</span>)
137 |   .load(<span class="hljs-string">"tweets.json.gz"</span>)
138 | 
139 | <span class="hljs-comment">// add column containing words from tweet text</span>
140 | <span class="hljs-keyword">import</span> scala.reflect.runtime.universe._
141 | <span class="hljs-keyword">import</span> org.apache.spark.sql.<span class="hljs-type">Column</span>
142 | <span class="hljs-keyword">val</span> wordsColumn = udf {
143 | 	(text: <span class="hljs-type">String</span>) =&gt; text.trim.split(<span class="hljs-string">"\\s+?"</span>)
144 | }(typeTag[<span class="hljs-type">Array</span>[<span class="hljs-type">String</span>]], typeTag[<span class="hljs-type">String</span>])(<span class="hljs-keyword">new</span> <span class="hljs-type">Column</span>(<span class="hljs-string">"text"</span>))
145 | <span class="hljs-keyword">val</span> df2 = df.withColumn(<span class="hljs-string">"tweet_words"</span>, wordsColumn)
146 | 
147 | <span class="hljs-comment">// do other things (if you're not in tears)</span></code>
148 |             </pre>
149 |             <p>There are things which are unintuitive in the Spark API</p>
150 |           </section>
151 | 					<section class="future" aria-hidden="true" style="top: 330px; display: none;">
152 | 						<pre class="presentation-code">              <code class="scala hljs" data-trim=""><span class="hljs-keyword">import</span> software.uncharted.sparkpipe.<span class="hljs-type">Pipe</span>
153 | <span class="hljs-keyword">import</span> software.uncharted.sparkpipe.ops.core.{dataframe =&gt; dfo}
154 | 
155 | <span class="hljs-keyword">val</span> ingest = <span class="hljs-type">Pipe</span>(sqlContext)
156 | .to(dfo.io.read(format = <span class="hljs-string">"json"</span>, path = <span class="hljs-string">"tweets.json.gz"</span>))
157 | .to(dfo.addColumn(
158 | 	<span class="hljs-string">"tweet_words"</span>,
159 | 	(text: <span class="hljs-type">String</span>) =&gt; text.trim.split(<span class="hljs-string">"\\s+?"</span>),
160 | 	<span class="hljs-string">"text"</span>
161 | ))
162 | <span class="hljs-keyword">val</span> df = ingest.run</code>
163 | 						</pre>
164 | 						<p>Sparkpipe aims to make them simpler and more semantically clear</p>
165 | 					</section>
166 | 					<section class="future" aria-hidden="true" style="top: 330px; display: none;">
167 | 						<pre class="presentation-code">              <code class="scala hljs" data-trim=""><span class="hljs-keyword">import</span> software.uncharted.sparkpipe.<span class="hljs-type">Pipe</span>
168 | <span class="hljs-keyword">import</span> software.uncharted.sparkpipe.ops.community.{twitter =&gt; twops}
169 | 
170 | <span class="hljs-keyword">val</span> ingest = <span class="hljs-type">Pipe</span>(sqlContext)
171 | .to(twops.io.read(format = <span class="hljs-string">"json"</span>, path = <span class="hljs-string">"tweets.json.gz"</span>))
172 | .to(twops.tweet.hashtags(<span class="hljs-string">"tweet_hashtags"</span>))
173 | 
174 | <span class="hljs-keyword">val</span> df = ingest.run</code>
175 | 						</pre>
176 | 						<p>more importantly, we want to reduce boilerplate<br> in domain-specific situations</p>
177 | 					</section>
178 | 					<section class="future" aria-hidden="true" style="top: 330px; display: none;">
179 | 						<p>
180 | 							The <strong>operation</strong> becomes the single modular unit that developers and data scientists can agree on.
181 | 						</p>
182 | 						<pre class="presentation-code">              <code class="scala hljs" data-trim=""><span class="hljs-comment">// operations are just simple scala functions!</span>
183 | <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">myOperation</span>(</span>arg1: <span class="hljs-type">String</span>, arg2: <span class="hljs-type">Int</span>)(input: <span class="hljs-type">DataFrame</span>): <span class="hljs-type">DataFrame</span> = {
184 |   <span class="hljs-comment">/* do something */</span>
185 | }</code>
186 | 						</pre>
187 | 						<p class="fragment" data-fragment-index="0">
188 | 							They should each do one thing, and do it well<br>(with well-defined inputs and outputs)
189 | 							<br><small>©UNIX :)</small>
190 | 						</p>
191 | 						<aside class="notes">
192 | 							And frankly, even if one doesn't do something well, it means the problem is isolated and can be fixed without breaking the agreed-upon interface.
193 | 						</aside>
194 | 					</section>
195 | 					<section class="future" aria-hidden="true" style="top: 330px; display: none;">
196 | 						<h3>Advantages for Uncharted</h3>
197 | 						<ul>
198 | 							<li class="fragment" data-fragment-index="0">Pipelines are highly scriptable!</li>
199 | 							<li class="fragment" data-fragment-index="1">Isolation from changes in the underlying Spark API <br><small>(i.e. RDD -&gt; DataFrame -&gt; Dataset)</small></li>
200 | 							<li class="fragment" data-fragment-index="2">Hide the uglier parts of Spark</li>
201 | 							<li class="fragment" data-fragment-index="3">Pipelines separate scripts into modular, reusable bits without adding too much overhead</li>
202 | 							<li class="fragment" data-fragment-index="4">Productionize scripts, optimize later</li>
203 | 							<li class="fragment" data-fragment-index="5">Simple DSL encourages a specific format for functions, and separates structure from implementation details</li>
204 | 						</ul>
205 | 					</section>
206 | 					<section class="future" aria-hidden="true" style="top: 330px; display: none;">
207 | 						<p>Separating structure from implementation details:</p>
208 | 						<pre class="presentation-code">							<code class="scala hljs" data-trim=""><span class="hljs-comment">// linear</span>
209 | <span class="hljs-keyword">val</span> ingest = <span class="hljs-type">Pipe</span>(red).to(blue).to(green).to(yellow)
210 | <span class="hljs-comment">// branch</span>
211 | <span class="hljs-keyword">val</span> analysis1 = pipe.to(something).to(somethingElse)
212 | <span class="hljs-keyword">val</span> analysis2 = pipe.to(anotherOp)
213 | <span class="hljs-comment">// merge</span>
214 | <span class="hljs-keyword">val</span> output = <span class="hljs-type">Pipe</span>(analysis1, analysis2).to(outputOp)
215 | <span class="hljs-comment">//non-linear pipelines are cool!</span></code>
216 | 						</pre>
217 | 						<p class="fragment" data-fragment-index="0">
218 | 							This makes even complex scripts <strong>easier to parse and share</strong> than the corresponding raw Spark code.
219 | 						</p>
220 | 					</section>
221 |         </section>
222 | 				<section data-transition="convex" hidden="" aria-hidden="true" class="stack future" style="top: 330px; display: none;">
223 | 					<section style="top: 330px; display: none;">
224 | 						<h2>The Bottom Line</h2>
225 | 					</section>
226 | 					<section class="future" aria-hidden="true" style="top: 330px; display: none;">
227 | 						<p>
228 | 							We're not trying to replace the underlying Spark API, so you generally won't find ops that wrap simple things like <code>select()</code>.
229 | 						</p>
230 | 						<pre class="presentation-code">							<code class="scala hljs" data-trim=""><span class="hljs-type">Pipe</span>(sqlContext)
231 | .to(twops.io.read(format = <span class="hljs-string">"json"</span>, path = <span class="hljs-string">"tweets.json.gz"</span>))
232 | .to(_.select(<span class="hljs-string">"text"</span>)) <span class="hljs-comment">//you can use scala inlines anytime you want!</span>
233 | .run</code>
234 | 						</pre>
235 | 					</section>
236 | 					<section class="future" aria-hidden="true" style="top: 330px; display: none;">
237 | 						<p>
238 | 							In particular, we're not trying to replace the SparkML <code>Pipeline</code>, though we do have ops for using it with Sparkpipe.
239 | 						</p>
240 | 					</section>
241 | 					<section class="future" aria-hidden="true" style="top: 330px; display: none;">
242 | 						<p>We've released <code>sparkpipe-core</code> as open source, and it's available now from Maven central.</p>
243 | 						<p class="fragment" data-fragment-index="0">
244 | 							We're also hard at work on domain-specific libraries which will also be open source.
245 | 						</p>
246 | 						<ul class="fragment" data-fragment-index="1">
247 | 							<li><code>sparkpipe-twitter-ops</code></li>
248 | 							<li><code>sparkpipe-instagram-ops</code></li>
249 | 							<li><code>sparkpipe-gdelt-ops</code></li>
250 | 							<li><code>sparkpipe-salt-ops</code> <em>(visualization)</em></li>
251 | 							<li>etc.</li>
252 | 						</ul>
253 | 					</section>
254 | 					<section class="future" aria-hidden="true" style="top: 330px; display: none;">
255 | 						<h3>But I need your help!</h3>
256 | 						<p class="fragment" data-fragment-index="0">Creating <code>sparkpipe</code> has helped improve the productivity of developers within my company.</p>
257 | 						<p class="fragment" data-fragment-index="1">But it would be even more powerful if a community of domain-specific libraries developed around it!</p>
258 | 						<aside class="notes">
259 | 							So, if you like what you see, get cracking! And tell me so I can link to it.
260 | 						</aside>
261 | 					</section>
262 | 					<section class="future" aria-hidden="true" style="top: 330px; display: none;">
263 | 						<h2>Mo data? <span class="fragment" data-fragment-index="0">No problems.</span></h2>
264 | 					</section>
265 | 				</section>
266 |       </div>
267 |     <div class="backgrounds"><div class="slide-background present" data-loaded="true" style="display: block;"></div><div class="slide-background stack future" data-loaded="true" style="display: block;"><div class="slide-background present" data-loaded="true" style="display: block;"></div><div class="slide-background future" data-loaded="true" style="display: block;"></div><div class="slide-background future" style="display: none;"></div><div class="slide-background future" style="display: none;"></div><div class="slide-background future" style="display: none;"></div></div><div class="slide-background stack future" data-loaded="true" style="display: block;"><div class="slide-background present" data-loaded="true" style="display: block;"></div><div class="slide-background future" style="display: none;"></div></div><div class="slide-background stack future" style="display: none;"><div class="slide-background present" style="display: none;"></div><div class="slide-background future" style="display: none;"></div><div class="slide-background future" style="display: none;"></div><div class="slide-background future" style="display: none;"></div><div class="slide-background future" style="display: none;"></div><div class="slide-background future" style="display: none;"></div><div class="slide-background future" style="display: none;"></div><div class="slide-background future" style="display: none;"></div></div><div class="slide-background stack future" style="display: none;"><div class="slide-background present" style="display: none;"></div><div class="slide-background future" style="display: none;"></div><div class="slide-background future" style="display: none;"></div><div class="slide-background future" style="display: none;"></div><div class="slide-background future" style="display: none;"></div><div class="slide-background future" style="display: none;"></div></div></div><div class="progress" style="display: block;"><span style="width: 0px;"></span></div><aside class="controls" style="display: block;"><button class="navigate-left" aria-label="previous slide"></button><button class="navigate-right enabled" aria-label="next slide"></button><button class="navigate-up" aria-label="above slide"></button><button class="navigate-down" aria-label="below slide"></button></aside><div class="slide-number" style="display: none;"></div><div class="speaker-notes" data-prevent-swipe=""></div><div class="pause-overlay"></div><div id="aria-status-div" aria-live="polite" aria-atomic="true" style="position: absolute; height: 1px; width: 1px; overflow: hidden; clip: rect(1px 1px 1px 1px);">
268 |           Sustainable Spark Development
269 |           
270 |             Sean McIntyre
271 |             Software Architect
272 |           
273 |           
274 |         </div></div>
275 | 
276 |     <script src="./sustainable_spark_development_files/head.min.js"></script>
277 |     <script src="./sustainable_spark_development_files/reveal.js"></script>
278 | 
279 |     <script>
280 | 
281 |       // Full list of configuration options available at:
282 |       // https://github.com/hakimel/reveal.js#configuration
283 |       Reveal.initialize({
284 |         controls: true,
285 |         progress: true,
286 |         history: true,
287 |         center: true,
288 | 
289 |         transition: 'slide', // none/fade/slide/convex/concave/zoom
290 | 
291 |         // Optional reveal.js plugins
292 |         dependencies: [
293 |           { src: 'lib/js/classList.js', condition: function() { return !document.body.classList; } },
294 |           { src: 'plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
295 |           { src: 'plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
296 |           { src: 'plugin/highlight/highlight.js', async: true, callback: function() { hljs.initHighlightingOnLoad(); } },
297 |           { src: 'plugin/zoom-js/zoom.js', async: true },
298 |           { src: 'plugin/notes/notes.js', async: true }
299 |         ]
300 |       });
301 | 
302 |     </script><script type="text/javascript" src="./sustainable_spark_development_files/highlight.js"></script><script type="text/javascript" src="./sustainable_spark_development_files/zoom.js"></script><script type="text/javascript" src="./sustainable_spark_development_files/notes.js"></script>
303 | 
304 |   
305 | 
306 | </body></html>


--------------------------------------------------------------------------------
/2016-03-30/tas_spark_realtime_risk_management_2016.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-03-30/tas_spark_realtime_risk_management_2016.pdf


--------------------------------------------------------------------------------
/2016-04-27/5BP.md:
--------------------------------------------------------------------------------
1 | [5BP - April 2016](https://github.com/TorontoApacheSpark/Spark-Meetup-Five-Bullet-Points/blob/master/content/2016/april.md)
2 | 


--------------------------------------------------------------------------------
/2016-04-27/SparkKafkaMeetup2016-04-27.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-04-27/SparkKafkaMeetup2016-04-27.pdf


--------------------------------------------------------------------------------
/2016-05-25/.ignore:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/2016-05-25/Hackathon-2016.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-05-25/Hackathon-2016.pdf


--------------------------------------------------------------------------------
/2016-05-25/Spark Tools and Methodologies at Shopify.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-05-25/Spark Tools and Methodologies at Shopify.pdf


--------------------------------------------------------------------------------
/2016-06-29/.ignore:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/2016-06-29/Collaborative_Recommendations_by_Mo_kijiji.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-06-29/Collaborative_Recommendations_by_Mo_kijiji.pdf


--------------------------------------------------------------------------------
/2016-07-27/ExperiencesinDeliveringSparkasaServiceIBM.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-07-27/ExperiencesinDeliveringSparkasaServiceIBM.pdf


--------------------------------------------------------------------------------
/2016-07-27/Readme.md:
--------------------------------------------------------------------------------
1 | # Slides for the July 27, 2016 TAS Meetup
2 | 
3 | http://www.meetup.com/Toronto-Apache-Spark/events/232329359/
4 | 


--------------------------------------------------------------------------------
/2016-07-27/SparkStoriesWattpad.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-07-27/SparkStoriesWattpad.pdf


--------------------------------------------------------------------------------
/2016-09-28/Continuous_Applications_with_Apache_Spark.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-09-28/Continuous_Applications_with_Apache_Spark.pdf


--------------------------------------------------------------------------------
/2016-10-26/.ignore:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/2016-10-26/Shoehorning_Spark.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-10-26/Shoehorning_Spark.pdf


--------------------------------------------------------------------------------
/2016-10-26/Spark_in_production_pipelines.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-10-26/Spark_in_production_pipelines.pdf


--------------------------------------------------------------------------------
/2016-11-30/Analyzing+Flight+Data+with+GraphFrames+2.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Load the airport and flight data from Cloudant"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 46,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "cloudantHost='dtaieb.cloudant.com'\n",
 19 |     "cloudantUserName='weenesserliffircedinvers'\n",
 20 |     "cloudantPassword='72a5c4f939a9e2578698029d2bb041d775d088b5'"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": 47,
 26 |    "metadata": {
 27 |     "collapsed": false
 28 |    },
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "airports = sqlContext.read.format(\"com.cloudant.spark\").option(\"cloudant.host\",cloudantHost)\\\n",
 32 |     "    .option(\"cloudant.username\",cloudantUserName).option(\"cloudant.password\",cloudantPassword)\\\n",
 33 |     "    .option(\"schemaSampleSize\", \"-1\").load(\"flight-metadata\")\n",
 34 |     "airports.cache()\n",
 35 |     "airports.registerTempTable(\"airports\")"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": null,
 41 |    "metadata": {
 42 |     "collapsed": false,
 43 |     "pixiedust": {
 44 |      "displayParams": {
 45 |       "handlerId": "dataframe"
 46 |      }
 47 |     }
 48 |    },
 49 |    "outputs": [],
 50 |    "source": [
 51 |     "import pixiedust\n",
 52 |     "display(airports)"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": null,
 58 |    "metadata": {
 59 |     "collapsed": false
 60 |    },
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "flights = sqlContext.read.format(\"com.cloudant.spark\").option(\"cloudant.host\",cloudantHost)\\\n",
 64 |     "    .option(\"cloudant.username\",cloudantUserName).option(\"cloudant.password\",cloudantPassword)\\\n",
 65 |     "    .option(\"schemaSampleSize\", \"-1\").load(\"pycon_flightpredict_training_set\")\n",
 66 |     "flights.cache()\n",
 67 |     "flights.registerTempTable(\"training\")"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": null,
 73 |    "metadata": {
 74 |     "collapsed": false,
 75 |     "pixiedust": {
 76 |      "displayParams": {
 77 |       "handlerId": "dataframe"
 78 |      }
 79 |     }
 80 |    },
 81 |    "outputs": [],
 82 |    "source": [
 83 |     "display(flights)"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "markdown",
 88 |    "metadata": {},
 89 |    "source": [
 90 |     "# Build the vertices and edges dataframe from the data"
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "code",
 95 |    "execution_count": 26,
 96 |    "metadata": {
 97 |     "collapsed": false
 98 |    },
 99 |    "outputs": [
100 |     {
101 |      "name": "stdout",
102 |      "output_type": "stream",
103 |      "text": [
104 |       "422\n"
105 |      ]
106 |     }
107 |    ],
108 |    "source": [
109 |     "from pyspark.sql import functions as f\n",
110 |     "from pyspark.sql.types import *\n",
111 |     "rdd = flights.flatMap(lambda s: [s.arrivalAirportFsCode, s.departureAirportFsCode]).distinct()\\\n",
112 |     "    .map(lambda row:[row])\n",
113 |     "vertices = airports.join(\n",
114 |     "      sqlContext.createDataFrame(rdd, StructType([StructField(\"fs\",StringType())])), \"fs\"\n",
115 |     "    ).dropDuplicates([\"fs\"]).withColumnRenamed(\"fs\",\"id\")\n",
116 |     "print(vertices.count())"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "code",
121 |    "execution_count": 27,
122 |    "metadata": {
123 |     "collapsed": false
124 |    },
125 |    "outputs": [],
126 |    "source": [
127 |     "edges=flights.withColumnRenamed(\"arrivalAirportFsCode\",\"dst\")\\\n",
128 |     "    .withColumnRenamed(\"departureAirportFsCode\",\"src\")\\\n",
129 |     "    .drop(\"departureWeather\").drop(\"arrivalWeather\").drop(\"pt_type\").drop(\"_id\").drop(\"_rev\")"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "metadata": {},
135 |    "source": [
136 |     "# Install GraphFrames package using PixieDust packageManager"
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "code",
141 |    "execution_count": 28,
142 |    "metadata": {
143 |     "collapsed": false
144 |    },
145 |    "outputs": [
146 |     {
147 |      "name": "stdout",
148 |      "output_type": "stream",
149 |      "text": [
150 |       "Package already installed: graphframes:graphframes:0.1.0-spark1.6\n",
151 |       "done\n"
152 |      ]
153 |     }
154 |    ],
155 |    "source": [
156 |     "import pixiedust\n",
157 |     "pixiedust.installPackage(\"graphframes:graphframes:0.1.0-spark1.6\")\n",
158 |     "print(\"done\")"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "markdown",
163 |    "metadata": {},
164 |    "source": [
165 |     "# Create the GraphFrame from the Vertices and Edges Dataframes"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "code",
170 |    "execution_count": 29,
171 |    "metadata": {
172 |     "collapsed": false,
173 |     "pixiedust": {
174 |      "displayParams": {
175 |       "handlerId": "graphMap"
176 |      }
177 |     },
178 |     "scrolled": false
179 |    },
180 |    "outputs": [],
181 |    "source": [
182 |     "from graphframes import GraphFrame\n",
183 |     "g = GraphFrame(vertices, edges)\n",
184 |     "display(g)"
185 |    ]
186 |   },
187 |   {
188 |    "cell_type": "markdown",
189 |    "metadata": {},
190 |    "source": [
191 |     "### Compute the degree for each vertex in the graph\n",
192 |     "The degree of a vertex is the number of edges incident to the vertex. In a directed graph, in-degree is the number of edges where vertex is the destination and out-degree is the number of edges where the vertex is the source. With GraphFrames, there is a degrees, outDegrees and inDegrees property that return a DataFrame containing the id of the vertext and the number of edges. We then sort then in descending order"
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "execution_count": null,
198 |    "metadata": {
199 |     "collapsed": false,
200 |     "pixiedust": {
201 |      "displayParams": {
202 |       "handlerId": "dataframe"
203 |      }
204 |     },
205 |     "scrolled": false
206 |    },
207 |    "outputs": [],
208 |    "source": [
209 |     "from pyspark.sql.functions import *\n",
210 |     "degrees = g.degrees.sort(desc(\"degree\"))\n",
211 |     "display( degrees )"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "markdown",
216 |    "metadata": {},
217 |    "source": [
218 |     "### Compute a list of shortest paths for each vertex to a specified list of landmarks\n",
219 |     "For this we use the `shortestPaths` api that returns DataFrame containing the properties for each vertex plus an extra column called distances that contains the number of hops to each landmark.\n",
220 |     "In the following code, we use BOS and LAX as the landmarks"
221 |    ]
222 |   },
223 |   {
224 |    "cell_type": "code",
225 |    "execution_count": 31,
226 |    "metadata": {
227 |     "collapsed": false,
228 |     "pixiedust": {
229 |      "displayParams": {
230 |       "handlerId": "dataframe"
231 |      }
232 |     },
233 |     "scrolled": true
234 |    },
235 |    "outputs": [],
236 |    "source": [
237 |     "r = g.shortestPaths(landmarks=[\"BOS\", \"LAX\"]).select(\"id\", \"distances\")\n",
238 |     "#display(r)"
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "code",
243 |    "execution_count": 32,
244 |    "metadata": {
245 |     "collapsed": false
246 |    },
247 |    "outputs": [
248 |     {
249 |      "data": {
250 |       "text/plain": [
251 |        "[Row(id=u'CAE', distances={})]"
252 |       ]
253 |      },
254 |      "execution_count": 32,
255 |      "metadata": {},
256 |      "output_type": "execute_result"
257 |     }
258 |    ],
259 |    "source": [
260 |     "r.take(1)"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "markdown",
265 |    "metadata": {},
266 |    "source": [
267 |     "### Compute the pageRank for each vertex in the graph\n",
268 |     "[PageRank](https://en.wikipedia.org/wiki/PageRank) is a famous algorithm used by Google Search to rank vertices in a graph by order of importance. To compute pageRank, we'll use the `pageRank` api that returns a new graph in which the vertices have a new `pagerank` column representing the pagerank score for the vertex and the edges have a new `weight` column representing the edge weight that contributed to the pageRank score. We'll then display the vertice ids and associated pageranks sorted descending: "
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "code",
273 |    "execution_count": null,
274 |    "metadata": {
275 |     "collapsed": false,
276 |     "pixiedust": {
277 |      "displayParams": {
278 |       "handlerId": "edges"
279 |      }
280 |     },
281 |     "scrolled": false
282 |    },
283 |    "outputs": [],
284 |    "source": [
285 |     "from pyspark.sql.functions import *\n",
286 |     "\n",
287 |     "ranks = g.pageRank(resetProbability=0.20, maxIter=5)\n",
288 |     "\n",
289 |     "rankedVertices = ranks.vertices.select(\"id\",\"pagerank\").orderBy(desc(\"pagerank\"))\n",
290 |     "rankedEdges = ranks.edges.select(\"src\", \"dst\", \"weight\").orderBy(desc(\"weight\") )\n",
291 |     "\n",
292 |     "ranks = GraphFrame(rankedVertices, rankedEdges)\n",
293 |     "display(ranks)"
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "markdown",
298 |    "metadata": {},
299 |    "source": [
300 |     "### Search routes between 2 airports with specific criteria\n",
301 |     "In this section, we want to find all the routes between Boston and San Francisco operated by United Airlines with at most 2 hops. To accomplish this, we use the `bfs` ([Breath First Search](https://en.wikipedia.org/wiki/Breadth-first_search)) api that returns a DataFrame containing the shortest path between matching vertices. For clarity will only keep the edge when displaying the results"
302 |    ]
303 |   },
304 |   {
305 |    "cell_type": "code",
306 |    "execution_count": 34,
307 |    "metadata": {
308 |     "collapsed": false,
309 |     "pixiedust": {
310 |      "displayParams": {
311 |       "handlerId": "dataframe"
312 |      }
313 |     },
314 |     "scrolled": true
315 |    },
316 |    "outputs": [],
317 |    "source": [
318 |     "paths = g.bfs(fromExpr=\"id='BOS'\",toExpr=\"id = 'SFO'\",edgeFilter=\"carrierFsCode='UA'\", maxPathLength = 2)\\\n",
319 |     "    .drop(\"from\").drop(\"to\")\n",
320 |     "paths.cache()\n",
321 |     "display(paths)"
322 |    ]
323 |   },
324 |   {
325 |    "cell_type": "markdown",
326 |    "metadata": {},
327 |    "source": [
328 |     "### Find all airports that do not have direct flights between each other\n",
329 |     "In this section, we'll use a very powerful graphFrames search feature that uses a pattern called [motif](http://graphframes.github.io/user-guide.html#motif-finding) to find nodes. The pattern we'll use the following pattern `\"(a)-[]->(b);(b)-[]->(c);!(a)-[]->(c)\"` which searches for all nodes a, b and c that have a path to (a,b) and a path to (b,c) but not a path to (a,c). \n",
330 |     "Also, because the search is computationally expensive, we reduce the number of edges by grouping the flights that have the same src and dst."
331 |    ]
332 |   },
333 |   {
334 |    "cell_type": "code",
335 |    "execution_count": null,
336 |    "metadata": {
337 |     "collapsed": false,
338 |     "pixiedust": {
339 |      "displayParams": {
340 |       "handlerId": "dataframe"
341 |      }
342 |     },
343 |     "scrolled": false
344 |    },
345 |    "outputs": [],
346 |    "source": [
347 |     "from pyspark.sql.functions import *\n",
348 |     "\n",
349 |     "h = GraphFrame(g.vertices, g.edges.select(\"src\",\"dst\")\\\n",
350 |     "   .groupBy(\"src\",\"dst\").agg(count(\"src\").alias(\"count\")))\n",
351 |     "\n",
352 |     "query = h.find(\"(a)-[]->(b);(b)-[]->(c);!(a)-[]->(c)\").drop(\"b\")\n",
353 |     "query.cache()\n",
354 |     "display(query)"
355 |    ]
356 |   },
357 |   {
358 |    "cell_type": "markdown",
359 |    "metadata": {},
360 |    "source": [
361 |     "### Compute the strongly connected components for this graph\n",
362 |     "[Strongly Connected Components](https://en.wikipedia.org/wiki/Strongly_connected_component) are components for which each vertex is reachable from every other vertex. To compute them, we'll use the `stronglyConnectedComponents` api that returns a DataFrame containing all the vertices with the addition of a `component` column that has the component id in which the vertex belongs to. We then group all the rows by components and aggregate the sum of all the member vertices. This gives us a good idea of the components distribution in the graph"
363 |    ]
364 |   },
365 |   {
366 |    "cell_type": "code",
367 |    "execution_count": null,
368 |    "metadata": {
369 |     "collapsed": false,
370 |     "pixiedust": {
371 |      "displayParams": {
372 |       "handlerId": "dataframe"
373 |      }
374 |     },
375 |     "scrolled": true
376 |    },
377 |    "outputs": [],
378 |    "source": [
379 |     "from pyspark.sql.functions import *\n",
380 |     "components = g.stronglyConnectedComponents(maxIter=10).select(\"id\",\"component\")\\\n",
381 |     "    .groupBy(\"component\").agg(count(\"id\").alias(\"count\")).orderBy(desc(\"count\"))\n",
382 |     "display(components)"
383 |    ]
384 |   },
385 |   {
386 |    "cell_type": "markdown",
387 |    "metadata": {},
388 |    "source": [
389 |     "### Detect communities in the graph using Label Propagation algorithm\n",
390 |     "[Label Propagation algorithm](https://en.wikipedia.org/wiki/Label_Propagation_Algorithm) is a popular algorithm for finding communities within a graph. It has the advantage to be computationally inexpensive and thus works well with large graphs. To compute the communities, we'll use the `labelPropagation` api that returns a DataFrame containing all the vertices with the addition of a `label` column that has the label id for the communities in which the vertex belongs to. Similar to the strongly connected components, we'll then group all the rows by label and aggregate the sum of all the member vertices."
391 |    ]
392 |   },
393 |   {
394 |    "cell_type": "code",
395 |    "execution_count": null,
396 |    "metadata": {
397 |     "collapsed": false,
398 |     "pixiedust": {
399 |      "displayParams": {
400 |       "handlerId": "dataframe"
401 |      }
402 |     }
403 |    },
404 |    "outputs": [],
405 |    "source": [
406 |     "from pyspark.sql.functions import *\n",
407 |     "communities = g.labelPropagation(maxIter=5).select(\"id\", \"label\")\\\n",
408 |     "    .groupBy(\"label\").agg(count(\"id\").alias(\"count\")).orderBy(desc(\"count\"))\n",
409 |     "display(communities)"
410 |    ]
411 |   },
412 |   {
413 |    "cell_type": "markdown",
414 |    "metadata": {},
415 |    "source": [
416 |     "## Use AggregateMessages to compute the average flight delays by originating airport\n",
417 |     "\n",
418 |     "AggregateMessages api is not currently available in Python, so we use PixieDust Scala bridge to call out the Scala API\n",
419 |     "Note: Notice that PixieDust is automatically rebinding the python GraphFrame variable g into a scala GraphFrame with same name"
420 |    ]
421 |   },
422 |   {
423 |    "cell_type": "code",
424 |    "execution_count": null,
425 |    "metadata": {
426 |     "collapsed": false
427 |    },
428 |    "outputs": [],
429 |    "source": [
430 |     "%%scala\n",
431 |     "import org.graphframes.lib.AggregateMessages\n",
432 |     "import org.apache.spark.sql.functions.{avg,desc,floor}\n",
433 |     "\n",
434 |     "// For each airport, average the delays of the departing flights\n",
435 |     "val msgToSrc = AggregateMessages.edge(\"deltaDeparture\")\n",
436 |     "val __agg = g.aggregateMessages\n",
437 |     "  .sendToSrc(msgToSrc)  // send each flight delay to source\n",
438 |     "  .agg(floor(avg(AggregateMessages.msg)).as(\"averageDelays\"))  // average up all delays\n",
439 |     "  .orderBy(desc(\"averageDelays\"))\n",
440 |     "  .limit(10)\n",
441 |     "__agg.cache()\n",
442 |     "__agg.show()"
443 |    ]
444 |   },
445 |   {
446 |    "cell_type": "code",
447 |    "execution_count": null,
448 |    "metadata": {
449 |     "collapsed": false,
450 |     "pixiedust": {
451 |      "displayParams": {
452 |       "aggregation": "SUM",
453 |       "handlerId": "barChart",
454 |       "keyFields": "id",
455 |       "showLegend": "true",
456 |       "stacked": "true",
457 |       "staticFigure": "false",
458 |       "title": "Average Flight delays by originating airport",
459 |       "valueFields": "averageDelays"
460 |      }
461 |     }
462 |    },
463 |    "outputs": [],
464 |    "source": [
465 |     "display(__agg)"
466 |    ]
467 |   },
468 |   {
469 |    "cell_type": "code",
470 |    "execution_count": null,
471 |    "metadata": {
472 |     "collapsed": true
473 |    },
474 |    "outputs": [],
475 |    "source": []
476 |   }
477 |  ],
478 |  "metadata": {
479 |   "kernelspec": {
480 |    "display_name": "Python 2 with Spark 1.6",
481 |    "language": "python",
482 |    "name": "python2"
483 |   },
484 |   "language_info": {
485 |    "codemirror_mode": {
486 |     "name": "ipython",
487 |     "version": 2
488 |    },
489 |    "file_extension": ".py",
490 |    "mimetype": "text/x-python",
491 |    "name": "python",
492 |    "nbconvert_exporter": "python",
493 |    "pygments_lexer": "ipython2",
494 |    "version": "2.7.11"
495 |   }
496 |  },
497 |  "nbformat": 4,
498 |  "nbformat_minor": 0
499 | }


--------------------------------------------------------------------------------
/2016-11-30/GraphFrame basics.ipynb:
--------------------------------------------------------------------------------
1 | {"cells": [{"cell_type": "code", "execution_count": 47, "metadata": {"collapsed": false}, "source": "import pixiedust\npixiedust.installPackage(\"graphframes:graphframes:0.1.0-spark1.6\")", "outputs": [{"output_type": "stream", "name": "stdout", "text": "Package already installed: graphframes:graphframes:0.1.0-spark1.6\n"}, {"data": {"text/plain": "<maven.artifact.Artifact at 0x7f5d2615f4d0>"}, "execution_count": 47, "metadata": {}, "output_type": "execute_result"}]}, {"cell_type": "code", "execution_count": 48, "metadata": {"collapsed": true}, "source": "from graphframes import GraphFrame", "outputs": []}, {"cell_type": "code", "execution_count": 49, "metadata": {"collapsed": true}, "source": "# Vertex DataFrame\nv = sqlContext.createDataFrame([\n  (\"a\", \"Alice\", 34),\n  (\"b\", \"Bob\", 36),\n  (\"c\", \"Charlie\", 30),\n  (\"d\", \"David\", 29),\n  (\"e\", \"Esther\", 32),\n  (\"f\", \"Fanny\", 36),\n  (\"g\", \"Gabby\", 60)\n], [\"id\", \"name\", \"age\"])\n\n# Edge DataFrame\ne = sqlContext.createDataFrame([\n  (\"a\", \"b\", \"friend\"),\n  (\"b\", \"c\", \"follow\"),\n  (\"c\", \"b\", \"follow\"),\n  (\"f\", \"c\", \"follow\"),\n  (\"e\", \"f\", \"follow\"),\n  (\"e\", \"d\", \"friend\"),\n  (\"d\", \"a\", \"friend\"),\n  (\"a\", \"e\", \"friend\")\n], [\"src\", \"dst\", \"relationship\"])", "outputs": []}, {"cell_type": "code", "execution_count": 50, "metadata": {"collapsed": true}, "source": "# Create a GraphFrame\ng = GraphFrame(v, e)", "outputs": []}, {"cell_type": "code", "execution_count": 51, "metadata": {"collapsed": false}, "source": "# take a look at the vertices (show)\ng.vertices.show()", "outputs": [{"output_type": "stream", "name": "stdout", "text": "+---+-------+---+\n| id|   name|age|\n+---+-------+---+\n|  a|  Alice| 34|\n|  b|    Bob| 36|\n|  c|Charlie| 30|\n|  d|  David| 29|\n|  e| Esther| 32|\n|  f|  Fanny| 36|\n|  g|  Gabby| 60|\n+---+-------+---+\n\n"}]}, {"cell_type": "code", "execution_count": null, "metadata": {"collapsed": false}, "source": "# take a look at the edges (show)\ng.edges.show()", "outputs": []}, {"cell_type": "code", "execution_count": 39, "metadata": {"collapsed": false}, "source": "# find the youngest user in the group                 # g.vertices.groupBy().min(\"age\").show()\ng.vertices.groupBy().min(\"age\").show()", "outputs": [{"output_type": "stream", "name": "stdout", "text": "+--------+\n|min(age)|\n+--------+\n|      29|\n+--------+\n\n"}]}, {"cell_type": "code", "execution_count": 40, "metadata": {"collapsed": false}, "source": "# how many follows are in the graph? \nnumFollows = g.edges.filter(\"relationship = 'follow'\").count()\n\nprint \"Total number of follows: \" + str(numFollows)", "outputs": [{"output_type": "stream", "name": "stdout", "text": "Total number of follows: 4\n"}]}, {"cell_type": "code", "execution_count": 52, "metadata": {"collapsed": false}, "source": "# Motif finding (DSL)\n# (a) - [e] -> (b) \n\n# Ex. Find all the pairs of vertices with edges in both directions (find) \ng.find(\"(a)-[]->(b); (b)-[]->(a)\").show()\n\n# find (filter) only those where one of the nodes is older than 30\ng.find(\"(a)-[]->(b); (b)-[]->(a)\").filter(\"a.age > 30\")\n\n# more complex: a->b, b->c but !a->b                                               g.find(\"(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)\").filter(\"a.name = 'Alice'\").show()\ng.find(\"(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)\").show()", "outputs": [{"output_type": "stream", "name": "stdout", "text": "+--------------+--------------+\n|             a|             b|\n+--------------+--------------+\n|[c,Charlie,30]|    [b,Bob,36]|\n|    [b,Bob,36]|[c,Charlie,30]|\n+--------------+--------------+\n\n+--------------+--------------+--------------+\n|             a|             b|             c|\n+--------------+--------------+--------------+\n| [e,Esther,32]|  [f,Fanny,36]|[c,Charlie,30]|\n|[c,Charlie,30]|    [b,Bob,36]|[c,Charlie,30]|\n|  [a,Alice,34]| [e,Esther,32]|  [f,Fanny,36]|\n|  [a,Alice,34]|    [b,Bob,36]|[c,Charlie,30]|\n|    [b,Bob,36]|[c,Charlie,30]|    [b,Bob,36]|\n| [e,Esther,32]|  [d,David,29]|  [a,Alice,34]|\n|  [a,Alice,34]| [e,Esther,32]|  [d,David,29]|\n|  [d,David,29]|  [a,Alice,34]|    [b,Bob,36]|\n|  [f,Fanny,36]|[c,Charlie,30]|    [b,Bob,36]|\n|  [d,David,29]|  [a,Alice,34]| [e,Esther,32]|\n+--------------+--------------+--------------+\n\n"}]}, {"cell_type": "code", "execution_count": 46, "metadata": {"collapsed": true}, "source": "# Select subgraph of users older than 30, and edges of type \"friend\"\nv2 = g.vertices.filter(\"age > 30\")\ne2 = g.edges.filter(\"relationship = 'friend'\")\ng2 = GraphFrame(v2, e2)", "outputs": []}, {"cell_type": "code", "execution_count": null, "metadata": {"collapsed": true}, "source": "", "outputs": []}], "nbformat_minor": 0, "metadata": {"kernelspec": {"display_name": "Python 2 with Spark 1.6", "language": "python", "name": "python2"}, "language_info": {"version": "2.7.11", "mimetype": "text/x-python", "codemirror_mode": {"version": 2, "name": "ipython"}, "file_extension": ".py", "name": "python", "pygments_lexer": "ipython2", "nbconvert_exporter": "python"}}, "nbformat": 4}


--------------------------------------------------------------------------------
/2016-11-30/Spark GraphFrames.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-11-30/Spark GraphFrames.pdf


--------------------------------------------------------------------------------
/2017-01-25/.gitignore:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/2017-01-25/README.md:
--------------------------------------------------------------------------------
1 | Event: https://www.meetup.com/Toronto-Apache-Spark/events/237100210/
2 | Video will be uploaded!
3 | 


--------------------------------------------------------------------------------
/2017-02-22/README.md:
--------------------------------------------------------------------------------
1 | Event: https://www.meetup.com/Toronto-Apache-Spark/events/237474395/
2 | 


--------------------------------------------------------------------------------
/2017-02-22/TAS-2017.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2017-02-22/TAS-2017.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Toronto Apache Spark Meetup
 2 | `Watch` [TAS meetups repo](https://github.com/TorontoApacheSpark/meetups) to get notified as we add slides and material for each event.
 3 | 
 4 | ## Monthly Contents:
 5 | - [Toronto Apache Spark Google+ Collection](https://plus.google.com/collection/UZdAbB) `Join` to participate via Hangouts in live sessions
 6 | - [Toronto Apache Spark Youtube Channel](https://www.youtube.com/channel/UCjES0_2fkZuNXlyC_HoHxyw) `Subscribe` to get notified!
 7 | - [5 Bullet Points](https://github.com/TorontoApacheSpark/Spark-Meetup-Five-Bullet-Points) `Watch` the repo to get notified! 
 8 | 
 9 | ## Meetup Links:
10 | - [Toronto Apache Spark Meetup Page](http://www.meetup.com/Toronto-Apache-Spark/)
11 | - [Giving talk at Toronto Apache Spark](http://goo.gl/forms/ygzYg8SjXr)
12 | - [Members Survey](http://goo.gl/forms/ykzMzlXDIQ)
13 | 
14 | ## Slack:
15 | - [Toronto Apache Spark Slack](https://torontoapachespark.slack.com) - You need to send an email to us with "Slack" as its subject so we can invite you.
16 | 
17 | E-mail: torontoapachespark@gmail.com
18 | 


--------------------------------------------------------------------------------