├── 2015-08-31 └── AdvancedDataScienceSpark.pdf ├── 2015-09-30 ├── Building_enterprise_data_lake_solution_using_spark_and_sequoiadb.pdf └── Spark_Scala_vs_The_Rest.pdf ├── 2015-11-25 ├── 5Bullets_Nov25.pdf ├── Continuous_Integration_for_Apache_Spark.pdf ├── Oliver-SparkMeetup-2015-11-23.pdf ├── Remote_QA_With_Denny_Lee.pptx └── a-year-of-spark-at-flipp_toronto-spark-meetup_20151125.pdf ├── 2015-12-14 ├── 5Bullets_Dec14.pdf ├── Organizational_Updates.gslides └── Toronto_SparkMeetup_Dec142015.pdf ├── 2016-01-27 ├── Databricks_Spark_Summit_East_2016_Promo.pptx ├── IntroToSpark_by_Adastra.pdf └── SparkAsService_by_Sansom_Lee.pptx ├── 2016-02-24 └── ScalaJVMBigData-SparkLessons.pdf ├── 2016-03-30 ├── sustainable_spark_development.htm └── tas_spark_realtime_risk_management_2016.pdf ├── 2016-04-27 ├── 5BP.md └── SparkKafkaMeetup2016-04-27.pdf ├── 2016-05-25 ├── .ignore ├── Hackathon-2016.pdf └── Spark Tools and Methodologies at Shopify.pdf ├── 2016-06-29 ├── .ignore └── Collaborative_Recommendations_by_Mo_kijiji.pdf ├── 2016-07-27 ├── ExperiencesinDeliveringSparkasaServiceIBM.pdf ├── Readme.md └── SparkStoriesWattpad.pdf ├── 2016-09-28 └── Continuous_Applications_with_Apache_Spark.pdf ├── 2016-10-26 ├── .ignore ├── Shoehorning_Spark.pdf └── Spark_in_production_pipelines.pdf ├── 2016-11-30 ├── Analyzing+Flight+Data+with+GraphFrames+2.ipynb ├── GraphFrame basics.ipynb └── Spark GraphFrames.pdf ├── 2017-01-25 ├── .gitignore └── README.md ├── 2017-02-22 ├── README.md └── TAS-2017.pdf └── README.md /2015-08-31/AdvancedDataScienceSpark.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-08-31/AdvancedDataScienceSpark.pdf -------------------------------------------------------------------------------- /2015-09-30/Building_enterprise_data_lake_solution_using_spark_and_sequoiadb.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-09-30/Building_enterprise_data_lake_solution_using_spark_and_sequoiadb.pdf -------------------------------------------------------------------------------- /2015-09-30/Spark_Scala_vs_The_Rest.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-09-30/Spark_Scala_vs_The_Rest.pdf -------------------------------------------------------------------------------- /2015-11-25/5Bullets_Nov25.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-11-25/5Bullets_Nov25.pdf -------------------------------------------------------------------------------- /2015-11-25/Continuous_Integration_for_Apache_Spark.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-11-25/Continuous_Integration_for_Apache_Spark.pdf -------------------------------------------------------------------------------- /2015-11-25/Oliver-SparkMeetup-2015-11-23.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-11-25/Oliver-SparkMeetup-2015-11-23.pdf -------------------------------------------------------------------------------- /2015-11-25/Remote_QA_With_Denny_Lee.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-11-25/Remote_QA_With_Denny_Lee.pptx -------------------------------------------------------------------------------- /2015-11-25/a-year-of-spark-at-flipp_toronto-spark-meetup_20151125.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-11-25/a-year-of-spark-at-flipp_toronto-spark-meetup_20151125.pdf -------------------------------------------------------------------------------- /2015-12-14/5Bullets_Dec14.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-12-14/5Bullets_Dec14.pdf -------------------------------------------------------------------------------- /2015-12-14/Organizational_Updates.gslides: -------------------------------------------------------------------------------- 1 | {"url": "https://docs.google.com/open?id=179XryZKXI965WCmaIKjA9zz7SjUTJXOuxM8kEvl4Ee8", "doc_id": "179XryZKXI965WCmaIKjA9zz7SjUTJXOuxM8kEvl4Ee8", "email": "pazookime@gmail.com", "resource_id": "presentation:179XryZKXI965WCmaIKjA9zz7SjUTJXOuxM8kEvl4Ee8"} -------------------------------------------------------------------------------- /2015-12-14/Toronto_SparkMeetup_Dec142015.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2015-12-14/Toronto_SparkMeetup_Dec142015.pdf -------------------------------------------------------------------------------- /2016-01-27/Databricks_Spark_Summit_East_2016_Promo.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-01-27/Databricks_Spark_Summit_East_2016_Promo.pptx -------------------------------------------------------------------------------- /2016-01-27/IntroToSpark_by_Adastra.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-01-27/IntroToSpark_by_Adastra.pdf -------------------------------------------------------------------------------- /2016-01-27/SparkAsService_by_Sansom_Lee.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-01-27/SparkAsService_by_Sansom_Lee.pptx -------------------------------------------------------------------------------- /2016-02-24/ScalaJVMBigData-SparkLessons.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-02-24/ScalaJVMBigData-SparkLessons.pdf -------------------------------------------------------------------------------- /2016-03-30/sustainable_spark_development.htm: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Introducing Apache Spark 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 33 | 34 | 37 | 38 | 39 | 40 | 41 |
42 | 43 | 44 |
45 |
46 |

Sustainable Spark Development

47 |

48 | Sean McIntyre
49 | Software Architect 50 |

51 | 52 |
53 | 107 | 121 | 222 | 266 |
267 |
268 | Sustainable Spark Development 269 | 270 | Sean McIntyre 271 | Software Architect 272 | 273 | 274 |
275 | 276 | 277 | 278 | 279 | 303 | 304 | 305 | 306 | -------------------------------------------------------------------------------- /2016-03-30/tas_spark_realtime_risk_management_2016.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-03-30/tas_spark_realtime_risk_management_2016.pdf -------------------------------------------------------------------------------- /2016-04-27/5BP.md: -------------------------------------------------------------------------------- 1 | [5BP - April 2016](https://github.com/TorontoApacheSpark/Spark-Meetup-Five-Bullet-Points/blob/master/content/2016/april.md) 2 | -------------------------------------------------------------------------------- /2016-04-27/SparkKafkaMeetup2016-04-27.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-04-27/SparkKafkaMeetup2016-04-27.pdf -------------------------------------------------------------------------------- /2016-05-25/.ignore: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /2016-05-25/Hackathon-2016.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-05-25/Hackathon-2016.pdf -------------------------------------------------------------------------------- /2016-05-25/Spark Tools and Methodologies at Shopify.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-05-25/Spark Tools and Methodologies at Shopify.pdf -------------------------------------------------------------------------------- /2016-06-29/.ignore: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /2016-06-29/Collaborative_Recommendations_by_Mo_kijiji.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-06-29/Collaborative_Recommendations_by_Mo_kijiji.pdf -------------------------------------------------------------------------------- /2016-07-27/ExperiencesinDeliveringSparkasaServiceIBM.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-07-27/ExperiencesinDeliveringSparkasaServiceIBM.pdf -------------------------------------------------------------------------------- /2016-07-27/Readme.md: -------------------------------------------------------------------------------- 1 | # Slides for the July 27, 2016 TAS Meetup 2 | 3 | http://www.meetup.com/Toronto-Apache-Spark/events/232329359/ 4 | -------------------------------------------------------------------------------- /2016-07-27/SparkStoriesWattpad.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-07-27/SparkStoriesWattpad.pdf -------------------------------------------------------------------------------- /2016-09-28/Continuous_Applications_with_Apache_Spark.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-09-28/Continuous_Applications_with_Apache_Spark.pdf -------------------------------------------------------------------------------- /2016-10-26/.ignore: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /2016-10-26/Shoehorning_Spark.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-10-26/Shoehorning_Spark.pdf -------------------------------------------------------------------------------- /2016-10-26/Spark_in_production_pipelines.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-10-26/Spark_in_production_pipelines.pdf -------------------------------------------------------------------------------- /2016-11-30/Analyzing+Flight+Data+with+GraphFrames+2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Load the airport and flight data from Cloudant" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 46, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "cloudantHost='dtaieb.cloudant.com'\n", 19 | "cloudantUserName='weenesserliffircedinvers'\n", 20 | "cloudantPassword='72a5c4f939a9e2578698029d2bb041d775d088b5'" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 47, 26 | "metadata": { 27 | "collapsed": false 28 | }, 29 | "outputs": [], 30 | "source": [ 31 | "airports = sqlContext.read.format(\"com.cloudant.spark\").option(\"cloudant.host\",cloudantHost)\\\n", 32 | " .option(\"cloudant.username\",cloudantUserName).option(\"cloudant.password\",cloudantPassword)\\\n", 33 | " .option(\"schemaSampleSize\", \"-1\").load(\"flight-metadata\")\n", 34 | "airports.cache()\n", 35 | "airports.registerTempTable(\"airports\")" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": { 42 | "collapsed": false, 43 | "pixiedust": { 44 | "displayParams": { 45 | "handlerId": "dataframe" 46 | } 47 | } 48 | }, 49 | "outputs": [], 50 | "source": [ 51 | "import pixiedust\n", 52 | "display(airports)" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "metadata": { 59 | "collapsed": false 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "flights = sqlContext.read.format(\"com.cloudant.spark\").option(\"cloudant.host\",cloudantHost)\\\n", 64 | " .option(\"cloudant.username\",cloudantUserName).option(\"cloudant.password\",cloudantPassword)\\\n", 65 | " .option(\"schemaSampleSize\", \"-1\").load(\"pycon_flightpredict_training_set\")\n", 66 | "flights.cache()\n", 67 | "flights.registerTempTable(\"training\")" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "metadata": { 74 | "collapsed": false, 75 | "pixiedust": { 76 | "displayParams": { 77 | "handlerId": "dataframe" 78 | } 79 | } 80 | }, 81 | "outputs": [], 82 | "source": [ 83 | "display(flights)" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "# Build the vertices and edges dataframe from the data" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 26, 96 | "metadata": { 97 | "collapsed": false 98 | }, 99 | "outputs": [ 100 | { 101 | "name": "stdout", 102 | "output_type": "stream", 103 | "text": [ 104 | "422\n" 105 | ] 106 | } 107 | ], 108 | "source": [ 109 | "from pyspark.sql import functions as f\n", 110 | "from pyspark.sql.types import *\n", 111 | "rdd = flights.flatMap(lambda s: [s.arrivalAirportFsCode, s.departureAirportFsCode]).distinct()\\\n", 112 | " .map(lambda row:[row])\n", 113 | "vertices = airports.join(\n", 114 | " sqlContext.createDataFrame(rdd, StructType([StructField(\"fs\",StringType())])), \"fs\"\n", 115 | " ).dropDuplicates([\"fs\"]).withColumnRenamed(\"fs\",\"id\")\n", 116 | "print(vertices.count())" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 27, 122 | "metadata": { 123 | "collapsed": false 124 | }, 125 | "outputs": [], 126 | "source": [ 127 | "edges=flights.withColumnRenamed(\"arrivalAirportFsCode\",\"dst\")\\\n", 128 | " .withColumnRenamed(\"departureAirportFsCode\",\"src\")\\\n", 129 | " .drop(\"departureWeather\").drop(\"arrivalWeather\").drop(\"pt_type\").drop(\"_id\").drop(\"_rev\")" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "# Install GraphFrames package using PixieDust packageManager" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 28, 142 | "metadata": { 143 | "collapsed": false 144 | }, 145 | "outputs": [ 146 | { 147 | "name": "stdout", 148 | "output_type": "stream", 149 | "text": [ 150 | "Package already installed: graphframes:graphframes:0.1.0-spark1.6\n", 151 | "done\n" 152 | ] 153 | } 154 | ], 155 | "source": [ 156 | "import pixiedust\n", 157 | "pixiedust.installPackage(\"graphframes:graphframes:0.1.0-spark1.6\")\n", 158 | "print(\"done\")" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "# Create the GraphFrame from the Vertices and Edges Dataframes" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 29, 171 | "metadata": { 172 | "collapsed": false, 173 | "pixiedust": { 174 | "displayParams": { 175 | "handlerId": "graphMap" 176 | } 177 | }, 178 | "scrolled": false 179 | }, 180 | "outputs": [], 181 | "source": [ 182 | "from graphframes import GraphFrame\n", 183 | "g = GraphFrame(vertices, edges)\n", 184 | "display(g)" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "### Compute the degree for each vertex in the graph\n", 192 | "The degree of a vertex is the number of edges incident to the vertex. In a directed graph, in-degree is the number of edges where vertex is the destination and out-degree is the number of edges where the vertex is the source. With GraphFrames, there is a degrees, outDegrees and inDegrees property that return a DataFrame containing the id of the vertext and the number of edges. We then sort then in descending order" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": { 199 | "collapsed": false, 200 | "pixiedust": { 201 | "displayParams": { 202 | "handlerId": "dataframe" 203 | } 204 | }, 205 | "scrolled": false 206 | }, 207 | "outputs": [], 208 | "source": [ 209 | "from pyspark.sql.functions import *\n", 210 | "degrees = g.degrees.sort(desc(\"degree\"))\n", 211 | "display( degrees )" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": {}, 217 | "source": [ 218 | "### Compute a list of shortest paths for each vertex to a specified list of landmarks\n", 219 | "For this we use the `shortestPaths` api that returns DataFrame containing the properties for each vertex plus an extra column called distances that contains the number of hops to each landmark.\n", 220 | "In the following code, we use BOS and LAX as the landmarks" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 31, 226 | "metadata": { 227 | "collapsed": false, 228 | "pixiedust": { 229 | "displayParams": { 230 | "handlerId": "dataframe" 231 | } 232 | }, 233 | "scrolled": true 234 | }, 235 | "outputs": [], 236 | "source": [ 237 | "r = g.shortestPaths(landmarks=[\"BOS\", \"LAX\"]).select(\"id\", \"distances\")\n", 238 | "#display(r)" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 32, 244 | "metadata": { 245 | "collapsed": false 246 | }, 247 | "outputs": [ 248 | { 249 | "data": { 250 | "text/plain": [ 251 | "[Row(id=u'CAE', distances={})]" 252 | ] 253 | }, 254 | "execution_count": 32, 255 | "metadata": {}, 256 | "output_type": "execute_result" 257 | } 258 | ], 259 | "source": [ 260 | "r.take(1)" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "### Compute the pageRank for each vertex in the graph\n", 268 | "[PageRank](https://en.wikipedia.org/wiki/PageRank) is a famous algorithm used by Google Search to rank vertices in a graph by order of importance. To compute pageRank, we'll use the `pageRank` api that returns a new graph in which the vertices have a new `pagerank` column representing the pagerank score for the vertex and the edges have a new `weight` column representing the edge weight that contributed to the pageRank score. We'll then display the vertice ids and associated pageranks sorted descending: " 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": { 275 | "collapsed": false, 276 | "pixiedust": { 277 | "displayParams": { 278 | "handlerId": "edges" 279 | } 280 | }, 281 | "scrolled": false 282 | }, 283 | "outputs": [], 284 | "source": [ 285 | "from pyspark.sql.functions import *\n", 286 | "\n", 287 | "ranks = g.pageRank(resetProbability=0.20, maxIter=5)\n", 288 | "\n", 289 | "rankedVertices = ranks.vertices.select(\"id\",\"pagerank\").orderBy(desc(\"pagerank\"))\n", 290 | "rankedEdges = ranks.edges.select(\"src\", \"dst\", \"weight\").orderBy(desc(\"weight\") )\n", 291 | "\n", 292 | "ranks = GraphFrame(rankedVertices, rankedEdges)\n", 293 | "display(ranks)" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "### Search routes between 2 airports with specific criteria\n", 301 | "In this section, we want to find all the routes between Boston and San Francisco operated by United Airlines with at most 2 hops. To accomplish this, we use the `bfs` ([Breath First Search](https://en.wikipedia.org/wiki/Breadth-first_search)) api that returns a DataFrame containing the shortest path between matching vertices. For clarity will only keep the edge when displaying the results" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": 34, 307 | "metadata": { 308 | "collapsed": false, 309 | "pixiedust": { 310 | "displayParams": { 311 | "handlerId": "dataframe" 312 | } 313 | }, 314 | "scrolled": true 315 | }, 316 | "outputs": [], 317 | "source": [ 318 | "paths = g.bfs(fromExpr=\"id='BOS'\",toExpr=\"id = 'SFO'\",edgeFilter=\"carrierFsCode='UA'\", maxPathLength = 2)\\\n", 319 | " .drop(\"from\").drop(\"to\")\n", 320 | "paths.cache()\n", 321 | "display(paths)" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "### Find all airports that do not have direct flights between each other\n", 329 | "In this section, we'll use a very powerful graphFrames search feature that uses a pattern called [motif](http://graphframes.github.io/user-guide.html#motif-finding) to find nodes. The pattern we'll use the following pattern `\"(a)-[]->(b);(b)-[]->(c);!(a)-[]->(c)\"` which searches for all nodes a, b and c that have a path to (a,b) and a path to (b,c) but not a path to (a,c). \n", 330 | "Also, because the search is computationally expensive, we reduce the number of edges by grouping the flights that have the same src and dst." 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": null, 336 | "metadata": { 337 | "collapsed": false, 338 | "pixiedust": { 339 | "displayParams": { 340 | "handlerId": "dataframe" 341 | } 342 | }, 343 | "scrolled": false 344 | }, 345 | "outputs": [], 346 | "source": [ 347 | "from pyspark.sql.functions import *\n", 348 | "\n", 349 | "h = GraphFrame(g.vertices, g.edges.select(\"src\",\"dst\")\\\n", 350 | " .groupBy(\"src\",\"dst\").agg(count(\"src\").alias(\"count\")))\n", 351 | "\n", 352 | "query = h.find(\"(a)-[]->(b);(b)-[]->(c);!(a)-[]->(c)\").drop(\"b\")\n", 353 | "query.cache()\n", 354 | "display(query)" 355 | ] 356 | }, 357 | { 358 | "cell_type": "markdown", 359 | "metadata": {}, 360 | "source": [ 361 | "### Compute the strongly connected components for this graph\n", 362 | "[Strongly Connected Components](https://en.wikipedia.org/wiki/Strongly_connected_component) are components for which each vertex is reachable from every other vertex. To compute them, we'll use the `stronglyConnectedComponents` api that returns a DataFrame containing all the vertices with the addition of a `component` column that has the component id in which the vertex belongs to. We then group all the rows by components and aggregate the sum of all the member vertices. This gives us a good idea of the components distribution in the graph" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": null, 368 | "metadata": { 369 | "collapsed": false, 370 | "pixiedust": { 371 | "displayParams": { 372 | "handlerId": "dataframe" 373 | } 374 | }, 375 | "scrolled": true 376 | }, 377 | "outputs": [], 378 | "source": [ 379 | "from pyspark.sql.functions import *\n", 380 | "components = g.stronglyConnectedComponents(maxIter=10).select(\"id\",\"component\")\\\n", 381 | " .groupBy(\"component\").agg(count(\"id\").alias(\"count\")).orderBy(desc(\"count\"))\n", 382 | "display(components)" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "### Detect communities in the graph using Label Propagation algorithm\n", 390 | "[Label Propagation algorithm](https://en.wikipedia.org/wiki/Label_Propagation_Algorithm) is a popular algorithm for finding communities within a graph. It has the advantage to be computationally inexpensive and thus works well with large graphs. To compute the communities, we'll use the `labelPropagation` api that returns a DataFrame containing all the vertices with the addition of a `label` column that has the label id for the communities in which the vertex belongs to. Similar to the strongly connected components, we'll then group all the rows by label and aggregate the sum of all the member vertices." 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": null, 396 | "metadata": { 397 | "collapsed": false, 398 | "pixiedust": { 399 | "displayParams": { 400 | "handlerId": "dataframe" 401 | } 402 | } 403 | }, 404 | "outputs": [], 405 | "source": [ 406 | "from pyspark.sql.functions import *\n", 407 | "communities = g.labelPropagation(maxIter=5).select(\"id\", \"label\")\\\n", 408 | " .groupBy(\"label\").agg(count(\"id\").alias(\"count\")).orderBy(desc(\"count\"))\n", 409 | "display(communities)" 410 | ] 411 | }, 412 | { 413 | "cell_type": "markdown", 414 | "metadata": {}, 415 | "source": [ 416 | "## Use AggregateMessages to compute the average flight delays by originating airport\n", 417 | "\n", 418 | "AggregateMessages api is not currently available in Python, so we use PixieDust Scala bridge to call out the Scala API\n", 419 | "Note: Notice that PixieDust is automatically rebinding the python GraphFrame variable g into a scala GraphFrame with same name" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "metadata": { 426 | "collapsed": false 427 | }, 428 | "outputs": [], 429 | "source": [ 430 | "%%scala\n", 431 | "import org.graphframes.lib.AggregateMessages\n", 432 | "import org.apache.spark.sql.functions.{avg,desc,floor}\n", 433 | "\n", 434 | "// For each airport, average the delays of the departing flights\n", 435 | "val msgToSrc = AggregateMessages.edge(\"deltaDeparture\")\n", 436 | "val __agg = g.aggregateMessages\n", 437 | " .sendToSrc(msgToSrc) // send each flight delay to source\n", 438 | " .agg(floor(avg(AggregateMessages.msg)).as(\"averageDelays\")) // average up all delays\n", 439 | " .orderBy(desc(\"averageDelays\"))\n", 440 | " .limit(10)\n", 441 | "__agg.cache()\n", 442 | "__agg.show()" 443 | ] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "execution_count": null, 448 | "metadata": { 449 | "collapsed": false, 450 | "pixiedust": { 451 | "displayParams": { 452 | "aggregation": "SUM", 453 | "handlerId": "barChart", 454 | "keyFields": "id", 455 | "showLegend": "true", 456 | "stacked": "true", 457 | "staticFigure": "false", 458 | "title": "Average Flight delays by originating airport", 459 | "valueFields": "averageDelays" 460 | } 461 | } 462 | }, 463 | "outputs": [], 464 | "source": [ 465 | "display(__agg)" 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": null, 471 | "metadata": { 472 | "collapsed": true 473 | }, 474 | "outputs": [], 475 | "source": [] 476 | } 477 | ], 478 | "metadata": { 479 | "kernelspec": { 480 | "display_name": "Python 2 with Spark 1.6", 481 | "language": "python", 482 | "name": "python2" 483 | }, 484 | "language_info": { 485 | "codemirror_mode": { 486 | "name": "ipython", 487 | "version": 2 488 | }, 489 | "file_extension": ".py", 490 | "mimetype": "text/x-python", 491 | "name": "python", 492 | "nbconvert_exporter": "python", 493 | "pygments_lexer": "ipython2", 494 | "version": "2.7.11" 495 | } 496 | }, 497 | "nbformat": 4, 498 | "nbformat_minor": 0 499 | } -------------------------------------------------------------------------------- /2016-11-30/GraphFrame basics.ipynb: -------------------------------------------------------------------------------- 1 | {"cells": [{"cell_type": "code", "execution_count": 47, "metadata": {"collapsed": false}, "source": "import pixiedust\npixiedust.installPackage(\"graphframes:graphframes:0.1.0-spark1.6\")", "outputs": [{"output_type": "stream", "name": "stdout", "text": "Package already installed: graphframes:graphframes:0.1.0-spark1.6\n"}, {"data": {"text/plain": ""}, "execution_count": 47, "metadata": {}, "output_type": "execute_result"}]}, {"cell_type": "code", "execution_count": 48, "metadata": {"collapsed": true}, "source": "from graphframes import GraphFrame", "outputs": []}, {"cell_type": "code", "execution_count": 49, "metadata": {"collapsed": true}, "source": "# Vertex DataFrame\nv = sqlContext.createDataFrame([\n (\"a\", \"Alice\", 34),\n (\"b\", \"Bob\", 36),\n (\"c\", \"Charlie\", 30),\n (\"d\", \"David\", 29),\n (\"e\", \"Esther\", 32),\n (\"f\", \"Fanny\", 36),\n (\"g\", \"Gabby\", 60)\n], [\"id\", \"name\", \"age\"])\n\n# Edge DataFrame\ne = sqlContext.createDataFrame([\n (\"a\", \"b\", \"friend\"),\n (\"b\", \"c\", \"follow\"),\n (\"c\", \"b\", \"follow\"),\n (\"f\", \"c\", \"follow\"),\n (\"e\", \"f\", \"follow\"),\n (\"e\", \"d\", \"friend\"),\n (\"d\", \"a\", \"friend\"),\n (\"a\", \"e\", \"friend\")\n], [\"src\", \"dst\", \"relationship\"])", "outputs": []}, {"cell_type": "code", "execution_count": 50, "metadata": {"collapsed": true}, "source": "# Create a GraphFrame\ng = GraphFrame(v, e)", "outputs": []}, {"cell_type": "code", "execution_count": 51, "metadata": {"collapsed": false}, "source": "# take a look at the vertices (show)\ng.vertices.show()", "outputs": [{"output_type": "stream", "name": "stdout", "text": "+---+-------+---+\n| id| name|age|\n+---+-------+---+\n| a| Alice| 34|\n| b| Bob| 36|\n| c|Charlie| 30|\n| d| David| 29|\n| e| Esther| 32|\n| f| Fanny| 36|\n| g| Gabby| 60|\n+---+-------+---+\n\n"}]}, {"cell_type": "code", "execution_count": null, "metadata": {"collapsed": false}, "source": "# take a look at the edges (show)\ng.edges.show()", "outputs": []}, {"cell_type": "code", "execution_count": 39, "metadata": {"collapsed": false}, "source": "# find the youngest user in the group # g.vertices.groupBy().min(\"age\").show()\ng.vertices.groupBy().min(\"age\").show()", "outputs": [{"output_type": "stream", "name": "stdout", "text": "+--------+\n|min(age)|\n+--------+\n| 29|\n+--------+\n\n"}]}, {"cell_type": "code", "execution_count": 40, "metadata": {"collapsed": false}, "source": "# how many follows are in the graph? \nnumFollows = g.edges.filter(\"relationship = 'follow'\").count()\n\nprint \"Total number of follows: \" + str(numFollows)", "outputs": [{"output_type": "stream", "name": "stdout", "text": "Total number of follows: 4\n"}]}, {"cell_type": "code", "execution_count": 52, "metadata": {"collapsed": false}, "source": "# Motif finding (DSL)\n# (a) - [e] -> (b) \n\n# Ex. Find all the pairs of vertices with edges in both directions (find) \ng.find(\"(a)-[]->(b); (b)-[]->(a)\").show()\n\n# find (filter) only those where one of the nodes is older than 30\ng.find(\"(a)-[]->(b); (b)-[]->(a)\").filter(\"a.age > 30\")\n\n# more complex: a->b, b->c but !a->b g.find(\"(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)\").filter(\"a.name = 'Alice'\").show()\ng.find(\"(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)\").show()", "outputs": [{"output_type": "stream", "name": "stdout", "text": "+--------------+--------------+\n| a| b|\n+--------------+--------------+\n|[c,Charlie,30]| [b,Bob,36]|\n| [b,Bob,36]|[c,Charlie,30]|\n+--------------+--------------+\n\n+--------------+--------------+--------------+\n| a| b| c|\n+--------------+--------------+--------------+\n| [e,Esther,32]| [f,Fanny,36]|[c,Charlie,30]|\n|[c,Charlie,30]| [b,Bob,36]|[c,Charlie,30]|\n| [a,Alice,34]| [e,Esther,32]| [f,Fanny,36]|\n| [a,Alice,34]| [b,Bob,36]|[c,Charlie,30]|\n| [b,Bob,36]|[c,Charlie,30]| [b,Bob,36]|\n| [e,Esther,32]| [d,David,29]| [a,Alice,34]|\n| [a,Alice,34]| [e,Esther,32]| [d,David,29]|\n| [d,David,29]| [a,Alice,34]| [b,Bob,36]|\n| [f,Fanny,36]|[c,Charlie,30]| [b,Bob,36]|\n| [d,David,29]| [a,Alice,34]| [e,Esther,32]|\n+--------------+--------------+--------------+\n\n"}]}, {"cell_type": "code", "execution_count": 46, "metadata": {"collapsed": true}, "source": "# Select subgraph of users older than 30, and edges of type \"friend\"\nv2 = g.vertices.filter(\"age > 30\")\ne2 = g.edges.filter(\"relationship = 'friend'\")\ng2 = GraphFrame(v2, e2)", "outputs": []}, {"cell_type": "code", "execution_count": null, "metadata": {"collapsed": true}, "source": "", "outputs": []}], "nbformat_minor": 0, "metadata": {"kernelspec": {"display_name": "Python 2 with Spark 1.6", "language": "python", "name": "python2"}, "language_info": {"version": "2.7.11", "mimetype": "text/x-python", "codemirror_mode": {"version": 2, "name": "ipython"}, "file_extension": ".py", "name": "python", "pygments_lexer": "ipython2", "nbconvert_exporter": "python"}}, "nbformat": 4} -------------------------------------------------------------------------------- /2016-11-30/Spark GraphFrames.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2016-11-30/Spark GraphFrames.pdf -------------------------------------------------------------------------------- /2017-01-25/.gitignore: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /2017-01-25/README.md: -------------------------------------------------------------------------------- 1 | Event: https://www.meetup.com/Toronto-Apache-Spark/events/237100210/ 2 | Video will be uploaded! 3 | -------------------------------------------------------------------------------- /2017-02-22/README.md: -------------------------------------------------------------------------------- 1 | Event: https://www.meetup.com/Toronto-Apache-Spark/events/237474395/ 2 | -------------------------------------------------------------------------------- /2017-02-22/TAS-2017.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TorontoApacheSpark/meetups/3660d57b800efdaef21fb8a8e6f6469b1612c004/2017-02-22/TAS-2017.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Toronto Apache Spark Meetup 2 | `Watch` [TAS meetups repo](https://github.com/TorontoApacheSpark/meetups) to get notified as we add slides and material for each event. 3 | 4 | ## Monthly Contents: 5 | - [Toronto Apache Spark Google+ Collection](https://plus.google.com/collection/UZdAbB) `Join` to participate via Hangouts in live sessions 6 | - [Toronto Apache Spark Youtube Channel](https://www.youtube.com/channel/UCjES0_2fkZuNXlyC_HoHxyw) `Subscribe` to get notified! 7 | - [5 Bullet Points](https://github.com/TorontoApacheSpark/Spark-Meetup-Five-Bullet-Points) `Watch` the repo to get notified! 8 | 9 | ## Meetup Links: 10 | - [Toronto Apache Spark Meetup Page](http://www.meetup.com/Toronto-Apache-Spark/) 11 | - [Giving talk at Toronto Apache Spark](http://goo.gl/forms/ygzYg8SjXr) 12 | - [Members Survey](http://goo.gl/forms/ykzMzlXDIQ) 13 | 14 | ## Slack: 15 | - [Toronto Apache Spark Slack](https://torontoapachespark.slack.com) - You need to send an email to us with "Slack" as its subject so we can invite you. 16 | 17 | E-mail: torontoapachespark@gmail.com 18 | --------------------------------------------------------------------------------