├── README.md ├── .gitignore ├── LICENSE └── tutorial.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # pyspark-tutorial 2 | A short tutorial notebook on PySpark 3 | 4 | Prerequisites 5 | ------------- 6 | 7 | - Install Java 7 or newer on your OS. 8 | 9 | - Download a pre-built version of Spark from [here](https://spark.apache.org/downloads.html) and unpack it. 10 | 11 | - Set the following environment variables, for example in your `~/.bashrc`: 12 | 13 | `export SPARK_HOME=/PATH/TO/SPARK` 14 | 15 | `export PYTHONPATH=/PATH/TO/SPARK/python` 16 | 17 | - If you want Spark to work with a specific Python version/virtualenv, also set this one: 18 | 19 | `export PYSPARK_PYTHON=/PATH/TO/PYTHON/INSIDE/VIRTUALENV` 20 | 21 | - Install Py4j dependency: 22 | 23 | `pip install py4j` 24 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | 5 | # C extensions 6 | *.so 7 | 8 | # Distribution / packaging 9 | .Python 10 | env/ 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | *.egg-info/ 23 | .installed.cfg 24 | *.egg 25 | 26 | # PyInstaller 27 | # Usually these files are written by a python script from a template 28 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 29 | *.manifest 30 | *.spec 31 | 32 | # Installer logs 33 | pip-log.txt 34 | pip-delete-this-directory.txt 35 | 36 | # Unit test / coverage reports 37 | htmlcov/ 38 | .tox/ 39 | .coverage 40 | .coverage.* 41 | .cache 42 | nosetests.xml 43 | coverage.xml 44 | *,cover 45 | 46 | # Translations 47 | *.mo 48 | *.pot 49 | 50 | # Django stuff: 51 | *.log 52 | 53 | # Sphinx documentation 54 | docs/_build/ 55 | 56 | # PyBuilder 57 | target/ 58 | 59 | .ipynb_checkpoints/ 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2016 Nico de Vos 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Spark / PySpark tutorial\n", 8 | "========================\n", 9 | "\n", 10 | "Spark is an increasingly popular cluster computing system based on Apache Hadoop that offers great potential value because of its speed and ease of use. We are going to have a look at it here, with special focus on the Python interface to Spark: PySpark." 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "Prerequisites\n", 18 | "-------------\n", 19 | "\n", 20 | "Download one of the pre-built versions of Spark from the website (http://spark.apache.org/downloads.html), and untar it (`tar -xvf `). Alternatively, build Spark yourself by running `./sbt/sbt assembly` from the Spark directory.\n", 21 | "\n", 22 | "We will run Spark locally on our machine in this tutorial. After you set the `SPARK_HOME` environment variable and add Spark's `python` directory to the `PYTHONPATH`, you're good to go.\n", 23 | "\n", 24 | "*Note:\n", 25 | "When running locally and running inside a virtualenv we need to tell Spark to use the current version of Python, otherwise it would use the default system Python version. Put this in your code: `os.environ['PYSPARK_PYTHON'] = sys.executable`.*\n", 26 | "\n", 27 | "*Note:\n", 28 | "Spark has a web UI that shows running tasks, workers and various statistics. Running locally, it can be reached at http://localhost:4040/.*" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "Calling PySpark\n", 36 | "---------------\n", 37 | "\n", 38 | "To call Spark from Python, we need to use the PySpark interface. It can, for example, be called as an interactive shell from your Spark home dir:\n", 39 | "\n", 40 | " ./bin/pyspark\n", 41 | "\n", 42 | "As an iPython Spark shell:\n", 43 | "\n", 44 | " IPYTHON=1 ./bin/pyspark\n", 45 | "\n", 46 | "Or as a launcher for scripts:\n", 47 | "\n", 48 | " ./bin/pyspark --master local\n", 49 | "\n", 50 | "Below we show how you would use the PySpark API inside a Python script." 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 1, 56 | "metadata": { 57 | "collapsed": false 58 | }, 59 | "outputs": [], 60 | "source": [ 61 | "import os\n", 62 | "import sys\n", 63 | "\n", 64 | "# Spark's home directory (here it's: ~/spark-1.6.0) should be set as an environment variable.\n", 65 | "# (Of course setting an env. variable doesn't need to be done from Python; any method will do.)\n", 66 | "# os.environ['SPARK_HOME'] = os.path.join(os.path.expanduser('~'), 'spark-1.6.0')\n", 67 | "\n", 68 | "# Add Spark's Python interface (PySpark) to PYTHONPATH.\n", 69 | "# (Again: this doesn't need to be done from Python.)\n", 70 | "# sys.path.append(os.path.join(os.environ.get('SPARK_HOME'), 'python'))\n", 71 | "\n", 72 | "# This can be useful for running in virtualenvs:\n", 73 | "# os.environ['PYSPARK_PYTHON'] = '/home/nico/virtualenv/bin/python'\n", 74 | "\n", 75 | "# OK, now we can import PySpark\n", 76 | "from pyspark import SparkContext, SparkConf" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "Inside our *driver program*, the connection to Spark is represented by a `SparkContext` instance. For running Spark locally, you can simply instantiate one with:\n", 84 | "\n", 85 | " sc = SparkContext('local', 'mySparkApp')\n", 86 | "\n", 87 | "Alternatively, you can use a `SparkConf` instance to control various Spark configuration properties, which is what we are going to demonstrate here." 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 2, 93 | "metadata": { 94 | "collapsed": false 95 | }, 96 | "outputs": [], 97 | "source": [ 98 | "conf = SparkConf()\n", 99 | "# To run local, use:\n", 100 | "conf.setMaster('local')\n", 101 | "conf.setAppName('spark_tutorial')\n", 102 | "# SparkConf's 'set', 'setAll' and 'setIfMissing' can be used to set a variety of\n", 103 | "# configuration properties, e.g.:\n", 104 | "conf.setIfMissing(\"spark.cores.max\", \"4\")\n", 105 | "conf.set(\"spark.executor.memory\", \"1g\")\n", 106 | "# Alternatively:\n", 107 | "conf.setAll([('spark.cores.max', '4'), ((\"spark.executor.memory\", \"1g\"))])\n", 108 | "\n", 109 | "# Now instantiate SparkContext\n", 110 | "sc = SparkContext(conf=conf)\n", 111 | "\n", 112 | "# Note: SparkContexts can be stopped manually with:\n", 113 | "# sc.stop()" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": {}, 119 | "source": [ 120 | "How Spark works, in brief\n", 121 | "-------------------------\n", 122 | "\n", 123 | "Spark uses a *cluster manager* (e.g., Spark's own standalone manager, YARN or Mesos), and a number of *worker nodes*. The manager attempts to acquire *executors* on the worker nodes, which do computations and store data based on the code and tasks that are sent to them.\n", 124 | "\n", 125 | "Spark's primary abstraction is a so-called *Resilient Distributed Dataset (RDD)*. Spark can create RDDs from any storage source supported by Hadoop. An RDD holds intermediate computational results and is stored in RAM or on disk across the worker nodes. In case a node fails, an RDD can be restored. Many processes can executed in parallel thanks to the distributed nature of RDDs, and pipelining and lazy execution prevent the need for saving intermediate results for the next step. Importantly, Spark supports pulling data sets into a cluster-wide *in-memory cache* for fast access.\n", 126 | "\n", 127 | "RDD operations can be divided into 2 groups: *transformations* and *actions*. Transformations (e.g., `map`) of RDDs always result in new RDDs, and actions (e.g., `reduce`) return values that are the result of operations on the RDD back to the driver program.\n", 128 | "\n", 129 | "The above and more will be demonstrated in the code examples below." 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "### RDDs are distributed data sets" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 5, 142 | "metadata": { 143 | "collapsed": false 144 | }, 145 | "outputs": [ 146 | { 147 | "name": "stdout", 148 | "output_type": "stream", 149 | "text": [ 150 | "Number of partitions: 4\n", 151 | "[[0, 1, 2], [3, 4, 5, 6], [7, 8, 9], [10, 11, 12, 13]]\n" 152 | ] 153 | } 154 | ], 155 | "source": [ 156 | "# 'parallelize' creates an RDD by distributing data over the cluster\n", 157 | "rdd = sc.parallelize(range(14), numSlices=4)\n", 158 | "print(\"Number of partitions: {}\".format(rdd.getNumPartitions()))\n", 159 | "# 'glom' lists all elements within each partition\n", 160 | "print(rdd.glom().collect())" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "### Spark is lazy\n", 168 | "Despite any intermediate transformations, Spark only runs after an *action* is performed on an RDD. This is because it tries to do smart pipelining of operations so that it does not have to save intermediate results." 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 6, 174 | "metadata": { 175 | "collapsed": false 176 | }, 177 | "outputs": [ 178 | { 179 | "name": "stdout", 180 | "output_type": "stream", 181 | "text": [ 182 | "[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169]\n" 183 | ] 184 | } 185 | ], 186 | "source": [ 187 | "rddSquared = rdd.map(lambda x: x ** 2)\n", 188 | "# Alternatively, you can use a normal function:\n", 189 | "# def squared(x):\n", 190 | "# return x ** 2\n", 191 | "# rddSquared = rdd.map(squared)\n", 192 | "\n", 193 | "# The 'collect' action triggers Spark: the above transformation is performed,\n", 194 | "# and results are collected.\n", 195 | "print(rddSquared.collect())" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 7, 201 | "metadata": { 202 | "collapsed": false 203 | }, 204 | "outputs": [ 205 | { 206 | "name": "stdout", 207 | "output_type": "stream", 208 | "text": [ 209 | "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]\n", 210 | "0\n", 211 | "[0, 1, 2, 3, 4]\n", 212 | "[13, 12, 11]\n", 213 | "[13, 12, 11, 10, 9, 8, 7]\n" 214 | ] 215 | } 216 | ], 217 | "source": [ 218 | "# Popular transformations\n", 219 | "# -----------------------\n", 220 | "\n", 221 | "func = lambda x: -x\n", 222 | "rdd.map(func)\n", 223 | "rdd.flatMap(func) # like map, but flattens results\n", 224 | "rdd.filter(func)\n", 225 | "rdd.sortBy(func)\n", 226 | "\n", 227 | "# Popular actions\n", 228 | "# ---------------\n", 229 | "\n", 230 | "rdd.reduce(lambda x, y: x + y)\n", 231 | "rdd.count()\n", 232 | "\n", 233 | "# Actions with which to take data from an RDD:\n", 234 | "print(rdd.collect()) # get all elements\n", 235 | "print(rdd.first()) # get first element\n", 236 | "print(rdd.take(5)) # get N first elements\n", 237 | "print(rdd.top(3)) # get N highest elements in descending order\n", 238 | "print(rdd.takeOrdered(7, lambda x: -x)) # get N first elements in ascending (or a function's) order" 239 | ] 240 | }, 241 | { 242 | "cell_type": "markdown", 243 | "metadata": {}, 244 | "source": [ 245 | "### RDDs can be cached to memory (or drive)\n", 246 | "Spark allows the user to control what data is cached and how. Proper caching of RDDs can be hugely beneficial! Whenever you have an RDD that will be re-used multiple times later on, you should consider caching it." 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": 8, 252 | "metadata": { 253 | "collapsed": false 254 | }, 255 | "outputs": [ 256 | { 257 | "ename": "ImportError", 258 | "evalue": "No module named 'numpy'", 259 | "output_type": "error", 260 | "traceback": [ 261 | "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", 262 | "\u001b[1;31mImportError\u001b[0m Traceback (most recent call last)", 263 | "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;32mimport\u001b[0m \u001b[0mnumpy\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3\u001b[0m \u001b[0mNUM_SAMPLES\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m1e6\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[0mrddBig\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0msc\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mparallelize\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mrandom\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mrandom\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mNUM_SAMPLES\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n", 264 | "\u001b[1;31mImportError\u001b[0m: No module named 'numpy'" 265 | ] 266 | } 267 | ], 268 | "source": [ 269 | "import numpy as np\n", 270 | "\n", 271 | "NUM_SAMPLES = int(1e6)\n", 272 | "rddBig = sc.parallelize(np.random.random(NUM_SAMPLES))\n", 273 | "\n", 274 | "# no caching: will be recalculated everytime we go through the loop\n", 275 | "rddBigTrans = rddBig.map(lambda x: (x ** 2 - 0.1) ** 0.5)\n", 276 | "print(rddBigTrans.getStorageLevel())\n", 277 | "for threshold in (0.2, 0.4, 0.6, 0.8):\n", 278 | " %timeit -n 1 -r 1 rddBigTrans.filter(lambda x: x >= threshold).count()\n", 279 | "\n", 280 | "# we cache this intermediate result because it will be repeatedly called\n", 281 | "rddBigTrans_c = rddBig.map(lambda x: (x ** 2 - 0.1) ** 0.5).cache()\n", 282 | "print(rddBigTrans_c.getStorageLevel())\n", 283 | "for threshold in (0.2, 0.4, 0.6, 0.8):\n", 284 | " %timeit -n 1 -r 1 rddBigTrans_c.filter(lambda x: x >= threshold).count()\n", 285 | "\n", 286 | "# use unpersist to remove from cache\n", 287 | "print(rddBigTrans_c.unpersist().getStorageLevel())\n", 288 | "# for even finer-grained control of caching, use the 'persist' function\n", 289 | "from pyspark import storagelevel\n", 290 | "print(rddBigTrans.persist(storagelevel.StorageLevel.MEMORY_AND_DISK_SER).getStorageLevel())" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "### Spark can work with key-value pairs\n", 298 | "So-called PairRDDs are RDDs that store key-value pairs. Spark has a variety of special operations that make use of this, such as joining by key, grouping by key, etc." 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": null, 304 | "metadata": { 305 | "collapsed": false 306 | }, 307 | "outputs": [], 308 | "source": [ 309 | "# PairRDDs are automatically created whenever we present a list of key-value tuples\n", 310 | "# Here we transform rddA and create a key based on even/odd flags\n", 311 | "rddP1 = rdd.map(lambda x: (x % 2 == 0, x))\n", 312 | "# A clearer shortcut for this is:\n", 313 | "rddP1 = rdd.keyBy(lambda x: x % 2 == 0)\n", 314 | "\n", 315 | "# Another way to create a PairRDD is to zip two RDDs (assumes equal length RDDs)\n", 316 | "print(\"Zipped: {}\".format(rdd.zip(rdd).collect()))\n", 317 | "\n", 318 | "# Access to the keys and values\n", 319 | "print(\"Keys: {}\".format(rddP1.keys().collect()))\n", 320 | "print(\"Values: {}\".format(rddP1.values().collect()))\n", 321 | "\n", 322 | "# This is how you can map a function to a pairRDD; x[0] is the key, x[1] the value\n", 323 | "print(rddP1.map(lambda x: (x[0], x[1] ** 2)).collect())\n", 324 | "# Better: mapValues/flatMapValues, which operates on values only and keeps the keys in place\n", 325 | "print(rddP1.mapValues(lambda x: x ** 2).collect())\n", 326 | "# We can also go back from a PairRDD to a normal RDD by simply dropping the key\n", 327 | "print(rddP1.map(lambda x: x[1] ** 2).collect())" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": null, 333 | "metadata": { 334 | "collapsed": false 335 | }, 336 | "outputs": [], 337 | "source": [ 338 | "# Various aggregations by key are possible, such as reduceByKey, combineByKey, and foldByKey\n", 339 | "# reduceByKey example:\n", 340 | "print(\"Sum per key: {}\".format(rddP1.reduceByKey(lambda x, y: x + y).collect()))\n", 341 | "\n", 342 | "# Also, some common operation are available in 'ByKey' form, e.g.:\n", 343 | "rddP1.sortByKey()\n", 344 | "rddP1.countByKey()" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": null, 350 | "metadata": { 351 | "collapsed": false 352 | }, 353 | "outputs": [], 354 | "source": [ 355 | "# Grouping and joining by key\n", 356 | "\n", 357 | "# There are various possible ways of joining 2 RDDs together by key:\n", 358 | "rddP2 = sc.parallelize(range(0, 28, 2)).map(lambda x: (x % 2 == 0, x))\n", 359 | "# inner join, or cross join in case of overlapping keys\n", 360 | "print(\"Join: {}\".format(rddP1.join(rddP2).collect()))\n", 361 | "# left/right outer join\n", 362 | "rddP1.leftOuterJoin(rddP2)\n", 363 | "rddP1.rightOuterJoin(rddP2)\n", 364 | "\n", 365 | "# for all keys in either rddP1 or rddP2, cogroup returns iterables of the values in either\n", 366 | "print(\"Cogroup: {}\".format(rddP1.cogroup(rddP2).collect()))\n", 367 | "# cogrouping together more than two RDDs by key can be done with groupWith\n", 368 | "rddP1.groupWith(rddP2, rddP2)\n", 369 | "\n", 370 | "# with groupByKey we create a new RDD that keeps the same keys on the same node, where possible\n", 371 | "print(\"After groupByKey: {}\".format(rddP1.groupByKey().glom().collect()))" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "metadata": {}, 377 | "source": [ 378 | "### Spark can directly create RDDs from text files" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": null, 384 | "metadata": { 385 | "collapsed": true 386 | }, 387 | "outputs": [], 388 | "source": [ 389 | "# TODO: addFile does not seem to work. Is it because we are running in standalone mode?\n", 390 | "# from pyspark import SparkFiles\n", 391 | "# sc.addFile(os.path.join(os.environ.get('SPARK_HOME'), 'LICENSE'))\n", 392 | "# rddT = sc.textFile(SparkFiles.get('LICENSE'))\n", 393 | "# print(rddT.take(5))" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "### RDDs support simple statistical actions" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": null, 406 | "metadata": { 407 | "collapsed": false 408 | }, 409 | "outputs": [], 410 | "source": [ 411 | "print(rdd.stats())\n", 412 | "print(rdd.count())\n", 413 | "print(rdd.sum())\n", 414 | "print(rdd.mean())\n", 415 | "print(rdd.stdev(), rdd.sampleStdev())\n", 416 | "print(rdd.variance(), rdd.sampleVariance())\n", 417 | "print(rdd.min(), rdd.max())\n", 418 | "print(rdd.histogram(5))" 419 | ] 420 | }, 421 | { 422 | "cell_type": "markdown", 423 | "metadata": {}, 424 | "source": [ 425 | "### RDDs support set transformations" 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "execution_count": null, 431 | "metadata": { 432 | "collapsed": false 433 | }, 434 | "outputs": [], 435 | "source": [ 436 | "rddB = sc.parallelize(range(0, 26, 2))\n", 437 | "print(rdd.union(rddB).collect()) # or: rdd + rddB\n", 438 | "print(rdd.union(rddB).distinct().collect())\n", 439 | "print(rdd.intersection(rddB).collect())\n", 440 | "print(rdd.subtract(rddB).collect())\n", 441 | "print(rdd.cartesian(rddB).collect())" 442 | ] 443 | }, 444 | { 445 | "cell_type": "markdown", 446 | "metadata": {}, 447 | "source": [ 448 | "### Spark supports cluster-wide shared variables" 449 | ] 450 | }, 451 | { 452 | "cell_type": "code", 453 | "execution_count": null, 454 | "metadata": { 455 | "collapsed": false 456 | }, 457 | "outputs": [], 458 | "source": [ 459 | "# A broadcast variable is copied to each machine only once, in an efficient manner.\n", 460 | "# It is very suitable when each node uses the data in it, and especially so if the data is\n", 461 | "# large and would otherwise be sent across the network multiple times.\n", 462 | "broadcastVar = sc.broadcast({'CA': 'California', 'NL': 'Netherlands'})\n", 463 | "print(broadcastVar.value)\n", 464 | "\n", 465 | "# An accumulator is a shared variable that lives on the master, and to\n", 466 | "# which each task can add values. (Basically, it's a simple reducer.)\n", 467 | "accu = sc.accumulator(0)\n", 468 | "# 'foreach' just applies a function to each RDD element without returning anything\n", 469 | "rdd.foreach(lambda x: accu.add(x))\n", 470 | "print(accu.value)" 471 | ] 472 | }, 473 | { 474 | "cell_type": "markdown", 475 | "metadata": {}, 476 | "source": [ 477 | "### Spark allows for customizable partitioning and parallelization" 478 | ] 479 | }, 480 | { 481 | "cell_type": "code", 482 | "execution_count": null, 483 | "metadata": { 484 | "collapsed": true 485 | }, 486 | "outputs": [], 487 | "source": [ 488 | "# First of all, a SparkContext instantiation can be used to set certain defaults\n", 489 | "sc.setLocalProperty('defaultMinPartitions', '4')\n", 490 | "sc.setLocalProperty('defaultParallelism', '4')\n", 491 | "\n", 492 | "# Also, aggregations such as 'reduceByKey' allow specifying the level of parallelism manually\n", 493 | "rddP1.reduceByKey(lambda x, y: x + y, 10)\n", 494 | "\n", 495 | "# Also, we can set any RDD to use a given number of partitions\n", 496 | "rddRepart = rdd.partitionBy(100)\n", 497 | "\n", 498 | "# Finally, there are some methods for manual re-partitioning of RDDs.\n", 499 | "# (Warning: these can be expensive, but in certain cases very useful.)\n", 500 | "\n", 501 | "# Efficient downscaling of partitions is done with 'coalesce'\n", 502 | "rddRepart = rdd.coalesce(2)\n", 503 | "# And full manual repartitioning\n", 504 | "rddRepart = rdd.repartition(10)" 505 | ] 506 | }, 507 | { 508 | "cell_type": "markdown", 509 | "metadata": {}, 510 | "source": [ 511 | "A more complicated example of `partitionBy` is to use it for predefining how an RDD is partitioned, and to use this partitioning on other RDDs in exactly the same way. This way, we avoid shuffling the entire data set during, e.g., a `join` operation between these data sets. Depending on the application, this can translate into significant speedups." 512 | ] 513 | }, 514 | { 515 | "cell_type": "code", 516 | "execution_count": null, 517 | "metadata": { 518 | "collapsed": false 519 | }, 520 | "outputs": [], 521 | "source": [ 522 | "rddP1 = sc.parallelize(range(14)).map(lambda x: (x % 2 == 0, x)).partitionBy(2)\n", 523 | "rddP2 = sc.parallelize(range(0, 28, 2)).map(lambda x: (x % 2 == 0, x)).partitionBy(2)\n", 524 | "\n", 525 | "print(rddP1.glom().collect())\n", 526 | "print(rddP2.glom().collect())\n", 527 | "\n", 528 | "# Now this 'join' does not require a full shuffle of the data,\n", 529 | "# since the 'False' and 'True' keys are on the same node\n", 530 | "rddJ = rddP1.join(rddP2)\n", 531 | "# If we also want to keep 'rddJ' in the same partitioning, we have to specify it again\n", 532 | "rddJ = rddJ.partitionBy(2)\n", 533 | "print(rddJ.glom().collect())" 534 | ] 535 | }, 536 | { 537 | "cell_type": "markdown", 538 | "metadata": {}, 539 | "source": [ 540 | "Pitfalls\n", 541 | "--------" 542 | ] 543 | }, 544 | { 545 | "cell_type": "markdown", 546 | "metadata": {}, 547 | "source": [ 548 | "### Not caching intermediate results that are re-used later on" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": null, 554 | "metadata": { 555 | "collapsed": false 556 | }, 557 | "outputs": [], 558 | "source": [ 559 | "print(\"Not so great:\")\n", 560 | "rddBigTrans = rddBig.map(lambda x: (x ** 2 - 0.1) ** 0.5)\n", 561 | "for threshold in (0.2, 0.4, 0.6, 0.8):\n", 562 | " %timeit -n 1 -r 1 rddBigTrans.filter(lambda x: x >= threshold).count()\n", 563 | "\n", 564 | "print(\"Better:\")\n", 565 | "rddBigTrans_c = rddBig.map(lambda x: (x ** 2 - 0.1) ** 0.5).cache()\n", 566 | "for threshold in (0.2, 0.4, 0.6, 0.8):\n", 567 | " %timeit -n 1 -r 1 rddBigTrans_c.filter(lambda x: x >= threshold).count()" 568 | ] 569 | }, 570 | { 571 | "cell_type": "markdown", 572 | "metadata": {}, 573 | "source": [ 574 | "### Not considering when and how data is transfered through the cluster\n", 575 | "Keep in mind that Spark is a distributed computing framework, and that transferring data over the network within a cluster should be avoided (network bandwidth is ~100 times more expensive than memory bandwidth)." 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "metadata": { 582 | "collapsed": true 583 | }, 584 | "outputs": [], 585 | "source": [ 586 | "# groupByKey triggers a shuffle so a lot of data is copied over the network\n", 587 | "sumPerKey = rddP1.groupByKey().mapValues(lambda x: sum(x)).collect()\n", 588 | "\n", 589 | "# Better: reduceByKey reduces locally before shuffling\n", 590 | "sumPerKey = rddP1.reduceByKey(lambda x, y: x + y).collect()" 591 | ] 592 | }, 593 | { 594 | "cell_type": "markdown", 595 | "metadata": {}, 596 | "source": [ 597 | "### Not working with an appropriate number of partitions" 598 | ] 599 | }, 600 | { 601 | "cell_type": "code", 602 | "execution_count": null, 603 | "metadata": { 604 | "collapsed": true 605 | }, 606 | "outputs": [], 607 | "source": [ 608 | "# Not enough partitions results in bad concurrency in the cluster.\n", 609 | "# It also puts pressure on the memory for certain operations.\n", 610 | "\n", 611 | "# On the other hand, suppose an RDD is distributed over 1000 partitions, but we work only on a small\n", 612 | "# subset of the data in the RDD, e.g.:\n", 613 | "rddF = rdd.filter(lambda x: x < 0.1).map(lambda x: x ** 2)\n", 614 | "\n", 615 | "# We are then effectively creating many empty tasks, and using coalesce or repartition\n", 616 | "# to create an RDD with less partitions would be beneficial:\n", 617 | "rddF = rdd.filter(lambda x: x < 3).coalesce(10).map(lambda x: x ** 2)" 618 | ] 619 | }, 620 | { 621 | "cell_type": "markdown", 622 | "metadata": {}, 623 | "source": [ 624 | "### Using map with high overhead per element, better to use mapPartitions" 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": null, 630 | "metadata": { 631 | "collapsed": true 632 | }, 633 | "outputs": [], 634 | "source": [ 635 | "# For example, opening and closing a database connection takes time.\n", 636 | "def db_operation(x):\n", 637 | " # Open connection to DB\n", 638 | " # Do something with an element\n", 639 | " # Close DB connection\n", 640 | " pass\n", 641 | "\n", 642 | "# Especially so if you repeat it for every element:\n", 643 | "rdd.map(db_operation)\n", 644 | "\n", 645 | "# Better: do this at the level of partition rather than at the level of element.\n", 646 | "def vectorized_db_operation(x):\n", 647 | " # Open connection to DB\n", 648 | " # Do something with an array of elements\n", 649 | " # Close DB connection\n", 650 | " pass\n", 651 | "\n", 652 | "result = rdd.mapPartitions(vectorized_db_operation)" 653 | ] 654 | }, 655 | { 656 | "cell_type": "markdown", 657 | "metadata": {}, 658 | "source": [ 659 | "### Sending a lot of data along with a function call to each element" 660 | ] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": null, 665 | "metadata": { 666 | "collapsed": true 667 | }, 668 | "outputs": [], 669 | "source": [ 670 | "bigData = np.random.random(int(1e6))\n", 671 | "\n", 672 | "def myFunc(x):\n", 673 | " return x * np.random.choice(bigData)\n", 674 | "\n", 675 | "# this would send bigData along for each element of rdd\n", 676 | "rdd.map(myFunc)\n", 677 | "\n", 678 | "# Better: make the big data a read-only broadcast variable so that it\n", 679 | "# is efficiently copied across the network\n", 680 | "bigDataBC = sc.broadcast(bigData)" 681 | ] 682 | }, 683 | { 684 | "cell_type": "markdown", 685 | "metadata": {}, 686 | "source": [ 687 | "Code examples\n", 688 | "-------------" 689 | ] 690 | }, 691 | { 692 | "cell_type": "markdown", 693 | "metadata": {}, 694 | "source": [ 695 | "### Simple scikit-learn" 696 | ] 697 | }, 698 | { 699 | "cell_type": "code", 700 | "execution_count": null, 701 | "metadata": { 702 | "collapsed": true 703 | }, 704 | "outputs": [], 705 | "source": [ 706 | "from sklearn.cross_validation import train_test_split, ShuffleSplit\n", 707 | "from sklearn.datasets import make_regression\n", 708 | "from sklearn import pipeline\n", 709 | "from sklearn.linear_model import Ridge\n", 710 | "from sklearn.preprocessing import StandardScaler\n", 711 | "\n", 712 | "N = 10000 # number of data points\n", 713 | "D = 100 # number of dimensions\n", 714 | "\n", 715 | "X, y = make_regression(n_samples=N, n_features=D, n_informative=int(D*0.1),\n", 716 | " n_targets=1, bias=-6., noise=50., random_state=42)\n", 717 | "X_train, X_test, y_train, y_test = train_test_split(X, y)\n", 718 | "\n", 719 | "# partition the data into random subsamples\n", 720 | "samples = sc.parallelize(ShuffleSplit(y_train.size, n_iter=8))\n", 721 | "reg_model = pipeline.Pipeline([(\"scaler\", StandardScaler()), (\"ridge\", Ridge())])\n", 722 | "# train a model for each subsample and apply it to the test set\n", 723 | "mean_rsq = samples.map(\n", 724 | " lambda (index, _): reg_model.fit(X[index], y[index]).score(X_test, y_test)).mean()\n", 725 | "print(mean_rsq)" 726 | ] 727 | }, 728 | { 729 | "cell_type": "markdown", 730 | "metadata": {}, 731 | "source": [ 732 | "### Stochastic gradient descent using scikit-learn (from: https://gist.github.com/MLnick/4707012)\n", 733 | "Each partition is a mini-batch for the SGD, uses average weights." 734 | ] 735 | }, 736 | { 737 | "cell_type": "code", 738 | "execution_count": null, 739 | "metadata": { 740 | "collapsed": true 741 | }, 742 | "outputs": [], 743 | "source": [ 744 | "from sklearn import linear_model as lm\n", 745 | "from sklearn.base import copy\n", 746 | "\n", 747 | "N = 10000 # Number of data points\n", 748 | "D = 10 # Numer of dimensions\n", 749 | "ITERATIONS = 5\n", 750 | "np.random.seed(seed=42)\n", 751 | "\n", 752 | "def generate_data(N):\n", 753 | " return [[[1] if np.random.rand() < 0.5 else [0], np.random.randn(D)]\n", 754 | " for _ in range(N)]\n", 755 | "\n", 756 | "def train(iterator, sgd):\n", 757 | " for x in iterator:\n", 758 | " sgd.partial_fit(x[1], x[0], classes=np.array([0, 1]))\n", 759 | " yield sgd\n", 760 | "\n", 761 | "def merge(left, right):\n", 762 | " new = copy.deepcopy(left)\n", 763 | " new.coef_ += right.coef_\n", 764 | " new.intercept_ += right.intercept_\n", 765 | " return new\n", 766 | "\n", 767 | "def avg_model(sgd, slices):\n", 768 | " sgd.coef_ /= slices\n", 769 | " sgd.intercept_ /= slices\n", 770 | " return sgd\n", 771 | "\n", 772 | "slices = 4\n", 773 | "data = generate_data(N)\n", 774 | "print(len(data))\n", 775 | "\n", 776 | "# init stochastic gradient descent\n", 777 | "sgd = lm.SGDClassifier(loss='log')\n", 778 | "# training\n", 779 | "for ii in range(ITERATIONS):\n", 780 | " sgd = sc.parallelize(data, numSlices=slices) \\\n", 781 | " .mapPartitions(lambda x: train(x, sgd)) \\\n", 782 | " .reduce(lambda x, y: merge(x, y))\n", 783 | " # averaging weight vector => iterative parameter mixtures\n", 784 | " sgd = avg_model(sgd, slices)\n", 785 | " print(\"Iteration %d:\" % (ii + 1))\n", 786 | " print(\"Model: \")\n", 787 | " print(sgd.coef_)\n", 788 | " print(sgd.intercept_)" 789 | ] 790 | }, 791 | { 792 | "cell_type": "markdown", 793 | "metadata": {}, 794 | "source": [ 795 | "The Spark universe\n", 796 | "------------------\n", 797 | "\n", 798 | "Other interesting tools for Spark:\n", 799 | "\n", 800 | "- Spark SQL: http://spark.apache.org/docs/latest/sql-programming-guide.html\n", 801 | "- MLlib, Spark's machine learning library: http://spark.apache.org/docs/latest/mllib-guide.html\n", 802 | "- Spark Streaming, for streaming data applications: http://spark.apache.org/docs/latest/streaming-programming-guide.html" 803 | ] 804 | }, 805 | { 806 | "cell_type": "markdown", 807 | "metadata": {}, 808 | "source": [ 809 | "More information\n", 810 | "----------------" 811 | ] 812 | }, 813 | { 814 | "cell_type": "markdown", 815 | "metadata": {}, 816 | "source": [ 817 | "### Documentation\n", 818 | "\n", 819 | "Spark documentation: https://spark.apache.org/docs/latest/index.html\n", 820 | "\n", 821 | "Spark programming guide: http://spark.apache.org/docs/latest/programming-guide.html\n", 822 | "\n", 823 | "PySpark documentation: https://spark.apache.org/docs/latest/api/python/index.html\n", 824 | "\n", 825 | "### Books\n", 826 | "\n", 827 | "Learning Spark: http://shop.oreilly.com/product/0636920028512.do\n", 828 | "\n", 829 | "(preview: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/)\n", 830 | "\n", 831 | "### Talks (recommended to watch them in this order)\n", 832 | "\n", 833 | "Parallel programming with Spark: https://www.youtube.com/watch?v=7k4yDKBYOcw\n", 834 | "\n", 835 | "Advanced Spark features: https://www.youtube.com/watch?v=w0Tisli7zn4\n", 836 | "\n", 837 | "PySpark: Python API for Spark: https://www.youtube.com/watch?v=xc7Lc8RA8wE\n", 838 | "\n", 839 | "Understanding Spark performance: https://www.youtube.com/watch?v=NXp3oJHNM7E\n", 840 | "\n", 841 | "A deeper understanding of Spark's internals: https://www.youtube.com/watch?v=dmL0N3qfSc8" 842 | ] 843 | }, 844 | { 845 | "cell_type": "code", 846 | "execution_count": null, 847 | "metadata": { 848 | "collapsed": true 849 | }, 850 | "outputs": [], 851 | "source": [] 852 | } 853 | ], 854 | "metadata": { 855 | "kernelspec": { 856 | "display_name": "Python 3", 857 | "language": "python", 858 | "name": "python3" 859 | }, 860 | "language_info": { 861 | "codemirror_mode": { 862 | "name": "ipython", 863 | "version": 3 864 | }, 865 | "file_extension": ".py", 866 | "mimetype": "text/x-python", 867 | "name": "python", 868 | "nbconvert_exporter": "python", 869 | "pygments_lexer": "ipython3", 870 | "version": "3.4.3" 871 | } 872 | }, 873 | "nbformat": 4, 874 | "nbformat_minor": 0 875 | } 876 | --------------------------------------------------------------------------------