├── sotu ├── ReadMe.txt └── 1790-1796-GW.txt ├── README.md └── Scala-2-Text-Analytics.ipynb /sotu/ReadMe.txt: -------------------------------------------------------------------------------- 1 | Collected & curated from http://stateoftheunion.onetwothree.net/texts/index.html 2 | September 25,2014 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # sparknotebook 3 | 4 | ## IMPORTANT 5 | 6 | I am in the process of removing the [IScala](https://github.com/mattpap/IScala) as it's development appears stalled. I'm replacing it with [jupyter-scala](https://github.com/alexarchambault/jupyter-scala). __However, jupyter-scala doesn't build for Scala 2.10 yet. Spark requires Scala 2.10.__ 7 | 8 | This project contains samples of jupyter notebooks running [Spark](https://spark.apache.org/). One notebook, _2-Text-Analytics.ipynb_ is written in python. The second, _Scala-2-Text-Analytics.ipynb_ is in Scala. The dataset and the most excellent _2-Text-Analytics.ipynb_ are originally from [https://github.com/xsankar/cloaked-ironman](https://github.com/xsankar/cloaked-ironman). 9 | 10 | Just open each notebook to see how Spark is instantiated and used. 11 | 12 |  13 | 14 | ## Python Set-up 15 | 16 | To run the python notebook, you will need to: 17 | 18 | 1. Install ipython and ipython notebook. For simplicity, I am just using the free [Anaconda python distribution from Continuum Analytics](http://continuum.io/downloads). 19 | 2. [Download and install Spark distribution](https://spark.apache.org/downloads.html). The download includes the `pyspark` script that you need to launch python with Spark. 20 | 21 | For best results, cd into this projects root directory before starting ipython. The actual command to start the ipython notebook is: 22 | 23 | ```bash 24 | IPYTHON=1 IPYTHON_OPTS="notebook --pylab inline" pyspark 25 | ``` 26 | 27 | __NOTE__: Sometimes when running Spark on Java 7 you may get a java.net.UnknownHostException. I have not yet seen this on Java 8. If this happens to you, you can resolve it by setting the __SPARK_LOCAL_IP__ environment variable to `127.0.0.1` before launching Spark. For example: 28 | 29 | ```bash 30 | SPARK_LOCAL_IP=127.0.0.1 IPYTHON=1 IPYTHON_OPTS="notebook --pylab inline" pyspark 31 | ``` 32 | 33 | ## Scala Set-up 34 | 35 | To run the scala notebook, you will need to: 36 | 37 | 1. Install [jupyter-scala](https://github.com/alexarchambault/jupyter-scala) 38 | ``` 39 | git clone https://github.com/alexarchambault/jupyter-scala.git 40 | cd jupyter-scala 41 | sbt cli/packArchive 42 | sbt publishM2 43 | # unpack cli/target/jupyter-scala_2.11.6-0.2.0-SNAPSHOT.zip 44 | cd cli/target/jupyter-scala_2.11.6-0.2.0-SNAPSHOT/bin 45 | ./jupyter-scala 46 | ``` 47 | 2. Start jupyter (formerly ipython) 48 | ``` 49 | ipython notebook 50 | # When the notebook starts you may need to select the "Scala 2.11" kernel 51 | ``` 52 | 53 | If you are running your notebook and it crashes with OutOfMemoryErrors you can increase the amount of memory used with the `-Xmx` flag (e.g. -Xmx2g or -Xmx2048m will both allocate 2GB of memory for the JVM to use): 54 | ```bash 55 | SBT_OPTS=-Xmx2048m ipython notebook --profile "Scala 2.11" 56 | ``` 57 | 58 | As with the python example, if you get a java.net.UnknownHostException when starting ipython use the following command: 59 | 60 | ```bash 61 | SPARK_LOCAL_IP=127.0.0.1 SBT_OPTS=-Xmx2048m ipython notebook --profile "Scala 2.11" 62 | ``` 63 | 64 | __NOTE:__ For the Scala notebook, you do __not__ need to download and install Spark. The Spark dependencies are managed via sbt which is running under the hood in the Spark notebook. 65 | 66 | -------------------------------------------------------------------------------- /Scala-2-Text-Analytics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 63, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [ 10 | { 11 | "name": "stdout", 12 | "output_type": "stream", 13 | "text": [ 14 | ":: problems summary ::\n", 15 | ":::: WARNINGS\n", 16 | "\tUnable to reparse com.github.alexarchambault.jupyter#jupyter-scala-api_2.11.6;0.2.0-SNAPSHOT from sonatype-snapshots, using Fri May 22 01:44:01 PDT 2015\n", 17 | "\n", 18 | "\tproblem while downloading module descriptor: /Users/brian/.m2/repository/com/github/alexarchambault/jupyter/jupyter-scala-api_2.11.6/0.2.0-SNAPSHOT/jupyter-scala-api_2.11.6-0.2.0-SNAPSHOT.pom: no protocol: /Users/brian/.m2/repository/com/github/alexarchambault/jupyter/jupyter-scala-api_2.11.6/0.2.0-SNAPSHOT/jupyter-scala-api_2.11.6-0.2.0-SNAPSHOT.pom (2ms)\n", 19 | "\n", 20 | "\tChoosing sonatype-snapshots for com.github.alexarchambault.jupyter#jupyter-scala-api_2.11.6;0.2.0-SNAPSHOT\n", 21 | "\n", 22 | "\tUnable to reparse com.github.alexarchambault#ammonite-api_2.11.6;0.3.1-SNAPSHOT from sonatype-snapshots, using Fri May 22 01:13:14 PDT 2015\n", 23 | "\n", 24 | "\tChoosing sonatype-snapshots for com.github.alexarchambault#ammonite-api_2.11.6;0.3.1-SNAPSHOT\n", 25 | "\n", 26 | "\tUnable to reparse com.github.alexarchambault.jupyter#jupyter-api_2.11;0.2.0-SNAPSHOT from sonatype-snapshots, using Thu May 21 19:38:28 PDT 2015\n", 27 | "\n", 28 | "\tChoosing sonatype-snapshots for com.github.alexarchambault.jupyter#jupyter-api_2.11;0.2.0-SNAPSHOT\n", 29 | "\n", 30 | "\t\tmodule not found: org.apache.spark#spark-assembly_2.11;1.1.0\n", 31 | "\n", 32 | "\t==== local: tried\n", 33 | "\n", 34 | "\t /Users/brian/.ivy2/local/org.apache.spark/spark-assembly_2.11/1.1.0/ivys/ivy.xml\n", 35 | "\n", 36 | "\t==== public: tried\n", 37 | "\n", 38 | "\t https://repo1.maven.org/maven2/org/apache/spark/spark-assembly_2.11/1.1.0/spark-assembly_2.11-1.1.0.pom\n", 39 | "\n", 40 | "\t==== sonatype-snapshots: tried\n", 41 | "\n", 42 | "\t https://oss.sonatype.org/content/repositories/snapshots/org/apache/spark/spark-assembly_2.11/1.1.0/spark-assembly_2.11-1.1.0.pom\n", 43 | "\n", 44 | "\t==== Maven Central: tried\n", 45 | "\n", 46 | "\t https://repo1.maven.org/maven2/org/apache/spark/spark-assembly_2.11/1.1.0/spark-assembly_2.11-1.1.0.pom\n", 47 | "\n", 48 | "\t==== Maven Central: tried\n", 49 | "\n", 50 | "\t https://repo1.maven.org/maven2/org/apache/spark/spark-assembly_2.11/1.1.0/spark-assembly_2.11-1.1.0.pom\n", 51 | "\n", 52 | "\t==== Maven Central: tried\n", 53 | "\n", 54 | "\t https://repo1.maven.org/maven2/org/apache/spark/spark-assembly_2.11/1.1.0/spark-assembly_2.11-1.1.0.pom\n", 55 | "\n", 56 | "\t==== Maven local repo: tried\n", 57 | "\n", 58 | "\t file:/Users/brian/.m2/repository/org/apache/spark/spark-assembly_2.11/1.1.0/spark-assembly_2.11-1.1.0.pom\n", 59 | "\n" 60 | ] 61 | }, 62 | { 63 | "data": { 64 | "text/plain": [] 65 | }, 66 | "metadata": {}, 67 | "output_type": "display_data" 68 | } 69 | ], 70 | "source": [ 71 | "load.resolver(\"Maven Central\" at \"https://repo1.maven.org/maven2/\")\n", 72 | "load.resolver(\"Maven local repo\" at \"file:/Users/brian/.m2/repository\")\n", 73 | "load.ivy(\"org.apache.spark\" %% \"spark-assembly\" % \"1.1.0\")" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": { 80 | "collapsed": false 81 | }, 82 | "outputs": [], 83 | "source": [] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": null, 88 | "metadata": { 89 | "collapsed": false 90 | }, 91 | "outputs": [], 92 | "source": [] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 64, 97 | "metadata": { 98 | "collapsed": false 99 | }, 100 | "outputs": [ 101 | { 102 | "ename": "", 103 | "evalue": "", 104 | "output_type": "error", 105 | "traceback": [ 106 | "Compilation Failed", 107 | "\u001b[31mMain.scala:41: object apache is not a member of package org", 108 | " import org.apache.spark.SparkContext ; import org.apache.spark.SparkContext._ ; val sc = { () =>", 109 | " ^\u001b[0m", 110 | "\u001b[31mMain.scala:41: object apache is not a member of package org", 111 | " import org.apache.spark.SparkContext ; import org.apache.spark.SparkContext._ ; val sc = { () =>", 112 | " ^\u001b[0m", 113 | "\u001b[31mMain.scala:42: not found: type SparkContext", 114 | " new SparkContext(\"local[*]\", \"Intro\") ", 115 | " ^\u001b[0m" 116 | ] 117 | } 118 | ], 119 | "source": [ 120 | "import org.apache.spark.SparkContext\n", 121 | "import org.apache.spark.SparkContext._\n", 122 | "val sc = new SparkContext(\"local[*]\", \"Intro\")" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 65, 128 | "metadata": { 129 | "collapsed": false 130 | }, 131 | "outputs": [ 132 | { 133 | "data": { 134 | "text/plain": [ 135 | "defined \u001b[32mfunction \u001b[36mclean\u001b[0m" 136 | ] 137 | }, 138 | "metadata": {}, 139 | "output_type": "display_data" 140 | } 141 | ], 142 | "source": [ 143 | "def clean(s: String): String = if (s.endsWith(\".\") || \n", 144 | " s.endsWith(\",\")) s.substring(0, s.size - 1) else s" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 66, 150 | "metadata": { 151 | "collapsed": false 152 | }, 153 | "outputs": [ 154 | { 155 | "ename": "", 156 | "evalue": "", 157 | "output_type": "error", 158 | "traceback": [ 159 | "Compilation Failed", 160 | "\u001b[31mMain.scala:44: not found: value sc", 161 | " sc.textFile(\"sotu/2009-2014-BO.txt\") ", 162 | " ^\u001b[0m" 163 | ] 164 | } 165 | ], 166 | "source": [ 167 | "var lines = sc.textFile(\"sotu/2009-2014-BO.txt\")\n", 168 | "val wordCountBO = lines\n", 169 | " .flatMap(_.split(\" \")\n", 170 | " .map(_.toLowerCase.trim)\n", 171 | " .map(clean)\n", 172 | " .map(word => (word, 1)))\n", 173 | " .reduceByKey(_ + _)\n", 174 | "wordCountBO.count()" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": 67, 180 | "metadata": { 181 | "collapsed": false 182 | }, 183 | "outputs": [ 184 | { 185 | "ename": "", 186 | "evalue": "", 187 | "output_type": "error", 188 | "traceback": [ 189 | "Compilation Failed", 190 | "\u001b[31mMain.scala:44: not found: value lines", 191 | "lines = sc.textFile(\"sotu/2001-2008-GWB.txt\")", 192 | "^\u001b[0m", 193 | "\u001b[31mMain.scala:46: not found: value lines", 194 | " lines", 195 | " ^\u001b[0m" 196 | ] 197 | } 198 | ], 199 | "source": [ 200 | "lines = sc.textFile(\"sotu/2001-2008-GWB.txt\")\n", 201 | "val wordCountGWB = lines\n", 202 | " .flatMap(_.split(\" \")\n", 203 | " .map(_.toLowerCase.trim)\n", 204 | " .map(clean)\n", 205 | " .map(word => (word, 1)))\n", 206 | " .reduceByKey(_ + _)\n", 207 | "wordCountGWB.count()" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": 68, 213 | "metadata": { 214 | "collapsed": false 215 | }, 216 | "outputs": [ 217 | { 218 | "ename": "", 219 | "evalue": "", 220 | "output_type": "error", 221 | "traceback": [ 222 | "Compilation Failed", 223 | "\u001b[31mMain.scala:44: not found: value lines", 224 | "lines = sc.textFile(\"sotu/1994-2000-WJC.txt\")", 225 | "^\u001b[0m", 226 | "\u001b[31mMain.scala:46: not found: value lines", 227 | " lines", 228 | " ^\u001b[0m" 229 | ] 230 | } 231 | ], 232 | "source": [ 233 | "lines = sc.textFile(\"sotu/1994-2000-WJC.txt\")\n", 234 | "val wordCountWJC = lines\n", 235 | " .flatMap(_.split(\" \")\n", 236 | " .map(_.toLowerCase.trim)\n", 237 | " .map(clean)\n", 238 | " .map(word => (word, 1)))\n", 239 | " .reduceByKey(_ + _)\n", 240 | "wordCountWJC.count()" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": 69, 246 | "metadata": { 247 | "collapsed": false 248 | }, 249 | "outputs": [ 250 | { 251 | "ename": "", 252 | "evalue": "", 253 | "output_type": "error", 254 | "traceback": [ 255 | "Compilation Failed", 256 | "\u001b[31mMain.scala:44: not found: value lines", 257 | "lines = sc.textFile(\"sotu/1961-1963-JFK.txt\")", 258 | "^\u001b[0m", 259 | "\u001b[31mMain.scala:46: not found: value lines", 260 | " lines", 261 | " ^\u001b[0m" 262 | ] 263 | } 264 | ], 265 | "source": [ 266 | "lines = sc.textFile(\"sotu/1961-1963-JFK.txt\")\n", 267 | "val wordCountJFK = lines\n", 268 | " .flatMap(_.split(\" \")\n", 269 | " .map(_.toLowerCase.trim)\n", 270 | " .map(clean)\n", 271 | " .map(word => (word, 1)))\n", 272 | " .reduceByKey(_ + _)\n", 273 | "wordCountJFK.count()" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": 70, 279 | "metadata": { 280 | "collapsed": false 281 | }, 282 | "outputs": [ 283 | { 284 | "ename": "", 285 | "evalue": "", 286 | "output_type": "error", 287 | "traceback": [ 288 | "Compilation Failed", 289 | "\u001b[31mMain.scala:44: not found: value lines", 290 | "lines = sc.textFile(\"sotu/1934-1945-FDR.txt\")", 291 | "^\u001b[0m", 292 | "\u001b[31mMain.scala:46: not found: value lines", 293 | " lines", 294 | " ^\u001b[0m" 295 | ] 296 | } 297 | ], 298 | "source": [ 299 | "lines = sc.textFile(\"sotu/1934-1945-FDR.txt\")\n", 300 | "val wordCountFDR = lines\n", 301 | " .flatMap(_.split(\" \")\n", 302 | " .map(_.toLowerCase.trim)\n", 303 | " .map(clean)\n", 304 | " .map(word => (word, 1)))\n", 305 | " .reduceByKey(_ + _)\n", 306 | "wordCountFDR.count()" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": 71, 312 | "metadata": { 313 | "collapsed": false 314 | }, 315 | "outputs": [ 316 | { 317 | "ename": "", 318 | "evalue": "", 319 | "output_type": "error", 320 | "traceback": [ 321 | "Compilation Failed", 322 | "\u001b[31mMain.scala:44: not found: value lines", 323 | "lines = sc.textFile(\"sotu/1861-1864-AL.txt\")", 324 | "^\u001b[0m", 325 | "\u001b[31mMain.scala:46: not found: value lines", 326 | " lines", 327 | " ^\u001b[0m" 328 | ] 329 | } 330 | ], 331 | "source": [ 332 | "lines = sc.textFile(\"sotu/1861-1864-AL.txt\")\n", 333 | "val wordCountAL = lines\n", 334 | " .flatMap(_.split(\" \")\n", 335 | " .map(_.toLowerCase.trim)\n", 336 | " .map(clean)\n", 337 | " .map(word => (word, 1)))\n", 338 | " .reduceByKey(_ + _)\n", 339 | "wordCountAL.count()" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 72, 345 | "metadata": { 346 | "collapsed": false 347 | }, 348 | "outputs": [ 349 | { 350 | "ename": "", 351 | "evalue": "", 352 | "output_type": "error", 353 | "traceback": [ 354 | "Compilation Failed", 355 | "\u001b[31mMain.scala:44: not found: value lines", 356 | "lines = sc.textFile(\"sotu/1790-1796-GW.txt\")", 357 | "^\u001b[0m", 358 | "\u001b[31mMain.scala:46: not found: value lines", 359 | " lines", 360 | " ^\u001b[0m" 361 | ] 362 | } 363 | ], 364 | "source": [ 365 | "lines = sc.textFile(\"sotu/1790-1796-GW.txt\")\n", 366 | "val wordCountGW = lines\n", 367 | " .flatMap(_.split(\" \")\n", 368 | " .map(_.toLowerCase.trim)\n", 369 | " .map(clean)\n", 370 | " .map(word => (word, 1)))\n", 371 | " .reduceByKey(_ + _)\n", 372 | "wordCountGW.count()" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": 73, 378 | "metadata": { 379 | "collapsed": false 380 | }, 381 | "outputs": [ 382 | { 383 | "data": { 384 | "text/plain": [ 385 | "\u001b[36mcommonWords\u001b[0m: \u001b[32mscala.Array[java.lang.String]\u001b[0m = \u001b[33mArray\u001b[0m(\n", 386 | " \u001b[32m\"us\"\u001b[0m,\n", 387 | " \u001b[32m\"has\"\u001b[0m,\n", 388 | " \u001b[32m\"all\"\u001b[0m,\n", 389 | " \u001b[32m\"they\"\u001b[0m,\n", 390 | " \u001b[32m\"from\"\u001b[0m,\n", 391 | " \u001b[32m\"who\"\u001b[0m,\n", 392 | " \u001b[32m\"what\"\u001b[0m,\n", 393 | " \u001b[32m\"on\"\u001b[0m,\n", 394 | " \u001b[32m\"by\"\u001b[0m,\n", 395 | " \u001b[32m\"more\"\u001b[0m,\n", 396 | " \u001b[32m\"as\"\u001b[0m,\n", 397 | " \u001b[32m\"not\"\u001b[0m,\n", 398 | " \u001b[32m\"their\"\u001b[0m,\n", 399 | " \u001b[32m\"can\"\u001b[0m,\n", 400 | "\u001b[33m...\u001b[0m" 401 | ] 402 | }, 403 | "metadata": {}, 404 | "output_type": "display_data" 405 | } 406 | ], 407 | "source": [ 408 | "val commonWords = Array(\"us\",\"has\",\"all\", \"they\", \"from\", \"who\",\"what\",\"on\",\"by\",\"more\",\"as\",\"not\",\"their\",\"can\",\n", 409 | " \"new\",\"it\",\"but\",\"be\",\"are\",\"--\",\"i\",\"have\",\"this\",\"will\",\"for\",\"with\",\"is\",\"that\",\"in\",\n", 410 | " \"our\",\"we\",\"a\",\"of\",\"to\",\"and\",\"the\",\"that's\",\"or\",\"make\",\"do\",\"you\",\"at\",\"it\\'s\",\"than\",\n", 411 | " \"if\",\"know\",\"last\",\"about\",\"no\",\"just\",\"now\",\"an\",\"because\",\"
we\",\"why\",\"we\\'ll\",\"how\",\n", 412 | " \"two\",\"also\",\"every\",\"come\",\"we've\",\"year\",\"over\",\"get\",\"take\",\"one\",\"them\",\"we\\'re\",\"need\",\n", 413 | " \"want\",\"when\",\"like\",\"most\",\"-\",\"been\",\"first\",\"where\",\"so\",\"these\",\"they\\'re\",\"good\",\"would\",\n", 414 | " \"there\",\"should\",\"-->\",\"