├── Part2 ├── readme.md ├── .DS_Store ├── BiosparkPart2.jpg └── Week1_20150820 │ └── ML_lab1_review_student.ipynb ├── Part1 ├── .DS_Store ├── Week3 │ ├── .DS_Store │ ├── Week3Lec5.pdf │ ├── Week3Lec6.pdf │ └── biospark.ipynb ├── Week2 │ └── Week2_Apache Spark.pptx ├── .ipynb_checkpoints │ ├── Untitled-checkpoint.ipynb │ └── Untitled1-checkpoint.ipynb ├── Tutorials │ └── lab1_word_count_student.ipynb └── Week4 │ └── lab1_word_count_student_Answer_CS_20150730.ipynb └── README.md /Part2/readme.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /Part1/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biospin/biospark/HEAD/Part1/.DS_Store -------------------------------------------------------------------------------- /Part2/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biospin/biospark/HEAD/Part2/.DS_Store -------------------------------------------------------------------------------- /Part1/Week3/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biospin/biospark/HEAD/Part1/Week3/.DS_Store -------------------------------------------------------------------------------- /Part2/BiosparkPart2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biospin/biospark/HEAD/Part2/BiosparkPart2.jpg -------------------------------------------------------------------------------- /Part1/Week3/Week3Lec5.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biospin/biospark/HEAD/Part1/Week3/Week3Lec5.pdf -------------------------------------------------------------------------------- /Part1/Week3/Week3Lec6.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biospin/biospark/HEAD/Part1/Week3/Week3Lec6.pdf -------------------------------------------------------------------------------- /Part1/Week2/Week2_Apache Spark.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biospin/biospark/HEAD/Part1/Week2/Week2_Apache Spark.pptx -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # biospark 2 | bio + Python + Spark ! 3 | 4 | 5 | # 스터디 웹 사이트 (정보 & 커리큘럼 & 자료) 6 | * http://biospin.github.io/biospark/ 7 | 8 | -------------------------------------------------------------------------------- /Part1/.ipynb_checkpoints/Untitled-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 0 6 | } 7 | -------------------------------------------------------------------------------- /Part1/.ipynb_checkpoints/Untitled1-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 0 6 | } 7 | -------------------------------------------------------------------------------- /Part1/Week3/biospark.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "#Semi-structured and structured data\n", 8 | "\n" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "##Lecture index\n", 16 | " - the structure spectrum\n", 17 | " - files: formats and performance\n", 18 | " - Tabular Data: Examples, Challenges, pySpark DataFrames\n", 19 | " - Log files" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "##Key Data Management Concepts\n", 27 | " - Data Model: 데이터를 묘사하기 위한 concepts의 집단\n", 28 | " - Schema: 주어진 Data model를 사용한 특정 데이터 집단의 묘사" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "##Structure Spectrum\n", 36 | "![spectrum](http://postfiles4.naver.net/20150716_179/valtin_1437027008345FOLW6_JPEG/1.jpg?type=w3)" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## file이란?\n", 44 | " - byte들의 이름 지어진 순서 (일반적으로 페이지들의 집단으로 저장됨)\n", 45 | " - Filesystem 은 계층적인 namespace 내의 구성된 파일들의 집단으로\n", 46 | " - 물리적 매체에서 byte들의 lay out을 맡고 있고\n", 47 | " - file metadata를 저장하고\n", 48 | " - 파일들과의 interaction을 위한 API를 제공\n", 49 | " - standard operations\n", 50 | " * open() / close()\n", 51 | " * seek()\n", 52 | " * read() / write()\n", 53 | " - filesystem의 root\n", 54 | " * 리눅스 - / , 윈도우 - \\\n", 55 | " - 파일과 디렉토리는 서로 관련된 권한을 가지고 있지만 파일들은 언제나 계층적으로 배치되어 있지는 않다\n", 56 | " * Content-addressable storage (CAS)\n", 57 | " * Often used for large multimedia collections\n", 58 | " \n", 59 | "### File format 을 위한 고려사항\n", 60 | " - Data model: tabular, hierarchical, array\n", 61 | " - physical layout\n", 62 | " - Field units and validation\n", 63 | " - metadata header, side file, specification, other?\n", 64 | " - Plain text or binary\n", 65 | " - Delimiters and escaping\n", 66 | " - Compression, encryption, checksums?\n", 67 | " - Schema evolution" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "##Semi-structured Tabular Data\n", 75 | " - 가장 자주쓰는 데이터 포맷중 하나\n", 76 | " - 표는 행과 열의 집단이다\n", 77 | " - 각 행은 index를 가지고 있고, 각 열은 이름을 가지고 있다\n", 78 | " - 한 cell은 index와 name 쌍으로 특정지어진다\n", 79 | " - cell은 값을 갖을 수도, 갖지 않을 수도 있다.\n", 80 | " - 각 cell의 type은 그 값으로부터 추론된다\n", 81 | " " 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "## Tabular Data Example\n", 89 | " * Fortune 500 companies\n", 90 | "![fortune](http://postfiles8.naver.net/20150716_167/valtin_1437031334286lSBKm_JPEG/2.jpg?type=w3)\n", 91 | " http://fortune.com/fortune500/\n", 92 | " \n", 93 | " 이 예시를 CSV하여 변환\n", 94 | " \n", 95 | " ![fortune CSV](http://postfiles15.naver.net/20150716_110/valtin_1437031334495hn49L_JPEG/3.jpg?type=w3)\n", 96 | " \n", 97 | " 또 다른 예시인 Protein Data Bank\n", 98 | " \n", 99 | " ![PDB](http://postfiles14.naver.net/20150716_29/valtin_1437031334708nkApL_JPEG/4.jpg?type=w3)\n", 100 | " http://www.rcsb.org/pdb/files/3J2T.pdb" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "##위의 예시들로 추론할수 있는 문제점\n", 108 | " 1. Tabular data\n", 109 | " - 형식이 잘 정립되어 있지 않다.\n", 110 | " - 결측치가 있을 수 있다\n", 111 | " - 타입이 정확하게 추론 되지 않는다 (e.g., \"2\" 와 \"2.0\")\n", 112 | " - format의 versioning 을 위한 지원이 없다\n", 113 | " 2. from multiple source\n", 114 | " * 결측 필드가 있다.(모든 소스에서 같은 데이터를 제공하지 않는다)\n", 115 | " * 데이터 타입이 일관되지 않을 수 있다\n", 116 | " * 같은 entity에서 일관되지 않은 값이 있을 수 있다. (Wal-Mart 와 WalMart)\n", 117 | " 3. from Sensors\n", 118 | " * 결측 필드가 있다\n", 119 | " * 센서의 손상이 있을 수 있다\n", 120 | " * timestamp가 정확하지 않을 수 있다\n", 121 | " * 다른 metadata가 에러를 가질 수 있다\n", 122 | " * 센서가 잠시동안 offline이 될 수 있다" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "##pandas: Pyhton Data Analysis Library\n", 130 | " * 오픈소스 기반 데이터 분석 및 모델링 라이브러리\n", 131 | " - An alternative to using R\n", 132 | " * Pandas DataFrame: a table with named columns\n", 133 | " - Pandas object에서 가장 자주 사용됨\n", 134 | " - 파이썬 Dict로 표현 (Column_name -> Series)\n", 135 | " * I-D labeled array capable of holding any data type\n", 136 | " - R 또한 비슷한 데이터 프레임 타입을 가지고 있다\n", 137 | " \n", 138 | "##Semi-Structured Data in pySpark\n", 139 | " * RDDS의 확장으로 Spark 1.3에서 Dataframes 가 소개됨\n", 140 | " * named columns(열)로 구성된 데이터의 분산된 집단\n", 141 | " - Pandas와 R dataframes은 같지만, Pandas는 분산되어져 있다\n", 142 | " * 값으로부터 열의 타입을 추론할 수 있다\n", 143 | " \n", 144 | "##pySpark and Pandas DataFrames\n", 145 | " * Pandas와 pySpark는 쉽게 전환할 수 있다\n", 146 | " \n", 147 | " - convert spark dataframe to pandas\n", 148 | " * pandas_df = spark_df.toPandas()\n", 149 | " - Create a park Dataframe from Pandas\n", 150 | " * spark_df = context.createDataFrame(pandas_df)\n", 151 | "\n", 152 | "## pySpark DataFrame Performance\n", 153 | " * 싱글 머신에서 약 5배 정도 pyspark performance가 높다\n", 154 | " ![performance](http://postfiles11.naver.net/20150716_154/valtin_1437034010888cz8Sj_JPEG/5.jpg?type=w3)\n", 155 | " https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html\n", 156 | " \n", 157 | "##Semi-Structured Log files\n", 158 | " * 서버에서 printf 명령어로 만들어짐\n", 159 | " * 사람이 읽을수 있는 텍스트 형식 파일\n", 160 | " * Format published or “defined” by code\n", 161 | " - parse 하기 매우 어렵다\n", 162 | "\n", 163 | "#Recall: Apache Web Server Log\n", 164 | "![log](http://postfiles15.naver.net/20150716_46/valtin_1437034252811xMmJX_JPEG/6.jpg?type=w3)\n", 165 | "\n", 166 | " * 예시 라인\n", 167 | " ![line](http://postfiles3.naver.net/20150716_194/valtin_1437034348116900wC_JPEG/7.jpg?type=w3)\n", 168 | " \n", 169 | " \n", 170 | " * 부분 분석\n", 171 | " - 127.0.0.1 클라이언트 IP 주소\n", 172 | " - \"-\" 원격 머신으로부터 유저 분석\n", 173 | " - \"-\" 로컬 로그온으로부터 유저 분석\n", 174 | " * 둘다 \"-\"은 not available을 의미\n", 175 | " - [01/Aug/1995:00:00:01 -0400] 요청 시간\n", 176 | " - \"GET/images/launch-logo.gifHTTP/1.0\" 클라이언트 요청\n", 177 | " - 200 서버가 클라이언트에게 보내는 상태 코드\n", 178 | " * OK response (2XX), others: 3XX,4XX,5XX\n", 179 | " - 1839 클라이언트에게 return 된 object의 사이즈\n", 180 | " * 만약 return된 내용이 없다면 \"-\" 나 가끔 \"0\"가 return됨\n", 181 | " \n", 182 | "##Some Log Analysis Questions\n", 183 | " * Overall\n", 184 | " - 리턴 된 컨텐츠나 사이즈, 상태를 위한 statistics는 무엇인가?\n", 185 | " - 리턴 코드들의 타입은 무엇인가?\n", 186 | " - 얼마나 많이 page not found 에러가 떴는가?\n", 187 | " * temporal\n", 188 | " - 하루에 얼마나 많은 unique hosts들이 왔는가\n", 189 | " - 하루에 얼마나 많은 요청이 있었는가\n", 190 | " - 평균적으로 호스트당 얼마나 많은 요청이 있는가?\n", 191 | " - 하루에 얼마나 404 에러가 떴는가?\n", 192 | " \n", 193 | "##Splunk\n", 194 | " - 여러 컴퓨터부터 파일들을 수집\n", 195 | " * 어플리케이션과 시스템 이벤트 로그\n", 196 | " - 일반적이지 않은 이벤트 확인:\n", 197 | " * 디스크 에러\n", 198 | " * 네트워크 혼잡\n", 199 | " * 보안 공격\n", 200 | " - 리소스 모니터링\n", 201 | " * 네트워크, 메모리, 디스크 등\n", 202 | " - dashboard로 시각화\n", 203 | "\n", 204 | "##File performance consideration\n", 205 | " * 읽기 vs 쓰기 성능\n", 206 | " * Plain text vs binary format\n", 207 | " * Environment: Pandas(python) vs Scala/java\n", 208 | " * uncompressed vs compressed\n", 209 | "\n", 210 | "##File Performance\n", 211 | " >![performance](http://postfiles6.naver.net/20150716_133/valtin_1437038727881KjKzB_JPEG/8.jpg?type=w3)\n", 212 | " \n", 213 | " * Pandas는 binary file I/O library를 가지고 있지 않음\n", 214 | " - 따로 사용하는 라이브러리에 따라 성능이 좌우됨\n", 215 | " * binary file 이 일반 텍스트 보다 훨씬 빠름\n", 216 | "\n", 217 | "## file performance - compression\n", 218 | "![comp](http://postfiles16.naver.net/20150716_159/valtin_1437039047510AD5KI_JPEG/9.jpg?type=w3)\n", 219 | "\n", 220 | " - compressed read 가 write 보다 훨씬 빠름\n", 221 | " * LZ4 가 gzip 보다 더 좋음\n", 222 | " * LZ4 compression times approach raw I/O times\n", 223 | " \n", 224 | "##Structured data\n", 225 | " - 데이터의 20%정보만이 정형화되어 있다\n", 226 | " - 미디어 어플리케이션, enterprise search, consumer application에 의해 감소중이다\n", 227 | " >\n", 228 | " - 보통 관계형 DB로 표현되는데, 각각 Schema와 instance로 구성되어 있다\n", 229 | " * Schema: 각 관계의 특정한 이름과 열의 이름 및 타입\n", 230 | " * Instance: 주어진 시간에서 실제 데이터\n", 231 | " - rows = cardinality\n", 232 | " - fields = degree\n" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "metadata": { 239 | "collapsed": true 240 | }, 241 | "outputs": [], 242 | "source": [] 243 | } 244 | ], 245 | "metadata": { 246 | "kernelspec": { 247 | "display_name": "Python 3", 248 | "language": "python", 249 | "name": "python3" 250 | }, 251 | "language_info": { 252 | "codemirror_mode": { 253 | "name": "ipython", 254 | "version": 3 255 | }, 256 | "file_extension": ".py", 257 | "mimetype": "text/x-python", 258 | "name": "python", 259 | "nbconvert_exporter": "python", 260 | "pygments_lexer": "ipython3", 261 | "version": "3.4.3" 262 | } 263 | }, 264 | "nbformat": 4, 265 | "nbformat_minor": 0 266 | } 267 | -------------------------------------------------------------------------------- /Part1/Tutorials/lab1_word_count_student.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "#![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png) + ![Python Logo](http://spark-mooc.github.io/web-assets/images/python-logo-master-v3-TM-flattened_small.png)\n", 8 | "# **Word Count Lab: Building a word count application**\n", 9 | "#### This lab will build on the techniques covered in the Spark tutorial to develop a simple word count application. The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data. In this lab, we will write code that calculates the most common words in the [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100) retrieved from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page). This could also be scaled to find the most common words on the Internet.\n", 10 | "#### ** During this lab we will cover: **\n", 11 | "#### *Part 1:* Creating a base RDD and pair RDDs\n", 12 | "#### *Part 2:* Counting with pair RDDs\n", 13 | "#### *Part 3:* Finding unique words and a mean value\n", 14 | "#### *Part 4:* Apply word count to a file\n", 15 | "#### Note that, for reference, you can look up the details of the relevant methods in [Spark's Python API](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD)" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "### ** Part 1: Creating a base RDD and pair RDDs **" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "#### In this part of the lab, we will explore creating a base RDD with `parallelize` and using pair RDDs to count words." 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "#### ** (1a) Create a base RDD **\n", 37 | "#### We'll start by generating a base RDD by using a Python list and the `sc.parallelize` method. Then we'll print out the type of the base RDD." 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "metadata": { 44 | "collapsed": false 45 | }, 46 | "outputs": [], 47 | "source": [ 48 | "wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']\n", 49 | "wordsRDD = sc.parallelize(wordsList, 4)\n", 50 | "# Print out the type of wordsRDD\n", 51 | "print type(wordsRDD)" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "#### ** (1b) Pluralize and test **\n", 59 | "#### Let's use a `map()` transformation to add the letter 's' to each string in the base RDD we just created. We'll define a Python function that returns the word with an 's' at the end of the word. Please replace `` with your solution. If you have trouble, the next cell has the solution. After you have defined `makePlural` you can run the third cell which contains a test. If you implementation is correct it will print `1 test passed`.\n", 60 | "#### This is the general form that exercises will take, except that no example solution will be provided. Exercises will include an explanation of what is expected, followed by code cells where one cell will have one or more `` sections. The cell that needs to be modified will have `# TODO: Replace with appropriate code` on its first line. Once the `` sections are updated and the code is run, the test cell can then be run to verify the correctness of your solution. The last code cell before the next markdown section will contain the tests." 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": { 67 | "collapsed": false 68 | }, 69 | "outputs": [], 70 | "source": [ 71 | "# TODO: Replace with appropriate code\n", 72 | "def makePlural(word):\n", 73 | " \"\"\"Adds an 's' to `word`.\n", 74 | "\n", 75 | " Note:\n", 76 | " This is a simple function that only adds an 's'. No attempt is made to follow proper\n", 77 | " pluralization rules.\n", 78 | "\n", 79 | " Args:\n", 80 | " word (str): A string.\n", 81 | "\n", 82 | " Returns:\n", 83 | " str: A string with 's' added to it.\n", 84 | " \"\"\"\n", 85 | " return \n", 86 | "\n", 87 | "print makePlural('cat')" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": { 94 | "collapsed": false 95 | }, 96 | "outputs": [], 97 | "source": [ 98 | "# One way of completing the function\n", 99 | "def makePlural(word):\n", 100 | " return word + 's'\n", 101 | "\n", 102 | "print makePlural('cat')" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": { 109 | "collapsed": false 110 | }, 111 | "outputs": [], 112 | "source": [ 113 | "# Load in the testing code and check to see if your answer is correct\n", 114 | "# If incorrect it will report back '1 test failed' for each failed test\n", 115 | "# Make sure to rerun any cell you change before trying the test again\n", 116 | "from test_helper import Test\n", 117 | "# TEST Pluralize and test (1b)\n", 118 | "Test.assertEquals(makePlural('rat'), 'rats', 'incorrect result: makePlural does not add an s')" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "#### ** (1c) Apply `makePlural` to the base RDD **\n", 126 | "#### Now pass each item in the base RDD into a [map()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.map) transformation that applies the `makePlural()` function to each element. And then call the [collect()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.collect) action to see the transformed RDD." 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": { 133 | "collapsed": false 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "# TODO: Replace with appropriate code\n", 138 | "pluralRDD = wordsRDD.map()\n", 139 | "print pluralRDD.collect()" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": null, 145 | "metadata": { 146 | "collapsed": false 147 | }, 148 | "outputs": [], 149 | "source": [ 150 | "# TEST Apply makePlural to the base RDD(1c)\n", 151 | "Test.assertEquals(pluralRDD.collect(), ['cats', 'elephants', 'rats', 'rats', 'cats'],\n", 152 | " 'incorrect values for pluralRDD')" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "#### ** (1d) Pass a `lambda` function to `map` **\n", 160 | "#### Let's create the same RDD using a `lambda` function." 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": { 167 | "collapsed": false 168 | }, 169 | "outputs": [], 170 | "source": [ 171 | "# TODO: Replace with appropriate code\n", 172 | "pluralLambdaRDD = wordsRDD.map(lambda )\n", 173 | "print pluralLambdaRDD.collect()" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": { 180 | "collapsed": false 181 | }, 182 | "outputs": [], 183 | "source": [ 184 | "# TEST Pass a lambda function to map (1d)\n", 185 | "Test.assertEquals(pluralLambdaRDD.collect(), ['cats', 'elephants', 'rats', 'rats', 'cats'],\n", 186 | " 'incorrect values for pluralLambdaRDD (1d)')" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "#### ** (1e) Length of each word **\n", 194 | "#### Now use `map()` and a `lambda` function to return the number of characters in each word. We'll `collect` this result directly into a variable." 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": null, 200 | "metadata": { 201 | "collapsed": false 202 | }, 203 | "outputs": [], 204 | "source": [ 205 | "# TODO: Replace with appropriate code\n", 206 | "pluralLengths = (pluralRDD\n", 207 | " \n", 208 | " .collect())\n", 209 | "print pluralLengths" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "metadata": { 216 | "collapsed": false 217 | }, 218 | "outputs": [], 219 | "source": [ 220 | "# TEST Length of each word (1e)\n", 221 | "Test.assertEquals(pluralLengths, [4, 9, 4, 4, 4],\n", 222 | " 'incorrect values for pluralLengths')" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "#### ** (1f) Pair RDDs **\n", 230 | "#### The next step in writing our word counting program is to create a new type of RDD, called a pair RDD. A pair RDD is an RDD where each element is a pair tuple `(k, v)` where `k` is the key and `v` is the value. In this example, we will create a pair consisting of `('', 1)` for each word element in the RDD.\n", 231 | "#### We can create the pair RDD using the `map()` transformation with a `lambda()` function to create a new RDD." 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "metadata": { 238 | "collapsed": false 239 | }, 240 | "outputs": [], 241 | "source": [ 242 | "# TODO: Replace with appropriate code\n", 243 | "wordPairs = wordsRDD.\n", 244 | "print wordPairs.collect()" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": null, 250 | "metadata": { 251 | "collapsed": false 252 | }, 253 | "outputs": [], 254 | "source": [ 255 | "# TEST Pair RDDs (1f)\n", 256 | "Test.assertEquals(wordPairs.collect(),\n", 257 | " [('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)],\n", 258 | " 'incorrect value for wordPairs')" 259 | ] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "metadata": {}, 264 | "source": [ 265 | "### ** Part 2: Counting with pair RDDs **" 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": {}, 271 | "source": [ 272 | "#### Now, let's count the number of times a particular word appears in the RDD. There are multiple ways to perform the counting, but some are much less efficient than others.\n", 273 | "#### A naive approach would be to `collect()` all of the elements and count them in the driver program. While this approach could work for small datasets, we want an approach that will work for any size dataset including terabyte- or petabyte-sized datasets. In addition, performing all of the work in the driver program is slower than performing it in parallel in the workers. For these reasons, we will use data parallel operations." 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "#### ** (2a) `groupByKey()` approach **\n", 281 | "#### An approach you might first consider (we'll see shortly that there are better ways) is based on using the [groupByKey()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.groupByKey) transformation. As the name implies, the `groupByKey()` transformation groups all the elements of the RDD with the same key into a single list in one of the partitions. There are two problems with using `groupByKey()`:\n", 282 | " + #### The operation requires a lot of data movement to move all the values into the appropriate partitions.\n", 283 | " + #### The lists can be very large. Consider a word count of English Wikipedia: the lists for common words (e.g., the, a, etc.) would be huge and could exhaust the available memory in a worker.\n", 284 | " \n", 285 | "#### Use `groupByKey()` to generate a pair RDD of type `('word', iterator)`." 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": null, 291 | "metadata": { 292 | "collapsed": false 293 | }, 294 | "outputs": [], 295 | "source": [ 296 | "# TODO: Replace with appropriate code\n", 297 | "# Note that groupByKey requires no parameters\n", 298 | "wordsGrouped = wordPairs.\n", 299 | "for key, value in wordsGrouped.collect():\n", 300 | " print '{0}: {1}'.format(key, list(value))" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": null, 306 | "metadata": { 307 | "collapsed": false 308 | }, 309 | "outputs": [], 310 | "source": [ 311 | "# TEST groupByKey() approach (2a)\n", 312 | "Test.assertEquals(sorted(wordsGrouped.mapValues(lambda x: list(x)).collect()),\n", 313 | " [('cat', [1, 1]), ('elephant', [1]), ('rat', [1, 1])],\n", 314 | " 'incorrect value for wordsGrouped')" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "metadata": {}, 320 | "source": [ 321 | "#### ** (2b) Use `groupByKey()` to obtain the counts **\n", 322 | "#### Using the `groupByKey()` transformation creates an RDD containing 3 elements, each of which is a pair of a word and a Python iterator.\n", 323 | "#### Now sum the iterator using a `map()` transformation. The result should be a pair RDD consisting of (word, count) pairs." 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": null, 329 | "metadata": { 330 | "collapsed": false 331 | }, 332 | "outputs": [], 333 | "source": [ 334 | "# TODO: Replace with appropriate code\n", 335 | "wordCountsGrouped = wordsGrouped.\n", 336 | "print wordCountsGrouped.collect()" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": null, 342 | "metadata": { 343 | "collapsed": false 344 | }, 345 | "outputs": [], 346 | "source": [ 347 | "# TEST Use groupByKey() to obtain the counts (2b)\n", 348 | "Test.assertEquals(sorted(wordCountsGrouped.collect()),\n", 349 | " [('cat', 2), ('elephant', 1), ('rat', 2)],\n", 350 | " 'incorrect value for wordCountsGrouped')" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": {}, 356 | "source": [ 357 | "#### ** (2c) Counting using `reduceByKey` **\n", 358 | "#### A better approach is to start from the pair RDD and then use the [reduceByKey()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.reduceByKey) transformation to create a new pair RDD. The `reduceByKey()` transformation gathers together pairs that have the same key and applies the function provided to two values at a time, iteratively reducing all of the values to a single value. `reduceByKey()` operates by applying the function first within each partition on a per-key basis and then across the partitions, allowing it to scale efficiently to large datasets." 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": null, 364 | "metadata": { 365 | "collapsed": false 366 | }, 367 | "outputs": [], 368 | "source": [ 369 | "# TODO: Replace with appropriate code\n", 370 | "# Note that reduceByKey takes in a function that accepts two values and returns a single value\n", 371 | "wordCounts = wordPairs.reduceByKey()\n", 372 | "print wordCounts.collect()" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "metadata": { 379 | "collapsed": false 380 | }, 381 | "outputs": [], 382 | "source": [ 383 | "# TEST Counting using reduceByKey (2c)\n", 384 | "Test.assertEquals(sorted(wordCounts.collect()), [('cat', 2), ('elephant', 1), ('rat', 2)],\n", 385 | " 'incorrect value for wordCounts')" 386 | ] 387 | }, 388 | { 389 | "cell_type": "markdown", 390 | "metadata": {}, 391 | "source": [ 392 | "#### ** (2d) All together **\n", 393 | "#### The expert version of the code performs the `map()` to pair RDD, `reduceByKey()` transformation, and `collect` in one statement." 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": null, 399 | "metadata": { 400 | "collapsed": false 401 | }, 402 | "outputs": [], 403 | "source": [ 404 | "# TODO: Replace with appropriate code\n", 405 | "wordCountsCollected = (wordsRDD\n", 406 | " \n", 407 | " .collect())\n", 408 | "print wordCountsCollected" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": null, 414 | "metadata": { 415 | "collapsed": false 416 | }, 417 | "outputs": [], 418 | "source": [ 419 | "# TEST All together (2d)\n", 420 | "Test.assertEquals(sorted(wordCountsCollected), [('cat', 2), ('elephant', 1), ('rat', 2)],\n", 421 | " 'incorrect value for wordCountsCollected')" 422 | ] 423 | }, 424 | { 425 | "cell_type": "markdown", 426 | "metadata": {}, 427 | "source": [ 428 | "### ** Part 3: Finding unique words and a mean value **" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "#### ** (3a) Unique words **\n", 436 | "#### Calculate the number of unique words in `wordsRDD`. You can use other RDDs that you have already created to make this easier." 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": null, 442 | "metadata": { 443 | "collapsed": false 444 | }, 445 | "outputs": [], 446 | "source": [ 447 | "# TODO: Replace with appropriate code\n", 448 | "uniqueWords = \n", 449 | "print uniqueWords" 450 | ] 451 | }, 452 | { 453 | "cell_type": "code", 454 | "execution_count": null, 455 | "metadata": { 456 | "collapsed": false 457 | }, 458 | "outputs": [], 459 | "source": [ 460 | "# TEST Unique words (3a)\n", 461 | "Test.assertEquals(uniqueWords, 3, 'incorrect count of uniqueWords')" 462 | ] 463 | }, 464 | { 465 | "cell_type": "markdown", 466 | "metadata": {}, 467 | "source": [ 468 | "#### ** (3b) Mean using `reduce` **\n", 469 | "#### Find the mean number of words per unique word in `wordCounts`.\n", 470 | "#### Use a `reduce()` action to sum the counts in `wordCounts` and then divide by the number of unique words. First `map()` the pair RDD `wordCounts`, which consists of (key, value) pairs, to an RDD of values." 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": null, 476 | "metadata": { 477 | "collapsed": false 478 | }, 479 | "outputs": [], 480 | "source": [ 481 | "# TODO: Replace with appropriate code\n", 482 | "from operator import add\n", 483 | "totalCount = (wordCounts\n", 484 | " .map()\n", 485 | " .reduce())\n", 486 | "average = totalCount / float()\n", 487 | "print totalCount\n", 488 | "print round(average, 2)" 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": null, 494 | "metadata": { 495 | "collapsed": false 496 | }, 497 | "outputs": [], 498 | "source": [ 499 | "# TEST Mean using reduce (3b)\n", 500 | "Test.assertEquals(round(average, 2), 1.67, 'incorrect value of average')" 501 | ] 502 | }, 503 | { 504 | "cell_type": "markdown", 505 | "metadata": {}, 506 | "source": [ 507 | "### ** Part 4: Apply word count to a file **" 508 | ] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "metadata": {}, 513 | "source": [ 514 | "#### In this section we will finish developing our word count application. We'll have to build the `wordCount` function, deal with real world problems like capitalization and punctuation, load in our data source, and compute the word count on the new data." 515 | ] 516 | }, 517 | { 518 | "cell_type": "markdown", 519 | "metadata": {}, 520 | "source": [ 521 | "#### ** (4a) `wordCount` function **\n", 522 | "#### First, define a function for word counting. You should reuse the techniques that have been covered in earlier parts of this lab. This function should take in an RDD that is a list of words like `wordsRDD` and return a pair RDD that has all of the words and their associated counts." 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": null, 528 | "metadata": { 529 | "collapsed": false 530 | }, 531 | "outputs": [], 532 | "source": [ 533 | "# TODO: Replace with appropriate code\n", 534 | "def wordCount(wordListRDD):\n", 535 | " \"\"\"Creates a pair RDD with word counts from an RDD of words.\n", 536 | "\n", 537 | " Args:\n", 538 | " wordListRDD (RDD of str): An RDD consisting of words.\n", 539 | "\n", 540 | " Returns:\n", 541 | " RDD of (str, int): An RDD consisting of (word, count) tuples.\n", 542 | " \"\"\"\n", 543 | " \n", 544 | "print wordCount(wordsRDD).collect()" 545 | ] 546 | }, 547 | { 548 | "cell_type": "code", 549 | "execution_count": null, 550 | "metadata": { 551 | "collapsed": false 552 | }, 553 | "outputs": [], 554 | "source": [ 555 | "# TEST wordCount function (4a)\n", 556 | "Test.assertEquals(sorted(wordCount(wordsRDD).collect()),\n", 557 | " [('cat', 2), ('elephant', 1), ('rat', 2)],\n", 558 | " 'incorrect definition for wordCount function')" 559 | ] 560 | }, 561 | { 562 | "cell_type": "markdown", 563 | "metadata": {}, 564 | "source": [ 565 | "#### ** (4b) Capitalization and punctuation **\n", 566 | "#### Real world files are more complicated than the data we have been using in this lab. Some of the issues we have to address are:\n", 567 | " + #### Words should be counted independent of their capitialization (e.g., Spark and spark should be counted as the same word).\n", 568 | " + #### All punctuation should be removed.\n", 569 | " + #### Any leading or trailing spaces on a line should be removed.\n", 570 | " \n", 571 | "#### Define the function `removePunctuation` that converts all text to lower case, removes any punctuation, and removes leading and trailing spaces. Use the Python [re](https://docs.python.org/2/library/re.html) module to remove any text that is not a letter, number, or space. Reading `help(re.sub)` might be useful." 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "execution_count": null, 577 | "metadata": { 578 | "collapsed": false 579 | }, 580 | "outputs": [], 581 | "source": [ 582 | "# TODO: Replace with appropriate code\n", 583 | "import re\n", 584 | "def removePunctuation(text):\n", 585 | " \"\"\"Removes punctuation, changes to lower case, and strips leading and trailing spaces.\n", 586 | "\n", 587 | " Note:\n", 588 | " Only spaces, letters, and numbers should be retained. Other characters should should be\n", 589 | " eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after\n", 590 | " punctuation is removed.\n", 591 | "\n", 592 | " Args:\n", 593 | " text (str): A string.\n", 594 | "\n", 595 | " Returns:\n", 596 | " str: The cleaned up string.\n", 597 | " \"\"\"\n", 598 | " \n", 599 | "print removePunctuation('Hi, you!')\n", 600 | "print removePunctuation(' No under_score!')" 601 | ] 602 | }, 603 | { 604 | "cell_type": "code", 605 | "execution_count": null, 606 | "metadata": { 607 | "collapsed": false 608 | }, 609 | "outputs": [], 610 | "source": [ 611 | "# TEST Capitalization and punctuation (4b)\n", 612 | "Test.assertEquals(removePunctuation(\" The Elephant's 4 cats. \"),\n", 613 | " 'the elephants 4 cats',\n", 614 | " 'incorrect definition for removePunctuation function')" 615 | ] 616 | }, 617 | { 618 | "cell_type": "markdown", 619 | "metadata": {}, 620 | "source": [ 621 | "#### ** (4c) Load a text file **\n", 622 | "#### For the next part of this lab, we will use the [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100) from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page). To convert a text file into an RDD, we use the `SparkContext.textFile()` method. We also apply the recently defined `removePunctuation()` function using a `map()` transformation to strip out the punctuation and change all text to lowercase. Since the file is large we use `take(15)`, so that we only print 15 lines." 623 | ] 624 | }, 625 | { 626 | "cell_type": "code", 627 | "execution_count": null, 628 | "metadata": { 629 | "collapsed": false 630 | }, 631 | "outputs": [], 632 | "source": [ 633 | "# Just run this code\n", 634 | "import os.path\n", 635 | "baseDir = os.path.join('data')\n", 636 | "inputPath = os.path.join('cs100', 'lab1', 'shakespeare.txt')\n", 637 | "fileName = os.path.join(baseDir, inputPath)\n", 638 | "\n", 639 | "shakespeareRDD = (sc\n", 640 | " .textFile(fileName, 8)\n", 641 | " .map(removePunctuation))\n", 642 | "print '\\n'.join(shakespeareRDD\n", 643 | " .zipWithIndex() # to (line, lineNum)\n", 644 | " .map(lambda (l, num): '{0}: {1}'.format(num, l)) # to 'lineNum: line'\n", 645 | " .take(15))" 646 | ] 647 | }, 648 | { 649 | "cell_type": "markdown", 650 | "metadata": {}, 651 | "source": [ 652 | "#### ** (4d) Words from lines **\n", 653 | "#### Before we can use the `wordcount()` function, we have to address two issues with the format of the RDD:\n", 654 | " + #### The first issue is that that we need to split each line by its spaces.\n", 655 | " + #### The second issue is we need to filter out empty lines.\n", 656 | " \n", 657 | "#### Apply a transformation that will split each element of the RDD by its spaces. For each element of the RDD, you should apply Python's string [split()](https://docs.python.org/2/library/string.html#string.split) function. You might think that a `map()` transformation is the way to do this, but think about what the result of the `split()` function will be." 658 | ] 659 | }, 660 | { 661 | "cell_type": "code", 662 | "execution_count": null, 663 | "metadata": { 664 | "collapsed": false 665 | }, 666 | "outputs": [], 667 | "source": [ 668 | "# TODO: Replace with appropriate code\n", 669 | "shakespeareWordsRDD = shakespeareRDD.\n", 670 | "shakespeareWordCount = shakespeareWordsRDD.count()\n", 671 | "print shakespeareWordsRDD.top(5)\n", 672 | "print shakespeareWordCount" 673 | ] 674 | }, 675 | { 676 | "cell_type": "code", 677 | "execution_count": null, 678 | "metadata": { 679 | "collapsed": false 680 | }, 681 | "outputs": [], 682 | "source": [ 683 | "# TEST Words from lines (4d)\n", 684 | "# This test allows for leading spaces to be removed either before or after\n", 685 | "# punctuation is removed.\n", 686 | "Test.assertTrue(shakespeareWordCount == 927631 or shakespeareWordCount == 928908,\n", 687 | " 'incorrect value for shakespeareWordCount')\n", 688 | "Test.assertEquals(shakespeareWordsRDD.top(5),\n", 689 | " [u'zwaggerd', u'zounds', u'zounds', u'zounds', u'zounds'],\n", 690 | " 'incorrect value for shakespeareWordsRDD')" 691 | ] 692 | }, 693 | { 694 | "cell_type": "markdown", 695 | "metadata": {}, 696 | "source": [ 697 | "#### ** (4e) Remove empty elements **\n", 698 | "#### The next step is to filter out the empty elements. Remove all entries where the word is `''`." 699 | ] 700 | }, 701 | { 702 | "cell_type": "code", 703 | "execution_count": null, 704 | "metadata": { 705 | "collapsed": false 706 | }, 707 | "outputs": [], 708 | "source": [ 709 | "# TODO: Replace with appropriate code\n", 710 | "shakeWordsRDD = shakespeareWordsRDD.\n", 711 | "shakeWordCount = shakeWordsRDD.count()\n", 712 | "print shakeWordCount" 713 | ] 714 | }, 715 | { 716 | "cell_type": "code", 717 | "execution_count": null, 718 | "metadata": { 719 | "collapsed": false 720 | }, 721 | "outputs": [], 722 | "source": [ 723 | "# TEST Remove empty elements (4e)\n", 724 | "Test.assertEquals(shakeWordCount, 882996, 'incorrect value for shakeWordCount')" 725 | ] 726 | }, 727 | { 728 | "cell_type": "markdown", 729 | "metadata": {}, 730 | "source": [ 731 | "#### ** (4f) Count the words **\n", 732 | "#### We now have an RDD that is only words. Next, let's apply the `wordCount()` function to produce a list of word counts. We can view the top 15 words by using the `takeOrdered()` action; however, since the elements of the RDD are pairs, we need a custom sort function that sorts using the value part of the pair.\n", 733 | "#### You'll notice that many of the words are common English words. These are called stopwords. In a later lab, we will see how to eliminate them from the results.\n", 734 | "#### Use the `wordCount()` function and `takeOrdered()` to obtain the fifteen most common words and their counts." 735 | ] 736 | }, 737 | { 738 | "cell_type": "code", 739 | "execution_count": null, 740 | "metadata": { 741 | "collapsed": false 742 | }, 743 | "outputs": [], 744 | "source": [ 745 | "# TODO: Replace with appropriate code\n", 746 | "top15WordsAndCounts = \n", 747 | "print '\\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top15WordsAndCounts))" 748 | ] 749 | }, 750 | { 751 | "cell_type": "code", 752 | "execution_count": null, 753 | "metadata": { 754 | "collapsed": false 755 | }, 756 | "outputs": [], 757 | "source": [ 758 | "# TEST Count the words (4f)\n", 759 | "Test.assertEquals(top15WordsAndCounts,\n", 760 | " [(u'the', 27361), (u'and', 26028), (u'i', 20681), (u'to', 19150), (u'of', 17463),\n", 761 | " (u'a', 14593), (u'you', 13615), (u'my', 12481), (u'in', 10956), (u'that', 10890),\n", 762 | " (u'is', 9134), (u'not', 8497), (u'with', 7771), (u'me', 7769), (u'it', 7678)],\n", 763 | " 'incorrect value for top15WordsAndCounts')" 764 | ] 765 | } 766 | ], 767 | "metadata": {}, 768 | "nbformat": 4, 769 | "nbformat_minor": 0 770 | } 771 | -------------------------------------------------------------------------------- /Part1/Week4/lab1_word_count_student_Answer_CS_20150730.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "#![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png) + ![Python Logo](http://spark-mooc.github.io/web-assets/images/python-logo-master-v3-TM-flattened_small.png)\n", 8 | "# **Word Count Lab: Building a word count application**\n", 9 | "#### This lab will build on the techniques covered in the Spark tutorial to develop a simple word count application. The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data. In this lab, we will write code that calculates the most common words in the [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100) retrieved from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page). This could also be scaled to find the most common words on the Internet.\n", 10 | "#### ** During this lab we will cover: **\n", 11 | "#### *Part 1:* Creating a base RDD and pair RDDs\n", 12 | "#### *Part 2:* Counting with pair RDDs\n", 13 | "#### *Part 3:* Finding unique words and a mean value\n", 14 | "#### *Part 4:* Apply word count to a file\n", 15 | "#### Note that, for reference, you can look up the details of the relevant methods in [Spark's Python API](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD)" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "#### **명령어 모음**\n", 23 | "\n", 24 | "http://spark.apache.org/docs/latest/api/python/pyspark.html" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "### ** Part 1: Creating a base RDD and pair RDDs **" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "#### In this part of the lab, we will explore creating a base RDD with `parallelize` and using pair RDDs to count words." 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "#### ** (1a) Create a base RDD **\n", 46 | "#### We'll start by generating a base RDD by using a Python list and the `sc.parallelize` method. Then we'll print out the type of the base RDD." 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 84, 52 | "metadata": { 53 | "collapsed": false 54 | }, 55 | "outputs": [ 56 | { 57 | "name": "stdout", 58 | "output_type": "stream", 59 | "text": [ 60 | "\n" 61 | ] 62 | } 63 | ], 64 | "source": [ 65 | "wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']\n", 66 | "wordsRDD = sc.parallelize(wordsList, 4)\n", 67 | "# Print out the type of wordsRDD\n", 68 | "print type(wordsRDD)" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "#### ** (1b) Pluralize and test **\n", 76 | "#### Let's use a `map()` transformation to add the letter 's' to each string in the base RDD we just created. We'll define a Python function that returns the word with an 's' at the end of the word. Please replace `` with your solution. If you have trouble, the next cell has the solution. After you have defined `makePlural` you can run the third cell which contains a test. If you implementation is correct it will print `1 test passed`.\n", 77 | "#### This is the general form that exercises will take, except that no example solution will be provided. Exercises will include an explanation of what is expected, followed by code cells where one cell will have one or more `` sections. The cell that needs to be modified will have `# TODO: Replace with appropriate code` on its first line. Once the `` sections are updated and the code is run, the test cell can then be run to verify the correctness of your solution. The last code cell before the next markdown section will contain the tests." 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "## **단수를 복수로 만들기!**" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 85, 90 | "metadata": { 91 | "collapsed": false 92 | }, 93 | "outputs": [ 94 | { 95 | "name": "stdout", 96 | "output_type": "stream", 97 | "text": [ 98 | "cats\n" 99 | ] 100 | } 101 | ], 102 | "source": [ 103 | "'''\n", 104 | "# TODO: Replace with appropriate code\n", 105 | "def makePlural(word):\n", 106 | " \"\"\"Adds an 's' to `word`.\n", 107 | "\n", 108 | " Note:\n", 109 | " This is a simple function that only adds an 's'. No attempt is made to follow proper\n", 110 | " pluralization rules.\n", 111 | "\n", 112 | " Args:\n", 113 | " word (str): A string.\n", 114 | "\n", 115 | " Returns:\n", 116 | " str: A string with 's' added to it.\n", 117 | " \"\"\"\n", 118 | " return \n", 119 | "\n", 120 | "print makePlural('cat')\n", 121 | "'''\n", 122 | "\n", 123 | "# TODO: Replace with appropriate code\n", 124 | "def makePlural(word):\n", 125 | " \"\"\"Adds an 's' to `word`.\n", 126 | "\n", 127 | " Note:\n", 128 | " This is a simple function that only adds an 's'. No attempt is made to follow proper\n", 129 | " pluralization rules.\n", 130 | "\n", 131 | " Args:\n", 132 | " word (str): A string.\n", 133 | "\n", 134 | " Returns:\n", 135 | " str: A string with 's' added to it.\n", 136 | " \"\"\"\n", 137 | " return word + \"s\"\n", 138 | "\n", 139 | "print makePlural('cat')" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 86, 145 | "metadata": { 146 | "collapsed": false 147 | }, 148 | "outputs": [ 149 | { 150 | "name": "stdout", 151 | "output_type": "stream", 152 | "text": [ 153 | "cats\n" 154 | ] 155 | } 156 | ], 157 | "source": [ 158 | "# One way of completing the function \n", 159 | "\n", 160 | "# 나는 이렇게 했소! 라고 보여줌. 이리 쉬운걸...왜....\n", 161 | "\n", 162 | "def makePlural(word):\n", 163 | " return word + 's'\n", 164 | "\n", 165 | "print makePlural('cat')" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "### **어디 잘했는지 확인해볼까????**" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": 87, 178 | "metadata": { 179 | "collapsed": false 180 | }, 181 | "outputs": [ 182 | { 183 | "name": "stdout", 184 | "output_type": "stream", 185 | "text": [ 186 | "1 test passed.\n" 187 | ] 188 | } 189 | ], 190 | "source": [ 191 | "# Load in the testing code and check to see if your answer is correct\n", 192 | "# If incorrect it will report back '1 test failed' for each failed test\n", 193 | "# Make sure to rerun any cell you change before trying the test again\n", 194 | "from test_helper import Test\n", 195 | "# TEST Pluralize and test (1b)\n", 196 | "Test.assertEquals(makePlural('rat'), 'rats', 'incorrect result: makePlural does not add an s')" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "#### ** (1c) Apply `makePlural` to the base RDD **\n", 204 | "#### Now pass each item in the base RDD into a [map()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.map) transformation that applies the `makePlural()` function to each element. And then call the [collect()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.collect) action to see the transformed RDD." 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "### **위에서 만든 makePlural 함수를 RDD에 적용해보자!**" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 88, 217 | "metadata": { 218 | "collapsed": false 219 | }, 220 | "outputs": [ 221 | { 222 | "name": "stdout", 223 | "output_type": "stream", 224 | "text": [ 225 | "['cats', 'elephants', 'rats', 'rats', 'cats']\n" 226 | ] 227 | } 228 | ], 229 | "source": [ 230 | "'''\n", 231 | "# TODO: Replace with appropriate code\n", 232 | "pluralRDD = wordsRDD.map()\n", 233 | "print pluralRDD.collect()\n", 234 | "\n", 235 | "'''\n", 236 | "# TODO: Replace with appropriate code\n", 237 | "pluralRDD = wordsRDD.map(makePlural)\n", 238 | "print pluralRDD.collect()" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 89, 244 | "metadata": { 245 | "collapsed": false 246 | }, 247 | "outputs": [ 248 | { 249 | "name": "stdout", 250 | "output_type": "stream", 251 | "text": [ 252 | "1 test passed.\n" 253 | ] 254 | } 255 | ], 256 | "source": [ 257 | "# TEST Apply makePlural to the base RDD(1c)\n", 258 | "Test.assertEquals(pluralRDD.collect(), ['cats', 'elephants', 'rats', 'rats', 'cats'],\n", 259 | " 'incorrect values for pluralRDD')" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "#### ** (1d) Pass a `lambda` function to `map` **\n", 267 | "#### Let's create the same RDD using a `lambda` function." 268 | ] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "metadata": {}, 273 | "source": [ 274 | "### **이번엔 람다 함수를 써서 해볼까?**" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": 91, 280 | "metadata": { 281 | "collapsed": false 282 | }, 283 | "outputs": [ 284 | { 285 | "name": "stdout", 286 | "output_type": "stream", 287 | "text": [ 288 | "['cats', 'elephants', 'rats', 'rats', 'cats']\n" 289 | ] 290 | } 291 | ], 292 | "source": [ 293 | "'''\n", 294 | "# TODO: Replace with appropriate code\n", 295 | "pluralLambdaRDD = wordsRDD.map(lambda )\n", 296 | "print pluralLambdaRDD.collect()\n", 297 | "'''\n", 298 | "\n", 299 | "# TODO: Replace with appropriate code\n", 300 | "pluralLambdaRDD = wordsRDD.map(lambda word : word + 's')\n", 301 | "print pluralLambdaRDD.collect()" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": 92, 307 | "metadata": { 308 | "collapsed": false 309 | }, 310 | "outputs": [ 311 | { 312 | "name": "stdout", 313 | "output_type": "stream", 314 | "text": [ 315 | "1 test passed.\n" 316 | ] 317 | } 318 | ], 319 | "source": [ 320 | "# TEST Pass a lambda function to map (1d)\n", 321 | "Test.assertEquals(pluralLambdaRDD.collect(), ['cats', 'elephants', 'rats', 'rats', 'cats'],\n", 322 | " 'incorrect values for pluralLambdaRDD (1d)')" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "#### ** (1e) Length of each word **\n", 330 | "#### Now use `map()` and a `lambda` function to return the number of characters in each word. We'll `collect` this result directly into a variable." 331 | ] 332 | }, 333 | { 334 | "cell_type": "markdown", 335 | "metadata": {}, 336 | "source": [ 337 | "### **이번엔 문자의 개수를 세어보자!!**" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 93, 343 | "metadata": { 344 | "collapsed": false 345 | }, 346 | "outputs": [ 347 | { 348 | "name": "stdout", 349 | "output_type": "stream", 350 | "text": [ 351 | "[4, 9, 4, 4, 4]\n" 352 | ] 353 | } 354 | ], 355 | "source": [ 356 | "'''\n", 357 | "# TODO: Replace with appropriate code\n", 358 | "pluralLengths = (pluralRDD\n", 359 | " \n", 360 | " .collect())\n", 361 | "print pluralLengths\n", 362 | "'''\n", 363 | "\n", 364 | "# TODO: Replace with appropriate code\n", 365 | "pluralLengths = (pluralRDD\n", 366 | " .map(lambda word : len(word))\n", 367 | " .collect())\n", 368 | "print pluralLengths" 369 | ] 370 | }, 371 | { 372 | "cell_type": "code", 373 | "execution_count": 94, 374 | "metadata": { 375 | "collapsed": false 376 | }, 377 | "outputs": [ 378 | { 379 | "name": "stdout", 380 | "output_type": "stream", 381 | "text": [ 382 | "1 test passed.\n" 383 | ] 384 | } 385 | ], 386 | "source": [ 387 | "# TEST Length of each word (1e)\n", 388 | "Test.assertEquals(pluralLengths, [4, 9, 4, 4, 4],\n", 389 | " 'incorrect values for pluralLengths')" 390 | ] 391 | }, 392 | { 393 | "cell_type": "markdown", 394 | "metadata": {}, 395 | "source": [ 396 | "#### ** (1f) Pair RDDs **\n", 397 | "#### The next step in writing our word counting program is to create a new type of RDD, called a pair RDD. A pair RDD is an RDD where each element is a pair tuple `(k, v)` where `k` is the key and `v` is the value. In this example, we will create a pair consisting of `('', 1)` for each word element in the RDD.\n", 398 | "#### We can create the pair RDD using the `map()` transformation with a `lambda()` function to create a new RDD." 399 | ] 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "metadata": {}, 404 | "source": [ 405 | "#### **숫자만 나오니 어떤 글자의 결과인지 알기 힘들다.. set 형태로 (단어, 단어 개수)로 출력해보자!!**\n", 406 | "#### 글자수가 아니라 단어 개수인것에 주의!" 407 | ] 408 | }, 409 | { 410 | "cell_type": "code", 411 | "execution_count": 95, 412 | "metadata": { 413 | "collapsed": false 414 | }, 415 | "outputs": [ 416 | { 417 | "name": "stdout", 418 | "output_type": "stream", 419 | "text": [ 420 | "[('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)]\n" 421 | ] 422 | } 423 | ], 424 | "source": [ 425 | "'''\n", 426 | "# TODO: Replace with appropriate code\n", 427 | "wordPairs = wordsRDD.\n", 428 | "print wordPairs.collect()\n", 429 | "\n", 430 | "'''\n", 431 | "\n", 432 | "# TODO: Replace with appropriate code\n", 433 | "wordPairs = wordsRDD.map(lambda word : (word, 1))\n", 434 | "print wordPairs.collect()" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": 96, 440 | "metadata": { 441 | "collapsed": false 442 | }, 443 | "outputs": [ 444 | { 445 | "name": "stdout", 446 | "output_type": "stream", 447 | "text": [ 448 | "1 test passed.\n" 449 | ] 450 | } 451 | ], 452 | "source": [ 453 | "# TEST Pair RDDs (1f)\n", 454 | "Test.assertEquals(wordPairs.collect(),\n", 455 | " [('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)],\n", 456 | " 'incorrect value for wordPairs')" 457 | ] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "metadata": {}, 462 | "source": [ 463 | "### ** Part 2: Counting with pair RDDs **" 464 | ] 465 | }, 466 | { 467 | "cell_type": "markdown", 468 | "metadata": {}, 469 | "source": [ 470 | "#### Now, let's count the number of times a particular word appears in the RDD. There are multiple ways to perform the counting, but some are much less efficient than others.\n", 471 | "#### A naive approach would be to `collect()` all of the elements and count them in the driver program. While this approach could work for small datasets, we want an approach that will work for any size dataset including terabyte- or petabyte-sized datasets. In addition, performing all of the work in the driver program is slower than performing it in parallel in the workers. For these reasons, we will use data parallel operations." 472 | ] 473 | }, 474 | { 475 | "cell_type": "markdown", 476 | "metadata": {}, 477 | "source": [ 478 | "#### ** (2a) `groupByKey()` approach **\n", 479 | "#### An approach you might first consider (we'll see shortly that there are better ways) is based on using the [groupByKey()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.groupByKey) transformation. As the name implies, the `groupByKey()` transformation groups all the elements of the RDD with the same key into a single list in one of the partitions. There are two problems with using `groupByKey()`:\n", 480 | " + #### The operation requires a lot of data movement to move all the values into the appropriate partitions.\n", 481 | " + #### The lists can be very large. Consider a word count of English Wikipedia: the lists for common words (e.g., the, a, etc.) would be huge and could exhaust the available memory in a worker.\n", 482 | " \n", 483 | "#### Use `groupByKey()` to generate a pair RDD of type `('word', iterator)`." 484 | ] 485 | }, 486 | { 487 | "cell_type": "markdown", 488 | "metadata": {}, 489 | "source": [ 490 | "#### **Key값을 중심으로 묶어보자 (1) : groupByKey()**" 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": 97, 496 | "metadata": { 497 | "collapsed": false 498 | }, 499 | "outputs": [ 500 | { 501 | "name": "stdout", 502 | "output_type": "stream", 503 | "text": [ 504 | "rat: [1, 1]\n", 505 | "elephant: [1]\n", 506 | "cat: [1, 1]\n" 507 | ] 508 | } 509 | ], 510 | "source": [ 511 | "'''\n", 512 | "# 기억 안나실까봐...ㅎㅎ\n", 513 | "# wordParis = [('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)]\n", 514 | "\n", 515 | "\n", 516 | "# TODO: Replace with appropriate code\n", 517 | "# Note that groupByKey requires no parameters\n", 518 | "wordsGrouped = wordPairs.\n", 519 | "for key, value in wordsGrouped.collect():\n", 520 | " print '{0}: {1}'.format(key, list(value))\n", 521 | " \n", 522 | "'''\n", 523 | "\n", 524 | "# TODO: Replace with appropriate code\n", 525 | "# Note that groupByKey requires no parameters\n", 526 | "wordsGrouped = wordPairs.groupByKey()\n", 527 | "for key, value in wordsGrouped.collect():\n", 528 | " print '{0}: {1}'.format(key, list(value))" 529 | ] 530 | }, 531 | { 532 | "cell_type": "code", 533 | "execution_count": 98, 534 | "metadata": { 535 | "collapsed": false 536 | }, 537 | "outputs": [ 538 | { 539 | "name": "stdout", 540 | "output_type": "stream", 541 | "text": [ 542 | "1 test passed.\n" 543 | ] 544 | } 545 | ], 546 | "source": [ 547 | "# TEST groupByKey() approach (2a)\n", 548 | "Test.assertEquals(sorted(wordsGrouped.mapValues(lambda x: list(x)).collect()),\n", 549 | " [('cat', [1, 1]), ('elephant', [1]), ('rat', [1, 1])],\n", 550 | " 'incorrect value for wordsGrouped')" 551 | ] 552 | }, 553 | { 554 | "cell_type": "markdown", 555 | "metadata": {}, 556 | "source": [ 557 | "#### ** (2b) Use `groupByKey()` to obtain the counts **\n", 558 | "#### Using the `groupByKey()` transformation creates an RDD containing 3 elements, each of which is a pair of a word and a Python iterator.\n", 559 | "#### Now sum the iterator using a `map()` transformation. The result should be a pair RDD consisting of (word, count) pairs." 560 | ] 561 | }, 562 | { 563 | "cell_type": "markdown", 564 | "metadata": {}, 565 | "source": [ 566 | "#### **Key값을 중심으로 묶어보자(2) : list로 출력됐던 value의 output을 sum 형태로! **" 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": 99, 572 | "metadata": { 573 | "collapsed": false 574 | }, 575 | "outputs": [ 576 | { 577 | "name": "stdout", 578 | "output_type": "stream", 579 | "text": [ 580 | "[('rat', 2), ('elephant', 1), ('cat', 2)]\n" 581 | ] 582 | } 583 | ], 584 | "source": [ 585 | "'''\n", 586 | "# TODO: Replace with appropriate code\n", 587 | "wordCountsGrouped = wordsGrouped.\n", 588 | "print wordCountsGrouped.collect()\n", 589 | "\n", 590 | "# wordsGrouped = wordPairs.groupByKey()\n", 591 | "\n", 592 | "rat: [1, 1]\n", 593 | "elephant: [1]\n", 594 | "cat: [1, 1]\n", 595 | "\n", 596 | "'''\n", 597 | "\n", 598 | "# TODO: Replace with appropriate code\n", 599 | "wordCountsGrouped = wordsGrouped.map(lambda (word, num): (word, sum(num)))\n", 600 | "print wordCountsGrouped.collect()" 601 | ] 602 | }, 603 | { 604 | "cell_type": "code", 605 | "execution_count": 100, 606 | "metadata": { 607 | "collapsed": false 608 | }, 609 | "outputs": [ 610 | { 611 | "name": "stdout", 612 | "output_type": "stream", 613 | "text": [ 614 | "1 test passed.\n" 615 | ] 616 | } 617 | ], 618 | "source": [ 619 | "# TEST Use groupByKey() to obtain the counts (2b)\n", 620 | "Test.assertEquals(sorted(wordCountsGrouped.collect()),\n", 621 | " [('cat', 2), ('elephant', 1), ('rat', 2)],\n", 622 | " 'incorrect value for wordCountsGrouped')" 623 | ] 624 | }, 625 | { 626 | "cell_type": "markdown", 627 | "metadata": {}, 628 | "source": [ 629 | "#### ** (2c) Counting using `reduceByKey` **\n", 630 | "#### A better approach is to start from the pair RDD and then use the [reduceByKey()](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.reduceByKey) transformation to create a new pair RDD. The `reduceByKey()` transformation gathers together pairs that have the same key and applies the function provided to two values at a time, iteratively reducing all of the values to a single value. `reduceByKey()` operates by applying the function first within each partition on a per-key basis and then across the partitions, allowing it to scale efficiently to large datasets." 631 | ] 632 | }, 633 | { 634 | "cell_type": "markdown", 635 | "metadata": {}, 636 | "source": [ 637 | "#### **Key값을 중심으로 묶어보자 (3) : reduceByKey**\n", 638 | "#### 부제 : 뭐하러 귀찮게 (2) 처럼 해!" 639 | ] 640 | }, 641 | { 642 | "cell_type": "code", 643 | "execution_count": 101, 644 | "metadata": { 645 | "collapsed": false 646 | }, 647 | "outputs": [ 648 | { 649 | "name": "stdout", 650 | "output_type": "stream", 651 | "text": [ 652 | "[('rat', 2), ('elephant', 1), ('cat', 2)]\n" 653 | ] 654 | } 655 | ], 656 | "source": [ 657 | "'''\n", 658 | "\n", 659 | "# TODO: Replace with appropriate code\n", 660 | "# Note that reduceByKey takes in a function that accepts two values and returns a single value\n", 661 | "wordCounts = wordPairs.reduceByKey()\n", 662 | "print wordCounts.collect()\n", 663 | "\n", 664 | "# 또 기억 안나실까봐...ㅎㅎ\n", 665 | "# wordParis = [('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)]\n", 666 | "\n", 667 | "'''\n", 668 | "\n", 669 | "# TODO: Replace with appropriate code\n", 670 | "# Note that reduceByKey takes in a function that accepts two values and returns a single value\n", 671 | "wordCounts = wordPairs.reduceByKey(lambda word, num : word + num)\n", 672 | "print wordCounts.collect()" 673 | ] 674 | }, 675 | { 676 | "cell_type": "code", 677 | "execution_count": 102, 678 | "metadata": { 679 | "collapsed": false 680 | }, 681 | "outputs": [ 682 | { 683 | "name": "stdout", 684 | "output_type": "stream", 685 | "text": [ 686 | "1 test passed.\n" 687 | ] 688 | } 689 | ], 690 | "source": [ 691 | "# TEST Counting using reduceByKey (2c)\n", 692 | "Test.assertEquals(sorted(wordCounts.collect()), [('cat', 2), ('elephant', 1), ('rat', 2)],\n", 693 | " 'incorrect value for wordCounts')" 694 | ] 695 | }, 696 | { 697 | "cell_type": "markdown", 698 | "metadata": {}, 699 | "source": [ 700 | "#### ** (2d) All together **\n", 701 | "#### The expert version of the code performs the `map()` to pair RDD, `reduceByKey()` transformation, and `collect` in one statement." 702 | ] 703 | }, 704 | { 705 | "cell_type": "markdown", 706 | "metadata": {}, 707 | "source": [ 708 | "#### **이제 배운걸 다 합쳐보자!**" 709 | ] 710 | }, 711 | { 712 | "cell_type": "code", 713 | "execution_count": 103, 714 | "metadata": { 715 | "collapsed": false 716 | }, 717 | "outputs": [ 718 | { 719 | "name": "stdout", 720 | "output_type": "stream", 721 | "text": [ 722 | "[('rat', 2), ('elephant', 1), ('cat', 2)]\n" 723 | ] 724 | } 725 | ], 726 | "source": [ 727 | "'''\n", 728 | "\n", 729 | "# TODO: Replace with appropriate code\n", 730 | "wordCountsCollected = (wordsRDD\n", 731 | " \n", 732 | " .collect())\n", 733 | "print wordCountsCollected\n", 734 | "\n", 735 | "'''\n", 736 | "\n", 737 | "\n", 738 | "# TODO: Replace with appropriate code\n", 739 | "wordCountsCollected = (wordsRDD\n", 740 | " .map(lambda word : (word, 1))\n", 741 | " .reduceByKey(lambda word, num : word + num)\n", 742 | " .collect())\n", 743 | "print wordCountsCollected" 744 | ] 745 | }, 746 | { 747 | "cell_type": "code", 748 | "execution_count": 104, 749 | "metadata": { 750 | "collapsed": false 751 | }, 752 | "outputs": [ 753 | { 754 | "name": "stdout", 755 | "output_type": "stream", 756 | "text": [ 757 | "1 test passed.\n" 758 | ] 759 | } 760 | ], 761 | "source": [ 762 | "# TEST All together (2d)\n", 763 | "Test.assertEquals(sorted(wordCountsCollected), [('cat', 2), ('elephant', 1), ('rat', 2)],\n", 764 | " 'incorrect value for wordCountsCollected')" 765 | ] 766 | }, 767 | { 768 | "cell_type": "markdown", 769 | "metadata": {}, 770 | "source": [ 771 | "### ** Part 3: Finding unique words and a mean value **" 772 | ] 773 | }, 774 | { 775 | "cell_type": "markdown", 776 | "metadata": {}, 777 | "source": [ 778 | "#### ** (3a) Unique words **\n", 779 | "#### Calculate the number of unique words in `wordsRDD`. You can use other RDDs that you have already created to make this easier." 780 | ] 781 | }, 782 | { 783 | "cell_type": "markdown", 784 | "metadata": {}, 785 | "source": [ 786 | "#### **유니크한.... 단어 찾아보자!**" 787 | ] 788 | }, 789 | { 790 | "cell_type": "code", 791 | "execution_count": 105, 792 | "metadata": { 793 | "collapsed": false 794 | }, 795 | "outputs": [ 796 | { 797 | "name": "stdout", 798 | "output_type": "stream", 799 | "text": [ 800 | "3\n" 801 | ] 802 | } 803 | ], 804 | "source": [ 805 | "'''\n", 806 | "\n", 807 | "# TODO: Replace with appropriate code\n", 808 | "uniqueWords = \n", 809 | "print uniqueWords\n", 810 | "\n", 811 | "'''\n", 812 | "\n", 813 | "# TODO: Replace with appropriate code\n", 814 | "uniqueWords = wordsRDD.map(lambda word:(word, [1]))\\\n", 815 | " .map(lambda (word, num): (word, sum(num)))\\\n", 816 | " .reduceByKey(lambda word, num : word + num).count()\n", 817 | "print uniqueWords\n" 818 | ] 819 | }, 820 | { 821 | "cell_type": "code", 822 | "execution_count": 106, 823 | "metadata": { 824 | "collapsed": false 825 | }, 826 | "outputs": [ 827 | { 828 | "data": { 829 | "text/plain": [ 830 | "3" 831 | ] 832 | }, 833 | "execution_count": 106, 834 | "metadata": {}, 835 | "output_type": "execute_result" 836 | } 837 | ], 838 | "source": [ 839 | "# .count()\n", 840 | "# Return the number of elements in this RDD.\n", 841 | "sc.parallelize([2, 3, 4]).count()" 842 | ] 843 | }, 844 | { 845 | "cell_type": "code", 846 | "execution_count": 107, 847 | "metadata": { 848 | "collapsed": false 849 | }, 850 | "outputs": [ 851 | { 852 | "name": "stdout", 853 | "output_type": "stream", 854 | "text": [ 855 | "1 test passed.\n" 856 | ] 857 | } 858 | ], 859 | "source": [ 860 | "# TEST Unique words (3a)\n", 861 | "Test.assertEquals(uniqueWords, 3, 'incorrect count of uniqueWords')" 862 | ] 863 | }, 864 | { 865 | "cell_type": "markdown", 866 | "metadata": {}, 867 | "source": [ 868 | "#### ** (3b) Mean using `reduce` **\n", 869 | "#### Find the mean number of words per unique word in `wordCounts`.\n", 870 | "#### Use a `reduce()` action to sum the counts in `wordCounts` and then divide by the number of unique words. First `map()` the pair RDD `wordCounts`, which consists of (key, value) pairs, to an RDD of values." 871 | ] 872 | }, 873 | { 874 | "cell_type": "code", 875 | "execution_count": 108, 876 | "metadata": { 877 | "collapsed": false 878 | }, 879 | "outputs": [ 880 | { 881 | "name": "stdout", 882 | "output_type": "stream", 883 | "text": [ 884 | "5\n", 885 | "1.67\n" 886 | ] 887 | } 888 | ], 889 | "source": [ 890 | "'''\n", 891 | "\n", 892 | "# TODO: Replace with appropriate code\n", 893 | "from operator import add\n", 894 | "totalCount = (wordCounts\n", 895 | " .map()\n", 896 | " .reduce())\n", 897 | "average = totalCount / float()\n", 898 | "print totalCount\n", 899 | "print round(average, 2)\n", 900 | "\n", 901 | "\n", 902 | "# 기억 안나실까봐..사실은 제가...\n", 903 | "wordCounts = [('rat', 2), ('elephant', 1), ('cat', 2)]\n", 904 | "\n", 905 | "'''\n", 906 | "# TODO: Replace with appropriate code\n", 907 | "from operator import add\n", 908 | "totalCount = (wordCounts\n", 909 | " .map(lambda x: x[1])\n", 910 | " .reduce(add))\n", 911 | "average = totalCount / float(wordsRDD.map(lambda x: (x,1)).reduceByKey(add).count())\n", 912 | "print totalCount\n", 913 | "print round(average, 2)" 914 | ] 915 | }, 916 | { 917 | "cell_type": "code", 918 | "execution_count": 109, 919 | "metadata": { 920 | "collapsed": false 921 | }, 922 | "outputs": [ 923 | { 924 | "name": "stdout", 925 | "output_type": "stream", 926 | "text": [ 927 | "1 test passed.\n" 928 | ] 929 | } 930 | ], 931 | "source": [ 932 | "# TEST Mean using reduce (3b)\n", 933 | "Test.assertEquals(round(average, 2), 1.67, 'incorrect value of average')" 934 | ] 935 | }, 936 | { 937 | "cell_type": "markdown", 938 | "metadata": {}, 939 | "source": [ 940 | "### ** Part 4: Apply word count to a file **" 941 | ] 942 | }, 943 | { 944 | "cell_type": "markdown", 945 | "metadata": {}, 946 | "source": [ 947 | "#### In this section we will finish developing our word count application. We'll have to build the `wordCount` function, deal with real world problems like capitalization and punctuation, load in our data source, and compute the word count on the new data." 948 | ] 949 | }, 950 | { 951 | "cell_type": "markdown", 952 | "metadata": {}, 953 | "source": [ 954 | "#### ** (4a) `wordCount` function **\n", 955 | "#### First, define a function for word counting. You should reuse the techniques that have been covered in earlier parts of this lab. This function should take in an RDD that is a list of words like `wordsRDD` and return a pair RDD that has all of the words and their associated counts." 956 | ] 957 | }, 958 | { 959 | "cell_type": "code", 960 | "execution_count": 110, 961 | "metadata": { 962 | "collapsed": false 963 | }, 964 | "outputs": [ 965 | { 966 | "name": "stdout", 967 | "output_type": "stream", 968 | "text": [ 969 | "[('rat', 2), ('elephant', 1), ('cat', 2)]\n" 970 | ] 971 | } 972 | ], 973 | "source": [ 974 | "'''\n", 975 | "# TODO: Replace with appropriate code\n", 976 | "def wordCount(wordListRDD):\n", 977 | " \"\"\"Creates a pair RDD with word counts from an RDD of words.\n", 978 | "\n", 979 | " Args:\n", 980 | " wordListRDD (RDD of str): An RDD consisting of words.\n", 981 | "\n", 982 | " Returns:\n", 983 | " RDD of (str, int): An RDD consisting of (word, count) tuples.\n", 984 | " \"\"\"\n", 985 | " \n", 986 | "print wordCount(wordsRDD).collect()\n", 987 | "'''\n", 988 | "\n", 989 | "# TODO: Replace with appropriate code\n", 990 | "def wordCount(wordListRDD):\n", 991 | " \"\"\"Creates a pair RDD with word counts from an RDD of words.\n", 992 | "\n", 993 | " Args:\n", 994 | " wordListRDD (RDD of str): An RDD consisting of words.\n", 995 | "\n", 996 | " Returns:\n", 997 | " RDD of (str, int): An RDD consisting of (word, count) tuples.\n", 998 | " \"\"\"\n", 999 | " wordCountsCollected = wordListRDD.map(lambda word:(word, [1]))\\\n", 1000 | " .map(lambda (word, num): (word, sum(num)))\\\n", 1001 | " .reduceByKey(lambda word, num : word + num)\n", 1002 | " \n", 1003 | " return wordCountsCollected\n", 1004 | "\n", 1005 | "print wordCount(wordsRDD).collect()" 1006 | ] 1007 | }, 1008 | { 1009 | "cell_type": "code", 1010 | "execution_count": 111, 1011 | "metadata": { 1012 | "collapsed": false 1013 | }, 1014 | "outputs": [ 1015 | { 1016 | "name": "stdout", 1017 | "output_type": "stream", 1018 | "text": [ 1019 | "1 test passed.\n" 1020 | ] 1021 | } 1022 | ], 1023 | "source": [ 1024 | "# TEST wordCount function (4a)\n", 1025 | "Test.assertEquals(sorted(wordCount(wordsRDD).collect()),\n", 1026 | " [('cat', 2), ('elephant', 1), ('rat', 2)],\n", 1027 | " 'incorrect definition for wordCount function')" 1028 | ] 1029 | }, 1030 | { 1031 | "cell_type": "markdown", 1032 | "metadata": {}, 1033 | "source": [ 1034 | "#### ** (4b) Capitalization and punctuation **\n", 1035 | "#### Real world files are more complicated than the data we have been using in this lab. Some of the issues we have to address are:\n", 1036 | " + #### Words should be counted independent of their capitialization (e.g., Spark and spark should be counted as the same word).\n", 1037 | " + #### All punctuation should be removed.\n", 1038 | " + #### Any leading or trailing spaces on a line should be removed.\n", 1039 | " \n", 1040 | "#### Define the function `removePunctuation` that converts all text to lower case, removes any punctuation, and removes leading and trailing spaces. Use the Python [re](https://docs.python.org/2/library/re.html) module to remove any text that is not a letter, number, or space. Reading `help(re.sub)` might be useful." 1041 | ] 1042 | }, 1043 | { 1044 | "cell_type": "code", 1045 | "execution_count": 112, 1046 | "metadata": { 1047 | "collapsed": false 1048 | }, 1049 | "outputs": [ 1050 | { 1051 | "name": "stdout", 1052 | "output_type": "stream", 1053 | "text": [ 1054 | "Help on function sub in module re:\n", 1055 | "\n", 1056 | "sub(pattern, repl, string, count=0, flags=0)\n", 1057 | " Return the string obtained by replacing the leftmost\n", 1058 | " non-overlapping occurrences of the pattern in string by the\n", 1059 | " replacement repl. repl can be either a string or a callable;\n", 1060 | " if a string, backslash escapes in it are processed. If it is\n", 1061 | " a callable, it's passed the match object and must return\n", 1062 | " a replacement string to be used.\n", 1063 | "\n" 1064 | ] 1065 | } 1066 | ], 1067 | "source": [ 1068 | "help(re.sub)" 1069 | ] 1070 | }, 1071 | { 1072 | "cell_type": "code", 1073 | "execution_count": 117, 1074 | "metadata": { 1075 | "collapsed": false 1076 | }, 1077 | "outputs": [ 1078 | { 1079 | "name": "stdout", 1080 | "output_type": "stream", 1081 | "text": [ 1082 | "hi you\n", 1083 | "no underscore\n" 1084 | ] 1085 | } 1086 | ], 1087 | "source": [ 1088 | "'''\n", 1089 | "# TODO: Replace with appropriate code\n", 1090 | "import re\n", 1091 | "def removePunctuation(text):\n", 1092 | " \"\"\"Removes punctuation, changes to lower case, and strips leading and trailing spaces.\n", 1093 | "\n", 1094 | " Note:\n", 1095 | " Only spaces, letters, and numbers should be retained. Other characters should should be\n", 1096 | " eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after\n", 1097 | " punctuation is removed.\n", 1098 | "\n", 1099 | " Args:\n", 1100 | " text (str): A string.\n", 1101 | "\n", 1102 | " Returns:\n", 1103 | " str: The cleaned up string.\n", 1104 | " \"\"\"\n", 1105 | " \n", 1106 | "print removePunctuation('Hi, you!')\n", 1107 | "print removePunctuation(' No under_score!')\n", 1108 | "\n", 1109 | "'''\n", 1110 | "# TODO: Replace with appropriate code\n", 1111 | "import re\n", 1112 | "def removePunctuation(text):\n", 1113 | " \"\"\"Removes punctuation, changes to lower case, and strips leading and trailing spaces.\n", 1114 | "\n", 1115 | " Note:\n", 1116 | " Only spaces, letters, and numbers should be retained. Other characters should should be\n", 1117 | " eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after\n", 1118 | " punctuation is removed.\n", 1119 | "\n", 1120 | " Args:\n", 1121 | " text (str): A string.\n", 1122 | "\n", 1123 | " Returns:\n", 1124 | " str: The cleaned up string.\n", 1125 | " \"\"\"\n", 1126 | " return re.sub(r'[^a-z0-9\\s]','',text.lower().strip())\n", 1127 | "print removePunctuation('Hi, you!')\n", 1128 | "print removePunctuation(' No under_score!')" 1129 | ] 1130 | }, 1131 | { 1132 | "cell_type": "code", 1133 | "execution_count": 118, 1134 | "metadata": { 1135 | "collapsed": false 1136 | }, 1137 | "outputs": [ 1138 | { 1139 | "name": "stdout", 1140 | "output_type": "stream", 1141 | "text": [ 1142 | "1 test passed.\n" 1143 | ] 1144 | } 1145 | ], 1146 | "source": [ 1147 | "# TEST Capitalization and punctuation (4b)\n", 1148 | "Test.assertEquals(removePunctuation(\" The Elephant's 4 cats. \"),\n", 1149 | " 'the elephants 4 cats',\n", 1150 | " 'incorrect definition for removePunctuation function')" 1151 | ] 1152 | }, 1153 | { 1154 | "cell_type": "markdown", 1155 | "metadata": {}, 1156 | "source": [ 1157 | "#### ** (4c) Load a text file **\n", 1158 | "#### For the next part of this lab, we will use the [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100) from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page). To convert a text file into an RDD, we use the `SparkContext.textFile()` method. We also apply the recently defined `removePunctuation()` function using a `map()` transformation to strip out the punctuation and change all text to lowercase. Since the file is large we use `take(15)`, so that we only print 15 lines." 1159 | ] 1160 | }, 1161 | { 1162 | "cell_type": "code", 1163 | "execution_count": 132, 1164 | "metadata": { 1165 | "collapsed": false 1166 | }, 1167 | "outputs": [ 1168 | { 1169 | "name": "stdout", 1170 | "output_type": "stream", 1171 | "text": [ 1172 | "../data\n", 1173 | "cs100/lab1/shakespeare.txt\n", 1174 | "../data/cs100/lab1/shakespeare.txt\n", 1175 | "0: 1609\n", 1176 | "1: \n", 1177 | "2: the sonnets\n", 1178 | "3: \n", 1179 | "4: by william shakespeare\n", 1180 | "5: \n", 1181 | "6: \n", 1182 | "7: \n", 1183 | "8: 1\n", 1184 | "9: from fairest creatures we desire increase\n", 1185 | "10: that thereby beautys rose might never die\n", 1186 | "11: but as the riper should by time decease\n", 1187 | "12: his tender heir might bear his memory\n", 1188 | "13: but thou contracted to thine own bright eyes\n", 1189 | "14: feedst thy lights flame with selfsubstantial fuel\n" 1190 | ] 1191 | } 1192 | ], 1193 | "source": [ 1194 | "# Just run this code\n", 1195 | "import os.path\n", 1196 | "baseDir = os.path.join('../data')\n", 1197 | "print baseDir\n", 1198 | "inputPath = os.path.join('cs100', 'lab1', 'shakespeare.txt')\n", 1199 | "print inputPath\n", 1200 | "fileName = os.path.join(baseDir, inputPath)\n", 1201 | "print fileName\n", 1202 | "\n", 1203 | "shakespeareRDD = (sc\n", 1204 | " .textFile(fileName, 8)\n", 1205 | " .map(removePunctuation))\n", 1206 | "\n", 1207 | "print '\\n'.join(shakespeareRDD\n", 1208 | " .zipWithIndex() # to (line, lineNum)\n", 1209 | " .map(lambda (l, num): '{0}: {1}'.format(num, l)) # to 'lineNum: line'\n", 1210 | " .take(15))" 1211 | ] 1212 | }, 1213 | { 1214 | "cell_type": "markdown", 1215 | "metadata": {}, 1216 | "source": [ 1217 | "# **os.path에 주의하세요!!**" 1218 | ] 1219 | }, 1220 | { 1221 | "cell_type": "markdown", 1222 | "metadata": {}, 1223 | "source": [ 1224 | "#### ** (4d) Words from lines **\n", 1225 | "#### Before we can use the `wordcount()` function, we have to address two issues with the format of the RDD:\n", 1226 | " + #### The first issue is that that we need to split each line by its spaces.\n", 1227 | " + #### The second issue is we need to filter out empty lines.\n", 1228 | " \n", 1229 | "#### Apply a transformation that will split each element of the RDD by its spaces. For each element of the RDD, you should apply Python's string [split()](https://docs.python.org/2/library/string.html#string.split) function. You might think that a `map()` transformation is the way to do this, but think about what the result of the `split()` function will be." 1230 | ] 1231 | }, 1232 | { 1233 | "cell_type": "code", 1234 | "execution_count": 125, 1235 | "metadata": { 1236 | "collapsed": false 1237 | }, 1238 | "outputs": [ 1239 | { 1240 | "name": "stdout", 1241 | "output_type": "stream", 1242 | "text": [ 1243 | "[u'zwaggerd', u'zounds', u'zounds', u'zounds', u'zounds']\n", 1244 | "928908\n" 1245 | ] 1246 | } 1247 | ], 1248 | "source": [ 1249 | "'''\n", 1250 | "# TODO: Replace with appropriate code\n", 1251 | "shakespeareWordsRDD = shakespeareRDD.\n", 1252 | "shakespeareWordCount = shakespeareWordsRDD.count()\n", 1253 | "print shakespeareWordsRDD.top(5)\n", 1254 | "print shakespeareWordCount\n", 1255 | "'''\n", 1256 | "\n", 1257 | "# TODO: Replace with appropriate code\n", 1258 | "shakespeareWordsRDD = shakespeareRDD.flatMap(lambda line: line.split(' '))\n", 1259 | "shakespeareWordCount = shakespeareWordsRDD.count()\n", 1260 | "print shakespeareWordsRDD.top(5)\n", 1261 | "print shakespeareWordCount" 1262 | ] 1263 | }, 1264 | { 1265 | "cell_type": "markdown", 1266 | "metadata": {}, 1267 | "source": [ 1268 | "#### ** flatMap(f, preservesPartitioning=False)**\n", 1269 | "- Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results." 1270 | ] 1271 | }, 1272 | { 1273 | "cell_type": "code", 1274 | "execution_count": 134, 1275 | "metadata": { 1276 | "collapsed": false 1277 | }, 1278 | "outputs": [ 1279 | { 1280 | "data": { 1281 | "text/plain": [ 1282 | "[1, 1, 1, 2, 2, 3]" 1283 | ] 1284 | }, 1285 | "execution_count": 134, 1286 | "metadata": {}, 1287 | "output_type": "execute_result" 1288 | } 1289 | ], 1290 | "source": [ 1291 | "rdd = sc.parallelize([2, 3, 4])\n", 1292 | "sorted(rdd.flatMap(lambda x: range(1, x)).collect())" 1293 | ] 1294 | }, 1295 | { 1296 | "cell_type": "code", 1297 | "execution_count": 135, 1298 | "metadata": { 1299 | "collapsed": false 1300 | }, 1301 | "outputs": [ 1302 | { 1303 | "data": { 1304 | "text/plain": [ 1305 | "[(2, 2), (2, 2), (3, 3), (3, 3), (4, 4), (4, 4)]" 1306 | ] 1307 | }, 1308 | "execution_count": 135, 1309 | "metadata": {}, 1310 | "output_type": "execute_result" 1311 | } 1312 | ], 1313 | "source": [ 1314 | "sorted(rdd.flatMap(lambda x: [(x, x), (x, x)]).collect())" 1315 | ] 1316 | }, 1317 | { 1318 | "cell_type": "code", 1319 | "execution_count": 126, 1320 | "metadata": { 1321 | "collapsed": false 1322 | }, 1323 | "outputs": [ 1324 | { 1325 | "name": "stdout", 1326 | "output_type": "stream", 1327 | "text": [ 1328 | "1 test passed.\n", 1329 | "1 test passed.\n" 1330 | ] 1331 | } 1332 | ], 1333 | "source": [ 1334 | "# TEST Words from lines (4d)\n", 1335 | "# This test allows for leading spaces to be removed either before or after\n", 1336 | "# punctuation is removed.\n", 1337 | "Test.assertTrue(shakespeareWordCount == 927631 or shakespeareWordCount == 928908,\n", 1338 | " 'incorrect value for shakespeareWordCount')\n", 1339 | "Test.assertEquals(shakespeareWordsRDD.top(5),\n", 1340 | " [u'zwaggerd', u'zounds', u'zounds', u'zounds', u'zounds'],\n", 1341 | " 'incorrect value for shakespeareWordsRDD')" 1342 | ] 1343 | }, 1344 | { 1345 | "cell_type": "markdown", 1346 | "metadata": {}, 1347 | "source": [ 1348 | "#### ** (4e) Remove empty elements **\n", 1349 | "#### The next step is to filter out the empty elements. Remove all entries where the word is `''`." 1350 | ] 1351 | }, 1352 | { 1353 | "cell_type": "code", 1354 | "execution_count": 127, 1355 | "metadata": { 1356 | "collapsed": false 1357 | }, 1358 | "outputs": [ 1359 | { 1360 | "name": "stdout", 1361 | "output_type": "stream", 1362 | "text": [ 1363 | "882996\n" 1364 | ] 1365 | } 1366 | ], 1367 | "source": [ 1368 | "'''\n", 1369 | "# TODO: Replace with appropriate code\n", 1370 | "shakeWordsRDD = shakespeareWordsRDD.\n", 1371 | "shakeWordCount = shakeWordsRDD.count()\n", 1372 | "print shakeWordCount\n", 1373 | "'''\n", 1374 | "\n", 1375 | "# TODO: Replace with appropriate code\n", 1376 | "shakeWordsRDD = shakespeareWordsRDD.flatMap(lambda x: x.split())\n", 1377 | "shakeWordCount = shakeWordsRDD.count()\n", 1378 | "print shakeWordCount" 1379 | ] 1380 | }, 1381 | { 1382 | "cell_type": "code", 1383 | "execution_count": 128, 1384 | "metadata": { 1385 | "collapsed": false 1386 | }, 1387 | "outputs": [ 1388 | { 1389 | "name": "stdout", 1390 | "output_type": "stream", 1391 | "text": [ 1392 | "1 test passed.\n" 1393 | ] 1394 | } 1395 | ], 1396 | "source": [ 1397 | "# TEST Remove empty elements (4e)\n", 1398 | "Test.assertEquals(shakeWordCount, 882996, 'incorrect value for shakeWordCount')" 1399 | ] 1400 | }, 1401 | { 1402 | "cell_type": "markdown", 1403 | "metadata": {}, 1404 | "source": [ 1405 | "#### ** (4f) Count the words **\n", 1406 | "#### We now have an RDD that is only words. Next, let's apply the `wordCount()` function to produce a list of word counts. We can view the top 15 words by using the `takeOrdered()` action; however, since the elements of the RDD are pairs, we need a custom sort function that sorts using the value part of the pair.\n", 1407 | "#### You'll notice that many of the words are common English words. These are called stopwords. In a later lab, we will see how to eliminate them from the results.\n", 1408 | "#### Use the `wordCount()` function and `takeOrdered()` to obtain the fifteen most common words and their counts." 1409 | ] 1410 | }, 1411 | { 1412 | "cell_type": "code", 1413 | "execution_count": 129, 1414 | "metadata": { 1415 | "collapsed": false 1416 | }, 1417 | "outputs": [ 1418 | { 1419 | "name": "stdout", 1420 | "output_type": "stream", 1421 | "text": [ 1422 | "the: 27361\n", 1423 | "and: 26028\n", 1424 | "i: 20681\n", 1425 | "to: 19150\n", 1426 | "of: 17463\n", 1427 | "a: 14593\n", 1428 | "you: 13615\n", 1429 | "my: 12481\n", 1430 | "in: 10956\n", 1431 | "that: 10890\n", 1432 | "is: 9134\n", 1433 | "not: 8497\n", 1434 | "with: 7771\n", 1435 | "me: 7769\n", 1436 | "it: 7678\n" 1437 | ] 1438 | } 1439 | ], 1440 | "source": [ 1441 | "# TODO: Replace with appropriate code\n", 1442 | "top15WordsAndCounts = sorted(shakeWordsRDD.map(lambda x: (x,1))\n", 1443 | " .reduceByKey(add).collect(), key=lambda word:word[1], \n", 1444 | " reverse=True)[:15]\n", 1445 | "print '\\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top15WordsAndCounts))" 1446 | ] 1447 | }, 1448 | { 1449 | "cell_type": "code", 1450 | "execution_count": 130, 1451 | "metadata": { 1452 | "collapsed": false 1453 | }, 1454 | "outputs": [ 1455 | { 1456 | "name": "stdout", 1457 | "output_type": "stream", 1458 | "text": [ 1459 | "1 test passed.\n" 1460 | ] 1461 | } 1462 | ], 1463 | "source": [ 1464 | "# TEST Count the words (4f)\n", 1465 | "Test.assertEquals(top15WordsAndCounts,\n", 1466 | " [(u'the', 27361), (u'and', 26028), (u'i', 20681), (u'to', 19150), (u'of', 17463),\n", 1467 | " (u'a', 14593), (u'you', 13615), (u'my', 12481), (u'in', 10956), (u'that', 10890),\n", 1468 | " (u'is', 9134), (u'not', 8497), (u'with', 7771), (u'me', 7769), (u'it', 7678)],\n", 1469 | " 'incorrect value for top15WordsAndCounts')" 1470 | ] 1471 | } 1472 | ], 1473 | "metadata": { 1474 | "kernelspec": { 1475 | "display_name": "Python 2", 1476 | "language": "python", 1477 | "name": "python2" 1478 | }, 1479 | "language_info": { 1480 | "codemirror_mode": { 1481 | "name": "ipython", 1482 | "version": 2 1483 | }, 1484 | "file_extension": ".py", 1485 | "mimetype": "text/x-python", 1486 | "name": "python", 1487 | "nbconvert_exporter": "python", 1488 | "pygments_lexer": "ipython2", 1489 | "version": "2.7.6" 1490 | } 1491 | }, 1492 | "nbformat": 4, 1493 | "nbformat_minor": 0 1494 | } 1495 | -------------------------------------------------------------------------------- /Part2/Week1_20150820/ML_lab1_review_student.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "![ML Logo](http://spark-mooc.github.io/web-assets/images/CS190.1x_Banner_300.png)\n", 8 | "# **Math and Python review and CTR data download**\n", 9 | "#### This notebook reviews vector and matrix math, the [NumPy](http://www.numpy.org/) Python package, and Python lambda expressions. It also covers downloading the data required for Lab 4, where you will analyze website click-through rates. Part 1 covers vector and matrix math, and you'll do a few exercises by hand. In Part 2, you'll learn about NumPy and use `ndarray` objects to solve the math exercises. Part 3 provides additional information about NumPy and how it relates to array usage in Spark's [MLlib](https://spark.apache.org/mllib/). Part 4 provides an overview of lambda expressions, and you'll wrap up by downloading the dataset for Lab 4.\n", 10 | "#### To move through the notebook just run each of the cells. You can run a cell by pressing \"shift-enter\", which will compute the current cell and advance to the next cell, or by clicking in a cell and pressing \"control-enter\", which will compute the current cell and remain in that cell. You should move through the notebook from top to bottom and run all of the cells. If you skip some cells, later cells might not work as expected.\n", 11 | "#### Note that there are several exercises within this notebook. You will need to provide solutions for cells that start with: `# TODO: Replace with appropriate code`.\n", 12 | " \n", 13 | "#### ** This notebook covers: **\n", 14 | "#### *Part 1:* Math review\n", 15 | "#### *Part 2:* NumPy\n", 16 | "#### *Part 3:* Additional NumPy and Spark linear algebra\n", 17 | "#### *Part 4:* Python lambda expressions\n", 18 | "#### *Part 5:* CTR data download" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "metadata": { 25 | "collapsed": false 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "labVersion = 'cs190_week1_v_1_2'" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "### ** Part 1: Math review **" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "#### ** (1a) Scalar multiplication: vectors **\n", 44 | "#### In this exercise, you will calculate the product of a scalar and a vector by hand and enter the result in the code cell below. Scalar multiplication is straightforward. The resulting vector equals the product of the scalar, which is a single value, and each item in the original vector. In the example below, $ a $ is the scalar (constant) and $ \\mathbf{v} $ is the vector. $$ a \\mathbf{v} = \\begin{bmatrix} a v_1 \\\\\\ a v_2 \\\\\\ \\vdots \\\\\\ a v_n \\end{bmatrix} $$\n", 45 | "#### Calculate the value of $ \\mathbf{x} $: $$ \\mathbf{x} = 3 \\begin{bmatrix} 1 \\\\\\ -2 \\\\\\ 0 \\end{bmatrix} $$\n", 46 | "#### Calculate the value of $ \\mathbf{y} $: $$ \\mathbf{y} = 2 \\begin{bmatrix} 2 \\\\\\ 4 \\\\\\ 8 \\end{bmatrix} $$" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 6, 52 | "metadata": { 53 | "collapsed": false 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "# TODO: Replace with appropriate code\n", 58 | "# Manually calculate your answer and represent the vector as a list of integers values.\n", 59 | "# For example, [2, 4, 8].\n", 60 | "x = [3, -6, 0]\n", 61 | "y = [4, 8, 16]" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 7, 67 | "metadata": { 68 | "collapsed": false 69 | }, 70 | "outputs": [ 71 | { 72 | "name": "stdout", 73 | "output_type": "stream", 74 | "text": [ 75 | "1 test passed.\n", 76 | "1 test passed.\n" 77 | ] 78 | } 79 | ], 80 | "source": [ 81 | "# TEST Scalar multiplication: vectors (1a)\n", 82 | "# Import test library\n", 83 | "from test_helper import Test\n", 84 | "Test.assertEqualsHashed(x, 'e460f5b87531a2b60e0f55c31b2e49914f779981',\n", 85 | " 'incorrect value for vector x')\n", 86 | "Test.assertEqualsHashed(y, 'e2d37ff11427dbac7f833a5a7039c0de5a740b1e',\n", 87 | " 'incorrect value for vector y')" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "#### ** (1b) Element-wise multiplication: vectors **\n", 95 | "#### In this exercise, you will calculate the element-wise multiplication of two vectors by hand and enter the result in the code cell below. You'll later see that element-wise multiplication is the default method when two NumPy arrays are multiplied together. Note we won't be performing element-wise multiplication in future labs, but we are introducing it here to distinguish it from other vector operators, and to because it is a common operations in NumPy, as we will discuss in Part (2b).\n", 96 | "#### The element-wise calculation is as follows: $$ \\mathbf{x} \\odot \\mathbf{y} = \\begin{bmatrix} x_1 y_1 \\\\\\ x_2 y_2 \\\\\\ \\vdots \\\\\\ x_n y_n \\end{bmatrix} $$\n", 97 | "#### Calculate the value of $ \\mathbf{z} $: $$ \\mathbf{z} = \\begin{bmatrix} 1 \\\\\\ 2 \\\\\\ 3 \\end{bmatrix} \\odot \\begin{bmatrix} 4 \\\\\\ 5 \\\\\\ 6 \\end{bmatrix} $$" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 8, 103 | "metadata": { 104 | "collapsed": false 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "# TODO: Replace with appropriate code\n", 109 | "# Manually calculate your answer and represent the vector as a list of integers values.\n", 110 | "# z = \n", 111 | "\n", 112 | "z = [4, 10, 18]" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 9, 118 | "metadata": { 119 | "collapsed": false 120 | }, 121 | "outputs": [ 122 | { 123 | "name": "stdout", 124 | "output_type": "stream", 125 | "text": [ 126 | "1 test passed.\n" 127 | ] 128 | } 129 | ], 130 | "source": [ 131 | "# TEST Element-wise multiplication: vectors (1b)\n", 132 | "Test.assertEqualsHashed(z, '4b5fe28ee2d274d7e0378bf993e28400f66205c2',\n", 133 | " 'incorrect value for vector z')" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "#### ** (1c) Dot product **\n", 141 | "#### In this exercise, you will calculate the dot product of two vectors by hand and enter the result in the code cell below. Note that the dot product is equivalent to performing element-wise multiplication and then summing the result.\n", 142 | "#### Below, you'll find the calculation for the dot product of two vectors, where each vector has length $ n $: $$ \\mathbf{w} \\cdot \\mathbf{x} = \\sum_{i=1}^n w_i x_i $$\n", 143 | "#### Note that you may also see $ \\mathbf{w} \\cdot \\mathbf{x} $ represented as $ \\mathbf{w}^\\top \\mathbf{x} $\n", 144 | "#### Calculate the value for $ c_1 $ based on the dot product of the following two vectors:\n", 145 | "#### $$ c_1 = \\begin{bmatrix} 1 \\\\\\ -3 \\end{bmatrix} \\cdot \\begin{bmatrix} 4 \\\\\\ 5 \\end{bmatrix}$$\n", 146 | "#### Calculate the value for $ c_2 $ based on the dot product of the following two vectors:\n", 147 | "#### $$ c_2 = \\begin{bmatrix} 3 \\\\\\ 4 \\\\\\ 5 \\end{bmatrix} \\cdot \\begin{bmatrix} 1 \\\\\\ 2 \\\\\\ 3 \\end{bmatrix}$$" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 10, 153 | "metadata": { 154 | "collapsed": false 155 | }, 156 | "outputs": [], 157 | "source": [ 158 | "# TODO: Replace with appropriate code\n", 159 | "# Manually calculate your answer and set the variables to their appropriate integer values.\n", 160 | "# c1 = \n", 161 | "# c2 = \n", 162 | "\n", 163 | "c1 = -11\n", 164 | "c2 = 26" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 11, 170 | "metadata": { 171 | "collapsed": false 172 | }, 173 | "outputs": [ 174 | { 175 | "name": "stdout", 176 | "output_type": "stream", 177 | "text": [ 178 | "1 test passed.\n", 179 | "1 test passed.\n" 180 | ] 181 | } 182 | ], 183 | "source": [ 184 | "# TEST Dot product (1c)\n", 185 | "Test.assertEqualsHashed(c1, '8d7a9046b6a6e21d66409ad0849d6ab8aa51007c', 'incorrect value for c1')\n", 186 | "Test.assertEqualsHashed(c2, '887309d048beef83ad3eabf2a79a64a389ab1c9f', 'incorrect value for c2')" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "#### ** (1d) Matrix multiplication **\n", 194 | "#### In this exercise, you will calculate the result of multiplying two matrices together by hand and enter the result in the code cell below.\n", 195 | "#### Below, you'll find the calculation for multiplying two matrices together. Note that the number of columns for the first matrix and the number of rows for the second matrix have to be equal and are are represented by $ n $:\n", 196 | "#### $$ [\\mathbf{X} \\mathbf{Y}]_{i,j} = \\sum_{r=1}^n \\mathbf{X}_{i,r} \\mathbf{Y}_{r,j} $$\n", 197 | "#### First, you'll calculate the value for $ \\mathbf{X} $.\n", 198 | "#### $$ \\mathbf{X} = \\begin{bmatrix} 1 & 2 & 3 \\\\\\ 4 & 5 & 6 \\end{bmatrix} \\begin{bmatrix} 1 & 2 \\\\\\ 3 & 4 \\\\\\ 5 & 6 \\end{bmatrix} $$\n", 199 | "#### Next, you'll perform an outer product and calculate the value for $ \\mathbf{Y} $. Note that outer product is just a special case of general matrix multiplication and follows the same rules as normal matrix multiplication.\n", 200 | "#### $$ \\mathbf{Y} = \\begin{bmatrix} 1 \\\\\\ 2 \\\\\\ 3 \\end{bmatrix} \\begin{bmatrix} 1 & 2 & 3 \\end{bmatrix} $$" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 12, 206 | "metadata": { 207 | "collapsed": false 208 | }, 209 | "outputs": [], 210 | "source": [ 211 | "# TODO: Replace with appropriate code\n", 212 | "# Represent matrices as lists within lists. For example, [[1,2,3], [4,5,6]] represents a matrix with\n", 213 | "# two rows and three columns. Use integer values.\n", 214 | "# X = \n", 215 | "# Y = \n", 216 | "\n", 217 | "X = [[22, 28], [49, 64]]\n", 218 | "Y = [[1, 2, 3], [2, 4, 6], [3, 6, 9]]" 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": 13, 224 | "metadata": { 225 | "collapsed": false 226 | }, 227 | "outputs": [ 228 | { 229 | "name": "stdout", 230 | "output_type": "stream", 231 | "text": [ 232 | "1 test passed.\n", 233 | "1 test passed.\n" 234 | ] 235 | } 236 | ], 237 | "source": [ 238 | "# TEST Matrix multiplication (1d)\n", 239 | "Test.assertEqualsHashed(X, 'c2ada2598d8a499e5dfb66f27a24f444483cba13',\n", 240 | " 'incorrect value for matrix X')\n", 241 | "Test.assertEqualsHashed(Y, 'f985daf651531b7d776523836f3068d4c12e4519',\n", 242 | " 'incorrect value for matrix Y')" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "### ** Part 2: NumPy **" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "#### ** (2a) Scalar multiplication **\n", 257 | "#### [NumPy](http://docs.scipy.org/doc/numpy/reference/) is a Python library for working with arrays. NumPy provides abstractions that make it easy to treat these underlying arrays as vectors and matrices. The library is optimized to be fast and memory efficient, and we'll be using it throughout the course. The building block for NumPy is the [ndarray](http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html), which is a multidimensional array of fixed-size that contains elements of one type (e.g. array of floats).\n", 258 | "#### For this exercise, you'll create a `ndarray` consisting of the elements \\[1, 2, 3\\] and multiply this array by 5. Use [np.array()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html) to create the array. Note that you can pass a Python list into `np.array()`. To perform scalar multiplication with an `ndarray` just use `*`.\n", 259 | "#### Note that if you create an array from a Python list of integers you will obtain a one-dimensional array, *which is equivalent to a vector for our purposes*." 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": 14, 265 | "metadata": { 266 | "collapsed": false 267 | }, 268 | "outputs": [], 269 | "source": [ 270 | "# It is convention to import NumPy with the alias np\n", 271 | "import numpy as np" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": 15, 277 | "metadata": { 278 | "collapsed": false 279 | }, 280 | "outputs": [ 281 | { 282 | "name": "stdout", 283 | "output_type": "stream", 284 | "text": [ 285 | "[1 2 3]\n", 286 | "[ 5 10 15]\n" 287 | ] 288 | } 289 | ], 290 | "source": [ 291 | "'''\n", 292 | "# TODO: Replace with appropriate code\n", 293 | "# Create a numpy array with the values 1, 2, 3\n", 294 | "simpleArray = \n", 295 | "# Perform the scalar product of 5 and the numpy array\n", 296 | "timesFive = \n", 297 | "print simpleArray\n", 298 | "print timesFive\n", 299 | "'''\n", 300 | "\n", 301 | "# TODO: Replace with appropriate code\n", 302 | "# Create a numpy array with the values 1, 2, 3\n", 303 | "simpleArray = np.array([1, 2, 3])\n", 304 | "# Perform the scalar product of 5 and the numpy array\n", 305 | "timesFive = simpleArray * 5\n", 306 | "print simpleArray\n", 307 | "print timesFive" 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": 16, 313 | "metadata": { 314 | "collapsed": false 315 | }, 316 | "outputs": [ 317 | { 318 | "name": "stdout", 319 | "output_type": "stream", 320 | "text": [ 321 | "1 test passed.\n" 322 | ] 323 | } 324 | ], 325 | "source": [ 326 | "# TEST Scalar multiplication (2a)\n", 327 | "Test.assertTrue(np.all(timesFive == [5, 10, 15]), 'incorrect value for timesFive')" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "#### ** (2b) Element-wise multiplication and dot product **\n", 335 | "#### NumPy arrays support both element-wise multiplication and dot product. Element-wise multiplication occurs automatically when you use the `*` operator to multiply two `ndarray` objects of the same length.\n", 336 | "#### To perform the dot product you can use either [np.dot()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html#numpy.dot) or [np.ndarray.dot()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.dot.html). For example, if you had NumPy arrays `x` and `y`, you could compute their dot product four ways: `np.dot(x, y)`, `np.dot(y, x)`, `x.dot(y)`, or `y.dot(x)`.\n", 337 | "#### For this exercise, multiply the arrays `u` and `v` element-wise and compute their dot product." 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 17, 343 | "metadata": { 344 | "collapsed": false 345 | }, 346 | "outputs": [ 347 | { 348 | "name": "stdout", 349 | "output_type": "stream", 350 | "text": [ 351 | "u: [ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]\n", 352 | "v: [ 5. 5.5 6. 6.5 7. 7.5 8. 8.5 9. 9.5]\n", 353 | "\n", 354 | "elementWise\n", 355 | "[ 0. 2.75 6. 9.75 14. 18.75 24. 29.75 36. 42.75]\n", 356 | "\n", 357 | "dotProduct\n", 358 | "183.75\n" 359 | ] 360 | } 361 | ], 362 | "source": [ 363 | "'''\n", 364 | "# TODO: Replace with appropriate code\n", 365 | "# Create a ndarray based on a range and step size.\n", 366 | "u = np.arange(0, 5, .5)\n", 367 | "v = np.arange(5, 10, .5)\n", 368 | "\n", 369 | "elementWise = \n", 370 | "dotProduct = \n", 371 | "print 'u: {0}'.format(u)\n", 372 | "print 'v: {0}'.format(v)\n", 373 | "print '\\nelementWise\\n{0}'.format(elementWise)\n", 374 | "print '\\ndotProduct\\n{0}'.format(dotProduct)\n", 375 | "'''\n", 376 | "\n", 377 | "# TODO: Replace with appropriate code\n", 378 | "# Create a ndarray based on a range and step size.\n", 379 | "u = np.arange(0, 5, .5)\n", 380 | "v = np.arange(5, 10, .5)\n", 381 | "\n", 382 | "elementWise = u * v\n", 383 | "dotProduct = u.dot(v)\n", 384 | "print 'u: {0}'.format(u)\n", 385 | "print 'v: {0}'.format(v)\n", 386 | "print '\\nelementWise\\n{0}'.format(elementWise)\n", 387 | "print '\\ndotProduct\\n{0}'.format(dotProduct)\n" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": 18, 393 | "metadata": { 394 | "collapsed": false 395 | }, 396 | "outputs": [ 397 | { 398 | "name": "stdout", 399 | "output_type": "stream", 400 | "text": [ 401 | "1 test passed.\n", 402 | "1 test passed.\n" 403 | ] 404 | } 405 | ], 406 | "source": [ 407 | "# TEST Element-wise multiplication and dot product (2b)\n", 408 | "Test.assertTrue(np.all(elementWise == [ 0., 2.75, 6., 9.75, 14., 18.75, 24., 29.75, 36., 42.75]),\n", 409 | " 'incorrect value for elementWise')\n", 410 | "Test.assertEquals(dotProduct, 183.75, 'incorrect value for dotProduct')" 411 | ] 412 | }, 413 | { 414 | "cell_type": "markdown", 415 | "metadata": {}, 416 | "source": [ 417 | "#### ** (2c) Matrix math **\n", 418 | "#### With NumPy it is very easy to perform matrix math. You can use [np.matrix()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html) to generate a NumPy matrix. Just pass a two-dimensional `ndarray` or a list of lists to the function. You can perform matrix math on NumPy matrices using `*`.\n", 419 | "#### You can transpose a matrix by calling [numpy.matrix.transpose()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.transpose.html) or by using `.T` on the matrix object (e.g. `myMatrix.T`). Transposing a matrix produces a matrix where the new rows are the columns from the old matrix. For example: $$ \\begin{bmatrix} 1 & 2 & 3 \\\\\\ 4 & 5 & 6 \\end{bmatrix}^\\mathbf{\\top} = \\begin{bmatrix} 1 & 4 \\\\\\ 2 & 5 \\\\\\ 3 & 6 \\end{bmatrix} $$\n", 420 | " \n", 421 | "#### Inverting a matrix can be done using [numpy.linalg.inv()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.inv.html). Note that only square matrices can be inverted, and square matrices are not guaranteed to have an inverse. If the inverse exists, then multiplying a matrix by its inverse will produce the identity matrix. $ \\scriptsize ( \\mathbf{A}^{-1} \\mathbf{A} = \\mathbf{I_n} ) $ The identity matrix $ \\scriptsize \\mathbf{I_n} $ has ones along its diagonal and zero elsewhere. $$ \\mathbf{I_n} = \\begin{bmatrix} 1 & 0 & 0 & \\dots & 0 \\\\\\ 0 & 1 & 0 & \\dots & 0 \\\\\\ 0 & 0 & 1 & \\dots & 0 \\\\\\ \\vdots & \\vdots & \\vdots & \\ddots & \\vdots \\\\\\ 0 & 0 & 0 & \\dots & 1 \\end{bmatrix} $$\n", 422 | "#### For this exercise, multiply $ \\mathbf{A} $ times its transpose $ ( \\mathbf{A}^\\top ) $ and then calculate the inverse of the result $ ( [ \\mathbf{A} \\mathbf{A}^\\top ]^{-1} ) $." 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": 19, 428 | "metadata": { 429 | "collapsed": false 430 | }, 431 | "outputs": [ 432 | { 433 | "name": "stdout", 434 | "output_type": "stream", 435 | "text": [ 436 | "A:\n", 437 | "[[1 2 3 4]\n", 438 | " [5 6 7 8]]\n", 439 | "\n", 440 | "A transpose:\n", 441 | "[[1 5]\n", 442 | " [2 6]\n", 443 | " [3 7]\n", 444 | " [4 8]]\n", 445 | "\n", 446 | "AAt:\n", 447 | "[[ 30 70]\n", 448 | " [ 70 174]]\n", 449 | "\n", 450 | "AAtInv:\n", 451 | "[[ 0.54375 -0.21875]\n", 452 | " [-0.21875 0.09375]]\n", 453 | "\n", 454 | "AAtInv * AAt:\n", 455 | "[[ 1. 0.]\n", 456 | " [-0. 1.]]\n" 457 | ] 458 | } 459 | ], 460 | "source": [ 461 | "'''\n", 462 | "# TODO: Replace with appropriate code\n", 463 | "from numpy.linalg import inv\n", 464 | "\n", 465 | "A = np.matrix([[1,2,3,4],[5,6,7,8]])\n", 466 | "print 'A:\\n{0}'.format(A)\n", 467 | "# Print A transpose\n", 468 | "print '\\nA transpose:\\n{0}'.format(A.T)\n", 469 | "\n", 470 | "# Multiply A by A transpose\n", 471 | "AAt = \n", 472 | "print '\\nAAt:\\n{0}'.format(AAt)\n", 473 | "\n", 474 | "# Invert AAt with np.linalg.inv()\n", 475 | "AAtInv = \n", 476 | "print '\\nAAtInv:\\n{0}'.format(AAtInv)\n", 477 | "\n", 478 | "# Show inverse times matrix equals identity\n", 479 | "# We round due to numerical precision\n", 480 | "print '\\nAAtInv * AAt:\\n{0}'.format((AAtInv * AAt).round(4))\n", 481 | "'''\n", 482 | "\n", 483 | "# TODO: Replace with appropriate code\n", 484 | "from numpy.linalg import inv\n", 485 | "\n", 486 | "A = np.matrix([[1,2,3,4],[5,6,7,8]])\n", 487 | "print 'A:\\n{0}'.format(A)\n", 488 | "# Print A transpose\n", 489 | "print '\\nA transpose:\\n{0}'.format(A.T)\n", 490 | "\n", 491 | "# Multiply A by A transpose\n", 492 | "AAt = A * A.T\n", 493 | "print '\\nAAt:\\n{0}'.format(AAt)\n", 494 | "\n", 495 | "# Invert AAt with np.linalg.inv()\n", 496 | "AAtInv = np.linalg.inv(AAt)\n", 497 | "print '\\nAAtInv:\\n{0}'.format(AAtInv)\n", 498 | "\n", 499 | "# Show inverse times matrix equals identity\n", 500 | "# We round due to numerical precision\n", 501 | "print '\\nAAtInv * AAt:\\n{0}'.format((AAtInv * AAt).round(4))\n" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": 20, 507 | "metadata": { 508 | "collapsed": false 509 | }, 510 | "outputs": [ 511 | { 512 | "name": "stdout", 513 | "output_type": "stream", 514 | "text": [ 515 | "1 test passed.\n", 516 | "1 test passed.\n" 517 | ] 518 | } 519 | ], 520 | "source": [ 521 | "# TEST Matrix math (2c)\n", 522 | "Test.assertTrue(np.all(AAt == np.matrix([[30, 70], [70, 174]])), 'incorrect value for AAt')\n", 523 | "Test.assertTrue(np.allclose(AAtInv, np.matrix([[0.54375, -0.21875], [-0.21875, 0.09375]])),\n", 524 | " 'incorrect value for AAtInv')" 525 | ] 526 | }, 527 | { 528 | "cell_type": "markdown", 529 | "metadata": {}, 530 | "source": [ 531 | "### ** Part 3: Additional NumPy and Spark linear algebra **" 532 | ] 533 | }, 534 | { 535 | "cell_type": "markdown", 536 | "metadata": {}, 537 | "source": [ 538 | "#### ** (3a) Slices **\n", 539 | "#### You can select a subset of a one-dimensional NumPy `ndarray`'s elements by using slices. These slices operate the same way as slices for Python lists. For example, `[0, 1, 2, 3][:2]` returns the first two elements `[0, 1]`. NumPy, additionally, has more sophisticated slicing that allows slicing across multiple dimensions; however, you'll only need to use basic slices in future labs for this course.\n", 540 | "#### Note that if no index is placed to the left of a `:`, it is equivalent to starting at 0, and hence `[0, 1, 2, 3][:2]` and `[0, 1, 2, 3][0:2]` yield the same result. Similarly, if no index is placed to the right of a `:`, it is equivalent to slicing to the end of the object. Also, you can use negative indices to index relative to the end of the object, so `[-2:]` would return the last two elements of the object.\n", 541 | "#### For this exercise, return the last 3 elements of the array `features`." 542 | ] 543 | }, 544 | { 545 | "cell_type": "code", 546 | "execution_count": 21, 547 | "metadata": { 548 | "collapsed": false 549 | }, 550 | "outputs": [ 551 | { 552 | "name": "stdout", 553 | "output_type": "stream", 554 | "text": [ 555 | "features:\n", 556 | "[1 2 3 4]\n", 557 | "\n", 558 | "lastThree:\n", 559 | "[2 3 4]\n" 560 | ] 561 | } 562 | ], 563 | "source": [ 564 | "'''\n", 565 | "# TODO: Replace with appropriate code\n", 566 | "features = np.array([1, 2, 3, 4])\n", 567 | "print 'features:\\n{0}'.format(features)\n", 568 | "\n", 569 | "# The last three elements of features\n", 570 | "lastThree = \n", 571 | "\n", 572 | "print '\\nlastThree:\\n{0}'.format(lastThree)\n", 573 | "'''\n", 574 | "# TODO: Replace with appropriate code\n", 575 | "features = np.array([1, 2, 3, 4])\n", 576 | "print 'features:\\n{0}'.format(features)\n", 577 | "\n", 578 | "# The last three elements of features\n", 579 | "lastThree = features[-3:]\n", 580 | "\n", 581 | "print '\\nlastThree:\\n{0}'.format(lastThree)" 582 | ] 583 | }, 584 | { 585 | "cell_type": "code", 586 | "execution_count": 22, 587 | "metadata": { 588 | "collapsed": false 589 | }, 590 | "outputs": [ 591 | { 592 | "name": "stdout", 593 | "output_type": "stream", 594 | "text": [ 595 | "1 test passed.\n" 596 | ] 597 | } 598 | ], 599 | "source": [ 600 | "# TEST Slices (3a)\n", 601 | "Test.assertTrue(np.all(lastThree == [2, 3, 4]), 'incorrect value for lastThree')" 602 | ] 603 | }, 604 | { 605 | "cell_type": "markdown", 606 | "metadata": {}, 607 | "source": [ 608 | "#### ** (3b) Combining `ndarray` objects **\n", 609 | "#### NumPy provides many functions for creating new arrays from existing arrays. We'll explore two functions: [np.hstack()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.hstack.html), which allows you to combine arrays column-wise, and [np.vstack()](http://docs.scipy.org/doc/numpy/reference/generated/numpy.vstack.html), which allows you to combine arrays row-wise. Note that both `np.hstack()` and `np.vstack()` take in a tuple of arrays as their first argument. To horizontally combine three arrays `a`, `b`, and `c`, you would run `np.hstack((a, b, c))`.\n", 610 | "#### If we had two arrays: `a = [1, 2, 3, 4]` and `b = [5, 6, 7, 8]`, we could use `np.vstack((a, b))` to produce the two-dimensional array: $$ \\begin{bmatrix} 1 & 2 & 3 & 4 \\\\\\ 5 & 6 & 7 & 8 \\end{bmatrix} $$\n", 611 | "#### For this exercise, you'll combine the `zeros` and `ones` arrays both horizontally (column-wise) and vertically (row-wise).\n", 612 | "#### Note that the result of stacking two arrays is an `ndarray`. If you need the result to be a matrix, you can call `np.matrix()` on the result, which will return a NumPy matrix." 613 | ] 614 | }, 615 | { 616 | "cell_type": "code", 617 | "execution_count": 23, 618 | "metadata": { 619 | "collapsed": false 620 | }, 621 | "outputs": [ 622 | { 623 | "name": "stdout", 624 | "output_type": "stream", 625 | "text": [ 626 | "zeros:\n", 627 | "[ 0. 0. 0. 0. 0. 0. 0. 0.]\n", 628 | "\n", 629 | "ones:\n", 630 | "[ 1. 1. 1. 1. 1. 1. 1. 1.]\n", 631 | "\n", 632 | "zerosThenOnes:\n", 633 | "[ 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]\n", 634 | "\n", 635 | "zerosAboveOnes:\n", 636 | "[[ 0. 0. 0. 0. 0. 0. 0. 0.]\n", 637 | " [ 1. 1. 1. 1. 1. 1. 1. 1.]]\n" 638 | ] 639 | } 640 | ], 641 | "source": [ 642 | "'''\n", 643 | "# TODO: Replace with appropriate code\n", 644 | "zeros = np.zeros(8)\n", 645 | "ones = np.ones(8)\n", 646 | "print 'zeros:\\n{0}'.format(zeros)\n", 647 | "print '\\nones:\\n{0}'.format(ones)\n", 648 | "\n", 649 | "zerosThenOnes = # A 1 by 16 array\n", 650 | "zerosAboveOnes = # A 2 by 8 array\n", 651 | "\n", 652 | "print '\\nzerosThenOnes:\\n{0}'.format(zerosThenOnes)\n", 653 | "print '\\nzerosAboveOnes:\\n{0}'.format(zerosAboveOnes)\n", 654 | "'''\n", 655 | "# TODO: Replace with appropriate code\n", 656 | "zeros = np.zeros(8)\n", 657 | "ones = np.ones(8)\n", 658 | "print 'zeros:\\n{0}'.format(zeros)\n", 659 | "print '\\nones:\\n{0}'.format(ones)\n", 660 | "\n", 661 | "zerosThenOnes = np.hstack((zeros, ones)) # A 1 by 16 array\n", 662 | "zerosAboveOnes = np.vstack((zeros, ones)) # A 2 by 8 array\n", 663 | "\n", 664 | "print '\\nzerosThenOnes:\\n{0}'.format(zerosThenOnes)\n", 665 | "print '\\nzerosAboveOnes:\\n{0}'.format(zerosAboveOnes)" 666 | ] 667 | }, 668 | { 669 | "cell_type": "code", 670 | "execution_count": 24, 671 | "metadata": { 672 | "collapsed": false 673 | }, 674 | "outputs": [ 675 | { 676 | "name": "stdout", 677 | "output_type": "stream", 678 | "text": [ 679 | "1 test passed.\n", 680 | "1 test passed.\n" 681 | ] 682 | } 683 | ], 684 | "source": [ 685 | "# TEST Combining ndarray objects (3b)\n", 686 | "Test.assertTrue(np.all(zerosThenOnes == [0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]),\n", 687 | " 'incorrect value for zerosThenOnes')\n", 688 | "Test.assertTrue(np.all(zerosAboveOnes == [[0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1]]),\n", 689 | " 'incorrect value for zerosAboveOnes')" 690 | ] 691 | }, 692 | { 693 | "cell_type": "markdown", 694 | "metadata": {}, 695 | "source": [ 696 | "#### ** (3c) PySpark's DenseVector **\n", 697 | "#### PySpark provides a [DenseVector](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.DenseVector) class within the module [pyspark.mllib.linalg](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.linalg). `DenseVector` is used to store arrays of values for use in PySpark. `DenseVector` actually stores values in a NumPy array and delegates calculations to that object. You can create a new `DenseVector` using `DenseVector()` and passing in an NumPy array or a Python list.\n", 698 | "#### `DenseVector` implements several functions. The only function needed for this course is `DenseVector.dot()`, which operates just like `np.ndarray.dot()`.\n", 699 | "#### Note that `DenseVector` stores all values as `np.float64`, so even if you pass in an NumPy array of integers, the resulting `DenseVector` will contain floating-point numbers. Also, `DenseVector` objects exist locally and are not inherently distributed. `DenseVector` objects can be used in the distributed setting by either passing functions that contain them to resilient distributed dataset (RDD) transformations or by distributing them directly as RDDs. You'll learn more about RDDs in the spark tutorial.\n", 700 | "#### For this exercise, create a `DenseVector` consisting of the values `[3.0, 4.0, 5.0]` and compute the dot product of this vector with `numpyVector`." 701 | ] 702 | }, 703 | { 704 | "cell_type": "code", 705 | "execution_count": 25, 706 | "metadata": { 707 | "collapsed": false 708 | }, 709 | "outputs": [], 710 | "source": [ 711 | "from pyspark.mllib.linalg import DenseVector" 712 | ] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "execution_count": 26, 717 | "metadata": { 718 | "collapsed": false 719 | }, 720 | "outputs": [ 721 | { 722 | "name": "stdout", 723 | "output_type": "stream", 724 | "text": [ 725 | "\n", 726 | "numpyVector:\n", 727 | "[-3 -4 5]\n", 728 | "myDenseVector:\n", 729 | "[3.0,4.0,5.0]\n", 730 | "\n", 731 | "denseDotProduct:\n", 732 | "0.0\n" 733 | ] 734 | } 735 | ], 736 | "source": [ 737 | "'''\n", 738 | "# TODO: Replace with appropriate code\n", 739 | "numpyVector = np.array([-3, -4, 5])\n", 740 | "print '\\nnumpyVector:\\n{0}'.format(numpyVector)\n", 741 | "\n", 742 | "# Create a DenseVector consisting of the values [3.0, 4.0, 5.0]\n", 743 | "myDenseVector = \n", 744 | "# Calculate the dot product between the two vectors.\n", 745 | "denseDotProduct = \n", 746 | "\n", 747 | "print 'myDenseVector:\\n{0}'.format(myDenseVector)\n", 748 | "print '\\ndenseDotProduct:\\n{0}'.format(denseDotProduct)\n", 749 | "'''\n", 750 | "# TODO: Replace with appropriate code\n", 751 | "numpyVector = np.array([-3, -4, 5])\n", 752 | "print '\\nnumpyVector:\\n{0}'.format(numpyVector)\n", 753 | "\n", 754 | "# Create a DenseVector consisting of the values [3.0, 4.0, 5.0]\n", 755 | "myDenseVector = DenseVector([3.0, 4.0, 5.0])\n", 756 | "# Calculate the dot product between the two vectors.\n", 757 | "denseDotProduct = myDenseVector.dot(numpyVector)\n", 758 | "\n", 759 | "print 'myDenseVector:\\n{0}'.format(myDenseVector)\n", 760 | "print '\\ndenseDotProduct:\\n{0}'.format(denseDotProduct)\n" 761 | ] 762 | }, 763 | { 764 | "cell_type": "code", 765 | "execution_count": 27, 766 | "metadata": { 767 | "collapsed": false 768 | }, 769 | "outputs": [ 770 | { 771 | "name": "stdout", 772 | "output_type": "stream", 773 | "text": [ 774 | "1 test passed.\n", 775 | "1 test passed.\n", 776 | "1 test passed.\n" 777 | ] 778 | } 779 | ], 780 | "source": [ 781 | "# TEST PySpark's DenseVector (3c)\n", 782 | "Test.assertTrue(isinstance(myDenseVector, DenseVector), 'myDenseVector is not a DenseVector')\n", 783 | "Test.assertTrue(np.allclose(myDenseVector, np.array([3., 4., 5.])),\n", 784 | " 'incorrect value for myDenseVector')\n", 785 | "Test.assertTrue(np.allclose(denseDotProduct, 0.0), 'incorrect value for denseDotProduct')" 786 | ] 787 | }, 788 | { 789 | "cell_type": "markdown", 790 | "metadata": {}, 791 | "source": [ 792 | "### ** Part 4: Python lambda expressions **" 793 | ] 794 | }, 795 | { 796 | "cell_type": "markdown", 797 | "metadata": {}, 798 | "source": [ 799 | "#### ** (4a) Lambda is an anonymous function **\n", 800 | "#### We can use a lambda expression to create a function. To do this, you type `lambda` followed by the names of the function's parameters separated by commas, followed by a `:`, and then the expression statement that the function will evaluate. For example, `lambda x, y: x + y` is an anonymous function that computes the sum of its two inputs.\n", 801 | "#### Lambda expressions return a function when evaluated. The function is not bound to any variable, which is why lambdas are associated with anonymous functions. However, it is possible to assign the function to a variable. Lambda expressions are particularly useful when you need to pass a simple function into another function. In that case, the lambda expression generates a function that is bound to the parameter being passed into the function.\n", 802 | "#### Below, we'll see an example of how we can bind the function returned by a lambda expression to a variable named `addSLambda`. From this example, we can see that `lambda` provides a shortcut for creating a simple function. Note that the behavior of the function created using `def` and the function created using `lambda` is equivalent. Both functions have the same type and return the same results. The only differences are the names and the way they were created.\n", 803 | "#### For this exercise, first run the two cells below to compare a function created using `def` with a corresponding anonymous function. Next, write your own lambda expression that creates a function that multiplies its input (a single parameter) by 10.\n", 804 | "#### Here are some additional references that explain lambdas: [Lambda Functions](http://www.secnetix.de/olli/Python/lambda_functions.hawk), [Lambda Tutorial](https://pythonconquerstheuniverse.wordpress.com/2011/08/29/lambda_tutorial/), and [Python Functions](http://www.bogotobogo.com/python/python_functions_lambda.php)." 805 | ] 806 | }, 807 | { 808 | "cell_type": "code", 809 | "execution_count": 28, 810 | "metadata": { 811 | "collapsed": false 812 | }, 813 | "outputs": [ 814 | { 815 | "name": "stdout", 816 | "output_type": "stream", 817 | "text": [ 818 | "\n", 819 | "\n", 820 | "cats\n" 821 | ] 822 | } 823 | ], 824 | "source": [ 825 | "# Example function\n", 826 | "def addS(x):\n", 827 | " return x + 's'\n", 828 | "print type(addS)\n", 829 | "print addS\n", 830 | "print addS('cat')" 831 | ] 832 | }, 833 | { 834 | "cell_type": "code", 835 | "execution_count": 29, 836 | "metadata": { 837 | "collapsed": false 838 | }, 839 | "outputs": [ 840 | { 841 | "name": "stdout", 842 | "output_type": "stream", 843 | "text": [ 844 | "\n", 845 | " at 0xb1f1dca4>\n", 846 | "cats\n" 847 | ] 848 | } 849 | ], 850 | "source": [ 851 | "# As a lambda\n", 852 | "addSLambda = lambda x: x + 's'\n", 853 | "print type(addSLambda)\n", 854 | "print addSLambda\n", 855 | "print addSLambda('cat')" 856 | ] 857 | }, 858 | { 859 | "cell_type": "code", 860 | "execution_count": 32, 861 | "metadata": { 862 | "collapsed": false 863 | }, 864 | "outputs": [ 865 | { 866 | "name": "stdout", 867 | "output_type": "stream", 868 | "text": [ 869 | "50\n", 870 | "\n", 871 | " at 0xb0e8409c>\n" 872 | ] 873 | } 874 | ], 875 | "source": [ 876 | "'''\n", 877 | "# TODO: Replace with appropriate code\n", 878 | "# Recall that: \"lambda x, y: x + y\" creates a function that adds together two numbers\n", 879 | "multiplyByTen = lambda x: \n", 880 | "print multiplyByTen(5)\n", 881 | "\n", 882 | "# Note that the function still shows its name as \n", 883 | "print '\\n', multiplyByTen\n", 884 | "'''\n", 885 | "# TODO: Replace with appropriate code\n", 886 | "# Recall that: \"lambda x, y: x + y\" creates a function that adds together two numbers\n", 887 | "multiplyByTen = lambda x: x * 10\n", 888 | "print multiplyByTen(5)\n", 889 | "\n", 890 | "# Note that the function still shows its name as \n", 891 | "print '\\n', multiplyByTen\n", 892 | " " 893 | ] 894 | }, 895 | { 896 | "cell_type": "code", 897 | "execution_count": 33, 898 | "metadata": { 899 | "collapsed": false 900 | }, 901 | "outputs": [ 902 | { 903 | "name": "stdout", 904 | "output_type": "stream", 905 | "text": [ 906 | "1 test passed.\n" 907 | ] 908 | } 909 | ], 910 | "source": [ 911 | "# TEST Python lambda expressions (4a)\n", 912 | "Test.assertEquals(multiplyByTen(10), 100, 'incorrect definition for multiplyByTen')" 913 | ] 914 | }, 915 | { 916 | "cell_type": "markdown", 917 | "metadata": {}, 918 | "source": [ 919 | "#### ** (4b) `lambda` fewer steps than `def` **\n", 920 | "#### `lambda` generates a function and returns it, while `def` generates a function and assigns it to a name. The function returned by `lambda` also automatically returns the value of its expression statement, which reduces the amount of code that needs to be written.\n", 921 | "#### For this exercise, recreate the `def` behavior using `lambda`. Note that since a lambda expression returns a function, it can be used anywhere an object is expected. For example, you can create a list of functions where each function in the list was generated by a lambda expression." 922 | ] 923 | }, 924 | { 925 | "cell_type": "code", 926 | "execution_count": 34, 927 | "metadata": { 928 | "collapsed": false 929 | }, 930 | "outputs": [ 931 | { 932 | "name": "stdout", 933 | "output_type": "stream", 934 | "text": [ 935 | "9\n", 936 | "-1\n" 937 | ] 938 | } 939 | ], 940 | "source": [ 941 | "# Code using def that we will recreate with lambdas\n", 942 | "def plus(x, y):\n", 943 | " return x + y\n", 944 | "\n", 945 | "def minus(x, y):\n", 946 | " return x - y\n", 947 | "\n", 948 | "functions = [plus, minus]\n", 949 | "print functions[0](4, 5)\n", 950 | "print functions[1](4, 5)" 951 | ] 952 | }, 953 | { 954 | "cell_type": "code", 955 | "execution_count": 36, 956 | "metadata": { 957 | "collapsed": false 958 | }, 959 | "outputs": [ 960 | { 961 | "name": "stdout", 962 | "output_type": "stream", 963 | "text": [ 964 | "9\n", 965 | "-1\n" 966 | ] 967 | } 968 | ], 969 | "source": [ 970 | "'''\n", 971 | "# TODO: Replace with appropriate code\n", 972 | "# The first function should add two values, while the second function should subtract the second\n", 973 | "# value from the first value.\n", 974 | "lambdaFunctions = [lambda , lambda ]\n", 975 | "print lambdaFunctions[0](4, 5)\n", 976 | "print lambdaFunctions[1](4, 5)\n", 977 | "'''\n", 978 | "lambdaFunctions = [lambda x, y: x + y , lambda x, y: x- y]\n", 979 | "print lambdaFunctions[0](4, 5)\n", 980 | "print lambdaFunctions[1](4, 5)\n", 981 | " " 982 | ] 983 | }, 984 | { 985 | "cell_type": "code", 986 | "execution_count": 37, 987 | "metadata": { 988 | "collapsed": false 989 | }, 990 | "outputs": [ 991 | { 992 | "name": "stdout", 993 | "output_type": "stream", 994 | "text": [ 995 | "1 test passed.\n", 996 | "1 test passed.\n" 997 | ] 998 | } 999 | ], 1000 | "source": [ 1001 | "# TEST lambda fewer steps than def (4b)\n", 1002 | "Test.assertEquals(lambdaFunctions[0](10, 10), 20, 'incorrect first lambdaFunction')\n", 1003 | "Test.assertEquals(lambdaFunctions[1](10, 10), 0, 'incorrect second lambdaFunction')" 1004 | ] 1005 | }, 1006 | { 1007 | "cell_type": "markdown", 1008 | "metadata": {}, 1009 | "source": [ 1010 | "#### ** (4c) Lambda expression arguments **\n", 1011 | "#### Lambda expressions can be used to generate functions that take in zero or more parameters. The syntax for `lambda` allows for multiple ways to define the same function. For example, we might want to create a function that takes in a single parameter, where the parameter is a tuple consisting of two values, and the function adds the two values together. The syntax could be either: `lambda x: x[0] + x[1]` or `lambda (x0, x1): x0 + x1`. If we called either function on the tuple `(3, 4)` it would return `7`. Note that the second `lambda` relies on the tuple `(3, 4)` being unpacked automatically, which means that `x0` is assigned the value `3` and `x1` is assigned the value `4`.\n", 1012 | "#### As an other example, consider the following parameter lambda expressions: `lambda x, y: (x[0] + y[0], x[1] + y[1])` and `lambda (x0, x1), (y0, y1): (x0 + y0, x1 + y1)`. The result of applying either of these functions to tuples `(1, 2)` and `(3, 4)` would be the tuple `(4, 6)`.\n", 1013 | "#### For this exercise: you'll create one-parameter functions `swap1` and `swap2` that swap the order of a tuple; a one-parameter function `swapOrder` that takes in a tuple with three values and changes the order to: second element, third element, first element; and finally, a three-parameter function `sumThree` that takes in three tuples, each with two values, and returns a tuple containing two values: the sum of the first element of each tuple and the sum of second element of each tuple." 1014 | ] 1015 | }, 1016 | { 1017 | "cell_type": "code", 1018 | "execution_count": 38, 1019 | "metadata": { 1020 | "collapsed": false 1021 | }, 1022 | "outputs": [ 1023 | { 1024 | "name": "stdout", 1025 | "output_type": "stream", 1026 | "text": [ 1027 | "a1( (3,4) ) = 7\n", 1028 | "a2( (3,4) ) = 7\n", 1029 | "\n", 1030 | "b1( (1,2), (3,4) ) = (4, 6)\n", 1031 | "b2( (1,2), (3,4) ) = (4, 6)\n" 1032 | ] 1033 | } 1034 | ], 1035 | "source": [ 1036 | "# Examples. Note that the spacing has been modified to distinguish parameters from tuples.\n", 1037 | "\n", 1038 | "# One-parameter function\n", 1039 | "a1 = lambda x: x[0] + x[1]\n", 1040 | "a2 = lambda (x0, x1): x0 + x1\n", 1041 | "print 'a1( (3,4) ) = {0}'.format( a1( (3,4) ) )\n", 1042 | "print 'a2( (3,4) ) = {0}'.format( a2( (3,4) ) )\n", 1043 | "\n", 1044 | "# Two-parameter function\n", 1045 | "b1 = lambda x, y: (x[0] + y[0], x[1] + y[1])\n", 1046 | "b2 = lambda (x0, x1), (y0, y1): (x0 + y0, x1 + y1)\n", 1047 | "print '\\nb1( (1,2), (3,4) ) = {0}'.format( b1( (1,2), (3,4) ) )\n", 1048 | "print 'b2( (1,2), (3,4) ) = {0}'.format( b2( (1,2), (3,4) ) )" 1049 | ] 1050 | }, 1051 | { 1052 | "cell_type": "code", 1053 | "execution_count": 39, 1054 | "metadata": { 1055 | "collapsed": false 1056 | }, 1057 | "outputs": [ 1058 | { 1059 | "name": "stdout", 1060 | "output_type": "stream", 1061 | "text": [ 1062 | "swap1((1, 2)) = (2, 1)\n", 1063 | "swap2((1, 2)) = (2, 1)\n", 1064 | "swapOrder((1, 2, 3)) = (2, 3, 1)\n", 1065 | "sumThree((1, 2), (3, 4), (5, 6)) = (9, 12)\n" 1066 | ] 1067 | } 1068 | ], 1069 | "source": [ 1070 | "'''\n", 1071 | "# TODO: Replace with appropriate code\n", 1072 | "# Use both syntaxes to create a function that takes in a tuple of two values and swaps their order\n", 1073 | "# E.g. (1, 2) => (2, 1)\n", 1074 | "swap1 = lambda x: \n", 1075 | "swap2 = lambda (x0, x1): \n", 1076 | "print 'swap1((1, 2)) = {0}'.format(swap1((1, 2)))\n", 1077 | "print 'swap2((1, 2)) = {0}'.format(swap2((1, 2)))\n", 1078 | "\n", 1079 | "# Using either syntax, create a function that takes in a tuple with three values and returns a tuple\n", 1080 | "# of (2nd value, 3rd value, 1st value). E.g. (1, 2, 3) => (2, 3, 1)\n", 1081 | "swapOrder = \n", 1082 | "print 'swapOrder((1, 2, 3)) = {0}'.format(swapOrder((1, 2, 3)))\n", 1083 | "\n", 1084 | "# Using either syntax, create a function that takes in three tuples each with two values. The\n", 1085 | "# function should return a tuple with the values in the first position summed and the values in the\n", 1086 | "# second position summed. E.g. (1, 2), (3, 4), (5, 6) => (1 + 3 + 5, 2 + 4 + 6) => (9, 12)\n", 1087 | "sumThree = \n", 1088 | "print 'sumThree((1, 2), (3, 4), (5, 6)) = {0}'.format(sumThree((1, 2), (3, 4), (5, 6)))\n", 1089 | "'''\n", 1090 | "\n", 1091 | "# TODO: Replace with appropriate code\n", 1092 | "# Use both syntaxes to create a function that takes in a tuple of two values and swaps their order\n", 1093 | "# E.g. (1, 2) => (2, 1)\n", 1094 | "swap1 = lambda x: (x[1], x[0])\n", 1095 | "swap2 = lambda (x0, x1): (x1, x0)\n", 1096 | "print 'swap1((1, 2)) = {0}'.format(swap1((1, 2)))\n", 1097 | "print 'swap2((1, 2)) = {0}'.format(swap2((1, 2)))\n", 1098 | "\n", 1099 | "# Using either syntax, create a function that takes in a tuple with three values and returns a tuple\n", 1100 | "# of (2nd value, 3rd value, 1st value). E.g. (1, 2, 3) => (2, 3, 1)\n", 1101 | "swapOrder = lambda x: (x[1], x[2], x[0])\n", 1102 | "print 'swapOrder((1, 2, 3)) = {0}'.format(swapOrder((1, 2, 3)))\n", 1103 | "\n", 1104 | "# Using either syntax, create a function that takes in three tuples each with two values. The\n", 1105 | "# function should return a tuple with the values in the first position summed and the values in the\n", 1106 | "# second position summed. E.g. (1, 2), (3, 4), (5, 6) => (1 + 3 + 5, 2 + 4 + 6) => (9, 12)\n", 1107 | "sumThree = lambda x, y, z: (x[0] + y[0] + z[0], x[1] + y[1] + z[1])\n", 1108 | "print 'sumThree((1, 2), (3, 4), (5, 6)) = {0}'.format(sumThree((1, 2), (3, 4), (5, 6)))" 1109 | ] 1110 | }, 1111 | { 1112 | "cell_type": "code", 1113 | "execution_count": 40, 1114 | "metadata": { 1115 | "collapsed": false 1116 | }, 1117 | "outputs": [ 1118 | { 1119 | "name": "stdout", 1120 | "output_type": "stream", 1121 | "text": [ 1122 | "1 test passed.\n", 1123 | "1 test passed.\n", 1124 | "1 test passed.\n", 1125 | "1 test passed.\n" 1126 | ] 1127 | } 1128 | ], 1129 | "source": [ 1130 | "# TEST Lambda expression arguments (4c)\n", 1131 | "Test.assertEquals(swap1((1, 2)), (2, 1), 'incorrect definition for swap1')\n", 1132 | "Test.assertEquals(swap2((1, 2)), (2, 1), 'incorrect definition for swap2')\n", 1133 | "Test.assertEquals(swapOrder((1, 2, 3)), (2, 3, 1), 'incorrect definition fo swapOrder')\n", 1134 | "Test.assertEquals(sumThree((1, 2), (3, 4), (5, 6)), (9, 12), 'incorrect definition for sumThree')" 1135 | ] 1136 | }, 1137 | { 1138 | "cell_type": "markdown", 1139 | "metadata": {}, 1140 | "source": [ 1141 | "#### ** (4d) Restrictions on lambda expressions **\n", 1142 | "#### [Lambda expressions](https://docs.python.org/2/reference/expressions.html#lambda) consist of a single [expression statement](https://docs.python.org/2/reference/simple_stmts.html#expression-statements) and cannot contain other [simple statements](https://docs.python.org/2/reference/simple_stmts.html). In short, this means that the lambda expression needs to evaluate to a value and exist on a single logical line. If more complex logic is necessary, use `def` in place of `lambda`.\n", 1143 | "#### Expression statements evaluate to a value (sometimes that value is None). Lambda expressions automatically return the value of their expression statement. In fact, a `return` statement in a `lambda` would raise a `SyntaxError`.\n", 1144 | "#### The following Python keywords refer to simple statements that cannot be used in a lambda expression: `assert`, `pass`, `del`, `print`, `return`, `yield`, `raise`, `break`, `continue`, `import`, `global`, and `exec`. Also, note that assignment statements (`=`) and augmented assignment statements (e.g. `+=`) cannot be used either." 1145 | ] 1146 | }, 1147 | { 1148 | "cell_type": "code", 1149 | "execution_count": 41, 1150 | "metadata": { 1151 | "collapsed": false 1152 | }, 1153 | "outputs": [ 1154 | { 1155 | "name": "stderr", 1156 | "output_type": "stream", 1157 | "text": [ 1158 | "Traceback (most recent call last):\n", 1159 | " File \"\", line 5, in \n", 1160 | " exec \"lambda x: print x\"\n", 1161 | " File \"\", line 1\n", 1162 | " lambda x: print x\n", 1163 | " ^\n", 1164 | "SyntaxError: invalid syntax\n" 1165 | ] 1166 | } 1167 | ], 1168 | "source": [ 1169 | "# Just run this code\n", 1170 | "# This code will fail with a syntax error, as we can't use print in a lambda expression\n", 1171 | "import traceback\n", 1172 | "try:\n", 1173 | " exec \"lambda x: print x\"\n", 1174 | "except:\n", 1175 | " traceback.print_exc()" 1176 | ] 1177 | }, 1178 | { 1179 | "cell_type": "markdown", 1180 | "metadata": {}, 1181 | "source": [ 1182 | "#### ** (4e) Functional programming **\n", 1183 | "#### The `lambda` examples we have shown so far have been somewhat contrived. This is because they were created to demonstrate the differences and similarities between `lambda` and `def`. An excellent use case for lambda expressions is functional programming. In functional programming, you will often pass functions to other functions as parameters, and `lambda` can be used to reduce the amount of code necessary and to make the code more readable.\n", 1184 | "#### Some commonly used functions in functional programming are map, filter, and reduce. Map transforms a series of elements by applying a function individually to each element in the series. It then returns the series of transformed elements. Filter also applies a function individually to each element in a series; however, with filter, this function evaluates to `True` or `False` and only elements that evaluate to `True` are retained. Finally, reduce operates on pairs of elements in a series. It applies a function that takes in two values and returns a single value. Using this function, reduce is able to, iteratively, \"reduce\" a series to a single value.\n", 1185 | "#### For this exercise, you'll create three simple `lambda` functions, one each for use in map, filter, and reduce. The map `lambda` will multiply its input by 5, the filter `lambda` will evaluate to `True` for even numbers, and the reduce `lambda` will add two numbers. Note that we have created a class called `FunctionalWrapper` so that the syntax for this exercise matches the syntax you'll see in PySpark.\n", 1186 | "#### Note that map requires a one parameter function that returns a new value, filter requires a one parameter function that returns `True` or `False`, and reduce requires a two parameter function that combines the two parameters and returns a new value." 1187 | ] 1188 | }, 1189 | { 1190 | "cell_type": "code", 1191 | "execution_count": 42, 1192 | "metadata": { 1193 | "collapsed": false 1194 | }, 1195 | "outputs": [], 1196 | "source": [ 1197 | "# Create a class to give our examples the same syntax as PySpark\n", 1198 | "class FunctionalWrapper(object):\n", 1199 | " def __init__(self, data):\n", 1200 | " self.data = data\n", 1201 | " def map(self, function):\n", 1202 | " \"\"\"Call `map` on the items in `data` using the provided `function`\"\"\"\n", 1203 | " return FunctionalWrapper(map(function, self.data))\n", 1204 | " def reduce(self, function):\n", 1205 | " \"\"\"Call `reduce` on the items in `data` using the provided `function`\"\"\"\n", 1206 | " return reduce(function, self.data)\n", 1207 | " def filter(self, function):\n", 1208 | " \"\"\"Call `filter` on the items in `data` using the provided `function`\"\"\"\n", 1209 | " return FunctionalWrapper(filter(function, self.data))\n", 1210 | " def __eq__(self, other):\n", 1211 | " return (isinstance(other, self.__class__)\n", 1212 | " and self.__dict__ == other.__dict__)\n", 1213 | " def __getattr__(self, name): return getattr(self.data, name)\n", 1214 | " def __getitem__(self, k): return self.data.__getitem__(k)\n", 1215 | " def __repr__(self): return 'FunctionalWrapper({0})'.format(repr(self.data))\n", 1216 | " def __str__(self): return 'FunctionalWrapper({0})'.format(str(self.data))" 1217 | ] 1218 | }, 1219 | { 1220 | "cell_type": "code", 1221 | "execution_count": 43, 1222 | "metadata": { 1223 | "collapsed": false 1224 | }, 1225 | "outputs": [ 1226 | { 1227 | "name": "stdout", 1228 | "output_type": "stream", 1229 | "text": [ 1230 | "Result from for loop: FunctionalWrapper([3, 4, 5, 6, 7])\n", 1231 | "Result from map call: FunctionalWrapper([3, 4, 5, 6, 7])\n" 1232 | ] 1233 | } 1234 | ], 1235 | "source": [ 1236 | "# Map example\n", 1237 | "\n", 1238 | "# Create some data\n", 1239 | "mapData = FunctionalWrapper(range(5))\n", 1240 | "\n", 1241 | "# Define a function to be applied to each element\n", 1242 | "f = lambda x: x + 3\n", 1243 | "\n", 1244 | "# Imperative programming: loop through and create a new object by applying f\n", 1245 | "mapResult = FunctionalWrapper([]) # Initialize the result\n", 1246 | "for element in mapData:\n", 1247 | " mapResult.append(f(element)) # Apply f and save the new value\n", 1248 | "print 'Result from for loop: {0}'.format(mapResult)\n", 1249 | "\n", 1250 | "# Functional programming: use map rather than a for loop\n", 1251 | "print 'Result from map call: {0}'.format(mapData.map(f))\n", 1252 | "\n", 1253 | "# Note that the results are the same but that the map function abstracts away the implementation\n", 1254 | "# and requires less code" 1255 | ] 1256 | }, 1257 | { 1258 | "cell_type": "code", 1259 | "execution_count": 45, 1260 | "metadata": { 1261 | "collapsed": false 1262 | }, 1263 | "outputs": [ 1264 | { 1265 | "name": "stdout", 1266 | "output_type": "stream", 1267 | "text": [ 1268 | "mapResult: FunctionalWrapper([0, 5, 10, 15, 20, 25, 30, 35, 40, 45])\n", 1269 | "\n", 1270 | "filterResult: FunctionalWrapper([0, 2, 4, 6, 8])\n", 1271 | "\n", 1272 | "reduceResult: 45\n" 1273 | ] 1274 | } 1275 | ], 1276 | "source": [ 1277 | "'''\n", 1278 | "# TODO: Replace with appropriate code\n", 1279 | "dataset = FunctionalWrapper(range(10))\n", 1280 | "\n", 1281 | "# Multiply each element by 5\n", 1282 | "mapResult = dataset.map()\n", 1283 | "# Keep the even elements\n", 1284 | "# Note that \"x % 2\" evaluates to the remainder of x divided by 2\n", 1285 | "filterResult = dataset.filter()\n", 1286 | "# Sum the elements\n", 1287 | "reduceResult = dataset.reduce()\n", 1288 | "\n", 1289 | "print 'mapResult: {0}'.format(mapResult)\n", 1290 | "print '\\nfilterResult: {0}'.format(filterResult)\n", 1291 | "print '\\nreduceResult: {0}'.format(reduceResult)\n", 1292 | "'''\n", 1293 | "\n", 1294 | "# TODO: Replace with appropriate code\n", 1295 | "dataset = FunctionalWrapper(range(10))\n", 1296 | "\n", 1297 | "# Multiply each element by 5\n", 1298 | "mapResult = dataset.map(lambda x: x * 5)\n", 1299 | "# Keep the even elements\n", 1300 | "# Note that \"x % 2\" evaluates to the remainder of x divided by 2\n", 1301 | "filterResult = dataset.filter(lambda x: x % 2 == 0)\n", 1302 | "# Sum the elements\n", 1303 | "reduceResult = dataset.reduce(lambda x, y: x + y)\n", 1304 | "\n", 1305 | "print 'mapResult: {0}'.format(mapResult)\n", 1306 | "print '\\nfilterResult: {0}'.format(filterResult)\n", 1307 | "print '\\nreduceResult: {0}'.format(reduceResult)" 1308 | ] 1309 | }, 1310 | { 1311 | "cell_type": "code", 1312 | "execution_count": 46, 1313 | "metadata": { 1314 | "collapsed": false 1315 | }, 1316 | "outputs": [ 1317 | { 1318 | "name": "stdout", 1319 | "output_type": "stream", 1320 | "text": [ 1321 | "1 test passed.\n", 1322 | "1 test passed.\n", 1323 | "1 test passed.\n" 1324 | ] 1325 | } 1326 | ], 1327 | "source": [ 1328 | "# TEST Functional programming (4e)\n", 1329 | "Test.assertEquals(mapResult, FunctionalWrapper([0, 5, 10, 15, 20, 25, 30, 35, 40, 45]),\n", 1330 | " 'incorrect value for mapResult')\n", 1331 | "Test.assertEquals(filterResult, FunctionalWrapper([0, 2, 4, 6, 8]),\n", 1332 | " 'incorrect value for filterResult')\n", 1333 | "Test.assertEquals(reduceResult, 45, 'incorrect value for reduceResult')" 1334 | ] 1335 | }, 1336 | { 1337 | "cell_type": "markdown", 1338 | "metadata": {}, 1339 | "source": [ 1340 | "#### ** (4f) Composability **\n", 1341 | "#### Since our methods for map and filter in the `FunctionalWrapper` class return `FunctionalWrapper` objects, we can compose (or chain) together our function calls. For example, `dataset.map(f1).filter(f2).reduce(f3)`, where `f1`, `f2`, and `f3` are functions or lambda expressions, first applies a map operation to `dataset`, then filters the result from map, and finally reduces the result from the first two operations.\n", 1342 | "#### Note that when we compose (chain) an operation, the output of one operation becomes the input for the next operation, and operations are applied from left to right. It's likely you've seen chaining used with Python strings. For example, `'Split this'.lower().split(' ')` first returns a new string object `'split this'` and then `split(' ')` is called on that string to produce `['split', 'this']`.\n", 1343 | "#### For this exercise, reuse your lambda expressions from (4e) but apply them to `dataset` in the sequence: map, filter, reduce. Note that since we are composing the operations our result will be different than in (4e). Also, we can write our operations on separate lines to improve readability." 1344 | ] 1345 | }, 1346 | { 1347 | "cell_type": "code", 1348 | "execution_count": 47, 1349 | "metadata": { 1350 | "collapsed": false 1351 | }, 1352 | "outputs": [ 1353 | { 1354 | "data": { 1355 | "text/plain": [ 1356 | "39916800" 1357 | ] 1358 | }, 1359 | "execution_count": 47, 1360 | "metadata": {}, 1361 | "output_type": "execute_result" 1362 | } 1363 | ], 1364 | "source": [ 1365 | "# Example of a mult-line expression statement\n", 1366 | "# Note that placing parentheses around the expression allow it to exist on multiple lines without\n", 1367 | "# causing a syntax error.\n", 1368 | "(dataset\n", 1369 | " .map(lambda x: x + 2)\n", 1370 | " .reduce(lambda x, y: x * y))" 1371 | ] 1372 | }, 1373 | { 1374 | "cell_type": "code", 1375 | "execution_count": null, 1376 | "metadata": { 1377 | "collapsed": false 1378 | }, 1379 | "outputs": [], 1380 | "source": [ 1381 | "'''\n", 1382 | "# TODO: Replace with appropriate code\n", 1383 | "# Multiply the elements in dataset by five, keep just the even values, and sum those values\n", 1384 | "finalSum = \n", 1385 | "print finalSum\n", 1386 | "'''" 1387 | ] 1388 | }, 1389 | { 1390 | "cell_type": "code", 1391 | "execution_count": null, 1392 | "metadata": { 1393 | "collapsed": false 1394 | }, 1395 | "outputs": [], 1396 | "source": [ 1397 | "# TEST Composability (4f)\n", 1398 | "Test.assertEquals(finalSum, 100, 'incorrect value for finalSum')" 1399 | ] 1400 | }, 1401 | { 1402 | "cell_type": "markdown", 1403 | "metadata": {}, 1404 | "source": [ 1405 | "### ** Part 5: CTR data download **" 1406 | ] 1407 | }, 1408 | { 1409 | "cell_type": "markdown", 1410 | "metadata": {}, 1411 | "source": [ 1412 | "#### Lab four will explore website click-through data provided by Criteo. To obtain the data, you must first accept Criteo's data sharing agreement. Below is the agreement from Criteo. After you accept the agreement, you can obtain the download URL by right-clicking on the \"Download Sample\" button and clicking \"Copy link address\" or \"Copy Link Location\", depending on your browser. Paste the URL into the `# TODO` cell below. The file is 8.4 MB compressed. The script below will download the file to the virtual machine (VM) and then extract the data.\n", 1413 | "#### If running the cell below does not render a webpage, open the [Criteo agreement](http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset/) in a separate browser tab. After you accept the agreement, you can obtain the download URL by right-clicking on the \"Download Sample\" button and clicking \"Copy link address\" or \"Copy Link Location\", depending on your browser. Paste the URL into the `# TODO` cell below.\n", 1414 | "#### Note that the download could take a few minutes, depending upon your connection speed." 1415 | ] 1416 | }, 1417 | { 1418 | "cell_type": "code", 1419 | "execution_count": 2, 1420 | "metadata": { 1421 | "collapsed": false 1422 | }, 1423 | "outputs": [ 1424 | { 1425 | "data": { 1426 | "text/html": [ 1427 | "\n", 1428 | " \n", 1435 | " " 1436 | ], 1437 | "text/plain": [ 1438 | "" 1439 | ] 1440 | }, 1441 | "execution_count": 2, 1442 | "metadata": {}, 1443 | "output_type": "execute_result" 1444 | } 1445 | ], 1446 | "source": [ 1447 | "# Run this code to view Criteo's agreement\n", 1448 | "# Note that some ad blocker software will prevent this IFrame from loading.\n", 1449 | "# If this happens, open the webpage in a separate tab and follow the instructions from above.\n", 1450 | "from IPython.lib.display import IFrame\n", 1451 | "\n", 1452 | "IFrame(\"http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset/\",\n", 1453 | " 600, 350)" 1454 | ] 1455 | }, 1456 | { 1457 | "cell_type": "code", 1458 | "execution_count": 3, 1459 | "metadata": { 1460 | "collapsed": false 1461 | }, 1462 | "outputs": [ 1463 | { 1464 | "name": "stdout", 1465 | "output_type": "stream", 1466 | "text": [ 1467 | "Successfully extracted: dac_sample.txt\n" 1468 | ] 1469 | } 1470 | ], 1471 | "source": [ 1472 | "# TODO: Replace with appropriate code\n", 1473 | "# Just replace with the url for dac_sample.tar.gz\n", 1474 | "import glob\n", 1475 | "import os.path\n", 1476 | "import tarfile\n", 1477 | "import urllib\n", 1478 | "import urlparse\n", 1479 | "\n", 1480 | "# Paste url, url should end with: dac_sample.tar.gz\n", 1481 | "# url = ''\n", 1482 | "url = \"http://labs.criteo.com/wp-content/uploads/2015/04/dac_sample.tar.gz\"\n", 1483 | "\n", 1484 | "url = url.strip()\n", 1485 | "baseDir = os.path.join('../data')\n", 1486 | "inputPath = os.path.join('cs190', 'dac_sample.txt')\n", 1487 | "fileName = os.path.join(baseDir, inputPath)\n", 1488 | "inputDir = os.path.split(fileName)[0]\n", 1489 | "\n", 1490 | "def extractTar(check = False):\n", 1491 | " # Find the zipped archive and extract the dataset\n", 1492 | " tars = glob.glob('dac_sample*.tar.gz*')\n", 1493 | " if check and len(tars) == 0:\n", 1494 | " return False\n", 1495 | "\n", 1496 | " if len(tars) > 0:\n", 1497 | " try:\n", 1498 | " tarFile = tarfile.open(tars[0])\n", 1499 | " except tarfile.ReadError:\n", 1500 | " if not check:\n", 1501 | " print 'Unable to open tar.gz file. Check your URL.'\n", 1502 | " return False\n", 1503 | "\n", 1504 | " tarFile.extract('dac_sample.txt', path=inputDir)\n", 1505 | " print 'Successfully extracted: dac_sample.txt'\n", 1506 | " return True\n", 1507 | " else:\n", 1508 | " print 'You need to retry the download with the correct url.'\n", 1509 | " print ('Alternatively, you can upload the dac_sample.tar.gz file to your Jupyter root ' +\n", 1510 | " 'directory')\n", 1511 | " return False\n", 1512 | "\n", 1513 | "\n", 1514 | "if os.path.isfile(fileName):\n", 1515 | " print 'File is already available. Nothing to do.'\n", 1516 | "elif extractTar(check = True):\n", 1517 | " print 'tar.gz file was already available.'\n", 1518 | "elif not url.endswith('dac_sample.tar.gz'):\n", 1519 | " print 'Check your download url. Are you downloading the Sample dataset?'\n", 1520 | "else:\n", 1521 | " # Download the file and store it in the same directory as this notebook\n", 1522 | " try:\n", 1523 | " urllib.urlretrieve(url, os.path.basename(urlparse.urlsplit(url).path))\n", 1524 | " except IOError:\n", 1525 | " print 'Unable to download and store: {0}'.format(url)\n", 1526 | "\n", 1527 | " extractTar()" 1528 | ] 1529 | }, 1530 | { 1531 | "cell_type": "code", 1532 | "execution_count": 4, 1533 | "metadata": { 1534 | "collapsed": false 1535 | }, 1536 | "outputs": [ 1537 | { 1538 | "name": "stdout", 1539 | "output_type": "stream", 1540 | "text": [ 1541 | "[u'0,1,1,5,0,1382,4,15,2,181,1,2,,2,68fd1e64,80e26c9b,fb936136,7b4723c4,25c83c98,7e0ccccf,de7995b8,1f89b562,a73ee510,a8cd5504,b2cb9c98,37c9c164,2824a5f6,1adce6ef,8ba8b39a,891b62e7,e5ba7672,f54016b9,21ddcdc9,b1252a9d,07b5194c,,3a171ecb,c5c50484,e8b83407,9727dd16']\n", 1542 | "100000\n", 1543 | "Criteo data loaded successfully!\n" 1544 | ] 1545 | } 1546 | ], 1547 | "source": [ 1548 | "import os.path\n", 1549 | "baseDir = os.path.join('../data')\n", 1550 | "inputPath = os.path.join('cs190', 'dac_sample.txt')\n", 1551 | "fileName = os.path.join(baseDir, inputPath)\n", 1552 | "\n", 1553 | "if os.path.isfile(fileName):\n", 1554 | " rawData = (sc\n", 1555 | " .textFile(fileName, 2)\n", 1556 | " .map(lambda x: x.replace('\\t', ','))) # work with either ',' or '\\t' separated data\n", 1557 | "\n", 1558 | "print rawData.take(1)\n", 1559 | "rawDataCount = rawData.count()\n", 1560 | "print rawDataCount\n", 1561 | "# This line tests that the correct number of observations have been loaded\n", 1562 | "assert rawDataCount == 100000, 'incorrect count for rawData'\n", 1563 | "if rawDataCount == 100000:\n", 1564 | " print 'Criteo data loaded successfully!'" 1565 | ] 1566 | }, 1567 | { 1568 | "cell_type": "code", 1569 | "execution_count": null, 1570 | "metadata": { 1571 | "collapsed": true 1572 | }, 1573 | "outputs": [], 1574 | "source": [] 1575 | } 1576 | ], 1577 | "metadata": { 1578 | "kernelspec": { 1579 | "display_name": "Python 2", 1580 | "language": "python", 1581 | "name": "python2" 1582 | }, 1583 | "language_info": { 1584 | "codemirror_mode": { 1585 | "name": "ipython", 1586 | "version": 2 1587 | }, 1588 | "file_extension": ".py", 1589 | "mimetype": "text/x-python", 1590 | "name": "python", 1591 | "nbconvert_exporter": "python", 1592 | "pygments_lexer": "ipython2", 1593 | "version": "2.7.6" 1594 | } 1595 | }, 1596 | "nbformat": 4, 1597 | "nbformat_minor": 0 1598 | } 1599 | --------------------------------------------------------------------------------