├── Part 1
├── svm.png
├── pca_pic.png
├── pca_vector.png
└── curse_of_dimensionality.png
├── Part 3
├── svm.png
├── pca_pic.png
├── pca_vector.png
└── curse_of_dimensionality.png
├── Part 5
└── test.db
├── Part 4
├── assets
│ ├── spark_ecosystem.png
│ └── spark_execution.png
├── spark-install
│ ├── images
│ │ └── spark-install-21.png
│ ├── setup_aws.md
│ ├── spark-install.sh
│ └── README.md
├── data
│ └── AAPL.csv
├── Section 3.3_3.4.ipynb
└── Section_3.1_3.2.ipynb
├── .gitignore
├── README.md
└── Part 6
└── Section-5 .ipynb
/Part 1/svm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pirple/Data-Mining-With-Python/HEAD/Part 1/svm.png
--------------------------------------------------------------------------------
/Part 3/svm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pirple/Data-Mining-With-Python/HEAD/Part 3/svm.png
--------------------------------------------------------------------------------
/Part 5/test.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pirple/Data-Mining-With-Python/HEAD/Part 5/test.db
--------------------------------------------------------------------------------
/Part 1/pca_pic.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pirple/Data-Mining-With-Python/HEAD/Part 1/pca_pic.png
--------------------------------------------------------------------------------
/Part 3/pca_pic.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pirple/Data-Mining-With-Python/HEAD/Part 3/pca_pic.png
--------------------------------------------------------------------------------
/Part 1/pca_vector.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pirple/Data-Mining-With-Python/HEAD/Part 1/pca_vector.png
--------------------------------------------------------------------------------
/Part 3/pca_vector.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pirple/Data-Mining-With-Python/HEAD/Part 3/pca_vector.png
--------------------------------------------------------------------------------
/Part 4/assets/spark_ecosystem.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pirple/Data-Mining-With-Python/HEAD/Part 4/assets/spark_ecosystem.png
--------------------------------------------------------------------------------
/Part 4/assets/spark_execution.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pirple/Data-Mining-With-Python/HEAD/Part 4/assets/spark_execution.png
--------------------------------------------------------------------------------
/Part 1/curse_of_dimensionality.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pirple/Data-Mining-With-Python/HEAD/Part 1/curse_of_dimensionality.png
--------------------------------------------------------------------------------
/Part 3/curse_of_dimensionality.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pirple/Data-Mining-With-Python/HEAD/Part 3/curse_of_dimensionality.png
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # OS generated files
2 | .DS_Store
3 | .DS_Store?
4 | ._*
5 | .Spotlight-V100
6 | .Trashes
7 | Icon?
8 | ehthumbs.db
9 | Thumbs.db
10 |
--------------------------------------------------------------------------------
/Part 4/spark-install/images/spark-install-21.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pirple/Data-Mining-With-Python/HEAD/Part 4/spark-install/images/spark-install-21.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Data Mining With Python
2 | > Code snippets for the "Data Mining With Python" course, available at Pirple.com
3 |
4 |
5 | ## This course is now available
6 | This course, like all others, is included with your pirple.com monthly membership. To join Pirple, visit our membership page here:
7 |
8 | [https://pirple.thinkific.com/pages/membership](https://pirple.thinkific.com/pages/membership)
9 |
--------------------------------------------------------------------------------
/Part 4/data/AAPL.csv:
--------------------------------------------------------------------------------
1 | Date,Open,High,Low,Close,Adj Close,Volume
2 | 2018-05-09,186.550003,187.399994,185.220001,187.360001,186.640305,23211200
3 | 2018-05-10,187.740005,190.369995,187.649994,190.039993,189.309998,27989300
4 | 2018-05-11,189.490005,190.059998,187.449997,188.589996,188.589996,26212200
5 | 2018-05-14,189.009995,189.529999,187.860001,188.149994,188.149994,20778800
6 | 2018-05-15,186.779999,187.070007,185.100006,186.440002,186.440002,23695200
7 | 2018-05-16,186.070007,188.460007,186.000000,188.179993,188.179993,19183100
8 | 2018-05-17,188.000000,188.910004,186.360001,186.990005,186.990005,17294000
9 | 2018-05-18,187.190002,187.809998,186.130005,186.309998,186.309998,18297700
10 | 2018-05-21,188.000000,189.270004,186.910004,187.630005,187.630005,18400800
11 | 2018-05-22,188.380005,188.880005,186.779999,187.160004,187.160004,15240700
12 | 2018-05-23,186.350006,188.500000,185.759995,188.360001,188.360001,19467900
13 | 2018-05-24,188.770004,188.839996,186.210007,188.149994,188.149994,20401000
14 | 2018-05-25,188.229996,189.649994,187.649994,188.580002,188.580002,17461000
15 | 2018-05-29,187.600006,188.750000,186.869995,187.899994,187.899994,22369000
16 | 2018-05-30,187.720001,188.000000,186.779999,187.500000,187.500000,18690500
17 | 2018-05-31,187.220001,188.229996,186.139999,186.869995,186.869995,27482800
18 | 2018-06-01,187.990005,190.259995,187.750000,190.240005,190.240005,23250400
19 | 2018-06-04,191.639999,193.419998,191.350006,191.830002,191.830002,26132000
20 | 2018-06-05,193.070007,193.940002,192.360001,193.309998,193.309998,21566000
21 | 2018-06-06,193.630005,194.080002,191.919998,193.979996,193.979996,20933600
22 | 2018-06-07,194.139999,194.199997,192.339996,193.460007,193.460007,21347200
23 | 2018-06-08,191.169998,192.000000,189.770004,191.699997,191.699997,26522000
24 |
--------------------------------------------------------------------------------
/Part 4/spark-install/setup_aws.md:
--------------------------------------------------------------------------------
1 | ## Setting Up Amazon Web Services (AWS)
2 |
3 | To use Spark on data that is too big to fit on our personal computers, we can use Amazon Web Services (AWS).
4 |
5 | **If you are setting up AWS for a Galvanize workshop:**
6 | It is essential to set up your AWS account *before* the workshop begins. Amazon may take several hours (up to 24 hours in extreme cases) to approve your new AWS account, so you may not be able to participate fully if you wait until the day of the event.
7 |
8 |
9 |
10 | ## Part 1: Create an AWS account
11 |
12 | Go to [http://aws.amazon.com/](http://aws.amazon.com/) and sign up:
13 |
14 | - You may sign in using your existing Amazon account or you can create a new account by selecting
15 | **Create a free account** using the button at the right, then selecting **I am a new user**.
16 |
17 | - Enter your contact information and confirm your acceptance of the AWS Customer Agreement.
18 |
19 | - Once you have created an Amazon Web Services Account, you may need to accept a telephone call to verify your identity. Some people have used Google Voice successfully if you don't have or don't want to give a mobile number.
20 |
21 | - Once you have an account, go to [http://aws.amazon.com/](http://aws.amazon.com/) and sign in. You will work primarily from the Amazon Management Console.
22 |
23 | ## Part 2: Verify your AWS account
24 |
25 | - Log into the [Amazon EC2 Console](https://console.aws.amazon.com/ec2) and click on the "Launch an Instance" button. If you are taken to **Step 1: Choose an Amazon Machine Image (AMI)** then your account is verified and ready to go. If not, please follow any additional verification steps indicated.
26 |
27 | ## Part 3: Notify Instructors
28 |
29 | - If you are setting up Spark to prepare for a Galvanize workshop, please notify your instructors in the workshop Slack channel (or by e-mail) when you have successfully created your AWS account. If you have any questions about completing the process, please reach out to us and we will help.
30 |
--------------------------------------------------------------------------------
/Part 4/spark-install/spark-install.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | # Run this by typing `bash spark-install.sh`
4 |
5 | echo "DOWNLOADING SPARK"
6 |
7 | # Specify your shell config file
8 | # Aliases will be appended to this file
9 | SHELL_PROFILE="$HOME/.bashrc"
10 |
11 | # Set the install location, $HOME is set by default
12 | SPARK_INSTALL_LOCATION=$HOME
13 |
14 | # Specify the URL to download Spark from
15 | SPARK_URL=http://apache.mirrors.tds.net/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
16 |
17 | # The Spark folder name should be the same as the name of the file being downloaded as specified in the SPARK_URL
18 | SPARK_FOLDER_NAME=spark-2.1.0-bin-hadoop2.7.tgz
19 |
20 | # Find the proper md5 hash from the Apache site
21 | SPARK_MD5=50e73f255f9bde50789ad5bd657c7a71
22 |
23 | # Print Disclaimer prior to running script
24 | echo "DISCLAIMER: This is an automated script for installing Spark but you should feel responsible for what you're doing!"
25 | echo "This script will install Spark to your home directory, modify your PATH, and add environment variables to your SHELL config file"
26 | read -r -p "Proceed? [y/N] " response
27 | if [[ ! $response =~ ^([yY][eE][sS]|[yY])$ ]]
28 | then
29 | echo "Aborting..."
30 | exit 1
31 | fi
32 |
33 | # Verify that $SHELL_PROFILE is pointing to the proper file
34 | read -r -p "Is $SHELL_PROFILE your shell profile? [y/N] " response
35 | if [[ $response =~ ^([yY][eE][sS]|[yY])$ ]]
36 | then
37 | echo "All relevent aliases will be added to this file; THIS IS IMPORTANT!"
38 | read -r -p "To verify, please type in the name of your shell profile (just the file name): " response
39 | if [[ ! $response == $(basename $SHELL_PROFILE) ]]
40 | then
41 | echo "What you typed doesn't match $(basename $SHELL_PROFILE)!"
42 | echo "Please double check what shell profile you are using and alter spark-install accordingly!"
43 | exit 1
44 | fi
45 | else
46 | echo "Please alter the spark-install.sh script to specify the correct file"
47 | exit 1
48 | fi
49 |
50 | # Create scripts folder for storing jupyspark.sh and localsparksubmit.sh
51 | read -r -p "Would you like to create a scripts folder in your Home directory? [y/N] " response
52 | if [[ $response =~ ^([yY][eE][sS]|[yY])$ ]]; then
53 | if [[ ! -d $HOME/scripts ]]
54 | then
55 | mkdir $HOME/scripts
56 | echo "export PATH=\$PATH:$HOME/scripts" >> $SHELL_PROFILE
57 | else
58 | echo "scripts folder already exists! Verify this folder has been added to your PATH"
59 | fi
60 | else
61 | if [[ ! -d $HOME/scripts ]]
62 | then
63 | echo "Installing without installing jupyspark.sh and localsparksubmit.sh"
64 | fi
65 | fi
66 |
67 | if [[ -d $HOME/scripts ]];
68 | then
69 | # Create jupyspark.sh script in your scripts folder
70 | read -r -p "Would you like to create the jupyspark.sh script for launching a local jupyter spark server? [y/N] " response
71 | if [[ $response =~ ^([yY][eE][sS]|[yY])$ ]]; then
72 | echo "#!/bin/bash
73 | export PYSPARK_DRIVER_PYTHON=jupyter
74 | export PYSPARK_DRIVER_PYTHON_OPTS=\"notebook --NotebookApp.open_browser=True --NotebookApp.ip='localhost' --NotebookApp.port=8888\"
75 |
76 | \${SPARK_HOME}/bin/pyspark \
77 | --master local[4] \
78 | --executor-memory 1G \
79 | --driver-memory 1G \
80 | --conf spark.sql.warehouse.dir=\"file:///tmp/spark-warehouse\" \
81 | --packages com.databricks:spark-csv_2.11:1.5.0 \
82 | --packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
83 | --packages org.apache.hadoop:hadoop-aws:2.7.3" > $HOME/scripts/jupyspark.sh
84 |
85 | chmod +x $HOME/scripts/jupyspark.sh
86 | fi
87 |
88 | # Create localsparksubmit.sh script in your scripts folder
89 | read -r -p "Would you like to create the localsparksubmit.sh script for submittiing local python scripts through spark-submit? [y/N] " response
90 | if [[ $response =~ ^([yY][eE][sS]|[yY])$ ]]; then
91 | echo "#!/bin/bash
92 | \${SPARK_HOME}/bin/spark-submit \
93 | --master local[4] \
94 | --executor-memory 1G \
95 | --driver-memory 1G \
96 | --conf spark.sql.warehouse.dir=\"file:///tmp/spark-warehouse\" \
97 | --packages com.databricks:spark-csv_2.11:1.5.0 \
98 | --packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
99 | --packages org.apache.hadoop:hadoop-aws:2.7.3 \
100 | \$1" > $HOME/scripts/localsparksubmit.sh
101 |
102 | chmod +x $HOME/scripts/localsparksubmit.sh
103 | fi
104 | fi
105 |
106 | # Check to see if JDK is installed
107 | javac -version 2> /dev/null
108 | if [ ! $? -eq 0 ]
109 | then
110 | # Install JDK
111 | if [[ $(uname -s) = "Darwin" ]]
112 | then
113 | echo "Downloading JDK..."
114 | brew install Caskroom/cask/java
115 | elif [[ $(uname -s) = "Linux" ]]
116 | then
117 | echo "Downloading JDK..."
118 | sudo add-apt-repository ppa:webupd8team/java
119 | sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys EEA14886
120 | sudo apt-get update
121 | sudo apt-get install oracle-java8-installer
122 | fi
123 | fi
124 |
125 | SUCCESSFUL_SPARK_INSTALL=0
126 | SPARK_INSTALL_TRY=0
127 |
128 | if [[ $(uname -s) = "Darwin" ]]
129 | then
130 | echo -e "\n\tDetected Mac OS X as the Operating System\n"
131 |
132 | while [ $SUCCESSFUL_SPARK_INSTALL -eq 0 ]
133 | do
134 | curl $SPARK_URL > $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME
135 | # Check MD5 Hash
136 | if [[ $(openssl md5 $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME | sed -e "s/^.* //") == "$SPARK_MD5" ]]
137 | then
138 | # Unzip
139 | tar -xzf $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME -C $SPARK_INSTALL_LOCATION
140 | # Remove the compressed file
141 | rm $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME
142 | # Install py4j
143 | pip install py4j
144 | SUCCESSFUL_SPARK_INSTALL=1
145 | else
146 | echo 'ERROR: Spark MD5 Hash does not match'
147 | echo "$(openssl md5 $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME | sed -e "s/^.* //") != $SPARK_MD5"
148 | if [ $SPARK_INSTALL_TRY -lt 3 ]
149 | then
150 | echo -e '\nTrying Spark Install Again...\n'
151 | SPARK_INSTALL_TRY=$[$SPARK_INSTALL_TRY+1]
152 | echo $SPARK_INSTALL_TRY
153 | else
154 | echo -e '\nSPARK INSTALL FAILED\n'
155 | echo -e 'Check the MD5 Hash and run again'
156 | exit 1
157 | fi
158 | fi
159 | done
160 | elif [[ $(uname -s) = "Linux" ]]
161 | then
162 | echo -e "\n\tDetected Linux as the Operating System\n"
163 |
164 | while [ $SUCCESSFUL_SPARK_INSTALL -eq 0 ]
165 | do
166 | curl $SPARK_URL > $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME
167 | # Check MD5 Hash
168 | if [[ $(md5sum $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME | sed -e "s/ .*$//") == "$SPARK_MD5" ]]
169 | then
170 | # Unzip
171 | tar -xzf $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME -C $SPARK_INSTALL_LOCATION
172 | # Remove the compressed file
173 | rm $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME
174 | # Install py4j
175 | pip install py4j
176 | SUCCESSFUL_SPARK_INSTALL=1
177 | else
178 | echo 'ERROR: Spark MD5 Hash does not match'
179 | echo "$(md5sum $SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME | sed -e "s/ .*$//") != $SPARK_MD5"
180 | if [ $SPARK_INSTALL_TRY -lt 3 ]
181 | then
182 | echo -e '\nTrying Spark Install Again...\n'
183 | SPARK_INSTALL_TRY=$[$SPARK_INSTALL_TRY+1]
184 | echo $SPARK_INSTALL_TRY
185 | else
186 | echo -e '\nSPARK INSTALL FAILED\n'
187 | echo -e 'Check the MD5 Hash and run again'
188 | exit 1
189 | fi
190 | fi
191 | done
192 | else
193 | echo "Unable to detect Operating System"
194 | exit 1
195 | fi
196 |
197 | # Remove extension from spark folder name
198 | SPARK_FOLDER_NAME=$(echo $SPARK_FOLDER_NAME | sed -e "s/.tgz$//")
199 |
200 | echo "
201 | # Spark variables
202 | export SPARK_HOME=\"$SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME\"
203 | export PYTHONPATH=\"$SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME/python/:$PYTHONPATH\"
204 |
205 | # Spark 2
206 | export PYSPARK_DRIVER_PYTHON=ipython
207 | export PATH=\$SPARK_HOME/bin:\$PATH
208 | alias pyspark=\"$SPARK_INSTALL_LOCATION/$SPARK_FOLDER_NAME/bin/pyspark \
209 | --conf spark.sql.warehouse.dir='file:///tmp/spark-warehouse' \
210 | --packages com.databricks:spark-csv_2.11:1.5.0 \
211 | --packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
212 | --packages org.apache.hadoop:hadoop-aws:2.7.3\"" >> $SHELL_PROFILE
213 |
214 | source $SHELL_PROFILE
215 |
216 | echo "INSTALL COMPLETE"
217 | echo "Please refer to Step 4 at https://github.com/zipfian/spark-install for testing your installation"
218 |
--------------------------------------------------------------------------------
/Part 4/spark-install/README.md:
--------------------------------------------------------------------------------
1 | ## Before doing anything [Requirements]
2 |
3 | **Step 1: AWS Account Setup** Before installing Spark on your computer, be sure to [set up an Amazon Web Services account](setup_aws.md). If you already have an AWS account, make sure that you can log into the [AWS Console](https://console.aws.amazon.com) with your username and password.
4 |
5 | * [AWS Setup Instructions](setup_aws.md)
6 |
7 | **Step 2: Software Installation** Before you dive into these installation instructions, you need to have some software installed. Here's a table of all the software you need to install, plus the online tutorials to do so.
8 |
9 | ### Requirements for Mac
10 |
11 | | Name | Description | Installation Guide |
12 | | :-- | :-- | :-- |
13 | | Brew | The package installation soft for mac. Very helpful for this installation and in life in general. | [brew install](http://brew.sh/) |
14 | | Anaconda | A distribution of python, with packaged modules and libraries. **Note: we recommend installing Anaconda 2 (for python 2.7)** | [anaconda install](https://docs.continuum.io/anaconda/install) |
15 | | JDK 8 | Java Development Kit, used in both Hadoop and Spark. | just use `brew cask install java` |
16 |
17 | ### Requirements for Linux
18 |
19 | | Name | Description | Installation Guide |
20 | | :-- | :-- | :-- |
21 | | Anaconda | A distribution of python, with packaged modules and libraries. **Note: we recommend installing Anaconda 2 (for python 2.7)** | [anaconda install](https://docs.continuum.io/anaconda/install) |
22 | | JDK 8 | Java Development Kit, used in both Hadoop and Spark. | [install for linux](https://docs.oracle.com/javase/8/docs/technotes/guides/install/linux_jdk.html) |
23 |
24 |
25 |
26 | # 1. Spark / PySpark Installation
27 |
28 | We are going to install Spark+Hadoop. Use the Part that corresponds to your configuration:
29 | - 1.1. Installing Spark+Hadoop on Mac with no prior installation
30 | - 1.2. Installing Spark+Hadoop on Linux with no prior installation
31 | - 1.3. Use Spark+Hadoop from a prior installation
32 |
33 | We'll do most of these steps from the command line. So, open a terminal and jump in !
34 |
35 | **NOTE**: If you would prefer to jump right into using spark you can use the `spark-install.sh` script provided in this repo which will automatically perform the installation and set any necessary environment variables for you. This script will install `spark-2.1.0-bin-hadoop2.7`.
36 |
37 | ## 1.1. Installing Spark+Hadoop on MAC with no prior installation (using brew)
38 |
39 | Be sure you have brew updated before starting: use `brew update` to update brew and brew packages to their last version.
40 |
41 | 1\. Use `brew install hadoop` to install Hadoop (version 2.7.3 as of Jan 2017)
42 |
43 | 2\. Check the hadoop installation directory by using the command:
44 |
45 | ```bash
46 | brew info hadoop
47 | ```
48 |
49 | 3\. Use `brew install apache-spark` to install Spark (version 2.1.0 as of Jan 2017)
50 |
51 | 4\. Check the installation directory by using the command:
52 |
53 | ```
54 | brew info apache-spark
55 | ```
56 |
57 | 5\. You're done ! You can now go to the section 2 for setting up your environment and run your Spark scripts.
58 |
59 |
60 | ## 1.2. Installing Spark+Hadoop on Linux with no prior installation
61 |
62 | 1\. Go to [Apache Spark Download page](http://spark.apache.org/downloads.html). Choose the latest Spark release (2.1.0), and the package type "Pre-built for Hadoop 2.7 and later". Click on the link "Download Spark" to get the `tgz` package of the latest Spark release. On Jan 2017 this file was `spark-2.1.0-bin-hadoop2.7.tgz` so we will be using that in the rest of these guidelines but feel free to adapt to your version.
63 |
64 | > 
65 |
66 | 2\. Uncompress that file into `/usr/local` by typing:
67 |
68 | ```
69 | sudo tar xvzf spark-2.1.0-bin-hadoop2.7.tgz -C /usr/local/
70 | ```
71 |
72 | 3\. Create a shorter symlink of the directory that was just created using:
73 |
74 | ```
75 | sudo ln -s /usr/local/spark-2.1.0-bin-hadoop2.7 /usr/local/spark
76 | ```
77 |
78 | 4\. Go to [Apache Hadoop Download page](http://hadoop.apache.org/releases.html#Download). On the table above, click on the latest version below 3 (2.7.3 as of Nov 2016). Click as to download the *binary* version `tar.gz` archive, choose a mirror and download the file unto your computer.
79 |
80 | 5\. Uncompress that file into `/usr/local` by typing:
81 |
82 | ```
83 | sudo tar xvzf /path_to_file/hadoop-2.7.3.tar.gz -C /usr/local/
84 | ```
85 |
86 | 6\. Create a shorter symlink of this directory using:
87 |
88 | ```
89 | sudo ln -s /usr/local/hadoop-2.7.3 /usr/local/hadoop
90 | ```
91 |
92 |
93 | ## 1.3. Using a prior installation of Spark+Hadoop
94 |
95 | We strongly recommend you update your installation to the must recent version of Spark. As of Jan 2017 we used Spark 2.1.0 and Hadoop 2.7.3.
96 |
97 | If you want to use another version there, all you have to do is to locate your installation directories for Spark and Hadoop, and use that in the next section 2.1 for setting up your environment.
98 |
99 |
100 | # 2. Setting up your environment
101 |
102 | ## 2.1. Environment variables
103 |
104 | To run Spark scripts you have to properly setup your shell environment: setting up environment variables, verifying your AWS credentials, etc.
105 |
106 | 1\. Edit your `~/.bash_profile` to add/edit the following lines depending on your configuration. This addition will setup your environment variables `SPARK_HOME` and `HADOOP_HOME` to point out to the directories used to install Spark and Hadoop.
107 |
108 | **For a Mac/Brew installation**, copy/paste the following lines into your `~/.bash_profile`:
109 | ```bash
110 | export SPARK_HOME=`brew info apache-spark | grep /usr | tail -n 1 | cut -f 1 -d " "`/libexec
111 | export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
112 |
113 | export HADOOP_HOME=`brew info hadoop | grep /usr | head -n 1 | cut -f 1 -d " "`/libexec
114 | export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH
115 | ```
116 |
117 | **For the Linux installation described above**, copy/paste the following lines into your `~/.bash_profile`:
118 | ```bash
119 | export SPARK_HOME=/usr/local/spark
120 | export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
121 |
122 | export HADOOP_HOME=/usr/local/hadoop
123 | export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH
124 | ```
125 |
126 | **For any other installation**, find what directories your Spark and Hadoop installation where, adapt the following lines to your configuration and put that into your `~/.bash_profile`:
127 | ```bash
128 | export SPARK_HOME=###COMPLETE HERE###
129 | export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
130 |
131 | export HADOOP_HOME=###COMPLETE HERE###
132 | export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH
133 | ```
134 |
135 | While you're in `~/.bash_profile`, be sure to have two environment variables for your AWS keys. We'll use that in the assignments. Be sure you have the following lines set up (with the actual values of your AWS credentials):
136 |
137 | ```bash
138 | export AWS_ACCESS_KEY_ID='put your access key here'
139 | export AWS_SECRET_ACCESS_KEY='put your secret access key here'
140 | ```
141 |
142 | **Note**: After any modification to your `.bash_profile`, for your terminal to take these changes into account, you need to run `source ~/.bash_profile` from the command line. They will be automatically taken into account next time you open a new terminal.
143 |
144 | ## 2.2. Python environment
145 |
146 | 1\. Back to the command line, install py4j using `pip install py4j`.
147 |
148 | 2\. To check if everything's ok, start an `ipython` console and type `import pyspark`. This will do nothing in practice, that's ok: **if it did not throw any error, then you are good to go.**
149 |
150 |
151 | # 3. How to run Spark python scripts
152 |
153 | ## 3.1. How to run Spark/Python from a Jupyter Notebook
154 |
155 | Running Spark from a jupyter notebook can require you to launch jupyter with a specific setup so that it connects seamlessly with the Spark Driver. We recommend you create a shell script `jupyspark.sh` designed specifically for doing that.
156 |
157 | 1\. Create a file called `jupyspark.sh` somewhere under your `$PATH`, or in a directory of your liking (I usually use a `scripts/` directory under my home directory). In this file, you'll copy/paste the following lines:
158 |
159 | ```bash
160 | #!/bin/bash
161 | export PYSPARK_DRIVER_PYTHON=jupyter
162 | export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=True --NotebookApp.ip='localhost' --NotebookApp.port=8888"
163 |
164 | ${SPARK_HOME}/bin/pyspark \
165 | --master local[4] \
166 | --executor-memory 1G \
167 | --driver-memory 1G \
168 | --conf spark.sql.warehouse.dir="file:///tmp/spark-warehouse" \
169 | --packages com.databricks:spark-csv_2.11:1.5.0 \
170 | --packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
171 | --packages org.apache.hadoop:hadoop-aws:2.7.3
172 | ```
173 |
174 |
175 | Save the file. Make it executable by doing `chmod 711 jupyspark.sh`. Now, whenever you want to launch a spark jupyter notebook run this script by typing `jupyspark.sh` in your terminal.
176 |
177 | Here's how to read that script... Basically, we are going to use `pyspark` (an executable from your Spark installation) to run `jupyter` with a proper Spark context.
178 |
179 | The first two lines :
180 | ```bash
181 | export PYSPARK_DRIVER_PYTHON=jupyter
182 | export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=True --NotebookApp.ip='localhost' --NotebookApp.port=8888"
183 | ```
184 | will set up two environment variables for `pyspark` to execute `jupyter`.
185 |
186 | Note: If you are installing Spark on a Virtual Machine and would like to access jupyter from your host browser, you should set the NotebookApp.ip flag to `--NotebookApp.ip='0.0.0.0'` so that your VM's jupyter server will accept external connections. You can then access jupyter notebook from the host machine on port 8888.
187 |
188 | The next line:
189 | ```bash
190 | ${SPARK_HOME}/bin/pyspark \
191 | ```
192 | is part of a multiline long command to run `pyspark` with all necessary packages and options.
193 |
194 | The next 3 lines:
195 | ```bash
196 | --master local[4] \
197 | --executor-memory 1G \
198 | --driver-memory 1G \
199 | ```
200 | Set up the options for `pyspark` to execute locally, using all 4 cores of your computer and setting up the memory usage for Spark Driver and Executor.
201 |
202 | The next line:
203 | ```bash
204 | --conf spark.sql.warehouse.dir="file:///tmp/spark-warehouse" \
205 | ```
206 | Is creating a directory to store Sparl SQL dataframes. This is non-necessary but has been solving a common error for loading and processing S3 data into Spark Dataframes.
207 |
208 | The final 3 lines:
209 | ```bash
210 | --packages com.databricks:spark-csv_2.11:1.5.0 \
211 | --packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
212 | --packages org.apache.hadoop:hadoop-aws:2.7.3
213 | ```
214 | add specific packages to `pyspark` to load. These packages are necessary to access AWS S3 repositories from Spark/Python and to read csv files.
215 |
216 | **Note**: You can adapt these parameters to your own liking. See Spark page on [Submitting applications](http://spark.apache.org/docs/latest/submitting-applications.html) to tune these parameters.
217 |
218 | 2\. Now run this script. It will open a notebook home page in your browser. From there, create a new notebook and copy/pate the following commands in your notebook:
219 |
220 | ```python
221 | import pyspark as ps
222 |
223 | spark = ps.sql.SparkSession.builder.getOrCreate()
224 | ```
225 |
226 | What these lines do is to try to connect to the Spark Driver by creating a new [`SparkSession` instance](http://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/SparkSession.html).
227 |
228 | After this point, you can use `spark` as a unique entry point for [reading files and doing spark things](https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html).
229 |
230 |
231 | ## 3.2. How to run Spark/Python from command line with via `spark-submit`
232 |
233 | Instead of using jupyter notebook, if you want to run your python script (using Spark) from the command line, you will need to use an executable from the Spark suite called `spark-submit`. Again, this executable requires some options that we propose to put into a script to use whenever you need to launch a Spark-based python script.
234 |
235 |
236 | 1\. Create a script you would call `localsparksubmit.sh`, put it somewhere handy. Copy/paste the following content in this file:
237 |
238 | ```bash
239 | #!/bin/bash
240 | ${SPARK_HOME}/bin/spark-submit \
241 | --master local[4] \
242 | --executor-memory 1G \
243 | --driver-memory 1G \
244 | --conf spark.sql.warehouse.dir="file:///tmp/spark-warehouse" \
245 | --packages com.databricks:spark-csv_2.11:1.5.0 \
246 | --packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
247 | --packages org.apache.hadoop:hadoop-aws:2.7.3 \
248 | $@
249 | ```
250 |
251 | See the previous section 2.1 for an explanation of these values. The final line here `$@` means that whatever you gave as an argument to this `localsparksubmit.sh` script will be used as a last argument in this command.
252 |
253 | 2\. Whenever you want to run your script (called for instance `script.py`), you would do it by typing `localsparksubmit.sh script.py` from the command line. Make sure you put `localsparksubmit.sh` somewhere under your `$PATH`, or in a directory of your linking.
254 |
255 | **Note**: You can adapt these parameters to your own setup. See Spark page on [Submitting applications](http://spark.apache.org/docs/latest/submitting-applications.html) to tune these parameters.
256 |
257 |
258 | # 4. Testing your installation
259 |
260 | 1\. Open a new jupyter notebook (from the `jupyspark.sh` script provided above) and paste the following code:
261 |
262 | ```python
263 | import pyspark as ps
264 | import random
265 |
266 | spark = ps.sql.SparkSession.builder \
267 | .appName("rdd test") \
268 | .getOrCreate()
269 |
270 | random.seed(1)
271 |
272 | def sample(p):
273 | x, y = random.random(), random.random()
274 | return 1 if x*x + y*y < 1 else 0
275 |
276 | count = spark.sparkContext.parallelize(range(0, 10000000)).map(sample) \
277 | .reduce(lambda a, b: a + b)
278 |
279 | print("Pi is (very) roughly {}".format(4.0 * count / 10000000))
280 | ```
281 |
282 | It should output the following result :
283 |
284 | ```
285 | Pi is (very) roughly 3.141317
286 | ```
287 |
288 | 2\. Create a python script called `testspark.py` and paste the same lines above in it. Run this script from the command line using `localsparksubmit.sh testspark.py`. It should output the same result as above.
289 |
--------------------------------------------------------------------------------
/Part 4/Section 3.3_3.4.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Section 3.3"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# Spark-ML Objectives\n",
15 | "\n",
16 | "At the end of this lecture you should be able to:\n",
17 | "\n",
18 | "1. Chain spark dataframe methods together to do data munging.\n",
19 | "2. Be able to describe the Spark-ML API, and recognize differences to sk-learn.\n",
20 | "3. Chain Spark-ML Transformers and Estimators together to compose ML pipelines."
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "# Let's design chains of transformations together!"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": 31,
33 | "metadata": {
34 | "collapsed": true
35 | },
36 | "outputs": [],
37 | "source": [
38 | "import pyspark.sql.functions as F\n",
39 | "import pyspark as ps\n",
40 | "from pyspark import SQLContext \n",
41 | "\n",
42 | "spark = ps.sql.SparkSession.builder \\\n",
43 | " .master('local[2]') \\\n",
44 | " .appName('spark-ml') \\\n",
45 | " .getOrCreate()\n",
46 | "\n",
47 | "sc = spark.sparkContext"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": 32,
53 | "metadata": {
54 | "collapsed": true
55 | },
56 | "outputs": [],
57 | "source": [
58 | "sqlContext = SQLContext(sc)"
59 | ]
60 | },
61 | {
62 | "cell_type": "markdown",
63 | "metadata": {},
64 | "source": [
65 | "## Find the date on which AAPL's closing stock price was the highest\n",
66 | "\n",
67 | "### Input DataFrame"
68 | ]
69 | },
70 | {
71 | "cell_type": "code",
72 | "execution_count": 33,
73 | "metadata": {},
74 | "outputs": [
75 | {
76 | "name": "stdout",
77 | "output_type": "stream",
78 | "text": [
79 | "+-------------------+----------+----------+----------+----------+----------+--------+\n",
80 | "| Date| Open| High| Low| Close| Adj Close| Volume|\n",
81 | "+-------------------+----------+----------+----------+----------+----------+--------+\n",
82 | "|2018-05-09 00:00:00|186.550003|187.399994|185.220001|187.360001|186.640305|23211200|\n",
83 | "|2018-05-10 00:00:00|187.740005|190.369995|187.649994|190.039993|189.309998|27989300|\n",
84 | "|2018-05-11 00:00:00|189.490005|190.059998|187.449997|188.589996|188.589996|26212200|\n",
85 | "|2018-05-14 00:00:00|189.009995|189.529999|187.860001|188.149994|188.149994|20778800|\n",
86 | "|2018-05-15 00:00:00|186.779999|187.070007|185.100006|186.440002|186.440002|23695200|\n",
87 | "+-------------------+----------+----------+----------+----------+----------+--------+\n",
88 | "only showing top 5 rows\n",
89 | "\n"
90 | ]
91 | }
92 | ],
93 | "source": [
94 | "# read CSV\n",
95 | "df_aapl = sqlContext.read.csv('data/aapl.csv',\n",
96 | " header=True, # use headers or not\n",
97 | " quote='\"', # char for quotes\n",
98 | " sep=\",\", # char for separation\n",
99 | " inferSchema=True) # do we infer schema or not ?\n",
100 | "\n",
101 | "df_aapl.show(5) #df.head(2)"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": 34,
107 | "metadata": {},
108 | "outputs": [
109 | {
110 | "data": {
111 | "text/plain": [
112 | "StructType(List(StructField(Date,TimestampType,true),StructField(Open,DoubleType,true),StructField(High,DoubleType,true),StructField(Low,DoubleType,true),StructField(Close,DoubleType,true),StructField(Adj Close,DoubleType,true),StructField(Volume,IntegerType,true)))"
113 | ]
114 | },
115 | "execution_count": 34,
116 | "metadata": {},
117 | "output_type": "execute_result"
118 | }
119 | ],
120 | "source": [
121 | "df_aapl.schema #df.info()"
122 | ]
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "metadata": {},
127 | "source": [
128 | "### Task\n",
129 | "\n",
130 | "Now, design a pipeline that will:\n",
131 | "\n",
132 | "1. Keep only fields for Date and Close\n",
133 | "2. Order by Close in descending order\n",
134 | "\n",
135 | "### Code"
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": 35,
141 | "metadata": {
142 | "scrolled": true
143 | },
144 | "outputs": [
145 | {
146 | "name": "stdout",
147 | "output_type": "stream",
148 | "text": [
149 | "+-------------------+----------+\n",
150 | "| Date| Close|\n",
151 | "+-------------------+----------+\n",
152 | "|2018-06-06 00:00:00|193.979996|\n",
153 | "|2018-06-07 00:00:00|193.460007|\n",
154 | "|2018-06-05 00:00:00|193.309998|\n",
155 | "|2018-06-04 00:00:00|191.830002|\n",
156 | "|2018-06-08 00:00:00|191.699997|\n",
157 | "+-------------------+----------+\n",
158 | "only showing top 5 rows\n",
159 | "\n"
160 | ]
161 | }
162 | ],
163 | "source": [
164 | "df_out = df_aapl.select('Date', 'Close').orderBy('Close', ascending=False)\n",
165 | "\n",
166 | "df_out.show(5)"
167 | ]
168 | },
169 | {
170 | "cell_type": "markdown",
171 | "metadata": {},
172 | "source": [
173 | "### Solution\n",
174 | "\n",
175 | "\n",
176 | "df_out.select(\"Close\", \"Date\").orderBy(df_aapl.Close, ascending=False).show(5)
\n",
177 | ""
178 | ]
179 | },
180 | {
181 | "cell_type": "markdown",
182 | "metadata": {},
183 | "source": [
184 | "# Supervised Machine Learning on DataFrames\n",
185 | "\n",
186 | "http://spark.apache.org/docs/latest/ml-features.html\n",
187 | "\n",
188 | "### What is the difference between df_aapl and df_vector after running the code below?"
189 | ]
190 | },
191 | {
192 | "cell_type": "code",
193 | "execution_count": 18,
194 | "metadata": {},
195 | "outputs": [
196 | {
197 | "name": "stdout",
198 | "output_type": "stream",
199 | "text": [
200 | "+-------------------+----------+----------+----------+----------+----------+--------+\n",
201 | "| Date| Open| High| Low| Close| Adj Close| Volume|\n",
202 | "+-------------------+----------+----------+----------+----------+----------+--------+\n",
203 | "|2018-05-09 00:00:00|186.550003|187.399994|185.220001|187.360001|186.640305|23211200|\n",
204 | "|2018-05-10 00:00:00|187.740005|190.369995|187.649994|190.039993|189.309998|27989300|\n",
205 | "|2018-05-11 00:00:00|189.490005|190.059998|187.449997|188.589996|188.589996|26212200|\n",
206 | "|2018-05-14 00:00:00|189.009995|189.529999|187.860001|188.149994|188.149994|20778800|\n",
207 | "|2018-05-15 00:00:00|186.779999|187.070007|185.100006|186.440002|186.440002|23695200|\n",
208 | "+-------------------+----------+----------+----------+----------+----------+--------+\n",
209 | "only showing top 5 rows\n",
210 | "\n",
211 | "+-------------------+----------+----------+----------+----------+----------+--------+------------+\n",
212 | "| Date| Open| High| Low| Close| Adj Close| Volume| Features|\n",
213 | "+-------------------+----------+----------+----------+----------+----------+--------+------------+\n",
214 | "|2018-05-09 00:00:00|186.550003|187.399994|185.220001|187.360001|186.640305|23211200|[187.360001]|\n",
215 | "|2018-05-10 00:00:00|187.740005|190.369995|187.649994|190.039993|189.309998|27989300|[190.039993]|\n",
216 | "|2018-05-11 00:00:00|189.490005|190.059998|187.449997|188.589996|188.589996|26212200|[188.589996]|\n",
217 | "|2018-05-14 00:00:00|189.009995|189.529999|187.860001|188.149994|188.149994|20778800|[188.149994]|\n",
218 | "|2018-05-15 00:00:00|186.779999|187.070007|185.100006|186.440002|186.440002|23695200|[186.440002]|\n",
219 | "+-------------------+----------+----------+----------+----------+----------+--------+------------+\n",
220 | "only showing top 5 rows\n",
221 | "\n"
222 | ]
223 | }
224 | ],
225 | "source": [
226 | "from pyspark.ml.feature import MinMaxScaler, VectorAssembler\n",
227 | "\n",
228 | "# assemble values in a vector\n",
229 | "vectorAssembler = VectorAssembler(inputCols=[\"Close\"], outputCol=\"Features\")\n",
230 | "\n",
231 | "\n",
232 | "df_vector = vectorAssembler.transform(df_aapl)\n",
233 | "df_aapl.show(5)\n",
234 | "\n",
235 | "df_vector.show(5)"
236 | ]
237 | },
238 | {
239 | "cell_type": "markdown",
240 | "metadata": {},
241 | "source": [
242 | "Gotta have the column be a vector."
243 | ]
244 | },
245 | {
246 | "cell_type": "code",
247 | "execution_count": 36,
248 | "metadata": {},
249 | "outputs": [
250 | {
251 | "name": "stdout",
252 | "output_type": "stream",
253 | "text": [
254 | "+------------+--------------------+\n",
255 | "| Features| Scaled Features|\n",
256 | "+------------+--------------------+\n",
257 | "|[187.360001]|[0.13689742813492...|\n",
258 | "|[190.039993]|[0.48630977478742...|\n",
259 | "|[188.589996]|[0.29726187673060...|\n",
260 | "|[188.149994]|[0.23989523856459...|\n",
261 | "|[186.440002]|[0.01694967847449...|\n",
262 | "|[188.179993]|[0.24380645210076...|\n",
263 | "|[186.990005]|[0.08865804137106...|\n",
264 | "|[186.309998]| [0.0]|\n",
265 | "|[187.630005]|[0.17210004487615...|\n",
266 | "|[187.160004]|[0.11082219317397...|\n",
267 | "+------------+--------------------+\n",
268 | "only showing top 10 rows\n",
269 | "\n"
270 | ]
271 | }
272 | ],
273 | "source": [
274 | "scaler = MinMaxScaler(inputCol=\"Features\", outputCol=\"Scaled Features\")\n",
275 | "\n",
276 | "# Compute summary statistics and generate MinMaxScalerModel\n",
277 | "scaler_model = scaler.fit(df_vector)\n",
278 | "\n",
279 | "# rescale each feature to range [min, max].\n",
280 | "scaled_data = scaler_model.transform(df_vector)\n",
281 | "scaled_data.select(\"Features\", \"Scaled Features\").show(10)"
282 | ]
283 | },
284 | {
285 | "cell_type": "code",
286 | "execution_count": null,
287 | "metadata": {
288 | "collapsed": true
289 | },
290 | "outputs": [],
291 | "source": []
292 | },
293 | {
294 | "cell_type": "markdown",
295 | "metadata": {},
296 | "source": [
297 | "## Section 3.4 "
298 | ]
299 | },
300 | {
301 | "cell_type": "markdown",
302 | "metadata": {},
303 | "source": [
304 | "# Transformers\n",
305 | "\n",
306 | "The `VectorAssembler` class above is an example of a generic type in Spark, called a [Transformer](http://spark.apache.org/docs/latest/ml-pipeline.html#transformers). Important things to know about this type:\n",
307 | "\n",
308 | "* They implement a `transform` method.\n",
309 | "* They convert one `DataFrame` into another, usually by adding columns.\n",
310 | "\n",
311 | "Examples of Transformers: [`VectorAssembler`](http://spark.apache.org/docs/latest/ml-features.html#vectorassembler), [`Tokenizer`](http://spark.apache.org/docs/latest/ml-features.html#tokenizer), [`StopWordsRemover`](http://spark.apache.org/docs/latest/ml-features.html#stopwordsremover), and [many more](http://spark.apache.org/docs/latest/ml-features.html).\n",
312 | "\n"
313 | ]
314 | },
315 | {
316 | "cell_type": "markdown",
317 | "metadata": {},
318 | "source": [
319 | "# Estimators\n",
320 | "\n",
321 | "According to the docs: \"An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data\". Important things to know about this type:\n",
322 | "\n",
323 | "* They implement a `fit` method whose argument is a `DataFrame`.\n",
324 | "* The output of `fit` is another type called `Model`, which is a `Transformer`.\n",
325 | "\n",
326 | "Examples of Estimators: [`LogisticRegression`](http://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression), [`DecisionTreeRegressor`](http://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-regression), and [many more](http://spark.apache.org/docs/latest/ml-classification-regression.html).\n"
327 | ]
328 | },
329 | {
330 | "cell_type": "markdown",
331 | "metadata": {},
332 | "source": [
333 | "# Pipelines\n",
334 | "\n",
335 | "Many Data Science workflows can be described as sequential application of various `Transforms` and `Estimators`.\n",
336 | "\n",
337 | ""
338 | ]
339 | },
340 | {
341 | "cell_type": "markdown",
342 | "metadata": {},
343 | "source": [
344 | "Let's see two ways to implement the above flow!"
345 | ]
346 | },
347 | {
348 | "cell_type": "code",
349 | "execution_count": 21,
350 | "metadata": {
351 | "collapsed": true
352 | },
353 | "outputs": [],
354 | "source": [
355 | "from pyspark.ml import Pipeline\n",
356 | "from pyspark.ml.classification import LogisticRegression\n",
357 | "from pyspark.ml.feature import RegexTokenizer, HashingTF\n",
358 | "\n",
359 | "# Prepare training documents from a list of (id, text, label) tuples.\n",
360 | "training = spark.createDataFrame([\n",
361 | " (0, \"spark is like hadoop mapreduce\", 1.0),\n",
362 | " (1, \"sparks light fire!!!\", 0.0),\n",
363 | " (2, \"elephants like simba\", 0.0),\n",
364 | " (3, \"hadoop is an elephant\", 1.0),\n",
365 | " (4, \"hadoop mapreduce\", 1.0)\n",
366 | "], [\"id\", \"text\", \"label\"])"
367 | ]
368 | },
369 | {
370 | "cell_type": "code",
371 | "execution_count": 22,
372 | "metadata": {
373 | "collapsed": true
374 | },
375 | "outputs": [],
376 | "source": [
377 | "regexTokenizer = RegexTokenizer(inputCol=\"text\", outputCol=\"tokens\", pattern=\"\\\\W\")\n",
378 | "hashingTF = HashingTF(inputCol=\"tokens\", outputCol=\"features\")\n",
379 | "lr = LogisticRegression(maxIter=10, regParam=0.001)\n",
380 | "\n",
381 | "tokens = regexTokenizer.transform(training)\n",
382 | "hashes = hashingTF.transform(tokens)\n",
383 | "logistic_model = lr.fit(hashes) # Uses columns named features/label by default"
384 | ]
385 | },
386 | {
387 | "cell_type": "code",
388 | "execution_count": 23,
389 | "metadata": {},
390 | "outputs": [
391 | {
392 | "name": "stdout",
393 | "output_type": "stream",
394 | "text": [
395 | "+------------------+----------+--------------------+\n",
396 | "| text|prediction| probability|\n",
397 | "+------------------+----------+--------------------+\n",
398 | "| simba has a spark| 0.0|[0.78819302361551...|\n",
399 | "| hadoop| 1.0|[0.02995590605364...|\n",
400 | "|mapreduce in spark| 1.0|[0.02401898451752...|\n",
401 | "| apache hadoop| 1.0|[0.02995590605364...|\n",
402 | "+------------------+----------+--------------------+\n",
403 | "\n"
404 | ]
405 | }
406 | ],
407 | "source": [
408 | "# Prepare test documents, which are unlabeled (id, text) tuples.\n",
409 | "test = spark.createDataFrame([\n",
410 | " (5, \"simba has a spark\"),\n",
411 | " (6, \"hadoop\"),\n",
412 | " (7, \"mapreduce in spark\"),\n",
413 | " (8, \"apache hadoop\")\n",
414 | "], [\"id\", \"text\"])\n",
415 | "\n",
416 | "# What do we need to do to this to get a prediction?\n",
417 | "preds = logistic_model.transform(hashingTF.transform(regexTokenizer.transform(test)))\n",
418 | "preds.select('text', 'prediction', 'probability').show()"
419 | ]
420 | },
421 | {
422 | "cell_type": "markdown",
423 | "metadata": {},
424 | "source": [
425 | "## Alternatively"
426 | ]
427 | },
428 | {
429 | "cell_type": "code",
430 | "execution_count": 24,
431 | "metadata": {
432 | "collapsed": true
433 | },
434 | "outputs": [],
435 | "source": [
436 | "# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.\n",
437 | "regexTokenizer = RegexTokenizer(inputCol=\"text\", outputCol=\"tokens\", pattern=\"\\\\W\")\n",
438 | "hashingTF = HashingTF(inputCol=\"tokens\", outputCol=\"features\")\n",
439 | "lr = LogisticRegression(maxIter=10, regParam=0.001)\n",
440 | "pipeline = Pipeline(stages=[regexTokenizer, hashingTF, lr])\n",
441 | "\n",
442 | "# Fit the pipeline to training documents.\n",
443 | "model = pipeline.fit(training)"
444 | ]
445 | },
446 | {
447 | "cell_type": "code",
448 | "execution_count": 25,
449 | "metadata": {},
450 | "outputs": [
451 | {
452 | "name": "stdout",
453 | "output_type": "stream",
454 | "text": [
455 | "+------------------+----------+--------------------+\n",
456 | "| text|prediction| probability|\n",
457 | "+------------------+----------+--------------------+\n",
458 | "| simba has a spark| 0.0|[0.78819302361551...|\n",
459 | "| hadoop| 1.0|[0.02995590605364...|\n",
460 | "|mapreduce in spark| 1.0|[0.02401898451752...|\n",
461 | "| apache hadoop| 1.0|[0.02995590605364...|\n",
462 | "+------------------+----------+--------------------+\n",
463 | "\n"
464 | ]
465 | }
466 | ],
467 | "source": [
468 | "#How can we test this against our training data?\n",
469 | "prediction = model.transform(test)\n",
470 | "prediction.select(['text', 'prediction', 'probability']).show()"
471 | ]
472 | },
473 | {
474 | "cell_type": "code",
475 | "execution_count": null,
476 | "metadata": {
477 | "collapsed": true
478 | },
479 | "outputs": [],
480 | "source": []
481 | },
482 | {
483 | "cell_type": "code",
484 | "execution_count": null,
485 | "metadata": {
486 | "collapsed": true
487 | },
488 | "outputs": [],
489 | "source": []
490 | }
491 | ],
492 | "metadata": {
493 | "kernelspec": {
494 | "display_name": "Python 2",
495 | "language": "python",
496 | "name": "python2"
497 | },
498 | "language_info": {
499 | "codemirror_mode": {
500 | "name": "ipython",
501 | "version": 2
502 | },
503 | "file_extension": ".py",
504 | "mimetype": "text/x-python",
505 | "name": "python",
506 | "nbconvert_exporter": "python",
507 | "pygments_lexer": "ipython2",
508 | "version": "2.7.13"
509 | }
510 | },
511 | "nbformat": 4,
512 | "nbformat_minor": 1
513 | }
514 |
--------------------------------------------------------------------------------
/Part 6/Section-5 .ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "collapsed": true,
7 | "slideshow": {
8 | "slide_type": "skip"
9 | }
10 | },
11 | "source": [
12 | "# Section- 5\n",
13 | "# NLP (Natural language processing)"
14 | ]
15 | },
16 | {
17 | "cell_type": "markdown",
18 | "metadata": {},
19 | "source": [
20 | "## Section 5.1: Key Concepts, text data cleaning\n",
21 | "## Section 5.2: Count Vectorizer, TFIDF \n",
22 | "## Section 5.3: Example with Spam data \n",
23 | "## Section 5.4: Tweak model with Spam data\n",
24 | "## Section 5.5: Pipeline with Spam data "
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "## Section 5.1: Key Concepts, text data cleaning"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 127,
37 | "metadata": {
38 | "collapsed": true
39 | },
40 | "outputs": [],
41 | "source": [
42 | "import numpy as np\n",
43 | "from collections import Counter\n",
44 | "import pandas as pd\n",
45 | "import nltk\n",
46 | "from nltk.tokenize import word_tokenize\n",
47 | "from nltk.stem.wordnet import WordNetLemmatizer \n",
48 | "from nltk.stem import SnowballStemmer\n",
49 | "import string\n",
50 | "from scipy.spatial.distance import pdist, squareform\n",
51 | "from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer \n",
52 | "from sklearn.metrics.pairwise import cosine_similarity\n",
53 | "from sklearn.linear_model import LogisticRegression\n",
54 | "from sklearn.cross_validation import train_test_split\n",
55 | "from sklearn.metrics import confusion_matrix\n",
56 | "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier\n",
57 | "from sklearn.svm import SVC \n",
58 | "from sklearn.tree import DecisionTreeClassifier\n",
59 | "from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline"
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "execution_count": 66,
65 | "metadata": {
66 | "collapsed": true,
67 | "slideshow": {
68 | "slide_type": "skip"
69 | }
70 | },
71 | "outputs": [],
72 | "source": [
73 | "stops = set(nltk.corpus.stopwords.words('english'))"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 68,
79 | "metadata": {
80 | "collapsed": true
81 | },
82 | "outputs": [],
83 | "source": [
84 | "#stops"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": 3,
90 | "metadata": {
91 | "collapsed": true,
92 | "slideshow": {
93 | "slide_type": "skip"
94 | }
95 | },
96 | "outputs": [],
97 | "source": [
98 | "corpus = [\"Jeff stole my octopus sandwich.\", \n",
99 | " \"'Help!' I sobbed, sandwichlessly.\", \n",
100 | " \"'Drop the sandwiches!' said the sandwich police.\"]"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {
106 | "slideshow": {
107 | "slide_type": "slide"
108 | }
109 | },
110 | "source": [
111 | "# How do I turn a corpus of documents into a feature matrix? \n",
112 | "# Words --> numbers?????"
113 | ]
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "metadata": {
118 | "slideshow": {
119 | "slide_type": "slide"
120 | }
121 | },
122 | "source": [
123 | "## Corpus: list of documents\n",
124 | "\n",
125 | "```\n",
126 | " [\n",
127 | " \"Jeff stole my octopus sandwich.\", \n",
128 | " \"'Help!' I sobbed, sandwichlessly.\", \n",
129 | " \"'Drop the sandwiches!' said the sandwich police.\"\n",
130 | " ]```"
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": 60,
136 | "metadata": {
137 | "collapsed": true,
138 | "slideshow": {
139 | "slide_type": "skip"
140 | }
141 | },
142 | "outputs": [],
143 | "source": [
144 | "def our_tokenizer(doc, stops=None, stemmer=None):\n",
145 | " doc = word_tokenize(doc.lower())\n",
146 | " tokens = [''.join([char for char in tok if char not in string.punctuation]) for tok in doc]\n",
147 | " tokens = [tok for tok in tokens if tok]\n",
148 | " if stops:\n",
149 | " tokens = [tok for tok in tokens if (tok not in stops)]\n",
150 | " if stemmer:\n",
151 | " tokens = [stemmer.stem(tok) for tok in tokens]\n",
152 | " return tokens"
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "execution_count": 5,
158 | "metadata": {
159 | "slideshow": {
160 | "slide_type": "skip"
161 | }
162 | },
163 | "outputs": [
164 | {
165 | "data": {
166 | "text/plain": [
167 | "[['jeff', 'stole', 'my', 'octopus', 'sandwich'],\n",
168 | " ['help', 'i', 'sobbed', 'sandwichlessly'],\n",
169 | " ['drop', 'the', 'sandwiches', 'said', 'the', 'sandwich', 'police']]"
170 | ]
171 | },
172 | "execution_count": 5,
173 | "metadata": {},
174 | "output_type": "execute_result"
175 | }
176 | ],
177 | "source": [
178 | "tokenized_docs = [our_tokenizer(doc) for doc in corpus]\n",
179 | "tokenized_docs"
180 | ]
181 | },
182 | {
183 | "cell_type": "markdown",
184 | "metadata": {
185 | "slideshow": {
186 | "slide_type": "slide"
187 | }
188 | },
189 | "source": [
190 | "## Step 1: lowercase, lose punction, split into tokens\n",
191 | "```\n",
192 | "[\n",
193 | " ['jeff', 'stole', 'my', 'octopus', 'sandwich'],\n",
194 | " ['help', 'i', 'sobbed', 'sandwichlessly'],\n",
195 | " ['drop', 'the', 'sandwiches', 'said', 'the', 'sandwich', 'police']\n",
196 | "]\n",
197 | "```"
198 | ]
199 | },
200 | {
201 | "cell_type": "code",
202 | "execution_count": 6,
203 | "metadata": {
204 | "collapsed": true,
205 | "slideshow": {
206 | "slide_type": "skip"
207 | }
208 | },
209 | "outputs": [],
210 | "source": [
211 | "stopwords = set(nltk.corpus.stopwords.words('english')) "
212 | ]
213 | },
214 | {
215 | "cell_type": "code",
216 | "execution_count": 57,
217 | "metadata": {},
218 | "outputs": [
219 | {
220 | "data": {
221 | "text/plain": [
222 | "True"
223 | ]
224 | },
225 | "execution_count": 57,
226 | "metadata": {},
227 | "output_type": "execute_result"
228 | }
229 | ],
230 | "source": [
231 | "'i' in stopwords "
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": 7,
237 | "metadata": {
238 | "slideshow": {
239 | "slide_type": "skip"
240 | }
241 | },
242 | "outputs": [
243 | {
244 | "data": {
245 | "text/plain": [
246 | "[['jeff', 'stole', 'octopus', 'sandwich'],\n",
247 | " ['help', 'sobbed', 'sandwichlessly'],\n",
248 | " ['drop', 'sandwiches', 'said', 'sandwich', 'police']]"
249 | ]
250 | },
251 | "execution_count": 7,
252 | "metadata": {},
253 | "output_type": "execute_result"
254 | }
255 | ],
256 | "source": [
257 | "tokenized_docs = [our_tokenizer(doc, stops=stopwords) for doc in corpus]\n",
258 | "tokenized_docs"
259 | ]
260 | },
261 | {
262 | "cell_type": "markdown",
263 | "metadata": {
264 | "slideshow": {
265 | "slide_type": "slide"
266 | }
267 | },
268 | "source": [
269 | "## Step 2: remove stop words\n",
270 | "```\n",
271 | "[\n",
272 | " ['jeff', 'stole', 'octopus', 'sandwich'],\n",
273 | " ['help', 'sobbed', 'sandwichlessly'],\n",
274 | " ['drop', 'sandwiches', 'said', 'sandwich', 'police']\n",
275 | "]\n",
276 | "```\n"
277 | ]
278 | },
279 | {
280 | "cell_type": "code",
281 | "execution_count": 64,
282 | "metadata": {
283 | "slideshow": {
284 | "slide_type": "skip"
285 | }
286 | },
287 | "outputs": [
288 | {
289 | "data": {
290 | "text/plain": [
291 | "[[u'jeff', u'stole', u'octopus', u'sandwich'],\n",
292 | " [u'help', u'sob', u'sandwichless'],\n",
293 | " [u'drop', u'sandwich', u'said', u'sandwich', u'polic']]"
294 | ]
295 | },
296 | "execution_count": 64,
297 | "metadata": {},
298 | "output_type": "execute_result"
299 | }
300 | ],
301 | "source": [
302 | "tokenized_docs = [our_tokenizer(doc, stops=stopwords, stemmer=SnowballStemmer('english')) for doc in corpus]\n",
303 | "tokenized_docs"
304 | ]
305 | },
306 | {
307 | "cell_type": "markdown",
308 | "metadata": {
309 | "slideshow": {
310 | "slide_type": "slide"
311 | }
312 | },
313 | "source": [
314 | "## Step 3: Stemming/Lemmatization\n",
315 | "```\n",
316 | "[\n",
317 | " ['jeff', 'stole', 'octopus', 'sandwich'],\n",
318 | " ['help', 'sobbed', 'sandwichlessly'],\n",
319 | " ['drop', u'sandwich', 'said', 'sandwich', 'police']\n",
320 | "]\n",
321 | "```\n",
322 | "### OK now what?"
323 | ]
324 | },
325 | {
326 | "cell_type": "markdown",
327 | "metadata": {
328 | "slideshow": {
329 | "slide_type": "fragment"
330 | }
331 | },
332 | "source": [
333 | "Vocabulary:\n",
334 | "```\n",
335 | "['drop', 'help', 'jeff', 'octopus', 'police', 'said', 'sandwich', 'sandwichlessly', 'sobbed', 'stole']\n",
336 | "\n",
337 | "```"
338 | ]
339 | },
340 | {
341 | "cell_type": "code",
342 | "execution_count": 65,
343 | "metadata": {
344 | "collapsed": true,
345 | "slideshow": {
346 | "slide_type": "skip"
347 | }
348 | },
349 | "outputs": [],
350 | "source": [
351 | "vocab_set = set()"
352 | ]
353 | },
354 | {
355 | "cell_type": "code",
356 | "execution_count": 10,
357 | "metadata": {
358 | "collapsed": true,
359 | "slideshow": {
360 | "slide_type": "skip"
361 | }
362 | },
363 | "outputs": [],
364 | "source": [
365 | "for doc in tokenized_docs:\n",
366 | " vocab_set.update(doc)"
367 | ]
368 | },
369 | {
370 | "cell_type": "code",
371 | "execution_count": 11,
372 | "metadata": {
373 | "slideshow": {
374 | "slide_type": "skip"
375 | }
376 | },
377 | "outputs": [
378 | {
379 | "name": "stdout",
380 | "output_type": "stream",
381 | "text": [
382 | "['drop', 'help', 'jeff', 'octopus', 'police', 'said', 'sandwich', 'sandwichlessly', 'sobbed', 'stole']\n"
383 | ]
384 | }
385 | ],
386 | "source": [
387 | "vocab = sorted(list(vocab_set))\n",
388 | "print vocab"
389 | ]
390 | },
391 | {
392 | "cell_type": "markdown",
393 | "metadata": {},
394 | "source": [
395 | "## Section 5.2: Count Vectorizer, TFIDF "
396 | ]
397 | },
398 | {
399 | "cell_type": "markdown",
400 | "metadata": {
401 | "slideshow": {
402 | "slide_type": "slide"
403 | }
404 | },
405 | "source": [
406 | "# Count vectorization"
407 | ]
408 | },
409 | {
410 | "cell_type": "markdown",
411 | "metadata": {
412 | "slideshow": {
413 | "slide_type": "fragment"
414 | }
415 | },
416 | "source": [
417 | "Vocabulary:\n",
418 | "```\n",
419 | "['drop', 'help', 'jeff', 'octopus', 'police', 'said', 'sandwich', 'sandwichlessly', 'sobbed', 'stole']\n",
420 | "\n",
421 | "```"
422 | ]
423 | },
424 | {
425 | "cell_type": "markdown",
426 | "metadata": {
427 | "slideshow": {
428 | "slide_type": "fragment"
429 | }
430 | },
431 | "source": [
432 | "```\n",
433 | "['jeff', 'stole', 'octopus', 'sandwich']\n",
434 | "[0, 0, 1, 1, 0, 0, 1, 0, 0, 1]\n",
435 | "\n",
436 | "['help', 'sobbed', 'sandwichlessly']\n",
437 | "[0, 1, 0, 0, 0, 0, 0, 1, 1, 0]\n",
438 | "\n",
439 | "['drop', u'sandwich', 'said', 'sandwich', 'police']\n",
440 | "[1, 0, 0, 0, 1, 1, 2, 0, 0, 0]\n",
441 | "```"
442 | ]
443 | },
444 | {
445 | "cell_type": "markdown",
446 | "metadata": {
447 | "slideshow": {
448 | "slide_type": "slide"
449 | }
450 | },
451 | "source": [
452 | "## Term frequency\n",
453 | "$$TF_{word,document} = \\frac{\\#\\_of\\_times\\_word\\_appears\\_in\\_document}{total\\_\\#\\_of\\_words\\_in\\_document}$$"
454 | ]
455 | },
456 | {
457 | "cell_type": "markdown",
458 | "metadata": {
459 | "slideshow": {
460 | "slide_type": "fragment"
461 | }
462 | },
463 | "source": [
464 | "```\n",
465 | "['jeff', 'stole', 'octopus', 'sandwich']\n",
466 | "[0, 0, 1/4, 1/4, 0, 0, 1/4, 0, 0, 1/4]\n",
467 | "\n",
468 | "['help', 'sobbed', 'sandwichlessly']\n",
469 | "[0, 1/3, 0, 0, 0, 0, 0, 1/3, 1/3, 0]\n",
470 | "\n",
471 | "['drop', u'sandwich', 'said', 'sandwich', 'police']\n",
472 | "[1/5, 0, 0, 0, 1/5, 1/5, 2/5, 0, 0, 0]\n",
473 | "```"
474 | ]
475 | },
476 | {
477 | "cell_type": "markdown",
478 | "metadata": {
479 | "slideshow": {
480 | "slide_type": "slide"
481 | }
482 | },
483 | "source": [
484 | "## Document frequency\n",
485 | "$$ DF_{word} = \\frac{\\#\\_of\\_documents\\_containing\\_word}{total\\_\\#\\_of\\_documents} $$"
486 | ]
487 | },
488 | {
489 | "cell_type": "markdown",
490 | "metadata": {
491 | "slideshow": {
492 | "slide_type": "fragment"
493 | }
494 | },
495 | "source": [
496 | "Vocabulary:\n",
497 | "```\n",
498 | "['drop', 'help', 'jeff', 'octopus', 'police', 'said', 'sandwich', 'sandwichlessly', 'sobbed', 'stole']\n",
499 | "```\n",
500 | "\n",
501 | "Document frequency for each word:\n",
502 | "```\n",
503 | "[1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 2/3, 1/3, 1/3, 1/3]\n",
504 | "```"
505 | ]
506 | },
507 | {
508 | "cell_type": "markdown",
509 | "metadata": {
510 | "slideshow": {
511 | "slide_type": "slide"
512 | }
513 | },
514 | "source": [
515 | "## Inverse document frequency\n",
516 | "$$ IDF_{word} = \\log\\left(\\frac{total\\_\\#\\_of\\_documents}{\\#\\_of\\_documents\\_containing\\_word}\\right) $$"
517 | ]
518 | },
519 | {
520 | "cell_type": "markdown",
521 | "metadata": {
522 | "slideshow": {
523 | "slide_type": "fragment"
524 | }
525 | },
526 | "source": [
527 | "Vocabulary:\n",
528 | "```\n",
529 | "['drop', 'help', 'jeff', 'octopus', 'police', 'said', 'sandwich', 'sandwichlessly', 'sobbed', 'stole']\n",
530 | "```\n",
531 | "\n",
532 | "IDF for each word:\n",
533 | "```\n",
534 | "[1.099, 1.099, 1.099, 1.099, 1.099, 1.099, 0.405, 1.099, 1.099, 1.099]\n",
535 | "```"
536 | ]
537 | },
538 | {
539 | "cell_type": "markdown",
540 | "metadata": {
541 | "slideshow": {
542 | "slide_type": "slide"
543 | }
544 | },
545 | "source": [
546 | "# TFIDF\n",
547 | "\n",
548 | "Vocabulary:\n",
549 | "```\n",
550 | "['drop', 'help', 'jeff', 'octopus', 'police', 'said', 'sandwich', 'sandwichlessly', 'sobbed', 'stole']\n",
551 | "```\n",
552 | "TF * IDF:"
553 | ]
554 | },
555 | {
556 | "cell_type": "markdown",
557 | "metadata": {
558 | "slideshow": {
559 | "slide_type": "fragment"
560 | }
561 | },
562 | "source": [
563 | "```\n",
564 | "['jeff', 'stole', 'octopus', 'sandwich']\n",
565 | "[0, 0, 0.275, 0.275, 0, 0, 0.101, 0, 0, 0.275]\n",
566 | "\n",
567 | "['help', 'sobbed', 'sandwichlessly']\n",
568 | "[0, 0.366, 0, 0, 0, 0, 0, 0.366, 0.366, 0]\n",
569 | "\n",
570 | "['drop', u'sandwich', 'said', 'sandwich', 'police']\n",
571 | "[0.22, 0, 0, 0, 0.22, 0.22, 0.162, 0, 0, 0]\n",
572 | "```"
573 | ]
574 | },
575 | {
576 | "cell_type": "markdown",
577 | "metadata": {
578 | "slideshow": {
579 | "slide_type": "slide"
580 | }
581 | },
582 | "source": [
583 | "Now that we have turned our DOCUMENTS into VECTORS, we can put them into whatever machine learning algorithm we want! We can use whatever kind of similarity measure we please!"
584 | ]
585 | },
586 | {
587 | "cell_type": "markdown",
588 | "metadata": {
589 | "slideshow": {
590 | "slide_type": "slide"
591 | }
592 | },
593 | "source": [
594 | "Wow!"
595 | ]
596 | },
597 | {
598 | "cell_type": "code",
599 | "execution_count": 15,
600 | "metadata": {},
601 | "outputs": [
602 | {
603 | "data": {
604 | "text/plain": [
605 | "array([[ 1. , 0.08115802],\n",
606 | " [ 0.08115802, 1. ]])"
607 | ]
608 | },
609 | "execution_count": 15,
610 | "metadata": {},
611 | "output_type": "execute_result"
612 | }
613 | ],
614 | "source": [
615 | "cosine_similarity([[0, 0, 0.275, 0.275, 0, 0, 0.101, 0, 0, 0.275], [0.22, 0, 0, 0, 0.22, 0.22, 0.162, 0, 0, 0]])"
616 | ]
617 | },
618 | {
619 | "cell_type": "code",
620 | "execution_count": 70,
621 | "metadata": {},
622 | "outputs": [
623 | {
624 | "data": {
625 | "text/plain": [
626 | "array([[ 1., 0.],\n",
627 | " [ 0., 1.]])"
628 | ]
629 | },
630 | "execution_count": 70,
631 | "metadata": {},
632 | "output_type": "execute_result"
633 | }
634 | ],
635 | "source": [
636 | "cosine_similarity([[0, 0.366, 0, 0, 0, 0, 0, 0.366, 0.366, 0], [0.22, 0, 0, 0, 0.22, 0.22, 0.162, 0, 0, 0]])"
637 | ]
638 | },
639 | {
640 | "cell_type": "markdown",
641 | "metadata": {},
642 | "source": [
643 | "## Section 5.3: Example with Spam data "
644 | ]
645 | },
646 | {
647 | "cell_type": "code",
648 | "execution_count": null,
649 | "metadata": {
650 | "collapsed": true
651 | },
652 | "outputs": [],
653 | "source": [
654 | "#revisit spam ham example "
655 | ]
656 | },
657 | {
658 | "cell_type": "code",
659 | "execution_count": 71,
660 | "metadata": {
661 | "collapsed": true
662 | },
663 | "outputs": [],
664 | "source": [
665 | "df= pd.read_table('data/SMSSpamCollection', header=None)"
666 | ]
667 | },
668 | {
669 | "cell_type": "code",
670 | "execution_count": 72,
671 | "metadata": {},
672 | "outputs": [
673 | {
674 | "data": {
675 | "text/html": [
676 | "
\n",
677 | "\n",
690 | "
\n",
691 | " \n",
692 | " \n",
693 | " | \n",
694 | " 0 | \n",
695 | " 1 | \n",
696 | "
\n",
697 | " \n",
698 | " \n",
699 | " \n",
700 | " | 0 | \n",
701 | " ham | \n",
702 | " Go until jurong point, crazy.. Available only ... | \n",
703 | "
\n",
704 | " \n",
705 | " | 1 | \n",
706 | " ham | \n",
707 | " Ok lar... Joking wif u oni... | \n",
708 | "
\n",
709 | " \n",
710 | " | 2 | \n",
711 | " spam | \n",
712 | " Free entry in 2 a wkly comp to win FA Cup fina... | \n",
713 | "
\n",
714 | " \n",
715 | "
\n",
716 | "
"
717 | ],
718 | "text/plain": [
719 | " 0 1\n",
720 | "0 ham Go until jurong point, crazy.. Available only ...\n",
721 | "1 ham Ok lar... Joking wif u oni...\n",
722 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina..."
723 | ]
724 | },
725 | "execution_count": 72,
726 | "metadata": {},
727 | "output_type": "execute_result"
728 | }
729 | ],
730 | "source": [
731 | "df.head(3)"
732 | ]
733 | },
734 | {
735 | "cell_type": "code",
736 | "execution_count": 73,
737 | "metadata": {
738 | "collapsed": true
739 | },
740 | "outputs": [],
741 | "source": [
742 | "df.columns=['spam', 'msg']"
743 | ]
744 | },
745 | {
746 | "cell_type": "code",
747 | "execution_count": 74,
748 | "metadata": {},
749 | "outputs": [
750 | {
751 | "data": {
752 | "text/html": [
753 | "\n",
754 | "\n",
767 | "
\n",
768 | " \n",
769 | " \n",
770 | " | \n",
771 | " spam | \n",
772 | " msg | \n",
773 | "
\n",
774 | " \n",
775 | " \n",
776 | " \n",
777 | " | 0 | \n",
778 | " ham | \n",
779 | " Go until jurong point, crazy.. Available only ... | \n",
780 | "
\n",
781 | " \n",
782 | " | 1 | \n",
783 | " ham | \n",
784 | " Ok lar... Joking wif u oni... | \n",
785 | "
\n",
786 | " \n",
787 | "
\n",
788 | "
"
789 | ],
790 | "text/plain": [
791 | " spam msg\n",
792 | "0 ham Go until jurong point, crazy.. Available only ...\n",
793 | "1 ham Ok lar... Joking wif u oni..."
794 | ]
795 | },
796 | "execution_count": 74,
797 | "metadata": {},
798 | "output_type": "execute_result"
799 | }
800 | ],
801 | "source": [
802 | "df.head(2)"
803 | ]
804 | },
805 | {
806 | "cell_type": "code",
807 | "execution_count": 75,
808 | "metadata": {
809 | "collapsed": true
810 | },
811 | "outputs": [],
812 | "source": [
813 | "stopwords_set=set(stopwords)\n",
814 | "\n",
815 | "punctuation_set=set(string.punctuation)"
816 | ]
817 | },
818 | {
819 | "cell_type": "code",
820 | "execution_count": 77,
821 | "metadata": {},
822 | "outputs": [
823 | {
824 | "data": {
825 | "text/plain": [
826 | "179"
827 | ]
828 | },
829 | "execution_count": 77,
830 | "metadata": {},
831 | "output_type": "execute_result"
832 | }
833 | ],
834 | "source": [
835 | "len(stopwords_set)"
836 | ]
837 | },
838 | {
839 | "cell_type": "code",
840 | "execution_count": 79,
841 | "metadata": {},
842 | "outputs": [
843 | {
844 | "data": {
845 | "text/plain": [
846 | "32"
847 | ]
848 | },
849 | "execution_count": 79,
850 | "metadata": {},
851 | "output_type": "execute_result"
852 | }
853 | ],
854 | "source": [
855 | "len(punctuation_set)"
856 | ]
857 | },
858 | {
859 | "cell_type": "code",
860 | "execution_count": 84,
861 | "metadata": {
862 | "collapsed": true
863 | },
864 | "outputs": [],
865 | "source": [
866 | "df['msg_cleaned']= df.msg.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords_set \\\n",
867 | " and word not in punctuation_set]))"
868 | ]
869 | },
870 | {
871 | "cell_type": "code",
872 | "execution_count": 83,
873 | "metadata": {},
874 | "outputs": [
875 | {
876 | "data": {
877 | "text/plain": [
878 | "'Go until jurong point, crazy'"
879 | ]
880 | },
881 | "execution_count": 83,
882 | "metadata": {},
883 | "output_type": "execute_result"
884 | }
885 | ],
886 | "source": [
887 | "str1='Go until jurong point, crazy'.split()\n",
888 | "' '.join(str1)"
889 | ]
890 | },
891 | {
892 | "cell_type": "code",
893 | "execution_count": 85,
894 | "metadata": {},
895 | "outputs": [
896 | {
897 | "data": {
898 | "text/html": [
899 | "\n",
900 | "\n",
913 | "
\n",
914 | " \n",
915 | " \n",
916 | " | \n",
917 | " spam | \n",
918 | " msg | \n",
919 | " msg_cleaned | \n",
920 | "
\n",
921 | " \n",
922 | " \n",
923 | " \n",
924 | " | 0 | \n",
925 | " ham | \n",
926 | " Go until jurong point, crazy.. Available only ... | \n",
927 | " Go jurong point, crazy.. Available bugis n gre... | \n",
928 | "
\n",
929 | " \n",
930 | " | 1 | \n",
931 | " ham | \n",
932 | " Ok lar... Joking wif u oni... | \n",
933 | " Ok lar... Joking wif u oni... | \n",
934 | "
\n",
935 | " \n",
936 | "
\n",
937 | "
"
938 | ],
939 | "text/plain": [
940 | " spam msg \\\n",
941 | "0 ham Go until jurong point, crazy.. Available only ... \n",
942 | "1 ham Ok lar... Joking wif u oni... \n",
943 | "\n",
944 | " msg_cleaned \n",
945 | "0 Go jurong point, crazy.. Available bugis n gre... \n",
946 | "1 Ok lar... Joking wif u oni... "
947 | ]
948 | },
949 | "execution_count": 85,
950 | "metadata": {},
951 | "output_type": "execute_result"
952 | }
953 | ],
954 | "source": [
955 | "df.head(2)"
956 | ]
957 | },
958 | {
959 | "cell_type": "code",
960 | "execution_count": 86,
961 | "metadata": {
962 | "collapsed": true
963 | },
964 | "outputs": [],
965 | "source": [
966 | "df['msg_cleaned']= df.msg_cleaned.str.lower() "
967 | ]
968 | },
969 | {
970 | "cell_type": "code",
971 | "execution_count": 87,
972 | "metadata": {},
973 | "outputs": [
974 | {
975 | "data": {
976 | "text/html": [
977 | "\n",
978 | "\n",
991 | "
\n",
992 | " \n",
993 | " \n",
994 | " | \n",
995 | " spam | \n",
996 | " msg | \n",
997 | " msg_cleaned | \n",
998 | "
\n",
999 | " \n",
1000 | " \n",
1001 | " \n",
1002 | " | 0 | \n",
1003 | " ham | \n",
1004 | " Go until jurong point, crazy.. Available only ... | \n",
1005 | " go jurong point, crazy.. available bugis n gre... | \n",
1006 | "
\n",
1007 | " \n",
1008 | " | 1 | \n",
1009 | " ham | \n",
1010 | " Ok lar... Joking wif u oni... | \n",
1011 | " ok lar... joking wif u oni... | \n",
1012 | "
\n",
1013 | " \n",
1014 | "
\n",
1015 | "
"
1016 | ],
1017 | "text/plain": [
1018 | " spam msg \\\n",
1019 | "0 ham Go until jurong point, crazy.. Available only ... \n",
1020 | "1 ham Ok lar... Joking wif u oni... \n",
1021 | "\n",
1022 | " msg_cleaned \n",
1023 | "0 go jurong point, crazy.. available bugis n gre... \n",
1024 | "1 ok lar... joking wif u oni... "
1025 | ]
1026 | },
1027 | "execution_count": 87,
1028 | "metadata": {},
1029 | "output_type": "execute_result"
1030 | }
1031 | ],
1032 | "source": [
1033 | "df.head(2)"
1034 | ]
1035 | },
1036 | {
1037 | "cell_type": "code",
1038 | "execution_count": 143,
1039 | "metadata": {
1040 | "collapsed": true
1041 | },
1042 | "outputs": [],
1043 | "source": [
1044 | "count_vect= CountVectorizer()"
1045 | ]
1046 | },
1047 | {
1048 | "cell_type": "code",
1049 | "execution_count": 144,
1050 | "metadata": {
1051 | "collapsed": true
1052 | },
1053 | "outputs": [],
1054 | "source": [
1055 | "X= count_vect.fit_transform(df.msg_cleaned) "
1056 | ]
1057 | },
1058 | {
1059 | "cell_type": "code",
1060 | "execution_count": 145,
1061 | "metadata": {},
1062 | "outputs": [
1063 | {
1064 | "data": {
1065 | "text/plain": [
1066 | "(5572, 8703)"
1067 | ]
1068 | },
1069 | "execution_count": 145,
1070 | "metadata": {},
1071 | "output_type": "execute_result"
1072 | }
1073 | ],
1074 | "source": [
1075 | "X.shape"
1076 | ]
1077 | },
1078 | {
1079 | "cell_type": "code",
1080 | "execution_count": 146,
1081 | "metadata": {
1082 | "collapsed": true
1083 | },
1084 | "outputs": [],
1085 | "source": [
1086 | "y=df.spam"
1087 | ]
1088 | },
1089 | {
1090 | "cell_type": "code",
1091 | "execution_count": 147,
1092 | "metadata": {
1093 | "collapsed": true
1094 | },
1095 | "outputs": [],
1096 | "source": [
1097 | "X_train, X_test, y_train, y_test= train_test_split(X,y)"
1098 | ]
1099 | },
1100 | {
1101 | "cell_type": "code",
1102 | "execution_count": 148,
1103 | "metadata": {},
1104 | "outputs": [
1105 | {
1106 | "data": {
1107 | "text/plain": [
1108 | "0.98277099784637478"
1109 | ]
1110 | },
1111 | "execution_count": 148,
1112 | "metadata": {},
1113 | "output_type": "execute_result"
1114 | }
1115 | ],
1116 | "source": [
1117 | "lg= LogisticRegression()\n",
1118 | "\n",
1119 | "lg.fit(X_train,y_train)\n",
1120 | "y_pred=lg.predict(X_test)\n",
1121 | "lg.score(X_test,y_test)"
1122 | ]
1123 | },
1124 | {
1125 | "cell_type": "code",
1126 | "execution_count": 149,
1127 | "metadata": {},
1128 | "outputs": [
1129 | {
1130 | "data": {
1131 | "text/plain": [
1132 | "array([[1207, 1],\n",
1133 | " [ 23, 162]])"
1134 | ]
1135 | },
1136 | "execution_count": 149,
1137 | "metadata": {},
1138 | "output_type": "execute_result"
1139 | }
1140 | ],
1141 | "source": [
1142 | "confusion_matrix(y_test, y_pred)"
1143 | ]
1144 | },
1145 | {
1146 | "cell_type": "code",
1147 | "execution_count": 94,
1148 | "metadata": {},
1149 | "outputs": [
1150 | {
1151 | "data": {
1152 | "text/plain": [
1153 | "array(['ham', 'ham', 'ham', ..., 'ham', 'ham', 'ham'], dtype=object)"
1154 | ]
1155 | },
1156 | "execution_count": 94,
1157 | "metadata": {},
1158 | "output_type": "execute_result"
1159 | }
1160 | ],
1161 | "source": [
1162 | "y_pred"
1163 | ]
1164 | },
1165 | {
1166 | "cell_type": "markdown",
1167 | "metadata": {
1168 | "collapsed": true
1169 | },
1170 | "source": [
1171 | "## Section 5.4: Tweak model with Spam data "
1172 | ]
1173 | },
1174 | {
1175 | "cell_type": "code",
1176 | "execution_count": 164,
1177 | "metadata": {
1178 | "collapsed": true
1179 | },
1180 | "outputs": [],
1181 | "source": [
1182 | "## try tfidf \n",
1183 | "\n",
1184 | "tfidf= TfidfVectorizer() "
1185 | ]
1186 | },
1187 | {
1188 | "cell_type": "code",
1189 | "execution_count": 165,
1190 | "metadata": {},
1191 | "outputs": [
1192 | {
1193 | "data": {
1194 | "text/html": [
1195 | "\n",
1196 | "\n",
1209 | "
\n",
1210 | " \n",
1211 | " \n",
1212 | " | \n",
1213 | " spam | \n",
1214 | " msg | \n",
1215 | " msg_cleaned | \n",
1216 | " spam_num | \n",
1217 | "
\n",
1218 | " \n",
1219 | " \n",
1220 | " \n",
1221 | " | 0 | \n",
1222 | " ham | \n",
1223 | " Go until jurong point, crazy.. Available only ... | \n",
1224 | " go jurong point, crazy.. available bugis n gre... | \n",
1225 | " 0 | \n",
1226 | "
\n",
1227 | " \n",
1228 | " | 1 | \n",
1229 | " ham | \n",
1230 | " Ok lar... Joking wif u oni... | \n",
1231 | " ok lar... joking wif u oni... | \n",
1232 | " 0 | \n",
1233 | "
\n",
1234 | " \n",
1235 | "
\n",
1236 | "
"
1237 | ],
1238 | "text/plain": [
1239 | " spam msg \\\n",
1240 | "0 ham Go until jurong point, crazy.. Available only ... \n",
1241 | "1 ham Ok lar... Joking wif u oni... \n",
1242 | "\n",
1243 | " msg_cleaned spam_num \n",
1244 | "0 go jurong point, crazy.. available bugis n gre... 0 \n",
1245 | "1 ok lar... joking wif u oni... 0 "
1246 | ]
1247 | },
1248 | "execution_count": 165,
1249 | "metadata": {},
1250 | "output_type": "execute_result"
1251 | }
1252 | ],
1253 | "source": [
1254 | "df.head(2)"
1255 | ]
1256 | },
1257 | {
1258 | "cell_type": "code",
1259 | "execution_count": 152,
1260 | "metadata": {
1261 | "collapsed": true
1262 | },
1263 | "outputs": [],
1264 | "source": [
1265 | "X= tfidf.fit_transform(df.msg_cleaned)\n",
1266 | "y=df.spam \n",
1267 | "X_train, X_test, y_train, y_test= train_test_split(X,y) "
1268 | ]
1269 | },
1270 | {
1271 | "cell_type": "code",
1272 | "execution_count": 153,
1273 | "metadata": {},
1274 | "outputs": [
1275 | {
1276 | "data": {
1277 | "text/plain": [
1278 | "0.97415649676956206"
1279 | ]
1280 | },
1281 | "execution_count": 153,
1282 | "metadata": {},
1283 | "output_type": "execute_result"
1284 | }
1285 | ],
1286 | "source": [
1287 | "## try random forest \n",
1288 | "rf= RandomForestClassifier()\n",
1289 | "rf.fit(X_train,y_train)\n",
1290 | "y_pred=rf.predict(X_test)\n",
1291 | "rf.score(X_test,y_test)"
1292 | ]
1293 | },
1294 | {
1295 | "cell_type": "code",
1296 | "execution_count": 154,
1297 | "metadata": {},
1298 | "outputs": [
1299 | {
1300 | "data": {
1301 | "text/plain": [
1302 | "array([[1222, 2],\n",
1303 | " [ 34, 135]])"
1304 | ]
1305 | },
1306 | "execution_count": 154,
1307 | "metadata": {},
1308 | "output_type": "execute_result"
1309 | }
1310 | ],
1311 | "source": [
1312 | "confusion_matrix(y_test, y_pred) "
1313 | ]
1314 | },
1315 | {
1316 | "cell_type": "code",
1317 | "execution_count": 155,
1318 | "metadata": {},
1319 | "outputs": [
1320 | {
1321 | "data": {
1322 | "text/plain": [
1323 | "0.97343862167982775"
1324 | ]
1325 | },
1326 | "execution_count": 155,
1327 | "metadata": {},
1328 | "output_type": "execute_result"
1329 | }
1330 | ],
1331 | "source": [
1332 | "#try gradient boost \n",
1333 | "gb= GradientBoostingClassifier()\n",
1334 | "gb.fit(X_train,y_train)\n",
1335 | "y_pred=gb.predict(X_test)\n",
1336 | "gb.score(X_test,y_test)"
1337 | ]
1338 | },
1339 | {
1340 | "cell_type": "code",
1341 | "execution_count": 156,
1342 | "metadata": {},
1343 | "outputs": [
1344 | {
1345 | "data": {
1346 | "text/plain": [
1347 | "array([[1216, 8],\n",
1348 | " [ 29, 140]])"
1349 | ]
1350 | },
1351 | "execution_count": 156,
1352 | "metadata": {},
1353 | "output_type": "execute_result"
1354 | }
1355 | ],
1356 | "source": [
1357 | "confusion_matrix(y_test, y_pred)"
1358 | ]
1359 | },
1360 | {
1361 | "cell_type": "code",
1362 | "execution_count": 157,
1363 | "metadata": {
1364 | "collapsed": true
1365 | },
1366 | "outputs": [],
1367 | "source": [
1368 | "# Try tfidf with bigrams & trigrams \n",
1369 | "tfidf=TfidfVectorizer(ngram_range=(1,3)) "
1370 | ]
1371 | },
1372 | {
1373 | "cell_type": "code",
1374 | "execution_count": 158,
1375 | "metadata": {
1376 | "collapsed": true
1377 | },
1378 | "outputs": [],
1379 | "source": [
1380 | "X= tfidf.fit_transform(df.msg_cleaned)\n",
1381 | "y=df.spam\n",
1382 | "X_train, X_test, y_train, y_test= train_test_split(X,y)"
1383 | ]
1384 | },
1385 | {
1386 | "cell_type": "code",
1387 | "execution_count": 159,
1388 | "metadata": {},
1389 | "outputs": [
1390 | {
1391 | "data": {
1392 | "text/plain": [
1393 | "0.96841349605168703"
1394 | ]
1395 | },
1396 | "execution_count": 159,
1397 | "metadata": {},
1398 | "output_type": "execute_result"
1399 | }
1400 | ],
1401 | "source": [
1402 | "#try gradient boost \n",
1403 | "gb= GradientBoostingClassifier()\n",
1404 | "gb.fit(X_train,y_train)\n",
1405 | "y_pred=gb.predict(X_test)\n",
1406 | "gb.score(X_test,y_test)"
1407 | ]
1408 | },
1409 | {
1410 | "cell_type": "code",
1411 | "execution_count": 166,
1412 | "metadata": {},
1413 | "outputs": [
1414 | {
1415 | "data": {
1416 | "text/plain": [
1417 | "array([[1209, 0],\n",
1418 | " [ 41, 143]])"
1419 | ]
1420 | },
1421 | "execution_count": 166,
1422 | "metadata": {},
1423 | "output_type": "execute_result"
1424 | }
1425 | ],
1426 | "source": [
1427 | "confusion_matrix(y_test, y_pred) "
1428 | ]
1429 | },
1430 | {
1431 | "cell_type": "code",
1432 | "execution_count": 171,
1433 | "metadata": {
1434 | "collapsed": true
1435 | },
1436 | "outputs": [],
1437 | "source": [
1438 | "tfidf=TfidfVectorizer()"
1439 | ]
1440 | },
1441 | {
1442 | "cell_type": "code",
1443 | "execution_count": 172,
1444 | "metadata": {
1445 | "collapsed": true
1446 | },
1447 | "outputs": [],
1448 | "source": [
1449 | "X=tfidf.fit_transform(df.msg_cleaned)\n",
1450 | "y=df.spam\n",
1451 | "X_train, X_test, y_train, y_test= train_test_split(X,y)"
1452 | ]
1453 | },
1454 | {
1455 | "cell_type": "code",
1456 | "execution_count": 173,
1457 | "metadata": {},
1458 | "outputs": [
1459 | {
1460 | "data": {
1461 | "text/plain": [
1462 | "0.95764536970567127"
1463 | ]
1464 | },
1465 | "execution_count": 173,
1466 | "metadata": {},
1467 | "output_type": "execute_result"
1468 | }
1469 | ],
1470 | "source": [
1471 | "lg= LogisticRegression()\n",
1472 | "lg.fit(X_train,y_train)\n",
1473 | "y_pred=lg.predict(X_test)\n",
1474 | "lg.score(X_test,y_test)"
1475 | ]
1476 | },
1477 | {
1478 | "cell_type": "code",
1479 | "execution_count": 174,
1480 | "metadata": {},
1481 | "outputs": [
1482 | {
1483 | "data": {
1484 | "text/plain": [
1485 | "array([[1204, 1],\n",
1486 | " [ 58, 130]])"
1487 | ]
1488 | },
1489 | "execution_count": 174,
1490 | "metadata": {},
1491 | "output_type": "execute_result"
1492 | }
1493 | ],
1494 | "source": [
1495 | "confusion_matrix(y_test, y_pred) "
1496 | ]
1497 | },
1498 | {
1499 | "cell_type": "markdown",
1500 | "metadata": {},
1501 | "source": [
1502 | "## Section 5.5: Pipeline with Spam data "
1503 | ]
1504 | },
1505 | {
1506 | "cell_type": "code",
1507 | "execution_count": 177,
1508 | "metadata": {
1509 | "collapsed": true
1510 | },
1511 | "outputs": [],
1512 | "source": [
1513 | "pipeline= Pipeline([('countvect', CountVectorizer(stop_words=stopwords_set)),\\\n",
1514 | " #('tfidf', TfidfVectorizer(stop_words=stopwords_set)),\\\n",
1515 | " ('lg', LogisticRegression())])"
1516 | ]
1517 | },
1518 | {
1519 | "cell_type": "code",
1520 | "execution_count": 178,
1521 | "metadata": {},
1522 | "outputs": [
1523 | {
1524 | "name": "stdout",
1525 | "output_type": "stream",
1526 | "text": [
1527 | "0.97415649677\n",
1528 | "[[1208 1]\n",
1529 | " [ 35 149]]\n"
1530 | ]
1531 | }
1532 | ],
1533 | "source": [
1534 | "X=df.msg_cleaned #note we are passing the cleaned msg to the pipeline \n",
1535 | "y=df.spam\n",
1536 | "X_train, X_test, y_train, y_test= train_test_split(X,y) \n",
1537 | "\n",
1538 | "\n",
1539 | "pipeline.fit(X_train, y_train) \n",
1540 | "y_pred= pipeline.predict(X_test)\n",
1541 | "print pipeline.score(X_test, y_test)\n",
1542 | "print confusion_matrix(y_test, y_pred) "
1543 | ]
1544 | },
1545 | {
1546 | "cell_type": "code",
1547 | "execution_count": 169,
1548 | "metadata": {
1549 | "collapsed": true
1550 | },
1551 | "outputs": [],
1552 | "source": [
1553 | "pipeline= Pipeline([#('countvect', CountVectorizer(stop_words=stopwords_set)),\\\n",
1554 | " ('countvect', CountVectorizer(stop_words=stopwords_set)),\\\n",
1555 | " ('rf', RandomForestClassifier())])"
1556 | ]
1557 | },
1558 | {
1559 | "cell_type": "code",
1560 | "execution_count": 139,
1561 | "metadata": {},
1562 | "outputs": [
1563 | {
1564 | "name": "stdout",
1565 | "output_type": "stream",
1566 | "text": [
1567 | "0.97272074659\n",
1568 | "[[1188 7]\n",
1569 | " [ 31 167]]\n"
1570 | ]
1571 | }
1572 | ],
1573 | "source": [
1574 | "X=df.msg_cleaned #note we are passing the cleaned msg to the pipeline \n",
1575 | "y=df.spam\n",
1576 | "X_train, X_test, y_train, y_test= train_test_split(X,y) \n",
1577 | "\n",
1578 | "\n",
1579 | "pipeline.fit(X_train, y_train) \n",
1580 | "y_pred= pipeline.predict(X_test)\n",
1581 | "print pipeline.score(X_test, y_test)\n",
1582 | "print confusion_matrix(y_test, y_pred) \n",
1583 | "\n",
1584 | "# the best one so far!"
1585 | ]
1586 | },
1587 | {
1588 | "cell_type": "code",
1589 | "execution_count": 140,
1590 | "metadata": {
1591 | "collapsed": true
1592 | },
1593 | "outputs": [],
1594 | "source": [
1595 | "pipeline= Pipeline([#('countvect', CountVectorizer(stop_words=stopwords_set)),\\\n",
1596 | " ('countvect', CountVectorizer(stop_words=stopwords_set, ngram_range=(1,3))),\\\n",
1597 | " ('rf', RandomForestClassifier())])"
1598 | ]
1599 | },
1600 | {
1601 | "cell_type": "code",
1602 | "execution_count": 141,
1603 | "metadata": {},
1604 | "outputs": [
1605 | {
1606 | "name": "stdout",
1607 | "output_type": "stream",
1608 | "text": [
1609 | "0.964824120603\n",
1610 | "[[1221 0]\n",
1611 | " [ 49 123]]\n"
1612 | ]
1613 | }
1614 | ],
1615 | "source": [
1616 | "X=df.msg_cleaned #note we are passing the cleaned msg to the pipeline \n",
1617 | "y=df.spam\n",
1618 | "X_train, X_test, y_train, y_test= train_test_split(X,y) \n",
1619 | "\n",
1620 | "\n",
1621 | "pipeline.fit(X_train, y_train) \n",
1622 | "y_pred= pipeline.predict(X_test)\n",
1623 | "print pipeline.score(X_test, y_test)\n",
1624 | "print confusion_matrix(y_test, y_pred) "
1625 | ]
1626 | },
1627 | {
1628 | "cell_type": "code",
1629 | "execution_count": null,
1630 | "metadata": {
1631 | "collapsed": true
1632 | },
1633 | "outputs": [],
1634 | "source": []
1635 | }
1636 | ],
1637 | "metadata": {
1638 | "celltoolbar": "Slideshow",
1639 | "kernelspec": {
1640 | "display_name": "Python 2",
1641 | "language": "python",
1642 | "name": "python2"
1643 | },
1644 | "language_info": {
1645 | "codemirror_mode": {
1646 | "name": "ipython",
1647 | "version": 2
1648 | },
1649 | "file_extension": ".py",
1650 | "mimetype": "text/x-python",
1651 | "name": "python",
1652 | "nbconvert_exporter": "python",
1653 | "pygments_lexer": "ipython2",
1654 | "version": "2.7.13"
1655 | },
1656 | "livereveal": {
1657 | "transition": "none"
1658 | }
1659 | },
1660 | "nbformat": 4,
1661 | "nbformat_minor": 1
1662 | }
1663 |
--------------------------------------------------------------------------------
/Part 4/Section_3.1_3.2.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 3.1 Introduction to frameworks"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# Apache Spark"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "## What is Spark?\n",
22 | "\n",
23 | "- Spark is a framework for distributed processing.\n",
24 | "- It is a streamlined alternative to Map-Reduce.\n",
25 | "- Spark applications can be written in Scala, Java, or Python."
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "## Why Spark?\n",
33 | "\n",
34 | "Why learn Spark?\n",
35 | "\n",
36 | "- Spark enables you to analyze petabytes of data.\n",
37 | "- Spark is significantly faster than Map-Reduce.\n",
38 | "- Paradoxically, Spark's API is simpler than the Map-Reduce API."
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "## Origins\n",
46 | "\n",
47 | "- Spark was initially started at UC Berkeley's AMPLab (AMP = Algorithms Machines People) in 2009.\n",
48 | "- After being open sourced in 2010 under a BSD license, the project was donated in 2013 to the Apache Software Foundation and switched its license to Apache 2.0.\n",
49 | "- Spark is one of the most active projects in the Apache Software Foundation and one of the most active open source big data projects."
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "## Essense of Spark\n",
57 | "\n",
58 | "What is the basic idea of Spark?\n",
59 | "- Spark takes the Map-Reduce paradigm and changes it in some critical ways:\n",
60 | " - Instead of writing single Map-Reduce jobs, a Spark job consists of a series of map and reduce functions.\n",
61 | " - Moreover, the intermediate data is kept in memory instead of being written to disk."
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "metadata": {},
67 | "source": [
68 | "## Spark Ecosystem\n",
69 | "\n",
70 | "
"
71 | ]
72 | },
73 | {
74 | "cell_type": "markdown",
75 | "metadata": {},
76 | "source": [
77 | "### Pop Quiz\n",
78 | "\n",
79 | "\n",
80 | "Q: Since Spark keeps intermediate data in memory to get speed, what does it make us give up? Where's the catch?
\n",
81 | "1. Spark does a trade-off between memory and performance.\n",
82 | "
\n",
83 | "2. While Spark jobs are faster, they also consume more memory.\n",
84 | "
\n",
85 | "3. Spark outshines Map-Reduce in iterative algorithms where the overhead of saving the results of each step to HDFS slows down Map-Reduce.\n",
86 | "
\n",
87 | "4. For non-iterative algorithms Spark is comparable to Map-Reduce.\n",
88 | " "
89 | ]
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "## Spark Logging\n",
96 | "\n",
97 | "Q: How can I make Spark logging less verbose?\n",
98 | "- By default Spark logs messages at the `INFO` level.\n",
99 | "- Here are the steps to make it only print out warnings and errors.\n",
100 | "\n",
101 | "```sh\n",
102 | "cd $SPARK_HOME/conf\n",
103 | "cp log4j.properties.template log4j.properties\n",
104 | "```\n",
105 | "\n",
106 | "- Edit `log4j.properties` and replace `rootCategory=INFO` with `rootCategory=ERROR`"
107 | ]
108 | },
109 | {
110 | "cell_type": "markdown",
111 | "metadata": {},
112 | "source": [
113 | "## Spark Execution\n",
114 | "\n",
115 | "
\n",
116 | "\n",
117 | "## Spark Terminology\n",
118 | "-----------------\n",
119 | "\n",
120 | "Term | Meaning\n",
121 | "--- |---\n",
122 | "Driver | Process that contains the Spark Context\n",
123 | "Executor | Process that executes one or more Spark tasks\n",
124 | "Master | Process which manages applications across the cluster, e.g., Spark Master\n",
125 | "Worker | Process which manages executors on a particular worker node, e.g. Spark Worker"
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": [
132 | "## Spark Job\n",
133 | "\n",
134 | "Q: Flip a coin 100 times using Python's `random()` function. What\n",
135 | "fraction of the time do you get heads?\n",
136 | "\n",
137 | "- Initialize Spark."
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": 3,
143 | "metadata": {
144 | "collapsed": true
145 | },
146 | "outputs": [],
147 | "source": [
148 | "import pyspark as ps\n",
149 | "\n",
150 | "spark = ps.sql.SparkSession.builder \\\n",
151 | " .master('local[4]') \\\n",
152 | " .appName('spark-lecture') \\\n",
153 | " .getOrCreate()\n",
154 | "\n",
155 | "sc = spark.sparkContext"
156 | ]
157 | },
158 | {
159 | "cell_type": "markdown",
160 | "metadata": {},
161 | "source": [
162 | "- Define and run the Spark job."
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": 27,
168 | "metadata": {},
169 | "outputs": [
170 | {
171 | "name": "stdout",
172 | "output_type": "stream",
173 | "text": [
174 | "heads = 400\n",
175 | "tails = 600\n",
176 | "ratio = 0.4\n"
177 | ]
178 | }
179 | ],
180 | "source": [
181 | "import random \n",
182 | "\n",
183 | "n = 1000\n",
184 | "\n",
185 | "\n",
186 | "heads = (sc.parallelize(xrange(n))\n",
187 | " .map(lambda _: random.random())\n",
188 | " .filter(lambda r: r <= 0.5)\n",
189 | " .count())\n",
190 | "\n",
191 | "tails = n - heads\n",
192 | "ratio = 1. * heads / n\n",
193 | "\n",
194 | "print 'heads =', heads\n",
195 | "print 'tails =', tails\n",
196 | "print 'ratio =', ratio\n"
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": 12,
202 | "metadata": {
203 | "collapsed": true
204 | },
205 | "outputs": [],
206 | "source": [
207 | "rdd= sc.parallelize(xrange(n))"
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": 15,
213 | "metadata": {},
214 | "outputs": [],
215 | "source": [
216 | "rdd2 = rdd.map(lambda x: random.random())"
217 | ]
218 | },
219 | {
220 | "cell_type": "code",
221 | "execution_count": 17,
222 | "metadata": {},
223 | "outputs": [
224 | {
225 | "data": {
226 | "text/plain": [
227 | "1000"
228 | ]
229 | },
230 | "execution_count": 17,
231 | "metadata": {},
232 | "output_type": "execute_result"
233 | }
234 | ],
235 | "source": [
236 | "len(rdd2.collect())"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": 21,
242 | "metadata": {
243 | "collapsed": true
244 | },
245 | "outputs": [],
246 | "source": [
247 | "rdd3=rdd2.filter(lambda r: r <= 0.5) "
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": 26,
253 | "metadata": {},
254 | "outputs": [
255 | {
256 | "data": {
257 | "text/plain": [
258 | "400"
259 | ]
260 | },
261 | "execution_count": 26,
262 | "metadata": {},
263 | "output_type": "execute_result"
264 | }
265 | ],
266 | "source": [
267 | "rdd3.count() #action : collect, count, mean #tranformation: "
268 | ]
269 | },
270 | {
271 | "cell_type": "markdown",
272 | "metadata": {},
273 | "source": [
274 | "### Notes\n",
275 | "\n",
276 | "- `sc.parallelize` creates an RDD.\n",
277 | "- `map` and `filter` are *transformations*.\n",
278 | " - They create new RDDs from existing RDDs.\n",
279 | "- `count` is an *action* and brings the data from the RDDs back to the driver.\n",
280 | "\n",
281 | "## Spark Terminology\n",
282 | "\n",
283 | "Term | Meaning\n",
284 | "--- | ---\n",
285 | "RDD | *Resilient Distributed Dataset* or a distributed sequence of records\n",
286 | "Spark Job | Sequence of transformations on data with a final action\n",
287 | "Transformation | Spark operation that produces an RDD\n",
288 | "Action | Spark operation that produces a local object\n",
289 | "Spark Application | Sequence of Spark jobs and other code\n",
290 | "\n",
291 | "- A Spark job pushes the data to the cluster, all computation happens on the *executors*, then the result is sent back to the driver."
292 | ]
293 | },
294 | {
295 | "cell_type": "markdown",
296 | "metadata": {},
297 | "source": [
298 | "### Pop Quiz\n",
299 | "\n",
300 | "\n",
301 | "\n",
302 | "In this Spark job what is the transformation is what is the action?\n",
303 | "
\n",
304 | "`sc.parallelize(xrange(10)).filter(lambda x: x % 2 == 0).collect()`\n",
305 | "
\n",
306 | "1. `filter` is the transformation.\n",
307 | "
\n",
308 | "2. `collect` is the action.\n",
309 | " "
310 | ]
311 | },
312 | {
313 | "cell_type": "markdown",
314 | "metadata": {},
315 | "source": [
316 | "## Lambda vs. Functions\n",
317 | "\n",
318 | "- Instead of `lambda` you can pass in fully defined functions into `map`, `filter`, and other RDD transformations.\n",
319 | "- Use `lambda` for short functions.\n",
320 | "- Use `def` for more substantial functions."
321 | ]
322 | },
323 | {
324 | "cell_type": "markdown",
325 | "metadata": {},
326 | "source": [
327 | "## Finding Primes\n",
328 | "\n",
329 | "Q: Find all the primes less than 100.\n",
330 | "\n",
331 | "- Define function to determine if a number is prime."
332 | ]
333 | },
334 | {
335 | "cell_type": "code",
336 | "execution_count": 34,
337 | "metadata": {
338 | "collapsed": true
339 | },
340 | "outputs": [],
341 | "source": [
342 | "def is_prime(number):\n",
343 | " factor_min = 2\n",
344 | " factor_max = int(number ** 0.5) + 1\n",
345 | " for factor in xrange(factor_min, factor_max):\n",
346 | " if number % factor == 0:\n",
347 | " return False\n",
348 | " return True"
349 | ]
350 | },
351 | {
352 | "cell_type": "markdown",
353 | "metadata": {},
354 | "source": [
355 | "- Use this to filter out non-primes."
356 | ]
357 | },
358 | {
359 | "cell_type": "code",
360 | "execution_count": 41,
361 | "metadata": {},
362 | "outputs": [
363 | {
364 | "name": "stdout",
365 | "output_type": "stream",
366 | "text": [
367 | "[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]\n"
368 | ]
369 | }
370 | ],
371 | "source": [
372 | "numbers = xrange(2, 100)\n",
373 | "\n",
374 | "primes = (sc.parallelize(numbers)\n",
375 | " .filter(is_prime)\n",
376 | " .collect())\n",
377 | "\n",
378 | "print primes"
379 | ]
380 | },
381 | {
382 | "cell_type": "code",
383 | "execution_count": 29,
384 | "metadata": {
385 | "collapsed": true
386 | },
387 | "outputs": [],
388 | "source": [
389 | "numbers=xrange(2,10)"
390 | ]
391 | },
392 | {
393 | "cell_type": "code",
394 | "execution_count": 30,
395 | "metadata": {},
396 | "outputs": [],
397 | "source": [
398 | "rdd=sc.parallelize(numbers)"
399 | ]
400 | },
401 | {
402 | "cell_type": "code",
403 | "execution_count": 35,
404 | "metadata": {},
405 | "outputs": [],
406 | "source": [
407 | "rdd2=rdd.filter(is_prime)"
408 | ]
409 | },
410 | {
411 | "cell_type": "code",
412 | "execution_count": 39,
413 | "metadata": {},
414 | "outputs": [],
415 | "source": [
416 | "rdd3=rdd2.filter(lambda x: x>5)"
417 | ]
418 | },
419 | {
420 | "cell_type": "code",
421 | "execution_count": 40,
422 | "metadata": {},
423 | "outputs": [
424 | {
425 | "data": {
426 | "text/plain": [
427 | "[7]"
428 | ]
429 | },
430 | "execution_count": 40,
431 | "metadata": {},
432 | "output_type": "execute_result"
433 | }
434 | ],
435 | "source": [
436 | "rdd3.collect()"
437 | ]
438 | },
439 | {
440 | "cell_type": "markdown",
441 | "metadata": {},
442 | "source": [
443 | "### Pop Quiz\n",
444 | "\n",
445 | "
\n",
446 | "\n",
447 | "\n",
448 | "Q: Where does `is_prime` execute?
\n",
449 | "A: On the executors.\n",
450 | " \n",
451 | "\n",
452 | "\n",
453 | "Q: Where does is the RDD object collected?
\n",
454 | "A: On the driver.\n",
455 | " "
456 | ]
457 | },
458 | {
459 | "cell_type": "markdown",
460 | "metadata": {},
461 | "source": [
462 | "### Transformations and Actions\n",
463 | "\n",
464 | "- Common RDD Constructors\n",
465 | "\n",
466 | "Expression | Meaning\n",
467 | "--- | ---\n",
468 | "`sc.parallelize(list)` | Create RDD of elements of list\n",
469 | "`sc.textFile(path)` | Create RDD of lines from file\n",
470 | "\n",
471 | "- Common Transformations\n",
472 | "\n",
473 | "Expression | Meaning\n",
474 | "--- | ---\n",
475 | "`filter(lambda x: x % 2 == 0)` | Discard non-even elements\n",
476 | "`map(lambda x: x * 2)` | Multiply each RDD element by `2`\n",
477 | "`map(lambda x: x.split())` | Split each string into words\n",
478 | "`flatMap(lambda x: x.split())` | Split each string into words and flatten sequence\n",
479 | "`sample(withReplacement = True, 0.25)` | Create sample of 25% of elements with replacement\n",
480 | "`union(rdd)` | Append `rdd` to existing RDD\n",
481 | "`distinct()` | Remove duplicates in RDD\n",
482 | "`sortBy(lambda x: x, ascending = False)` | Sort elements in descending order\n",
483 | "\n",
484 | "- Common Actions\n",
485 | "\n",
486 | "Expression | Meaning\n",
487 | "--- | ---\n",
488 | "`collect()` | Convert RDD to in-memory list \n",
489 | "`take(3)` | First 3 elements of RDD \n",
490 | "`top(3)` | Top 3 elements of RDD\n",
491 | "`takeSample(withReplacement = True, 3)` | Create sample of 3 elements with replacement\n",
492 | "`sum()` | Find element sum (assumes numeric elements)\n",
493 | "`mean()` | Find element mean (assumes numeric elements)\n",
494 | "`stdev()` | Find element deviation (assumes numeric elements)"
495 | ]
496 | },
497 | {
498 | "cell_type": "markdown",
499 | "metadata": {},
500 | "source": [
501 | "### Pop Quiz\n",
502 | "\n",
503 | "Q: What will this output?"
504 | ]
505 | },
506 | {
507 | "cell_type": "code",
508 | "execution_count": 43,
509 | "metadata": {},
510 | "outputs": [
511 | {
512 | "data": {
513 | "text/plain": [
514 | "[4, 1, 5, 2, 3]"
515 | ]
516 | },
517 | "execution_count": 43,
518 | "metadata": {},
519 | "output_type": "execute_result"
520 | }
521 | ],
522 | "source": [
523 | "sc.parallelize([1, 3, 2, 2, 1, 4, 5]).distinct().collect()"
524 | ]
525 | },
526 | {
527 | "cell_type": "markdown",
528 | "metadata": {},
529 | "source": [
530 | "Q: What will this output?"
531 | ]
532 | },
533 | {
534 | "cell_type": "code",
535 | "execution_count": 48,
536 | "metadata": {},
537 | "outputs": [
538 | {
539 | "data": {
540 | "text/plain": [
541 | "[9, 8, 7, 6, 5, 4, 3, 2]"
542 | ]
543 | },
544 | "execution_count": 48,
545 | "metadata": {},
546 | "output_type": "execute_result"
547 | }
548 | ],
549 | "source": [
550 | "sc.parallelize(xrange(2, 10)).sortBy(lambda x: x, ascending=False).collect()"
551 | ]
552 | },
553 | {
554 | "cell_type": "markdown",
555 | "metadata": {},
556 | "source": [
557 | "Q: What will this output?"
558 | ]
559 | },
560 | {
561 | "cell_type": "code",
562 | "execution_count": 49,
563 | "metadata": {},
564 | "outputs": [
565 | {
566 | "name": "stdout",
567 | "output_type": "stream",
568 | "text": [
569 | "Writing input.txt\n"
570 | ]
571 | }
572 | ],
573 | "source": [
574 | "%%writefile input.txt\n",
575 | "hello world\n",
576 | "another line\n",
577 | "yet another line\n",
578 | "yet another another line"
579 | ]
580 | },
581 | {
582 | "cell_type": "code",
583 | "execution_count": 50,
584 | "metadata": {},
585 | "outputs": [
586 | {
587 | "data": {
588 | "text/plain": [
589 | "4"
590 | ]
591 | },
592 | "execution_count": 50,
593 | "metadata": {},
594 | "output_type": "execute_result"
595 | }
596 | ],
597 | "source": [
598 | "sc.textFile('input.txt').map(lambda x: x.split()).count()"
599 | ]
600 | },
601 | {
602 | "cell_type": "markdown",
603 | "metadata": {},
604 | "source": [
605 | "Q: What about this?"
606 | ]
607 | },
608 | {
609 | "cell_type": "code",
610 | "execution_count": 51,
611 | "metadata": {},
612 | "outputs": [
613 | {
614 | "data": {
615 | "text/plain": [
616 | "11"
617 | ]
618 | },
619 | "execution_count": 51,
620 | "metadata": {},
621 | "output_type": "execute_result"
622 | }
623 | ],
624 | "source": [
625 | "sc.textFile('input.txt').flatMap(lambda x: x.split()).count()"
626 | ]
627 | },
628 | {
629 | "cell_type": "markdown",
630 | "metadata": {},
631 | "source": [
632 | "## Section 3.2 "
633 | ]
634 | },
635 | {
636 | "cell_type": "markdown",
637 | "metadata": {},
638 | "source": [
639 | "## Map vs. FlatMap\n",
640 | "\n",
641 | "Here's the difference between `map` and `flatMap`:\n",
642 | "\n",
643 | "- Map:"
644 | ]
645 | },
646 | {
647 | "cell_type": "code",
648 | "execution_count": 91,
649 | "metadata": {},
650 | "outputs": [
651 | {
652 | "data": {
653 | "text/plain": [
654 | "[[u'hello', u'world'],\n",
655 | " [u'another', u'line'],\n",
656 | " [u'yet', u'another', u'line'],\n",
657 | " [u'yet', u'another', u'another', u'line']]"
658 | ]
659 | },
660 | "execution_count": 91,
661 | "metadata": {},
662 | "output_type": "execute_result"
663 | }
664 | ],
665 | "source": [
666 | "sc.textFile('input.txt') \\\n",
667 | " .map(lambda x: x.split()) \\\n",
668 | " .collect()"
669 | ]
670 | },
671 | {
672 | "cell_type": "code",
673 | "execution_count": 93,
674 | "metadata": {},
675 | "outputs": [],
676 | "source": [
677 | "rdd= sc.textFile('input.txt')"
678 | ]
679 | },
680 | {
681 | "cell_type": "code",
682 | "execution_count": 97,
683 | "metadata": {},
684 | "outputs": [],
685 | "source": [
686 | "rdd2=rdd.map(lambda x: x.split())"
687 | ]
688 | },
689 | {
690 | "cell_type": "code",
691 | "execution_count": 98,
692 | "metadata": {},
693 | "outputs": [
694 | {
695 | "data": {
696 | "text/plain": [
697 | "[[u'hello', u'world'],\n",
698 | " [u'another', u'line'],\n",
699 | " [u'yet', u'another', u'line'],\n",
700 | " [u'yet', u'another', u'another', u'line']]"
701 | ]
702 | },
703 | "execution_count": 98,
704 | "metadata": {},
705 | "output_type": "execute_result"
706 | }
707 | ],
708 | "source": [
709 | "rdd2.collect()"
710 | ]
711 | },
712 | {
713 | "cell_type": "markdown",
714 | "metadata": {},
715 | "source": [
716 | "- FlatMap:"
717 | ]
718 | },
719 | {
720 | "cell_type": "code",
721 | "execution_count": 99,
722 | "metadata": {},
723 | "outputs": [
724 | {
725 | "data": {
726 | "text/plain": [
727 | "[u'hello',\n",
728 | " u'world',\n",
729 | " u'another',\n",
730 | " u'line',\n",
731 | " u'yet',\n",
732 | " u'another',\n",
733 | " u'line',\n",
734 | " u'yet',\n",
735 | " u'another',\n",
736 | " u'another',\n",
737 | " u'line']"
738 | ]
739 | },
740 | "execution_count": 99,
741 | "metadata": {},
742 | "output_type": "execute_result"
743 | }
744 | ],
745 | "source": [
746 | "sc.textFile('input.txt') \\\n",
747 | " .flatMap(lambda x: x.split()) \\\n",
748 | " .collect()"
749 | ]
750 | },
751 | {
752 | "cell_type": "code",
753 | "execution_count": null,
754 | "metadata": {
755 | "collapsed": true
756 | },
757 | "outputs": [],
758 | "source": []
759 | },
760 | {
761 | "cell_type": "markdown",
762 | "metadata": {},
763 | "source": [
764 | "## PairRDD\n",
765 | "\n",
766 | "At this point we know how to aggregate values across an RDD. If we have an RDD containing sales transactions we can find the total revenue across all transactions.\n",
767 | "\n",
768 | "Q: Using the following sales data find the total revenue across all transactions."
769 | ]
770 | },
771 | {
772 | "cell_type": "code",
773 | "execution_count": 100,
774 | "metadata": {},
775 | "outputs": [
776 | {
777 | "name": "stdout",
778 | "output_type": "stream",
779 | "text": [
780 | "Overwriting sales.txt\n"
781 | ]
782 | }
783 | ],
784 | "source": [
785 | "%%writefile sales.txt\n",
786 | "#ID Date Store State Product Amount\n",
787 | "101 11/13/2014 100 WA 331 300.00\n",
788 | "104 11/18/2014 700 OR 329 450.00\n",
789 | "102 11/15/2014 203 CA 321 200.00\n",
790 | "106 11/19/2014 202 CA 331 330.00\n",
791 | "103 11/17/2014 101 WA 373 750.00\n",
792 | "105 11/19/2014 202 CA 321 200.00"
793 | ]
794 | },
795 | {
796 | "cell_type": "code",
797 | "execution_count": 101,
798 | "metadata": {},
799 | "outputs": [
800 | {
801 | "data": {
802 | "text/plain": [
803 | "[u'106 11/19/2014 202 CA 331 330.00',\n",
804 | " u'105 11/19/2014 202 CA 321 200.00']"
805 | ]
806 | },
807 | "execution_count": 101,
808 | "metadata": {},
809 | "output_type": "execute_result"
810 | }
811 | ],
812 | "source": [
813 | "sc.textFile('sales.txt').top(2)"
814 | ]
815 | },
816 | {
817 | "cell_type": "markdown",
818 | "metadata": {},
819 | "source": [
820 | "- Read the file."
821 | ]
822 | },
823 | {
824 | "cell_type": "code",
825 | "execution_count": 102,
826 | "metadata": {},
827 | "outputs": [
828 | {
829 | "data": {
830 | "text/plain": [
831 | "[u'#ID Date Store State Product Amount',\n",
832 | " u'101 11/13/2014 100 WA 331 300.00']"
833 | ]
834 | },
835 | "execution_count": 102,
836 | "metadata": {},
837 | "output_type": "execute_result"
838 | }
839 | ],
840 | "source": [
841 | "sc.textFile('sales.txt')\\\n",
842 | " .take(2)"
843 | ]
844 | },
845 | {
846 | "cell_type": "markdown",
847 | "metadata": {},
848 | "source": [
849 | "- Split the lines."
850 | ]
851 | },
852 | {
853 | "cell_type": "code",
854 | "execution_count": 104,
855 | "metadata": {},
856 | "outputs": [
857 | {
858 | "data": {
859 | "text/plain": [
860 | "[[u'106', u'11/19/2014', u'202', u'CA', u'331', u'330.00'],\n",
861 | " [u'105', u'11/19/2014', u'202', u'CA', u'321', u'200.00']]"
862 | ]
863 | },
864 | "execution_count": 104,
865 | "metadata": {},
866 | "output_type": "execute_result"
867 | }
868 | ],
869 | "source": [
870 | "sc.textFile('sales.txt')\\\n",
871 | " .map(lambda x: x.split())\\\n",
872 | " .top(2)"
873 | ]
874 | },
875 | {
876 | "cell_type": "markdown",
877 | "metadata": {},
878 | "source": [
879 | "- Remove `#`."
880 | ]
881 | },
882 | {
883 | "cell_type": "code",
884 | "execution_count": 105,
885 | "metadata": {},
886 | "outputs": [
887 | {
888 | "data": {
889 | "text/plain": [
890 | "[[u'#ID', u'Date', u'Store', u'State', u'Product', u'Amount']]"
891 | ]
892 | },
893 | "execution_count": 105,
894 | "metadata": {},
895 | "output_type": "execute_result"
896 | }
897 | ],
898 | "source": [
899 | "sc.textFile('sales.txt')\\\n",
900 | " .map(lambda x: x.split())\\\n",
901 | " .filter(lambda x: x[0].startswith('#'))\\\n",
902 | " .take(3)"
903 | ]
904 | },
905 | {
906 | "cell_type": "markdown",
907 | "metadata": {},
908 | "source": [
909 | "- Try again."
910 | ]
911 | },
912 | {
913 | "cell_type": "code",
914 | "execution_count": 107,
915 | "metadata": {},
916 | "outputs": [
917 | {
918 | "data": {
919 | "text/plain": [
920 | "[[u'101', u'11/13/2014', u'100', u'WA', u'331', u'300.00'],\n",
921 | " [u'104', u'11/18/2014', u'700', u'OR', u'329', u'450.00'],\n",
922 | " [u'102', u'11/15/2014', u'203', u'CA', u'321', u'200.00']]"
923 | ]
924 | },
925 | "execution_count": 107,
926 | "metadata": {},
927 | "output_type": "execute_result"
928 | }
929 | ],
930 | "source": [
931 | "sc.textFile('sales.txt')\\\n",
932 | " .map(lambda x: x.split())\\\n",
933 | " .filter(lambda x: not x[0].startswith('#'))\\\n",
934 | " .take(3)"
935 | ]
936 | },
937 | {
938 | "cell_type": "markdown",
939 | "metadata": {},
940 | "source": [
941 | "- Pick last field."
942 | ]
943 | },
944 | {
945 | "cell_type": "code",
946 | "execution_count": 109,
947 | "metadata": {},
948 | "outputs": [
949 | {
950 | "data": {
951 | "text/plain": [
952 | "[u'300.00', u'450.00', u'200.00']"
953 | ]
954 | },
955 | "execution_count": 109,
956 | "metadata": {},
957 | "output_type": "execute_result"
958 | }
959 | ],
960 | "source": [
961 | "sc.textFile('sales.txt')\\\n",
962 | " .map(lambda x: x.split())\\\n",
963 | " .filter(lambda x: not x[0].startswith('#'))\\\n",
964 | " .map(lambda x: x[-1])\\\n",
965 | " .take(3)"
966 | ]
967 | },
968 | {
969 | "cell_type": "markdown",
970 | "metadata": {},
971 | "source": [
972 | "- Convert to float and then sum."
973 | ]
974 | },
975 | {
976 | "cell_type": "code",
977 | "execution_count": 112,
978 | "metadata": {},
979 | "outputs": [
980 | {
981 | "data": {
982 | "text/plain": [
983 | "2230.0"
984 | ]
985 | },
986 | "execution_count": 112,
987 | "metadata": {},
988 | "output_type": "execute_result"
989 | }
990 | ],
991 | "source": [
992 | "sc.textFile('sales.txt')\\\n",
993 | " .map(lambda x: x.split())\\\n",
994 | " .filter(lambda x: not x[0].startswith('#'))\\\n",
995 | " .map(lambda x: float(x[-1]))\\\n",
996 | " .sum()"
997 | ]
998 | },
999 | {
1000 | "cell_type": "markdown",
1001 | "metadata": {},
1002 | "source": [
1003 | "## ReduceByKey\n",
1004 | "\n",
1005 | "Q: Calculate revenue per state?\n",
1006 | "\n",
1007 | "- Instead of creating a sequence of revenue numbers we can create tuples of states and revenues."
1008 | ]
1009 | },
1010 | {
1011 | "cell_type": "code",
1012 | "execution_count": 61,
1013 | "metadata": {},
1014 | "outputs": [
1015 | {
1016 | "data": {
1017 | "text/plain": [
1018 | "[(u'WA', 300.0),\n",
1019 | " (u'OR', 450.0),\n",
1020 | " (u'CA', 200.0),\n",
1021 | " (u'CA', 330.0),\n",
1022 | " (u'WA', 750.0),\n",
1023 | " (u'CA', 200.0)]"
1024 | ]
1025 | },
1026 | "execution_count": 61,
1027 | "metadata": {},
1028 | "output_type": "execute_result"
1029 | }
1030 | ],
1031 | "source": [
1032 | "sc.textFile('sales.txt')\\\n",
1033 | " .map(lambda x: x.split())\\\n",
1034 | " .filter(lambda x: not x[0].startswith('#'))\\\n",
1035 | " .map(lambda x: (x[-3], float(x[-1])))\\\n",
1036 | " .collect()"
1037 | ]
1038 | },
1039 | {
1040 | "cell_type": "markdown",
1041 | "metadata": {},
1042 | "source": [
1043 | "- Now use `reduceByKey` to add them up."
1044 | ]
1045 | },
1046 | {
1047 | "cell_type": "code",
1048 | "execution_count": 116,
1049 | "metadata": {},
1050 | "outputs": [
1051 | {
1052 | "data": {
1053 | "text/plain": [
1054 | "[(u'CA', 730.0), (u'WA', 1050.0), (u'OR', 450.0)]"
1055 | ]
1056 | },
1057 | "execution_count": 116,
1058 | "metadata": {},
1059 | "output_type": "execute_result"
1060 | }
1061 | ],
1062 | "source": [
1063 | "sc.textFile('sales.txt')\\\n",
1064 | " .map(lambda x: x.split())\\\n",
1065 | " .filter(lambda x: not x[0].startswith('#'))\\\n",
1066 | " .map(lambda x: (x[-3], float(x[-1])))\\\n",
1067 | " .reduceByKey(lambda amount1, amount2: amount1 + amount2)\\\n",
1068 | " .collect()"
1069 | ]
1070 | },
1071 | {
1072 | "cell_type": "markdown",
1073 | "metadata": {},
1074 | "source": [
1075 | "Q: Find the state with the highest total revenue.\n",
1076 | "\n",
1077 | "- You can either use the action `top` or the transformation `sortBy`."
1078 | ]
1079 | },
1080 | {
1081 | "cell_type": "code",
1082 | "execution_count": 123,
1083 | "metadata": {},
1084 | "outputs": [
1085 | {
1086 | "data": {
1087 | "text/plain": [
1088 | "2230.0"
1089 | ]
1090 | },
1091 | "execution_count": 123,
1092 | "metadata": {},
1093 | "output_type": "execute_result"
1094 | }
1095 | ],
1096 | "source": [
1097 | "sc.textFile('sales.txt')\\\n",
1098 | " .map(lambda x: x.split())\\\n",
1099 | " .filter(lambda x: not x[0].startswith('#'))\\\n",
1100 | " .map(lambda x: (x[-3], float(x[-1])))\\\n",
1101 | " .reduceByKey(lambda amount1, amount2: amount1 + amount2)\\\n",
1102 | " .sortBy(lambda state_amount: state_amount[1])\\\n",
1103 | " .map(lambda x: x[1])\\\n",
1104 | " .sum()"
1105 | ]
1106 | },
1107 | {
1108 | "cell_type": "markdown",
1109 | "metadata": {},
1110 | "source": [
1111 | "### Pop Quiz\n",
1112 | "\n",
1113 | "\n",
1114 | "Q: What does `reduceByKey` do?
\n",
1115 | "1. It is like a reducer.\n",
1116 | "
\n",
1117 | "2. If the RDD is made up of key-value pairs, it combines the values across all tuples with the same key by using the function we pass to it.\n",
1118 | "
\n",
1119 | "3. It only works on RDDs made up of key-value pairs or 2-tuples.\n",
1120 | " \n",
1121 | "\n",
1122 | "## Notes\n",
1123 | "\n",
1124 | "- `reduceByKey` only works on RDDs made up of 2-tuples.\n",
1125 | "- `reduceByKey` works as both a reducer and a combiner.\n",
1126 | "- It requires that the operation is associative."
1127 | ]
1128 | },
1129 | {
1130 | "cell_type": "markdown",
1131 | "metadata": {},
1132 | "source": [
1133 | "## Word Count\n",
1134 | "\n",
1135 | "Q: Implement word count in Spark.\n",
1136 | "\n",
1137 | "- Create some input."
1138 | ]
1139 | },
1140 | {
1141 | "cell_type": "code",
1142 | "execution_count": 64,
1143 | "metadata": {},
1144 | "outputs": [
1145 | {
1146 | "name": "stdout",
1147 | "output_type": "stream",
1148 | "text": [
1149 | "Overwriting input.txt\n"
1150 | ]
1151 | }
1152 | ],
1153 | "source": [
1154 | "%%writefile input.txt\n",
1155 | "hello world\n",
1156 | "another line\n",
1157 | "yet another line\n",
1158 | "yet another another line"
1159 | ]
1160 | },
1161 | {
1162 | "cell_type": "markdown",
1163 | "metadata": {},
1164 | "source": [
1165 | "- Count the words."
1166 | ]
1167 | },
1168 | {
1169 | "cell_type": "code",
1170 | "execution_count": 124,
1171 | "metadata": {},
1172 | "outputs": [
1173 | {
1174 | "data": {
1175 | "text/plain": [
1176 | "[(u'line', 3), (u'another', 4), (u'world', 1), (u'hello', 1), (u'yet', 2)]"
1177 | ]
1178 | },
1179 | "execution_count": 124,
1180 | "metadata": {},
1181 | "output_type": "execute_result"
1182 | }
1183 | ],
1184 | "source": [
1185 | "sc.textFile('input.txt')\\\n",
1186 | " .flatMap(lambda line: line.split())\\\n",
1187 | " .map(lambda word: (word, 1))\\\n",
1188 | " .reduceByKey(lambda count1, count2: count1 + count2)\\\n",
1189 | " .collect()"
1190 | ]
1191 | },
1192 | {
1193 | "cell_type": "code",
1194 | "execution_count": 125,
1195 | "metadata": {
1196 | "collapsed": true
1197 | },
1198 | "outputs": [],
1199 | "source": [
1200 | "rdd=sc.textFile('input.txt')\\\n",
1201 | " .flatMap(lambda line: line.split())\\"
1202 | ]
1203 | },
1204 | {
1205 | "cell_type": "code",
1206 | "execution_count": 130,
1207 | "metadata": {},
1208 | "outputs": [],
1209 | "source": [
1210 | "rdd2= rdd.map(lambda word: (word, 1))"
1211 | ]
1212 | },
1213 | {
1214 | "cell_type": "code",
1215 | "execution_count": 132,
1216 | "metadata": {},
1217 | "outputs": [
1218 | {
1219 | "data": {
1220 | "text/plain": [
1221 | "[(u'hello', 1),\n",
1222 | " (u'world', 1),\n",
1223 | " (u'another', 1),\n",
1224 | " (u'line', 1),\n",
1225 | " (u'yet', 1),\n",
1226 | " (u'another', 1),\n",
1227 | " (u'line', 1),\n",
1228 | " (u'yet', 1),\n",
1229 | " (u'another', 1),\n",
1230 | " (u'another', 1),\n",
1231 | " (u'line', 1)]"
1232 | ]
1233 | },
1234 | "execution_count": 132,
1235 | "metadata": {},
1236 | "output_type": "execute_result"
1237 | }
1238 | ],
1239 | "source": [
1240 | "rdd2.collect()"
1241 | ]
1242 | },
1243 | {
1244 | "cell_type": "code",
1245 | "execution_count": 131,
1246 | "metadata": {},
1247 | "outputs": [
1248 | {
1249 | "data": {
1250 | "text/plain": [
1251 | "[(u'line', 3), (u'another', 4), (u'world', 1), (u'hello', 1), (u'yet', 2)]"
1252 | ]
1253 | },
1254 | "execution_count": 131,
1255 | "metadata": {},
1256 | "output_type": "execute_result"
1257 | }
1258 | ],
1259 | "source": [
1260 | "rdd2.reduceByKey(lambda a, b:a+b).collect()"
1261 | ]
1262 | },
1263 | {
1264 | "cell_type": "markdown",
1265 | "metadata": {},
1266 | "source": [
1267 | "## Making List Indexing Readable\n",
1268 | "\n",
1269 | "- While this code looks reasonable, the list indexes are cryptic and hard to read."
1270 | ]
1271 | },
1272 | {
1273 | "cell_type": "code",
1274 | "execution_count": 66,
1275 | "metadata": {},
1276 | "outputs": [
1277 | {
1278 | "data": {
1279 | "text/plain": [
1280 | "[(u'WA', 1050.0), (u'CA', 730.0), (u'OR', 450.0)]"
1281 | ]
1282 | },
1283 | "execution_count": 66,
1284 | "metadata": {},
1285 | "output_type": "execute_result"
1286 | }
1287 | ],
1288 | "source": [
1289 | "sc.textFile('sales.txt')\\\n",
1290 | " .map(lambda x: x.split())\\\n",
1291 | " .filter(lambda x: not x[0].startswith('#'))\\\n",
1292 | " .map(lambda x: (x[-3], float(x[-1])))\\\n",
1293 | " .reduceByKey(lambda amount1, amount2: amount1 + amount2)\\\n",
1294 | " .sortBy(lambda state_amount: state_amount[1], ascending = False) \\\n",
1295 | " .collect()"
1296 | ]
1297 | },
1298 | {
1299 | "cell_type": "markdown",
1300 | "metadata": {},
1301 | "source": [
1302 | "- We can make this more readable using Python's argument unpacking feature."
1303 | ]
1304 | },
1305 | {
1306 | "cell_type": "markdown",
1307 | "metadata": {},
1308 | "source": [
1309 | "## Argument Unpacking\n",
1310 | "\n",
1311 | "Q: Which version of `getCity` is more readable and why?\n",
1312 | "\n",
1313 | "- Consider this code."
1314 | ]
1315 | },
1316 | {
1317 | "cell_type": "code",
1318 | "execution_count": 67,
1319 | "metadata": {},
1320 | "outputs": [
1321 | {
1322 | "name": "stdout",
1323 | "output_type": "stream",
1324 | "text": [
1325 | "getCity1(client) = SF\n",
1326 | "getCity1(client) = SF\n"
1327 | ]
1328 | }
1329 | ],
1330 | "source": [
1331 | "client = ('Dmitri', 'Smith', 'SF')\n",
1332 | "\n",
1333 | "def getCity1(client):\n",
1334 | " return client[2]\n",
1335 | "\n",
1336 | "def getCity2((first, last, city)):\n",
1337 | " return city\n",
1338 | "\n",
1339 | "print 'getCity1(client) =', getCity1(client)\n",
1340 | "print 'getCity1(client) =', getCity2(client)"
1341 | ]
1342 | },
1343 | {
1344 | "cell_type": "markdown",
1345 | "metadata": {},
1346 | "source": [
1347 | "- What is the difference between `getCity1` and `getCity2`?\n",
1348 | "- Which is more readable?\n",
1349 | "- What is the essence of argument unpacking?"
1350 | ]
1351 | },
1352 | {
1353 | "cell_type": "markdown",
1354 | "metadata": {},
1355 | "source": [
1356 | "### Pop Quiz\n",
1357 | "\n",
1358 | "\n",
1359 | "Q: Can argument unpacking work for deeper nested structures?
\n",
1360 | "A: Yes. It can work for arbitrarily nested tuples and lists.\n",
1361 | " \n",
1362 | "\n",
1363 | "\n",
1364 | "Q: How would you write `getCity` given\n",
1365 | "
\n",
1366 | "`client = ('Dmitri','Smith', ('123 Eddy','SF','CA'))`\n",
1367 | "
\n",
1368 | "`def getCity((first, last, (street, city, state))): return city`\n",
1369 | " \n",
1370 | "\n"
1371 | ]
1372 | },
1373 | {
1374 | "cell_type": "markdown",
1375 | "metadata": {},
1376 | "source": [
1377 | "- Let's test this out."
1378 | ]
1379 | },
1380 | {
1381 | "cell_type": "code",
1382 | "execution_count": 68,
1383 | "metadata": {},
1384 | "outputs": [
1385 | {
1386 | "data": {
1387 | "text/plain": [
1388 | "'SF'"
1389 | ]
1390 | },
1391 | "execution_count": 68,
1392 | "metadata": {},
1393 | "output_type": "execute_result"
1394 | }
1395 | ],
1396 | "source": [
1397 | "client = ('Dmitri', 'Smith',('123 Eddy', 'SF', 'CA'))\n",
1398 | "\n",
1399 | "def getCity((first, last, (street, city, state))):\n",
1400 | " return city\n",
1401 | "\n",
1402 | "getCity(client)"
1403 | ]
1404 | },
1405 | {
1406 | "cell_type": "markdown",
1407 | "metadata": {},
1408 | "source": [
1409 | "- Whenever you find yourself indexing into a tuple consider usingargument unpacking to make it more readable.\n",
1410 | "- Here is what `getCity` looks like without tuple indexing."
1411 | ]
1412 | },
1413 | {
1414 | "cell_type": "code",
1415 | "execution_count": 69,
1416 | "metadata": {},
1417 | "outputs": [
1418 | {
1419 | "data": {
1420 | "text/plain": [
1421 | "'SF'"
1422 | ]
1423 | },
1424 | "execution_count": 69,
1425 | "metadata": {},
1426 | "output_type": "execute_result"
1427 | }
1428 | ],
1429 | "source": [
1430 | "def badGetCity(client):\n",
1431 | " return client[2][1]\n",
1432 | "\n",
1433 | "getCity(client)"
1434 | ]
1435 | },
1436 | {
1437 | "cell_type": "markdown",
1438 | "metadata": {},
1439 | "source": [
1440 | "## Argument Unpacking In Spark\n",
1441 | "\n",
1442 | "Q: Rewrite the last Spark job using argument unpacking.\n",
1443 | "\n",
1444 | "- Here is the original version of the code:"
1445 | ]
1446 | },
1447 | {
1448 | "cell_type": "code",
1449 | "execution_count": 70,
1450 | "metadata": {},
1451 | "outputs": [
1452 | {
1453 | "data": {
1454 | "text/plain": [
1455 | "[(u'WA', 1050.0), (u'CA', 730.0), (u'OR', 450.0)]"
1456 | ]
1457 | },
1458 | "execution_count": 70,
1459 | "metadata": {},
1460 | "output_type": "execute_result"
1461 | }
1462 | ],
1463 | "source": [
1464 | "sc.textFile('sales.txt')\\\n",
1465 | " .map(lambda x: x.split())\\\n",
1466 | " .filter(lambda x: not x[0].startswith('#'))\\\n",
1467 | " .map(lambda x: (x[-3], float(x[-1])))\\\n",
1468 | " .reduceByKey(lambda amount1,amount2: amount1+amount2)\\\n",
1469 | " .sortBy(lambda state_amount: state_amount[1],ascending=False) \\\n",
1470 | " .collect()"
1471 | ]
1472 | },
1473 | {
1474 | "cell_type": "markdown",
1475 | "metadata": {},
1476 | "source": [
1477 | "- Here is the code with argument unpacking:"
1478 | ]
1479 | },
1480 | {
1481 | "cell_type": "code",
1482 | "execution_count": 71,
1483 | "metadata": {},
1484 | "outputs": [
1485 | {
1486 | "data": {
1487 | "text/plain": [
1488 | "[(u'WA', 1050.0), (u'CA', 730.0), (u'OR', 450.0)]"
1489 | ]
1490 | },
1491 | "execution_count": 71,
1492 | "metadata": {},
1493 | "output_type": "execute_result"
1494 | }
1495 | ],
1496 | "source": [
1497 | "sc.textFile('sales.txt')\\\n",
1498 | " .map(lambda x: x.split())\\\n",
1499 | " .filter(lambda x: not x[0].startswith('#'))\\\n",
1500 | " .map(lambda (id, date, store, state, product, amount): (state, float(amount)))\\\n",
1501 | " .reduceByKey(lambda amount1, amount2: amount1 + amount2)\\\n",
1502 | " .sortBy(lambda (state, amount): amount, ascending = False) \\\n",
1503 | " .collect()"
1504 | ]
1505 | },
1506 | {
1507 | "cell_type": "markdown",
1508 | "metadata": {},
1509 | "source": [
1510 | "- In this case because we have a long list or tuple argument unpacking is a judgement call."
1511 | ]
1512 | },
1513 | {
1514 | "cell_type": "markdown",
1515 | "metadata": {},
1516 | "source": [
1517 | "## GroupByKey\n",
1518 | "\n",
1519 | "`reduceByKey` lets us aggregate values using sum, max, min, and other associative operations. But what about non-associative operations like average? How can we calculate them?\n",
1520 | "\n",
1521 | "There are several ways to do this.\n",
1522 | "- The first approach is to change the RDD tuples so that the operation becomes associative. \n",
1523 | " - Instead of `(state, amount)` use `(state, (amount, count))`.\n",
1524 | "- The second approach is to use `groupByKey`, which is like `reduceByKey` except it gathers together all the values in an iterator. \n",
1525 | " - The iterator can then be reduced in a `map` step immediately after the `groupByKey`."
1526 | ]
1527 | },
1528 | {
1529 | "cell_type": "markdown",
1530 | "metadata": {},
1531 | "source": [
1532 | "Q: Calculate the average sales per state.\n",
1533 | "\n",
1534 | "- Approach 1: Restructure the tuples:"
1535 | ]
1536 | },
1537 | {
1538 | "cell_type": "code",
1539 | "execution_count": 72,
1540 | "metadata": {},
1541 | "outputs": [
1542 | {
1543 | "data": {
1544 | "text/plain": [
1545 | "[(u'CA', 243.33333333333334), (u'WA', 525.0), (u'OR', 450.0)]"
1546 | ]
1547 | },
1548 | "execution_count": 72,
1549 | "metadata": {},
1550 | "output_type": "execute_result"
1551 | }
1552 | ],
1553 | "source": [
1554 | "sc.textFile('sales.txt')\\\n",
1555 | " .map(lambda x: x.split())\\\n",
1556 | " .filter(lambda x: not x[0].startswith('#'))\\\n",
1557 | " .map(lambda x: (x[-3], (float(x[-1]), 1)))\\\n",
1558 | " .reduceByKey(lambda (amount1, count1), (amount2, count2):\\\n",
1559 | " (amount1 + amount2, count1 + count2))\\\n",
1560 | " .map(lambda (state, (amount, count)): (state, amount / count))\\\n",
1561 | " .collect()"
1562 | ]
1563 | },
1564 | {
1565 | "cell_type": "markdown",
1566 | "metadata": {},
1567 | "source": [
1568 | "- Note the argument unpacking we are doing in `reduceByKey` to name the elements of the tuples."
1569 | ]
1570 | },
1571 | {
1572 | "cell_type": "markdown",
1573 | "metadata": {},
1574 | "source": [
1575 | "- Approach 2: Use `groupByKey`:"
1576 | ]
1577 | },
1578 | {
1579 | "cell_type": "code",
1580 | "execution_count": 73,
1581 | "metadata": {},
1582 | "outputs": [
1583 | {
1584 | "data": {
1585 | "text/plain": [
1586 | "[(u'CA', 243.33333333333334), (u'WA', 525.0), (u'OR', 450.0)]"
1587 | ]
1588 | },
1589 | "execution_count": 73,
1590 | "metadata": {},
1591 | "output_type": "execute_result"
1592 | }
1593 | ],
1594 | "source": [
1595 | "def mean(iterator):\n",
1596 | " total = 0.0; count = 0\n",
1597 | " for x in iterator:\n",
1598 | " total += x; count += 1\n",
1599 | " return total / count\n",
1600 | "\n",
1601 | "sc.textFile('sales.txt')\\\n",
1602 | " .map(lambda x: x.split())\\\n",
1603 | " .filter(lambda x: not x[0].startswith('#'))\\\n",
1604 | " .map(lambda x: (x[-3], float(x[-1])))\\\n",
1605 | " .groupByKey()\\\n",
1606 | " .map(lambda (state, iterator): (state, mean(iterator)))\\\n",
1607 | " .collect()"
1608 | ]
1609 | },
1610 | {
1611 | "cell_type": "markdown",
1612 | "metadata": {},
1613 | "source": [
1614 | "- Note that we are using unpacking again."
1615 | ]
1616 | },
1617 | {
1618 | "cell_type": "markdown",
1619 | "metadata": {},
1620 | "source": [
1621 | "### Pop Quiz\n",
1622 | "\n",
1623 | "\n",
1624 | "Q: What would be the disadvantage of not using unpacking?
\n",
1625 | "1. We will need to drill down into the elements.\n",
1626 | "
\n",
1627 | "2. The code will be harder to read.\n",
1628 | " \n",
1629 | "\n",
1630 | "\n",
1631 | "Q: What are the pros and cons of `reduceByKey` vs. `groupByKey`?
\n",
1632 | "1. `groupByKey` stores the values for particular key as an iterable.\n",
1633 | "
\n",
1634 | "2. This will take up space in memory or on disk.\n",
1635 | "
\n",
1636 | "3. `reduceByKey` therefore is more scalable.\n",
1637 | "
\n",
1638 | "4. However, `groupByKey` does not require associative reducer operation.\n",
1639 | "
\n",
1640 | "5. For this reason `groupByKey` can be easier to program with.\n",
1641 | " "
1642 | ]
1643 | },
1644 | {
1645 | "cell_type": "markdown",
1646 | "metadata": {},
1647 | "source": [
1648 | "## Joins\n",
1649 | "\n",
1650 | "Q: Given a table of employees and locations find the cities that the employees live in.\n",
1651 | "\n",
1652 | "- The easiest way to do this is with a `join`."
1653 | ]
1654 | },
1655 | {
1656 | "cell_type": "code",
1657 | "execution_count": 74,
1658 | "metadata": {},
1659 | "outputs": [
1660 | {
1661 | "data": {
1662 | "text/plain": [
1663 | "[(13, ('Dee', None)),\n",
1664 | " (14, ('Alice', 'SF')),\n",
1665 | " (14, ('Chad', 'SF')),\n",
1666 | " (15, ('Bob', 'Seattle')),\n",
1667 | " (15, ('Jen', 'Seattle'))]"
1668 | ]
1669 | },
1670 | "execution_count": 74,
1671 | "metadata": {},
1672 | "output_type": "execute_result"
1673 | }
1674 | ],
1675 | "source": [
1676 | "# Employees: emp_id, loc_id, name\n",
1677 | "employee_data = [\n",
1678 | " (101, 14, 'Alice'),\n",
1679 | " (102, 15, 'Bob'),\n",
1680 | " (103, 14, 'Chad'),\n",
1681 | " (104, 15, 'Jen'),\n",
1682 | " (105, 13, 'Dee') ]\n",
1683 | "\n",
1684 | "# Locations: loc_id, location\n",
1685 | "location_data = [\n",
1686 | " (14, 'SF'),\n",
1687 | " (15, 'Seattle'),\n",
1688 | " (16, 'Portland')]\n",
1689 | "\n",
1690 | "employees = sc.parallelize(employee_data)\n",
1691 | "locations = sc.parallelize(location_data)\n",
1692 | "\n",
1693 | "# Re-key employee records with loc_id\n",
1694 | "employees2 = employees.map(lambda (emp_id, loc_id, name): (loc_id, name))\n",
1695 | "\n",
1696 | "# Now join.\n",
1697 | "employees2.leftOuterJoin(locations).collect()"
1698 | ]
1699 | },
1700 | {
1701 | "cell_type": "markdown",
1702 | "metadata": {},
1703 | "source": [
1704 | "## Pop Quiz\n",
1705 | "\n",
1706 | "\n",
1707 | "Q: How can we keep employees that don't have a valid location ID in the final result?
\n",
1708 | "1. Use `leftOuterJoin` to keep employees without location IDs.\n",
1709 | "
\n",
1710 | "2. Use `rightOuterJoin` to keep locations without employees. \n",
1711 | "
\n",
1712 | "3. Use `fullOuterJoin` to keep both.\n",
1713 | "
\n",
1714 | " "
1715 | ]
1716 | },
1717 | {
1718 | "cell_type": "markdown",
1719 | "metadata": {},
1720 | "source": [
1721 | "## RDD Statistics\n",
1722 | "\n",
1723 | "Q: How would you calculate the mean, variance, and standard deviation of a sample produced by Python's `random` function?\n",
1724 | "\n",
1725 | "- Create an RDD and apply the statistical actions to it."
1726 | ]
1727 | },
1728 | {
1729 | "cell_type": "code",
1730 | "execution_count": 75,
1731 | "metadata": {},
1732 | "outputs": [
1733 | {
1734 | "name": "stdout",
1735 | "output_type": "stream",
1736 | "text": [
1737 | "mean = 0.506333143283\n",
1738 | "variance = 0.0813565893047\n",
1739 | "stdev = 0.285230765004\n"
1740 | ]
1741 | }
1742 | ],
1743 | "source": [
1744 | "n = 1000\n",
1745 | "\n",
1746 | "list = [random.random() for _ in xrange(n)]\n",
1747 | "\n",
1748 | "rdd = sc.parallelize(list)\n",
1749 | "\n",
1750 | "print 'mean =', rdd.mean()\n",
1751 | "print 'variance =', rdd.variance()\n",
1752 | "print 'stdev =', rdd.stdev()"
1753 | ]
1754 | },
1755 | {
1756 | "cell_type": "markdown",
1757 | "metadata": {},
1758 | "source": [
1759 | "## Pop Quiz\n",
1760 | "\n",
1761 | "\n",
1762 | "Q: What requirement does an RDD have to satisfy before you can apply these statistical actions to it? \n",
1763 | "
\n",
1764 | "The RDD must consist of numeric elements.\n",
1765 | " \n",
1766 | "\n",
1767 | "\n",
1768 | "Q: What is the advantage of using Spark vs Numpy to calculate mean or standard deviation?
\n",
1769 | "The calculation is distributed across different machines and will be more scalable.\n",
1770 | " "
1771 | ]
1772 | },
1773 | {
1774 | "cell_type": "markdown",
1775 | "metadata": {},
1776 | "source": [
1777 | "## RDD Laziness\n",
1778 | "\n",
1779 | "- Q: What is this Spark job doing?"
1780 | ]
1781 | },
1782 | {
1783 | "cell_type": "code",
1784 | "execution_count": 76,
1785 | "metadata": {},
1786 | "outputs": [
1787 | {
1788 | "name": "stdout",
1789 | "output_type": "stream",
1790 | "text": [
1791 | "CPU times: user 12.7 ms, sys: 5.38 ms, total: 18.1 ms\n",
1792 | "Wall time: 1.26 s\n"
1793 | ]
1794 | },
1795 | {
1796 | "data": {
1797 | "text/plain": [
1798 | "10000000"
1799 | ]
1800 | },
1801 | "execution_count": 76,
1802 | "metadata": {},
1803 | "output_type": "execute_result"
1804 | }
1805 | ],
1806 | "source": [
1807 | "n = 10000000\n",
1808 | "\n",
1809 | "%time sc.parallelize(xrange(n)).map(lambda x: x + 1).count()"
1810 | ]
1811 | },
1812 | {
1813 | "cell_type": "markdown",
1814 | "metadata": {},
1815 | "source": [
1816 | "- Q: How is the following job different from the previous one? How\n",
1817 | " long do you expect it to take?"
1818 | ]
1819 | },
1820 | {
1821 | "cell_type": "code",
1822 | "execution_count": 77,
1823 | "metadata": {},
1824 | "outputs": [
1825 | {
1826 | "name": "stdout",
1827 | "output_type": "stream",
1828 | "text": [
1829 | "CPU times: user 2.52 ms, sys: 1.84 ms, total: 4.36 ms\n",
1830 | "Wall time: 7.46 ms\n"
1831 | ]
1832 | },
1833 | {
1834 | "data": {
1835 | "text/plain": [
1836 | "PythonRDD[210] at RDD at PythonRDD.scala:48"
1837 | ]
1838 | },
1839 | "execution_count": 77,
1840 | "metadata": {},
1841 | "output_type": "execute_result"
1842 | }
1843 | ],
1844 | "source": [
1845 | "%time sc.parallelize(xrange(n)).map(lambda x: x + 1)"
1846 | ]
1847 | },
1848 | {
1849 | "cell_type": "markdown",
1850 | "metadata": {},
1851 | "source": [
1852 | "### Pop Quiz\n",
1853 | "\n",
1854 | "\n",
1855 | "Q: Why did the second job complete so much faster?
\n",
1856 | "1. Because Spark is lazy. \n",
1857 | "
\n",
1858 | "2. Transformations produce new RDDs and do no operations on the data.\n",
1859 | "
\n",
1860 | "3. Nothing happens until an action is applied to an RDD.\n",
1861 | "
\n",
1862 | "4. An RDD is the *recipe* for a transformation, rather than the *result* of the transformation.\n",
1863 | " \n",
1864 | "\n",
1865 | "\n",
1866 | "Q: What is the benefit of keeping the recipe instead of the result of the action?
\n",
1867 | "1. It saves memory.\n",
1868 | "
\n",
1869 | "2. It produces *resilience*. \n",
1870 | "
\n",
1871 | "3. If an RDD loses data on a machine, it always knows how to recompute it.\n",
1872 | " "
1873 | ]
1874 | },
1875 | {
1876 | "cell_type": "markdown",
1877 | "metadata": {},
1878 | "source": [
1879 | "## Writing Data\n",
1880 | "\n",
1881 | "Besides reading data Spark and also write data out to a file system.\n",
1882 | "\n",
1883 | "Q: Calculate the squares of integers from 1 to 10 and write them out to `squares.txt`.\n",
1884 | "\n",
1885 | "- Make sure `squares.txt` does not exist."
1886 | ]
1887 | },
1888 | {
1889 | "cell_type": "code",
1890 | "execution_count": 78,
1891 | "metadata": {
1892 | "collapsed": true
1893 | },
1894 | "outputs": [],
1895 | "source": [
1896 | "!if [ -e squares.txt ] ; then rm -rf squares.txt ; fi"
1897 | ]
1898 | },
1899 | {
1900 | "cell_type": "markdown",
1901 | "metadata": {},
1902 | "source": [
1903 | "- Create the RDD and then save it to `squares.txt`."
1904 | ]
1905 | },
1906 | {
1907 | "cell_type": "code",
1908 | "execution_count": 79,
1909 | "metadata": {
1910 | "collapsed": true
1911 | },
1912 | "outputs": [],
1913 | "source": [
1914 | "rdd1 = sc.parallelize(xrange(10))\n",
1915 | "rdd2 = rdd1.map(lambda x: x ** 2)\n",
1916 | "rdd2.saveAsTextFile('squares.txt')"
1917 | ]
1918 | },
1919 | {
1920 | "cell_type": "markdown",
1921 | "metadata": {},
1922 | "source": [
1923 | "- Now look at the output."
1924 | ]
1925 | },
1926 | {
1927 | "cell_type": "code",
1928 | "execution_count": 80,
1929 | "metadata": {},
1930 | "outputs": [
1931 | {
1932 | "name": "stdout",
1933 | "output_type": "stream",
1934 | "text": [
1935 | "cat: squares.txt: Is a directory\r\n"
1936 | ]
1937 | }
1938 | ],
1939 | "source": [
1940 | "!cat squares.txt"
1941 | ]
1942 | },
1943 | {
1944 | "cell_type": "markdown",
1945 | "metadata": {},
1946 | "source": [
1947 | "- Looks like the output is a directory."
1948 | ]
1949 | },
1950 | {
1951 | "cell_type": "code",
1952 | "execution_count": 81,
1953 | "metadata": {},
1954 | "outputs": [
1955 | {
1956 | "name": "stdout",
1957 | "output_type": "stream",
1958 | "text": [
1959 | "total 32\r\n",
1960 | "-rw-r--r-- 1 koyuki.nakamori staff 0 Jun 8 22:13 _SUCCESS\r\n",
1961 | "-rw-r--r-- 1 koyuki.nakamori staff 4 Jun 8 22:13 part-00000\r\n",
1962 | "-rw-r--r-- 1 koyuki.nakamori staff 7 Jun 8 22:13 part-00001\r\n",
1963 | "-rw-r--r-- 1 koyuki.nakamori staff 6 Jun 8 22:13 part-00002\r\n",
1964 | "-rw-r--r-- 1 koyuki.nakamori staff 9 Jun 8 22:13 part-00003\r\n"
1965 | ]
1966 | }
1967 | ],
1968 | "source": [
1969 | "!ls -l squares.txt"
1970 | ]
1971 | },
1972 | {
1973 | "cell_type": "markdown",
1974 | "metadata": {},
1975 | "source": [
1976 | "- Lets take a look at the files."
1977 | ]
1978 | },
1979 | {
1980 | "cell_type": "code",
1981 | "execution_count": 82,
1982 | "metadata": {},
1983 | "outputs": [
1984 | {
1985 | "name": "stdout",
1986 | "output_type": "stream",
1987 | "text": [
1988 | "squares.txt/part-00000\r\n",
1989 | "0\r\n",
1990 | "1\r\n",
1991 | "squares.txt/part-00001\r\n",
1992 | "4\r\n",
1993 | "9\r\n",
1994 | "16\r\n",
1995 | "squares.txt/part-00002\r\n",
1996 | "25\r\n",
1997 | "36\r\n",
1998 | "squares.txt/part-00003\r\n",
1999 | "49\r\n",
2000 | "64\r\n",
2001 | "81\r\n"
2002 | ]
2003 | }
2004 | ],
2005 | "source": [
2006 | "!for i in squares.txt/part-*; do echo $i; cat $i; done"
2007 | ]
2008 | },
2009 | {
2010 | "cell_type": "markdown",
2011 | "metadata": {},
2012 | "source": [
2013 | "### Pop Quiz\n",
2014 | "\n",
2015 | "\n",
2016 | "Q: What's going on? Why are there four files (excluding `_SUCCESS`) in the output directory?
\n",
2017 | "1. There were four threads that were processing the RDD.\n",
2018 | "
\n",
2019 | "2. The RDD was split up in four partitions (default with local mode: number of cores on the local machine).\n",
2020 | "
\n",
2021 | "3. Each partition was processed in a different task.\n",
2022 | " "
2023 | ]
2024 | },
2025 | {
2026 | "cell_type": "markdown",
2027 | "metadata": {},
2028 | "source": [
2029 | "## Partitions\n",
2030 | "\n",
2031 | "Q: Can we control the number of partitions/tasks that Spark uses for processing data? Solve the same problem as above but this time with 5 tasks.\n",
2032 | "\n",
2033 | "- Make sure `squares.txt` does not exist."
2034 | ]
2035 | },
2036 | {
2037 | "cell_type": "code",
2038 | "execution_count": 83,
2039 | "metadata": {
2040 | "collapsed": true
2041 | },
2042 | "outputs": [],
2043 | "source": [
2044 | "!if [ -e squares.txt ] ; then rm -rf squares.txt ; fi"
2045 | ]
2046 | },
2047 | {
2048 | "cell_type": "markdown",
2049 | "metadata": {},
2050 | "source": [
2051 | "- Create the RDD and then save it to `squares.txt`."
2052 | ]
2053 | },
2054 | {
2055 | "cell_type": "code",
2056 | "execution_count": 84,
2057 | "metadata": {
2058 | "collapsed": true
2059 | },
2060 | "outputs": [],
2061 | "source": [
2062 | "partitions = 5\n",
2063 | "rdd1 = sc.parallelize(xrange(10), partitions)\n",
2064 | "rdd2 = rdd1.map(lambda x: x ** 2)\n",
2065 | "rdd2.saveAsTextFile('squares.txt')"
2066 | ]
2067 | },
2068 | {
2069 | "cell_type": "markdown",
2070 | "metadata": {},
2071 | "source": [
2072 | "- Now look at the output."
2073 | ]
2074 | },
2075 | {
2076 | "cell_type": "code",
2077 | "execution_count": 85,
2078 | "metadata": {},
2079 | "outputs": [
2080 | {
2081 | "name": "stdout",
2082 | "output_type": "stream",
2083 | "text": [
2084 | "total 40\n",
2085 | "-rw-r--r-- 1 koyuki.nakamori staff 0 Jun 8 22:13 _SUCCESS\n",
2086 | "-rw-r--r-- 1 koyuki.nakamori staff 4 Jun 8 22:13 part-00000\n",
2087 | "-rw-r--r-- 1 koyuki.nakamori staff 4 Jun 8 22:13 part-00001\n",
2088 | "-rw-r--r-- 1 koyuki.nakamori staff 6 Jun 8 22:13 part-00002\n",
2089 | "-rw-r--r-- 1 koyuki.nakamori staff 6 Jun 8 22:13 part-00003\n",
2090 | "-rw-r--r-- 1 koyuki.nakamori staff 6 Jun 8 22:13 part-00004\n",
2091 | "squares.txt/part-00000\n",
2092 | "0\n",
2093 | "1\n",
2094 | "squares.txt/part-00001\n",
2095 | "4\n",
2096 | "9\n",
2097 | "squares.txt/part-00002\n",
2098 | "16\n",
2099 | "25\n",
2100 | "squares.txt/part-00003\n",
2101 | "36\n",
2102 | "49\n",
2103 | "squares.txt/part-00004\n",
2104 | "64\n",
2105 | "81\n"
2106 | ]
2107 | }
2108 | ],
2109 | "source": [
2110 | "!ls -l squares.txt\n",
2111 | "\n",
2112 | "!for i in squares.txt/part-*; do echo $i; cat $i; done"
2113 | ]
2114 | },
2115 | {
2116 | "cell_type": "markdown",
2117 | "metadata": {},
2118 | "source": [
2119 | "### Pop Quiz\n",
2120 | "\n",
2121 | "\n",
2122 | "Q: How many partitions does Spark use by default?
\n",
2123 | "1. For operations like parallelize, it depends on the cluster manager:\n",
2124 | "
\n",
2125 | " - Local mode: number of cores on the local machine\n",
2126 | "
\n",
2127 | " - Others: total number of cores on all executor nodes or 2, whichever is larger\n",
2128 | "
\n",
2129 | "2. If you read an HDFS file into an RDD, Spark uses one partition per block.\n",
2130 | "
\n",
2131 | "3. If you read a file into an RDD from S3 or some other source, Spark uses 1 partition per 32 MB of data.\n",
2132 | " \n",
2133 | "\n",
2134 | "\n",
2135 | "Q: If I read a file that is 200 MB into an RDD, how many partitions will that have?
\n",
2136 | "1. If the file is on HDFS that will produce 2 partitions (each is 128 MB).\n",
2137 | "
\n",
2138 | "2. If the file is on S3 or some other file system it will produce 7 partitions.\n",
2139 | "
\n",
2140 | "3. You can also control the number of partitions by passing in an additional argument into `textFile`.\n",
2141 | " "
2142 | ]
2143 | },
2144 | {
2145 | "cell_type": "markdown",
2146 | "metadata": {},
2147 | "source": [
2148 | "## Spark Terminology\n",
2149 | "\n",
2150 | "
\n",
2151 | "\n",
2152 | "Term | Meaning\n",
2153 | "--- |---\n",
2154 | "Task | Single thread in an executor\n",
2155 | "Partition | Data processed by a single task\n",
2156 | "Record | Records make up a partition that is processed by a single task\n",
2157 | "\n",
2158 | "### Notes\n",
2159 | "\n",
2160 | "- Every Spark application gets executors when you create a new `SparkContext`.\n",
2161 | "- You can specify how many cores to assign to each executor.\n",
2162 | "- A core is equivalent to a thread.\n",
2163 | "- The number of cores determine how many tasks can run concurrently on an executor.\n",
2164 | "- Each task corresponds to one partition."
2165 | ]
2166 | },
2167 | {
2168 | "cell_type": "markdown",
2169 | "metadata": {},
2170 | "source": [
2171 | "## Pop Quiz\n",
2172 | "\n",
2173 | "\n",
2174 | "\n",
2175 | "Q: Suppose you have 2 executors, each with 2 cores--so a total of 4\n",
2176 | "cores. And you start a Spark job with 8 partitions. How many tasks\n",
2177 | "will run concurrently?\n",
2178 | "
\n",
2179 | "4 tasks will execute concurrently.\n",
2180 | " \n",
2181 | "\n",
2182 | "\n",
2183 | "Q: What happens to the other partitions?
\n",
2184 | "1. The other partitions wait in queue until a task thread becomes available.\n",
2185 | "
\n",
2186 | "2. Think of cores as turnstile gates at a train station, and partitions as people.\n",
2187 | "
\n",
2188 | "3. The number of turnstiles determine how many people can get through at once.\n",
2189 | " \n",
2190 | "\n",
2191 | "\n",
2192 | "Q: How many Spark jobs can you have in a Spark application?
\n",
2193 | "As many as you want.\n",
2194 | " \n",
2195 | "\n",
2196 | "\n",
2197 | "Q: How many Spark applications and Spark jobs are in this IPython Notebook?
\n",
2198 | "1. There is one Spark application because there is one `SparkContext`.\n",
2199 | "
\n",
2200 | "2. There are as many Spark jobs as we have invoked actions on RDDs.\n",
2201 | " "
2202 | ]
2203 | },
2204 | {
2205 | "cell_type": "markdown",
2206 | "metadata": {},
2207 | "source": [
2208 | "## Stock Quotes\n",
2209 | "\n",
2210 | "Q: Find the date on which AAPL's stock price was the highest.\n",
2211 | "\n",
2212 | "Suppose you have stock market data from Yahoo! for AAPL from . The data is in CSV format and has these values.\n",
2213 | "\n",
2214 | "Date | Open | High | Low | Close | Volume | Adj Close\n",
2215 | "---- | --- | --- | --- | ----- |------ | ---\n",
2216 | "11-18-2014 | 113.94 | 115.69 | 113.89 | 115.47 | 44,200,30 | 115.47\n",
2217 | "11-17-2014 | 114.27 | 117.28 | 113.30 | 113.99 | 46,746,700 | 113.99\n",
2218 | "\n",
2219 | "Here is what the CSV looks like:\n",
2220 | " \n",
2221 | " csv = [\n",
2222 | " \"#Date,Open,High,Low,Close,Volume,Adj Close\\n\",\n",
2223 | " \"2014-11-18,113.94,115.69,113.89,115.47,44200300,115.47\\n\",\n",
2224 | " \"2014-11-17,114.27,117.28,113.30,113.99,46746700,113.99\\n\",\n",
2225 | " ]\n",
2226 | "\n",
2227 | "Lets find the date on which the price was the highest. \n",
2228 | "\n",
2229 | "\n",
2230 | "Q: What two fields do we need to extract?
\n",
2231 | "1. *Date* and *Adj Close*.\n",
2232 | "
\n",
2233 | "2. We want to use *Adj Close* instead of *High* so our calculation is not affected by stock splits.\n",
2234 | " \n",
2235 | "\n",
2236 | "Q: What field should we sort on?
\n",
2237 | "*Adj Close*\n",
2238 | " \n",
2239 | "\n",
2240 | "\n",
2241 | "Q: What sequence of operations would we need to perform?
\n",
2242 | "1. Use `filter` to remove the header line.\n",
2243 | "
\n",
2244 | "2. Use `map` to split each row into fields.\n",
2245 | "
\n",
2246 | "3. Use `map` to extract *Adj Close* and *Date*.\n",
2247 | "
\n",
2248 | "4. Use `sortBy` to sort descending on *Adj Close*.\n",
2249 | "
\n",
2250 | "5. Use `take(1)` to get the highest value.\n",
2251 | " \n",
2252 | "\n",
2253 | "- Here is full source."
2254 | ]
2255 | },
2256 | {
2257 | "cell_type": "code",
2258 | "execution_count": 86,
2259 | "metadata": {},
2260 | "outputs": [
2261 | {
2262 | "data": {
2263 | "text/plain": [
2264 | "[(115.47, '2014-11-18')]"
2265 | ]
2266 | },
2267 | "execution_count": 86,
2268 | "metadata": {},
2269 | "output_type": "execute_result"
2270 | }
2271 | ],
2272 | "source": [
2273 | "csv = [\n",
2274 | " \"#Date,Open,High,Low,Close,Volume,Adj Close\\n\",\n",
2275 | " \"2014-11-18,113.94,115.69,113.89,115.47,44200300,115.47\\n\",\n",
2276 | " \"2014-11-17,114.27,117.28,113.30,113.99,46746700,113.99\\n\",\n",
2277 | "]\n",
2278 | "\n",
2279 | "sc.parallelize(csv) \\\n",
2280 | " .filter(lambda line: not line.startswith(\"#\")) \\\n",
2281 | " .map(lambda line: line.split(\",\")) \\\n",
2282 | " .map(lambda fields: (float(fields[-1]), fields[0])) \\\n",
2283 | " .sortBy(lambda (close, date): close, ascending = False) \\\n",
2284 | " .take(1)"
2285 | ]
2286 | },
2287 | {
2288 | "cell_type": "markdown",
2289 | "metadata": {},
2290 | "source": [
2291 | "- Here is the program for finding the high of any stock that stores\n",
2292 | " the data in memory."
2293 | ]
2294 | },
2295 | {
2296 | "cell_type": "code",
2297 | "execution_count": 89,
2298 | "metadata": {},
2299 | "outputs": [],
2300 | "source": [
2301 | "# import urllib2\n",
2302 | "# import re\n",
2303 | "\n",
2304 | "# def get_stock_high(symbol):\n",
2305 | " \n",
2306 | "# url = 'http://real-chart.finance.yahoo.com/table.csv?s=' + symbol + '&g=d&ignore=.csv'\n",
2307 | "# csv = urllib2.urlopen(url).read()\n",
2308 | "# csv_lines = csv.split('\\n')\n",
2309 | "\n",
2310 | "# #print csv_lines\n",
2311 | "# stock_rdd = sc.parallelize(csv_lines)\\\n",
2312 | "# .filter(lambda line: re.match(r'\\d', line))\\\n",
2313 | "# .map(lambda line: line.split(\",\"))\\\n",
2314 | "# .map(lambda fields: (float(fields[-1]), fields[0]))\\\n",
2315 | "# .sortBy(lambda (close, date): close, ascending = False)\n",
2316 | "\n",
2317 | "# return stock_rdd.take(1)\n",
2318 | "\n",
2319 | "# get_stock_high('AAPL')"
2320 | ]
2321 | },
2322 | {
2323 | "cell_type": "markdown",
2324 | "metadata": {},
2325 | "source": [
2326 | "### Notes\n",
2327 | "\n",
2328 | "- Spark is high-level like Hive and Pig.\n",
2329 | "- At the same time it does not invent a new language.\n",
2330 | "- This allows it to leverage the ecosystem of tools that Python, Scala, and Java provide."
2331 | ]
2332 | },
2333 | {
2334 | "cell_type": "markdown",
2335 | "metadata": {},
2336 | "source": [
2337 | "## RDD Caching\n",
2338 | "\n",
2339 | "- Consider this Spark job."
2340 | ]
2341 | },
2342 | {
2343 | "cell_type": "code",
2344 | "execution_count": null,
2345 | "metadata": {
2346 | "collapsed": true
2347 | },
2348 | "outputs": [],
2349 | "source": [
2350 | "n = 500000\n",
2351 | "numbers = [random.random() for _ in xrange(n)]\n",
2352 | "rdd1 = sc.parallelize(numbers)\n",
2353 | "rdd2 = rdd1.sortBy(lambda number: number)"
2354 | ]
2355 | },
2356 | {
2357 | "cell_type": "markdown",
2358 | "metadata": {},
2359 | "source": [
2360 | "- Lets time running `count()` on `rdd2`."
2361 | ]
2362 | },
2363 | {
2364 | "cell_type": "code",
2365 | "execution_count": null,
2366 | "metadata": {},
2367 | "outputs": [],
2368 | "source": [
2369 | "%time rdd2.count()\n",
2370 | "%time rdd2.count()\n",
2371 | "%time rdd2.count()"
2372 | ]
2373 | },
2374 | {
2375 | "cell_type": "markdown",
2376 | "metadata": {},
2377 | "source": [
2378 | "- The RDD does no work until an action is called. And then when an action is called it figures out the answer and then throws away all the data.\n",
2379 | "- If you have an RDD that you are going to reuse in your computation you can use `cache()` to make Spark cache the RDD.\n",
2380 | "\n",
2381 | "- Let's cache it and try again."
2382 | ]
2383 | },
2384 | {
2385 | "cell_type": "code",
2386 | "execution_count": 88,
2387 | "metadata": {},
2388 | "outputs": [
2389 | {
2390 | "name": "stdout",
2391 | "output_type": "stream",
2392 | "text": [
2393 | "CPU times: user 6.19 ms, sys: 2.38 ms, total: 8.58 ms\n",
2394 | "Wall time: 147 ms\n",
2395 | "CPU times: user 4.99 ms, sys: 1.8 ms, total: 6.79 ms\n",
2396 | "Wall time: 45.5 ms\n",
2397 | "CPU times: user 6.23 ms, sys: 1.96 ms, total: 8.18 ms\n",
2398 | "Wall time: 53.8 ms\n"
2399 | ]
2400 | },
2401 | {
2402 | "data": {
2403 | "text/plain": [
2404 | "10"
2405 | ]
2406 | },
2407 | "execution_count": 88,
2408 | "metadata": {},
2409 | "output_type": "execute_result"
2410 | }
2411 | ],
2412 | "source": [
2413 | "rdd2.cache()\n",
2414 | "\n",
2415 | "%time rdd2.count()\n",
2416 | "%time rdd2.count()\n",
2417 | "%time rdd2.count()"
2418 | ]
2419 | },
2420 | {
2421 | "cell_type": "markdown",
2422 | "metadata": {},
2423 | "source": [
2424 | "- Caching the RDD speeds up the job because the RDD does not have to be computed from scratch again."
2425 | ]
2426 | },
2427 | {
2428 | "cell_type": "markdown",
2429 | "metadata": {},
2430 | "source": [
2431 | "### Notes\n",
2432 | "\n",
2433 | "- Calling `cache()` flips a flag on the RDD. \n",
2434 | "- The data is not cached until an action is called.\n",
2435 | "- You can uncache an RDD using `unpersist()`."
2436 | ]
2437 | },
2438 | {
2439 | "cell_type": "markdown",
2440 | "metadata": {},
2441 | "source": [
2442 | "### Pop Quiz\n",
2443 | "\n",
2444 | "\n",
2445 | "Q: Will `unpersist` uncache the RDD immediately or does it wait for an action?
\n",
2446 | "A: It unpersists immediately.\n",
2447 | " "
2448 | ]
2449 | },
2450 | {
2451 | "cell_type": "markdown",
2452 | "metadata": {},
2453 | "source": [
2454 | "## Caching and Persistence\n",
2455 | "\n",
2456 | "Q: Persist RDD to disk instead of caching it in memory.\n",
2457 | "- You can cache RDDs at different levels.\n",
2458 | "- Here is an example."
2459 | ]
2460 | },
2461 | {
2462 | "cell_type": "code",
2463 | "execution_count": 49,
2464 | "metadata": {},
2465 | "outputs": [
2466 | {
2467 | "data": {
2468 | "text/plain": [
2469 | "PythonRDD[190] at RDD at PythonRDD.scala:48"
2470 | ]
2471 | },
2472 | "execution_count": 49,
2473 | "metadata": {},
2474 | "output_type": "execute_result"
2475 | }
2476 | ],
2477 | "source": [
2478 | "rdd = sc.parallelize(xrange(100))\n",
2479 | "rdd.persist(pyspark.StorageLevel.DISK_ONLY)"
2480 | ]
2481 | },
2482 | {
2483 | "cell_type": "markdown",
2484 | "metadata": {},
2485 | "source": [
2486 | "### Pop Quiz\n",
2487 | "\n",
2488 | "\n",
2489 | "Q: Will the RDD be stored on disk at this point?
\n",
2490 | "A: No. It will get stored after we call an action.\n",
2491 | " "
2492 | ]
2493 | },
2494 | {
2495 | "cell_type": "markdown",
2496 | "metadata": {},
2497 | "source": [
2498 | "## Persistence Levels\n",
2499 | "\n",
2500 | "Level | Meaning\n",
2501 | "--- | ---\n",
2502 | "`MEMORY_ONLY` | Same as `cache()`\n",
2503 | "`MEMORY_AND_DISK` | Cache in memory then overflow to disk\n",
2504 | "`MEMORY_AND_DISK_SER` | Like above; in cache keep objects serialized instead of live \n",
2505 | "`DISK_ONLY` | Cache to disk not to memory\n",
2506 | "\n",
2507 | "### Notes\n",
2508 | "\n",
2509 | "- `MEMORY_AND_DISK_SER` is a good compromise between the levels. \n",
2510 | "- Fast, but not too expensive.\n",
2511 | "- Make sure you unpersist when you don't need the RDD any more."
2512 | ]
2513 | },
2514 | {
2515 | "cell_type": "markdown",
2516 | "metadata": {},
2517 | "source": [
2518 | "## Spark MLlib\n",
2519 | "\n",
2520 | "- MLlib is Spark’s machine learning (ML) library.\n",
2521 | "- Its goal is to make practical machine learning scalable and easy.\n",
2522 | "- It consists of common learning algorithms, including:\n",
2523 | " - Classification/Regression\n",
2524 | " - Logistic Regression, Support vector machine (SVM), Naive Bayes, Gradient Boosted Trees, Random Forests, Multilayer Perceptron (e.g., a neural network), Generalized linear regression (GLM)\n",
2525 | " - Recommenders/Collaborative Filtering\n",
2526 | " - Non-negative matrix factorization (NMF)\n",
2527 | " - Decomposition\n",
2528 | " - Singular value decomposition (SVD), (Principal component analysis)\n",
2529 | " - Clustering\n",
2530 | " - K-Means, Latent Dirichlet allocation (LDA)"
2531 | ]
2532 | },
2533 | {
2534 | "cell_type": "markdown",
2535 | "metadata": {},
2536 | "source": [
2537 | "### Misc\n",
2538 | "\n",
2539 | "- *\"s3:\" URLs break when Secret Key contains a slash, even if encoded* "
2540 | ]
2541 | }
2542 | ],
2543 | "metadata": {
2544 | "anaconda-cloud": {},
2545 | "kernelspec": {
2546 | "display_name": "Python 2",
2547 | "language": "python",
2548 | "name": "python2"
2549 | },
2550 | "language_info": {
2551 | "codemirror_mode": {
2552 | "name": "ipython",
2553 | "version": 2
2554 | },
2555 | "file_extension": ".py",
2556 | "mimetype": "text/x-python",
2557 | "name": "python",
2558 | "nbconvert_exporter": "python",
2559 | "pygments_lexer": "ipython2",
2560 | "version": "2.7.13"
2561 | }
2562 | },
2563 | "nbformat": 4,
2564 | "nbformat_minor": 1
2565 | }
2566 |
--------------------------------------------------------------------------------