├── README.md
├── .gitignore
└── Scikit-Learn-To-Spark-ML.ipynb


/README.md:
--------------------------------------------------------------------------------
1 | # scikit-learn-to-spark-ml
2 | Notebook comparing scikit-learn and Spark ML for building Machine Learning Pipelines
3 | 
4 | Tested with Spark 1.5.0
5 | 
6 | More explanations (in French) here: http://blog.xebia.fr/2015/10/08/from-scikit-learn-to-spark-ml/
7 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | 
 5 | # C extensions
 6 | *.so
 7 | 
 8 | # Distribution / packaging
 9 | .Python
10 | env/
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | *.egg-info/
23 | .installed.cfg
24 | *.egg
25 | 
26 | # PyInstaller
27 | #  Usually these files are written by a python script from a template
28 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
29 | *.manifest
30 | *.spec
31 | 
32 | # Installer logs
33 | pip-log.txt
34 | pip-delete-this-directory.txt
35 | 
36 | # Unit test / coverage reports
37 | htmlcov/
38 | .tox/
39 | .coverage
40 | .coverage.*
41 | .cache
42 | nosetests.xml
43 | coverage.xml
44 | *,cover
45 | 
46 | # Translations
47 | *.mo
48 | *.pot
49 | 
50 | # Django stuff:
51 | *.log
52 | 
53 | # Sphinx documentation
54 | docs/_build/
55 | 
56 | # PyBuilder
57 | target/
58 | 


--------------------------------------------------------------------------------
/Scikit-Learn-To-Spark-ML.ipynb:
--------------------------------------------------------------------------------
1 | {"nbformat_minor": 0, "cells": [{"source": "#From scikit-learn to Spark ML\n----\n*****", "cell_type": "markdown", "metadata": {}}, {"source": "To launch a notebook with pyspark (in local mode), run the following command in your shell: \n\nIPYTHON_OPTS=\"notebook\" pyspark --master local[*]", "cell_type": "markdown", "metadata": {}}, {"source": "##### Goals\nThis notebook aims at demonstrating the similarities and main differencies between two powerful Machine Learning libraries: scikit-learn and Sparl ML. \n\nThe main objective behind this is to show the simplicity of moving from scikit-learn to Spark ML when working on a bigger range of data to train and use Machine Learning workflows.\n\nAs we will see, Spark ML is mainly inspired from scikit-learn's structure, so the scikit-learn user will easily be able to use Spark ML API when working on Big Data workflows is needed.\n\n##### Structure of the notebook\nIn order to explain and present the main concepts behind both libraries, we will go through a complete example to build an entire Machine Learning workflow, and present the code for both scikit-learn and Spark ML at every step.\n\n##### Dataset\nWe will work on the dataset 20 NewsGroup, which gathers comments about news documents, grouped in several topics (politics, sports, science, etc.). This example is drawn from one of the scikit-learn's tutorial on text data: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#loading-the-20-newsgroups-dataset.", "cell_type": "markdown", "metadata": {}}, {"source": "##Initial Configurations", "cell_type": "markdown", "metadata": {}}, {"execution_count": 1, "cell_type": "code", "source": "from pyspark.sql import SQLContext\nsqlContext = SQLContext(sc)", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"source": "##Initial example\n----", "cell_type": "markdown", "metadata": {}}, {"source": "Let's start with a very simple example to compare the use of scikit-learn and Spark ML. There is the same notion of Estimator/Transformer, and the way to use them is also the same. Two main differences though:\n- In scikit-learn, even a Transformer has the structure of an Estimator, with a fit() method that does nothing.\n- The result of the transformation is generally a vector in scikit-learn, whereas it is another DataFrame in Spark ML.", "cell_type": "markdown", "metadata": {}}, {"source": "#####scikit-learn", "cell_type": "markdown", "metadata": {}}, {"execution_count": 2, "cell_type": "code", "source": "import pandas as pd\nfrom sklearn.datasets import load_iris\ndata = pd.DataFrame(data=load_iris().data, columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 3, "cell_type": "code", "source": "data.head()", "outputs": [{"execution_count": 3, "output_type": "execute_result", "data": {"text/plain": "   sepal_length  sepal_width  petal_length  petal_width\n0           5.1          3.5           1.4          0.2\n1           4.9          3.0           1.4          0.2\n2           4.7          3.2           1.3          0.2\n3           4.6          3.1           1.5          0.2\n4           5.0          3.6           1.4          0.2", "text/html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>sepal_length</th>\n      <th>sepal_width</th>\n      <th>petal_length</th>\n      <th>petal_width</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>5.1</td>\n      <td>3.5</td>\n      <td>1.4</td>\n      <td>0.2</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>4.9</td>\n      <td>3.0</td>\n      <td>1.4</td>\n      <td>0.2</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>4.7</td>\n      <td>3.2</td>\n      <td>1.3</td>\n      <td>0.2</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>4.6</td>\n      <td>3.1</td>\n      <td>1.5</td>\n      <td>0.2</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>5.0</td>\n      <td>3.6</td>\n      <td>1.4</td>\n      <td>0.2</td>\n    </tr>\n  </tbody>\n</table>\n</div>"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 4, "cell_type": "code", "source": "from sklearn.preprocessing import Binarizer\nbinarizer = Binarizer(threshold=5)\nbinarizer.fit_transform(data.sepal_length)", "outputs": [{"execution_count": 4, "output_type": "execute_result", "data": {"text/plain": "array([[ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,\n         0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,  0.,\n         0.,  1.,  1.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  1.,  0.,  0.,\n         1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  1.,\n         1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,\n         1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,\n         1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,\n         1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,\n         1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,\n         1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,\n         1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,\n         1.,  1.,  1.,  1.,  1.,  1.,  1.]])"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "#####Spark ML", "cell_type": "markdown", "metadata": {}}, {"execution_count": 5, "cell_type": "code", "source": "df = sqlContext.createDataFrame(data)", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 6, "cell_type": "code", "source": "from pyspark.ml.feature import Binarizer\nbinarizer = Binarizer(threshold=5.0, inputCol='sepal_length', outputCol='sepal_length_bin')\nbinarizer.transform(df).show(5)", "outputs": [{"output_type": "stream", "name": "stdout", "text": "+------------+-----------+------------+-----------+----------------+\n|sepal_length|sepal_width|petal_length|petal_width|sepal_length_bin|\n+------------+-----------+------------+-----------+----------------+\n|         5.1|        3.5|         1.4|        0.2|             1.0|\n|         4.9|        3.0|         1.4|        0.2|             0.0|\n|         4.7|        3.2|         1.3|        0.2|             0.0|\n|         4.6|        3.1|         1.5|        0.2|             0.0|\n|         5.0|        3.6|         1.4|        0.2|             0.0|\n+------------+-----------+------------+-----------+----------------+\nonly showing top 5 rows\n\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "##Load Newsgroup Data\n----", "cell_type": "markdown", "metadata": {}}, {"source": "Let's now work on the 20 NewsGroup dataset and prepare the data in both libraries.", "cell_type": "markdown", "metadata": {}}, {"source": "#####scikit-learn", "cell_type": "markdown", "metadata": {}}, {"source": "There is a scikit-learn loader for this dataset. We will convert the data and the target to pandas DataFrames.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 7, "cell_type": "code", "source": "# Import data\nfrom sklearn.datasets import fetch_20newsgroups\ncategories = ['rec.autos', 'rec.sport.baseball', 'comp.graphics', 'comp.sys.mac.hardware', \n              'sci.space', 'sci.crypt', 'talk.politics.guns', 'talk.religion.misc']\nnewsgroup = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)\nprint newsgroup.data[0]\n\n# Create pandas DataFrames for values and targets\nimport pandas as pd\npdf_newsgroup = pd.DataFrame(data=newsgroup.data, columns=['news']) # Texts\npdf_newsgroup_target = pd.DataFrame(data=newsgroup.target, columns=['target']) # Targets", "outputs": [{"output_type": "stream", "name": "stdout", "text": "From: alizard@tweekco.uucp (A.Lizard)\nSubject: Re: OTO, the Ancient Order of Oriental Templars\nOrganization: Tweek-Com Systems BBS, Moraga, CA (510) 631-0615\nLines: 18\n\nThyagi@cup.portal.com (Thyagi Morgoth NagaSiva) writes:\n\n> \"This organization is known at the present time as the Ancient\n> Order of Oriental Templars.  Ordo Templi Orientis.  Otherwise:\n> The Hermetic Brotherhood of Light.\n> \nDoes this organization have an official e-mail address these\ndays? (an address for any of the SF Bay Area Lodges, e.g. Thelema\nwould do.)\n                                      93...\n                                       A.Lizard\n\n-------------------------------------------------------------------\nA.Lizard Internet Addresses:\nalizard%tweekco%boo@PacBell.COM        (preferred)\nPacBell.COM!boo!tweekco!alizard (bang path for above)\nalizard@gentoo.com (backup)\nPGP2.2 public key available on request\n\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "#####Spark ML", "cell_type": "markdown", "metadata": {}}, {"source": "In Spark ML, one often gathers all the information (data and targets) into the same DataFrame. We will therefore create a unique Spark DataFrame from concatenation of the two previous pandas DataFrames.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 8, "cell_type": "code", "source": "from pyspark.sql import SQLContext\nsqlContext = SQLContext(sc)\ndf_newsgroup = sqlContext.createDataFrame(pd.concat([pdf_newsgroup, pdf_newsgroup_target], axis=1))\ndf_newsgroup.printSchema()\ndf_newsgroup.show(3)", "outputs": [{"output_type": "stream", "name": "stdout", "text": "root\n |-- news: string (nullable = true)\n |-- target: long (nullable = true)\n\n+--------------------+------+\n|                news|target|\n+--------------------+------+\n|From: alizard@twe...|     7|\n|From: djk@ccwf.cc...|     1|\n|From: rgonzal@gan...|     1|\n+--------------------+------+\nonly showing top 3 rows\n\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "##Train-Test Split\n----", "cell_type": "markdown", "metadata": {}}, {"source": "Train-Test split is a common operation in Machine Learning. It means that we hold on some of the available data in a test set and do as if it were new data. The Machine Learning algorithm will be train on the remaining training set, and the test set will be used to compare the predictions made on it to the ground truth, in order to measure the generalization capacity of the algorithm (the ability to adapt to new data and not only to the data used for training)", "cell_type": "markdown", "metadata": {}}, {"source": "#####scikit-learn", "cell_type": "markdown", "metadata": {}}, {"source": "In scikit-learn, a Train-Test split is simply done with the function train_test_split from the cross_validation package.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 9, "cell_type": "code", "source": "from sklearn.cross_validation import train_test_split\nX_train, X_test, y_train, y_test = train_test_split(newsgroup.data, newsgroup.target, train_size=0.8, random_state=42)", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"source": "#####Spark", "cell_type": "markdown", "metadata": {}}, {"source": "In Spark SQL, a more general function named randomSplit allows to split randomly any DataFrame given the proportions wanted. No need to separate the data from the target, both are kept in a same DataFrame.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 10, "cell_type": "code", "source": "(df_train, df_test) = df_newsgroup.randomSplit([0.8, 0.2])", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"source": "##Feature engineering\n----", "cell_type": "markdown", "metadata": {}}, {"source": "Feature Engineering represents all the actions done on the data to transform, extract and select features in order to collect the maximum amount of information on the data to optimize Machine Learning algorithms' performances.\n\nSince the algorithms mostly take as entry numerical data, we need in our case to extract knowledge from the text data and convert it into numerical features. Here are the transformations we are going to perform:\n- Tokenizing: Transform a text into a list of words\n- Term Frequency: The more a term is frequent, the more it has chances to carry useful information obout the text (unless it is a stop-word).\n- Inverse Document Frequency: If a term appears in most a the documents, there's little chance that it would be helpfull to distinguish and classify them.\n\nIn both scikit-learn and Spark ML, there are objects to perform these transformations.\n- CountVectorizer and TfidfTransformer in scikit-learn\n- Tokenizer, HashingTF and IDF in Spark ML\n\nIn both cases, the way to use these objects are much alike: they all have fit() and transform() methods.\n\nNB: The objects used are not exactly the same, and do not have the same default parameters, so the results will be different. The purpose here is to show how to use Spark ML and to see how it looks like scikit-learn.", "cell_type": "markdown", "metadata": {}}, {"source": "#####scikit-learn", "cell_type": "markdown", "metadata": {}}, {"execution_count": 11, "cell_type": "code", "source": "# Tokenizing and Occurrence Counts\nfrom sklearn.feature_extraction.text import CountVectorizer\ncount_vect = CountVectorizer()\nX_train_counts = count_vect.fit_transform(X_train)\n\n# TF-IDF\nfrom sklearn.feature_extraction.text import TfidfTransformer\ntfidf_transformer = TfidfTransformer()\nX_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"source": "#####Spark ML", "cell_type": "markdown", "metadata": {"collapsed": true}}, {"execution_count": 12, "cell_type": "code", "source": "# Tokenizing\nfrom pyspark.ml.feature import Tokenizer\ntokenizer = Tokenizer(inputCol='news', outputCol='news_words')\ndf_train_words = tokenizer.transform(df_train)\n\n# Hashing Term-Frequency\nfrom pyspark.ml.feature import HashingTF\nhashing_tf = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol='news_tf', numFeatures=10000)\ndf_train_tf = hashing_tf.transform(df_train_words)\n\n# Inverse Document Frequency\nfrom pyspark.ml.feature import IDF\nidf = IDF(inputCol=hashing_tf.getOutputCol(), outputCol=\"news_tfidf\")\nidf_model = idf.fit(df_train_tf) # fit to build the model on all the data, and then apply it line by line\ndf_train_tfidf = idf_model.transform(df_train_tf)", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 13, "cell_type": "code", "source": "df_train_tfidf.show(5)", "outputs": [{"output_type": "stream", "name": "stdout", "text": "+--------------------+------+--------------------+--------------------+--------------------+\n|                news|target|          news_words|             news_tf|          news_tfidf|\n+--------------------+------+--------------------+--------------------+--------------------+\n|From: alizard@twe...|     7|[from:, alizard@t...|(10000,[0,62,141,...|(10000,[0,62,141,...|\n|From: rgonzal@gan...|     1|[from:, rgonzal@g...|(10000,[0,97,284,...|(10000,[0,97,284,...|\n|From: cka52397@ux...|     2|[from:, cka52397@...|(10000,[0,6,45,97...|(10000,[0,6,45,97...|\n|From: Thomas Keph...|     1|[from:, thomas, k...|(10000,[0,54,62,7...|(10000,[0,54,62,7...|\n|From: lovall@bohr...|     7|[from:, lovall@bo...|(10000,[0,55,58,6...|(10000,[0,55,58,6...|\n+--------------------+------+--------------------+--------------------+--------------------+\nonly showing top 5 rows\n\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "##Modelling & Prediction\n---", "cell_type": "markdown", "metadata": {}}, {"source": "Now that the data is ready to be used, we can start the modelling step. For this example, we will use a simple algorithm: a Decision Tree. Both scikit-learn and Spark ML have a DecisionTreeClassifier object for this.\n\nThe parameters to specify to this classifier are the same in both libraries, but with slightly different names. The way to use them is exactly the same.\n\nOne slightly difference though. In Spark ML, we need to specify that the target column is categorical, even if we use a Classifier. This is because the classifier in Spark ML needs to know the number of classes. One way to do this is to use a StringIndexer that will convert the column into a double column with the number of classes in its metadata. \n\nIf you don't do this, you will get an error like: \"DecisionTreeClassifier was given input with invalid label column target, without the number of classes specified. See StringIndexer.\"\n\nOne last important note: Always perform the learning task on the training set, and the predictions on the test set. The test set needs to be transformed as the training set before it can be used by the model to make predictions.", "cell_type": "markdown", "metadata": {}}, {"source": "#####scikit-learn", "cell_type": "markdown", "metadata": {}}, {"execution_count": 14, "cell_type": "code", "source": "# Training a Decision Tree on training set\nfrom sklearn.tree import DecisionTreeClassifier\nclf = DecisionTreeClassifier(max_depth=10).fit(X_train_tfidf, y_train)\n\n# Transform test set\nX_test_counts = count_vect.transform(X_test)\nX_test_tfidf = tfidf_transformer.transform(X_test_counts)\n\n# Predictions on the test set\ny_test_pred = clf.predict(X_test_tfidf)", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "#####Spark ML", "cell_type": "markdown", "metadata": {"collapsed": true}}, {"execution_count": 15, "cell_type": "code", "source": "# Indexing the target\nfrom pyspark.ml.feature import StringIndexer\nstring_indexer = StringIndexer(inputCol='target', outputCol='target_indexed')\nstring_indexer_model = string_indexer.fit(df_train_tfidf)\ndf_train_final = string_indexer_model.transform(df_train_tfidf)", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"execution_count": 16, "cell_type": "code", "source": "# Training a Decision Tree on training set\nfrom pyspark.ml.classification import DecisionTreeClassifier\ndt = DecisionTreeClassifier(featuresCol=idf.getOutputCol(), labelCol=string_indexer.getOutputCol())\ndt_model = dt.fit(df_train_final)\n\n# Transform the test set\ndf_test_words = tokenizer.transform(df_test)\ndf_test_tf = hashing_tf.transform(df_test_words)\ndf_test_tfidf = idf_model.transform(df_test_tf)\ndf_test_final = string_indexer_model.transform(df_test_tfidf)\n\n# Preditions on the test set\ndf_test_pred = dt_model.transform(df_test_final)", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 17, "cell_type": "code", "source": "df_test_pred.select('news', 'target', 'prediction', 'probability').show(5)", "outputs": [{"output_type": "stream", "name": "stdout", "text": "+--------------------+------+----------+--------------------+\n|                news|target|prediction|         probability|\n+--------------------+------+----------+--------------------+\n|From: djk@ccwf.cc...|     1|       3.0|[0.03717472118959...|\n|From: pjtier01@ul...|     3|       0.0|[0.18422122814152...|\n|From: pmetzger@sn...|     4|       4.0|[0.0,0.0,0.013043...|\n|From: mrr@scss3.c...|     4|       4.0|[0.0,0.0,0.013043...|\n|From: Earl D. Fif...|     1|       0.0|[0.18422122814152...|\n+--------------------+------+----------+--------------------+\nonly showing top 5 rows\n\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "##Pipeline\n----", "cell_type": "markdown", "metadata": {}}, {"source": "As we can see, the number of steps to perform can be quite important, especially for the Feature Engineering part. Chaining all the required steps on the training set to train a model, and then perform them all again on the test set to make predictions can be quite long. \n\nThe Pipeline object is here to make our lives easier on this point. It will gather into the same estimator all the steps to perform to transform the data, which will be used on the raw data of the training and test sets.\n\nThe steps to perform are the following:\n- Create an instance of each Transformer / Estimator to use\n- Group them into a Pipeline object\n- Call the method fit() of the pipeline to load the transformation and learning on the training set\n- Call the method transform() to perform the predictions on the test set\n\nWhen the fit() method is called, the Pipeline object will call, in the order specified, the fit() method of the estimator if it has one, and then its transform() method.", "cell_type": "markdown", "metadata": {}}, {"source": "#####scikit-learn", "cell_type": "markdown", "metadata": {}}, {"execution_count": 18, "cell_type": "code", "source": "from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.pipeline import Pipeline\n\n# Instanciate a Pipeline\ntext_clf = Pipeline([('vect', CountVectorizer()),\n                     ('tfidf', TfidfTransformer()),\n                     ('clf', DecisionTreeClassifier(max_depth=10)),\n                    ])\n\n# Transform the data and train the classifier on the training set\ntext_clf = text_clf.fit(X_train, y_train)\n\n# Transform the data and perform predictions on the test set\ny_test_pred = text_clf.predict(X_test)", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "#####Spark ML", "cell_type": "markdown", "metadata": {}}, {"execution_count": 19, "cell_type": "code", "source": "from pyspark.ml.feature import Tokenizer, HashingTF, IDF, StringIndexer\nfrom pyspark.ml.classification import DecisionTreeClassifier\nfrom pyspark.ml import Pipeline\n\n# Instanciate all the Estimators and Transformers necessary\ntokenizer = Tokenizer(inputCol='news', outputCol='news_words')\nhashing_tf = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol='news_tf', numFeatures=10000)\nidf = IDF(inputCol=hashing_tf.getOutputCol(), outputCol=\"news_tfidf\")\nstring_indexer = StringIndexer(inputCol='target', outputCol='target_indexed')\ndt = DecisionTreeClassifier(featuresCol=idf.getOutputCol(), labelCol=string_indexer.getOutputCol(), maxDepth=10)\n\n# Instanciate a Pipeline\npipeline = Pipeline(stages=[tokenizer, \n                            hashing_tf, \n                            idf, \n                            string_indexer, \n                            dt])\n\n# Transform the data and train the classifier on the training set\npipeline_model = pipeline.fit(df_train)\n\n# Transform the data and perform predictions on the test set\ndf_test_pred = pipeline_model.transform(df_test)", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 20, "cell_type": "code", "source": "df_test_pred.show(5)", "outputs": [{"output_type": "stream", "name": "stdout", "text": "+--------------------+------+--------------------+--------------------+--------------------+--------------+--------------------+--------------------+----------+\n|                news|target|          news_words|             news_tf|          news_tfidf|target_indexed|       rawPrediction|         probability|prediction|\n+--------------------+------+--------------------+--------------------+--------------------+--------------+--------------------+--------------------+----------+\n|From: djk@ccwf.cc...|     1|[from:, djk@ccwf....|(10000,[0,38,45,4...|(10000,[0,38,45,4...|           5.0|[0.0,7.0,7.0,79.0...|[0.0,0.0642201834...|       3.0|\n|From: pjtier01@ul...|     3|[from:, pjtier01@...|(10000,[0,54,62,9...|(10000,[0,54,62,9...|           0.0|[333.0,231.0,310....|[0.18078175895765...|       0.0|\n|From: pmetzger@sn...|     4|[from:, pmetzger@...|(10000,[0,40,41,4...|(10000,[0,40,41,4...|           4.0|[0.0,0.0,2.0,0.0,...|[0.0,0.0,0.009009...|       4.0|\n|From: mrr@scss3.c...|     4|[from:, mrr@scss3...|(10000,[0,25,26,3...|(10000,[0,25,26,3...|           4.0|[0.0,0.0,2.0,0.0,...|[0.0,0.0,0.009009...|       4.0|\n|From: Earl D. Fif...|     1|[from:, earl, d.,...|(10000,[0,38,54,6...|(10000,[0,38,54,6...|           5.0|[1.0,0.0,107.0,5....|[0.00793650793650...|       2.0|\n+--------------------+------+--------------------+--------------------+--------------------+--------------+--------------------+--------------------+----------+\nonly showing top 5 rows\n\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "##Model Evaluation\n----", "cell_type": "markdown", "metadata": {}}, {"source": "Once we have built our pipeline, it is time to evaluate it. This is where the test set is crucial. We perform perdictions on the test set, as if we didn't know the actual classes, and then compare the predictions with the ground truth. If we do this on the training set, we would be biased because we would perform predictions on the data used to build the model. Keeping a test set whose data is not used to build the model helps in observing the generalisation capacity of the model.\n\nBoth scikit-learn and Spark ML have built-in metrics to score all kinds of predictions. In our case, we will measure the precision of the prediction: the percentage of well classified data. This metric is present in the precision_score method in scikit-learn, and in the MulticlassClassificationEvaluator object in Spark ML.", "cell_type": "markdown", "metadata": {}}, {"source": "#####scikit-learn", "cell_type": "markdown", "metadata": {}}, {"execution_count": 21, "cell_type": "code", "source": "from sklearn.metrics import precision_score\n\n# Evaluate the predictions done on the test set\nprecision_score(y_test_pred, y_test, average='micro')", "outputs": [{"execution_count": 21, "output_type": "execute_result", "data": {"text/plain": "0.53751399776035835"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "#####Spark ML", "cell_type": "markdown", "metadata": {}}, {"execution_count": 22, "cell_type": "code", "source": "from pyspark.ml.evaluation import MulticlassClassificationEvaluator\n\n# Instanciate a MulticlassClassificationEvaluator with precision metric\nevaluator = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='target_indexed', \n                                              metricName='precision')\n\n# Evaluate the predictions done on the test set\nevaluator.evaluate(df_test_pred)", "outputs": [{"execution_count": 22, "output_type": "execute_result", "data": {"text/plain": "0.4908045977011494"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "> Scores are different mainly because default parameters are not the same in scikit-learn and Spark ML", "cell_type": "markdown", "metadata": {}}, {"source": "##Parameter tuning", "cell_type": "markdown", "metadata": {}}, {"source": "We now would like to improve the score of our model. One way to do that is to tune the parameters in order to find the best combinaison of parameters. \n\nTuning is generally done using the following tools:\n- Grid Search: Specify in a grid all values of each parameters we want to try\n- Cross Validation: Test several times all combinations of parameters, on different splits of the training set\n\nIn scikit-learn, one can use the GridSearchCV object. In Spark ML, it is a CrossValidator object. In each cases, there are three things that we need to specify:\n- The parameters grid (using a ParamGridBuilder object in Spark ML)\n- The estimator (or pipeline)\n- The scoring function to decide which combination gives the best score", "cell_type": "markdown", "metadata": {}}, {"source": "#####scikit-learn", "cell_type": "markdown", "metadata": {}}, {"execution_count": 23, "cell_type": "code", "source": "from sklearn.grid_search import GridSearchCV\n\n# Create the parameters grid\nparameters = {'tfidf__use_idf': (True, False),\n              'clf__max_depth': (10, 20)\n             }\n\n# Instanciate a GridSearchCV object with the pipeline, the parameters grid and the scoring function\ngs_clf = GridSearchCV(text_clf, parameters, score_func=precision_score, n_jobs=-1)\n\n# Transform the data and train the classifier on the training set\ngs_clf = gs_clf.fit(X_train, y_train)\n\n# Transform the data and perform predictions on the test set\ny_test_pred = gs_clf.predict(X_test)\n\n# Evaluate the predictions done on the test set\nprecision_score(y_test_pred, y_test, average='micro')", "outputs": [{"execution_count": 23, "output_type": "execute_result", "data": {"text/plain": "0.64613661814109746"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "#####Spark ML", "cell_type": "markdown", "metadata": {}}, {"execution_count": 24, "cell_type": "code", "source": "from pyspark.ml.tuning import ParamGridBuilder\nfrom pyspark.ml.tuning import CrossValidator\n\n# Instanciation of a ParamGridBuilder\n\ngrid = (ParamGridBuilder()\n        .baseOn([evaluator.metricName, 'precision'])\n        .addGrid(dt.maxDepth, [10, 20])\n        .build())\n\n# Instanciation of a CrossValidator\ncv = CrossValidator(estimator=pipeline, estimatorParamMaps=grid, evaluator=evaluator)\n\n# Transform the data and train the classifier on the training set\ncv_model = cv.fit(df_train)\n\n# Transform the data and perform predictions on the test set\ndf_test_pred = cv_model.transform(df_test)\n\n# Evaluate the predictions done on the test set\nevaluator.evaluate(df_test_pred)", "outputs": [{"execution_count": 24, "output_type": "execute_result", "data": {"text/plain": "0.6149425287356322"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "> Again, the results are different since not all parameters are used, and the default ones may not be the same. Moreover, we did not use exactly the same objects in the Feature Engineering phase (CountVectorizer / Tokenizer for example).", "cell_type": "markdown", "metadata": {}}, {"source": "##Conclusion\n----", "cell_type": "markdown", "metadata": {}}, {"source": "As we saw, scikit-learn and Spark ML have a lot in common. There are some slightly differences between both libraries, in terms of implementation and how the data is handled, but they are minimal. Spark ML was designed to be close to scikit-learn in the way to use it, and this helps a lot when going at scale with Spark to build complex Machine Learning pipelines.\n\nSpark ML is still under active development, and has a limited amount of algorithms implemented for now comparing to scikit-learn. The list of possibilities offered by Spark ML will expand with time, and it will be more and more easy to go from scikit-learn to Spark ML.", "cell_type": "markdown", "metadata": {}}], "nbformat": 4, "metadata": {"kernelspec": {"display_name": "Python 2", "name": "python2", "language": "python"}, "language_info": {"mimetype": "text/x-python", "nbconvert_exporter": "python", "version": "2.7.10", "name": "python", "file_extension": ".py", "pygments_lexer": "ipython2", "codemirror_mode": {"version": 2, "name": "ipython"}}}}


--------------------------------------------------------------------------------