├── README.md
├── chapters
    ├── k_means.md
    ├── random_forest.md
    ├── references.md
    └── regression.md
└── figures
    └── k-means.gif


/README.md:
--------------------------------------------------------------------------------
 1 | # Spark Machine Learning Introduction
 2 | 
 3 | **NOTE: the methods introduced here are all based on RDD-based API. As of Spark 2.0, the RDD-based APIs in the `spark.mllib` package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the `spark.ml` package. I would strongly suggest NOT use this repo for your learning anymore (please refer to https://spark.apache.org/docs/2.1.0/ml-guide.html).**
 4 | 
 5 | In this repo, I try to introduce some basic machine learning usages of *PySpark*. The contents I'm going to cover would be quite simple. But I guess it would be helpful for some people since I would cover some questions I encountered myself from the perspective of a person who's used to more "normal" ML settings (like R language). 
 6 | 
 7 | For the basic PySpark operations (Tranformations and Actions), you may refer to my another GitHub repo, [Spark Practice](https://github.com/XD-DENG/Spark-practice).
 8 | 
 9 | Some of the examples are from the official examples given by Spark. But I will give more details.
10 | 
11 | - [Random Forest](https://github.com/XD-DENG/Spark-ML-Intro/tree/master/chapters/random_forest.md)
12 | - [Regression](https://github.com/XD-DENG/Spark-ML-Intro/tree/master/chapters/regression.md)
13 | - [K-means](https://github.com/XD-DENG/Spark-ML-Intro/tree/master/chapters/k_means.md)
14 | - [References](https://github.com/XD-DENG/Spark-ML-Intro/tree/master/chapters/references.md)
15 | 
16 | ## License
17 | Please note this repostory is under the Creative Commons Attribution-ShareAlike License[https://creativecommons.org/licenses/by-sa/3.0/].
18 | 


--------------------------------------------------------------------------------
/chapters/k_means.md:
--------------------------------------------------------------------------------
 1 | ## K-means
 2 | 
 3 | The next machine learning method I'd like to introduce is about clustering, K-means. It is an unsupervised learning method where we would like to group the observations into *K* groups (or subsets). We call it "unsupervised" since we don't have associated response measurements together with the observations to help check and evaluate the model we built (of course we can use other measures to evaluate the clustering models).
 4 | 
 5 | K-means may be the simplest approach for clustering while it’s also an elegant and efficient method. To produce the clusters, K-means method only requires the number of clusters *K* as its input.
 6 | 
 7 | The idea of K-means clustering is that a good clustering is with the smallest within-cluster variation (a measurement of how different the observations within a cluster are from each other) in a possible range. To achieve this purpose, K-means algorithm is designed in a "greedy" algorithm fashion
 8 | 
 9 | **K-means Algorithm**
10 | 
11 | 	1. For each observation, assign a random number which is generated from 1 to *K* to it.
12 | 
13 | 	2. For each of the *K* clusters, compute the cluster center. The *k*th cluster’s center is the vector of the means of the vectors of all the observations belonging to the kth cluster.
14 | 
15 | 	3. Re-assign each observation to the cluster whose cluster center is closest to this observation.
16 | 
17 | 	4. Check if the new assignments are the same as the last iteration. If not, go to step 2; if yes, END.
18 | 
19 | 
20 | An example of iteration with K-means algorithm is presented below
21 | 
22 | ![alt text](../figures/k-means.gif)
23 | 
24 | 
25 | Now it's time to implement K-means with PySpak. I generate a dateset myself, it contains 30 observations, and I purposedly "made" them group 3 sets.
26 | 
27 | ```python
28 | from pyspark import SparkContext
29 | sc = SparkContext("local", "Simple App")
30 | 
31 | from pyspark.mllib.clustering import KMeans, KMeansModel
32 | from numpy import array
33 | from random import randrange
34 | from math import sqrt
35 | 
36 | 
37 | # Generate the observations -----------------------------------------------------
38 | n_in_each_group = 10   # how many observations in each group
39 | n_of_feature = 5 # how many features we have for each observation
40 | 
41 | observation_group_1=[]
42 | for i in range(n_in_each_group*n_of_feature):
43 | 	observation_group_1.append(randrange(5, 8))
44 | 
45 | observation_group_2=[]
46 | for i in range(n_in_each_group*n_of_feature):
47 | 	observation_group_2.append(randrange(55, 58))
48 | 
49 | observation_group_3=[]
50 | for i in range(n_in_each_group*n_of_feature):
51 | 	observation_group_3.append(randrange(105, 108))
52 | 
53 | data = array([observation_group_1, observation_group_2, observation_group_3]).reshape(n_in_each_group*3, 5)
54 | data = sc.parallelize(data)
55 | 
56 | 
57 | # Run the K-Means algorithm -----------------------------------------------------
58 | 
59 | 
60 | # Build the K-Means model
61 | clusters = KMeans.train(data, 3, maxIterations=10, initializationMode="random")  # the initializationMode can also be "k-means||" or set by users.
62 | 
63 | # Collect the clustering result
64 | result=data.map(lambda point: clusters.predict(point)).collect()
65 | print result
66 | 
67 | # Evaluate clustering by computing Within Set Sum of Squared Errors
68 | def error(point):
69 |     center = clusters.centers[clusters.predict(point)]
70 |     return sqrt(sum([x**2 for x in (point - center)]))
71 | 
72 | WSSSE = data.map(lambda point: error(point)).reduce(lambda x, y: x + y)
73 | print("Within Set Sum of Squared Error = " + str(WSSSE))
74 | 
75 | ```
76 | 


--------------------------------------------------------------------------------
/chapters/random_forest.md:
--------------------------------------------------------------------------------
 1 | ## Random Forest
 2 | 
 3 | What is the idea of **Random Forest**? 
 4 | 
 5 | To put it simple, averaging a set of observations reduces variance. Hence a natural way to reduce the variance and hence increase the prediction accuracy of a decision tree model is to take training sets repeatedly from the population, build a separate tree model using each training set, and average the resulting predictions [1]. This is the idea of **bagging** (**B**ootstrap **agg**regat**ing**). 
 6 | 
 7 | Then we may need to subset the predictors. That is, in each split of one tree model, we don't use all the features we have. You may ask WHY since this seems like a 'waste' of resources we have. But let's suppose that there is a very strong predictor in the data, then in the models we produced, most of them will use that strong predictor in the top split and all of these decision trees will look similar, i.e., they're highly correlated. This may effect the reduction in variance and worsen the result. [1] This is why we only use randomly selected features in each split of the tree models. 
 8 | 
 9 | This is just the idea of **random forest**. Simple, straitforward, and elegant at the same time.
10 | 
11 | Now let's have a look at the example code given by Spark. I commented the points where we may need to note (and details will be given later). You can also run this example by using `./bin/spark-submit <python file>`  in the top-level Spark directory.
12 | 
13 | ```python
14 | from pyspark.mllib.tree import RandomForest, RandomForestModel
15 | from pyspark.mllib.util import MLUtils
16 | from pyspark import SparkContext
17 | 
18 | # --- Point 1, 2 ---
19 | # Load and parse the data file into an RDD of LabeledPoint.
20 | sc = SparkContext()
21 | data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
22 | # Split the data into training and test sets (30% held out for testing)
23 | (trainingData, testData) = data.randomSplit([0.7, 0.3])
24 | 
25 | 
26 | # --- Point 3, 4, 5 ---
27 | # Train a RandomForest model.
28 | #  Empty categoricalFeaturesInfo indicates all features are continuous.
29 | model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
30 |                                      numTrees=3, featureSubsetStrategy="auto",
31 |                                      impurity='gini', maxDepth=4, maxBins=32)
32 | 
33 | # Evaluate model on test instances and compute test error
34 | predictions = model.predict(testData.map(lambda x: x.features))
35 | labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
36 | testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
37 | print('Test Error = ' + str(testErr))
38 | print('Learned classification forest model:')
39 | print(model.toDebugString())
40 | 
41 | ```
42 | 
43 | ##### Point 1: LIBSVM data format
44 | When I looked into LIBSVM data file for the first time, I got a little bit confused. But then I found its design is a brilliant idea.
45 | 
46 | LIBSVM data files look like below:
47 | ```
48 | -1 1:-766 2:128 3:0.140625 4:0.304688 5:0.234375 6:0.140625 7:0.304688 8:0.234375
49 | -1 1:-726 2:131 3:0.129771 4:0.328244 5:0.229008 6:0.129771 7:0.328244 8:0.229008
50 | ......
51 | ```
52 | The first element of each row is the *label*, or we can say it's the *response value*. The labels can be either discrete or continuous. Normally, the labels will be discrete if we're working on classification, and continuous if we're trying to do regression. Following the labels are the *feature indices* and the *feature values* in format `index:value` (Please note that the index starts from `1` instead of `0` in LIBSVM data files, i.e., the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based [4]).
53 | 
54 | Sometimes we may find 'weird' LIBSVM data like below
55 | ```
56 | -1 3:1 11:1 14:1 19:1 39:1 42:1 55:1 64:1 67:1 73:1 75:1 76:1 80:1 83:1 
57 | -1 3:1 6:1 17:1 27:1 35:1 40:1 57:1 63:1 69:1 73:1 74:1 76:1 81:1 103:1 
58 | -1 1:1 7:1 16:1 22:1 36:1 42:1 56:1 62:1 67:1 73:1 74:1 76:1 79:1 83:1 
59 | ```
60 | The indices in it are not continuous. What's wrong? Actually the missing features are all 0. For example, in the first row, feature 1, 2, 4-10, 12-13, ... are all zero-values. This design is partially for the sake of memory usage. It would help improve the efficiency of the our programs if the data are sparse (containing quite many zero-values).
61 | 
62 | 
63 | ##### Point 2: Data Type "Labeled Point"
64 | 
65 | The data loaded by method `loadLibSVMFile` will be saved as `Labeled Points`. What is it?
66 | 
67 | MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices backed by one or more RDDs. Local vectors and local matrices are simple data models that serve as public interfaces. A training example used in supervised learning is called a “labeled point” in MLlib [4].
68 | 
69 | ##### Point 3: How many trees we should have (`numTrees`)
70 | 
71 | This argument determines how many trees we build in a random forest. Increasing the number of trees will decrease the variance in predictions, and improve the model’s test accuracy. At the same time, training time will increaseroughly linearly in the number of trees.
72 | 
73 | Personally, I would recommend 400-500 as a 'safe' choice.
74 | 
75 | 
76 | ##### Point 4: How many features to use (`featureSubsetStrategy`)
77 | 
78 | As we mentioned above, the very unique charactristic of *random forest* is that in each split of the tree model we use a subset of features (predictors) instead of using all of them. Then, how many features should we use in each split? we can set `featureSubsetStrategy="auto"` of course so that the function we called will help us configure automatically, but we may want to tune it in some situations. Decreasing this number will speed up training, but can sometimes impact performance if too low [2].
79 | 
80 | For the function `RandomForest.trainClassifier` in PySaprk , argument `featureSubsetStrategy` supports“auto” (default), “all”, “sqrt”, “log2”, “onethird”. If “auto” is set, this parameter is set based on numTrees: if numTrees == 1, set to “all”; if numTrees > 1 (forest) set to “sqrt” [3].
81 | 
82 | Usually, given the number of features is `p`, we use `p/3` features in each model when building a random forest for regression, and use `sqrt(p)` features in each model if a random forest is built for classification [1].
83 | 
84 | 
85 | 
86 | ##### Point 5: What is 'gini' --- the measures used to grow the trees (`impurity`)
87 | 
88 | `impurity` argument helps determine the criterion used for information gain calculation, and in PySpark the supported values are “gini” (recommended) or “entropy” [3]. Since random forest is some kind of *greedy algorithm*, we can say that `impurity` helps determine what is the objective function when the algorithm makes each decisions.
89 | 
90 | The most commonly used measures for this are just **Gini Index** and *Cross-entropy*, corresponding to the two supported values for `impurity` argument.
91 | 
92 | 


--------------------------------------------------------------------------------
/chapters/references.md:
--------------------------------------------------------------------------------
 1 | ## References
 2 | [1] An Introduction to Statistical Learning with Applications in R
 3 | 
 4 | [2] MLlib - Ensembles, http://spark.apache.org/docs/latest/mllib-ensembles.html
 5 | 
 6 | [3] pyspark.mllib package, http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html
 7 | 
 8 | [4] MLlib - Data Types, http://spark.apache.org/docs/latest/mllib-data-types.html
 9 | 
10 | [5] pyspark.mllib.clustering module, http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.KMeansModel
11 | 
12 | [6] Clustering - spark.mllib, http://spark.apache.org/docs/latest/mllib-clustering.html
13 | 


--------------------------------------------------------------------------------
/chapters/regression.md:
--------------------------------------------------------------------------------
 1 | ## Regression
 2 | 
 3 | From my own understanding, regression is to build a model which can fit the training data most closely. We normally use *mean squared error* (MSE) as the measure of the fit quality and the objective function when we estimate the parameters.
 4 | 
 5 | The methods we usually use to do regression in Spark MLlib is `linearRegressionWithSGD.train` and we use `predict` to do the prediction with the regression model we obtain. Note that the 'SGD' here refers to Stochastic Gradient Descent.
 6 | 
 7 | ```python
 8 | # the two lines below are added so that this code can be run as a self-containd application.
 9 | from pyspark import SparkContext
10 | sc = SparkContext("local", "Simple App")
11 | 
12 | # load the necessary modules
13 | from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
14 | from pyspark.mllib.evaluation import RegressionMetrics
15 | 
16 | # Load and parse the data
17 | def parsePoint(line):
18 |     values = [float(x) for x in line.replace(',', ' ').split(' ')]
19 |     return LabeledPoint(values[0], values[1:])
20 | 
21 | data = sc.textFile("data/mllib/ridge-data/lpsa.data")
22 | parsedData = data.map(parsePoint)
23 | 
24 | 
25 | # split the data into two sets for training and testing
26 | # Here I have set the seed so that I can reproduce the result
27 | (trainingData, testData) = parsedData.randomSplit([0.7, 0.3], seed=100)
28 | 
29 | 
30 | # Build the model
31 | model = LinearRegressionWithSGD.train(trainingData)
32 | 
33 | 
34 | # Evaluate the model on training data
35 | # --- Point 1 ---
36 | Preds = testData.map(lambda p: (float(model.predict(p.features)), p.label))
37 | MSE = Preds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / Preds.count()
38 | print("Mean Squared Error = " + str(MSE))
39 | print("\n")
40 | 
41 | # --- Point 2 ---
42 | # More about model evaluation and regression analysis
43 | # Instantiate metrics object
44 | metrics = RegressionMetrics(Preds)
45 | 
46 | # Squared Error
47 | print("MSE = %s" % metrics.meanSquaredError)
48 | print("RMSE = %s" % metrics.rootMeanSquaredError)
49 | 
50 | # R-squared
51 | print("R-squared = %s" % metrics.r2)
52 | 
53 | # Mean absolute error
54 | print("MAE = %s" % metrics.meanAbsoluteError)
55 | 
56 | # Explained variance
57 | print("Explained variance = %s" % metrics.explainedVariance)
58 | ```
59 | We can run this script as an application with `spark-submit` command and get the output
60 | ```
61 | Mean Squared Error = 7.35754024842
62 | 
63 | MSE = 7.35754024842
64 | RMSE = 2.71247861714
65 | R-squared = -4.74791121611
66 | MAE = 2.52897021533
67 | Explained variance = 7.89672343551
68 | ```
69 | 
70 | ##### Point 1: A Small Trap
71 | Note that we need to exclusively convert the predicted values into float, otherwise you'll encounter an error like below
72 | ```
73 | TypeError: DoubleType can not accept object in type <type 'numpy.float64'>
74 | ```
75 | when you call the command `metrics = RegressionMetrics(Preds)` (but everything would be okay if you don't do regression analysis with `metrics` method). 
76 | 
77 | And you'll also need to include `p.label` if you want to do regression analysis with `metrics` method.
78 | 
79 | 
80 | ##### Point 2: Regression Analysis
81 | `MLlib` provided the most commonly used metrics for regressiona analysis. You may refer to https://en.wikipedia.org/wiki/Regression_analysis for the relevant information.
82 | 
83 | 


--------------------------------------------------------------------------------
/figures/k-means.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/XD-DENG/Spark-ML-Intro/582cbeee56259b1957e1ebf4c027e90307b319eb/figures/k-means.gif


--------------------------------------------------------------------------------