├── README.md
├── customer_segments.ipynb
├── customers.csv
└── visuals.py
/README.md:
--------------------------------------------------------------------------------
1 | # Study-09-MachineLearning-E
2 | UnsupervisedLearning
3 |
4 | - **A. Basic Clustering**
5 | - K-mean, Hierarchical, DBSCAN
6 | - **B. Model-Based Clustering**
7 | - Gaussian Mixture
8 | - **C. Cluster Validation**
9 | - **D. Dimensionality Reduction**
10 | - PCA, ICA
11 |
12 | ---
13 | ## 00. Min-Max_Scaler: Feature Scaling in the pre-processing data stage
14 | - Unbalanced features: height / weight..the unit differ you dummy! How this combination of features can describe someone ?
15 | - Transform features to have a range [0,1], but what if it has outliers?? such as ridiculous max or min ??
16 |
17 |
18 | ```
19 | def featureScaling(array):
20 | answer = []
21 | for i in array:
22 | value = float(i - min(array))/float(max(array)-min(array))
23 | answer.append(value)
24 | return(answer)
25 | data = [115, 140, 175]
26 | print featureScaling(data)
27 | ```
28 | **`ScikitLearn` loves `numpy input`!!!!!**
29 | ```
30 | import numpy as np
31 | from sklearn.preprocessing import MinMaxScaler
32 |
33 | X = np.array([ [115.0],[140.0],[175.0] ]) # need to be float!! "[]" means.."row"
34 |
35 | scaler = MinMaxScaler()
36 | rescaled_X = scaler.fit_transform(X)
37 | ```
38 | > [Note]: Which algorithms are affected by the **feature scaling** ??
39 | - SVM Classification =>(YES): We trade off one dimension to the other when calculating the `distances`(the **"diagonal"** decision_surf maximizing distances)
40 | - K-means Clustering =>(YES): Having a cluster center, and calculating the `distances` of it to all data pt..they are **"diagonal"**.
41 | - Linear Regression Classification =>(NO): Each feature always goes with its coefficient. What's going on with feature_A does not affect anything with the coefficient of feature_B..So they are separated.
42 | - DescisionTree Classification =>(NO): No need to use diagonal decision surf. There is no trade off.
43 |
44 | ---
45 | ## A. Basic Clustering
46 |
47 |
48 |
49 | ### 1. K-mean Clustering
50 | - Find the groups of similar observations
51 | - Step_01: randomly generate the centroids (MU1, MU2,...).
52 | - Step_02: Allocation
53 | - Holding `MU_k` fixed, label each data points (which MU_k is closest?) and find the membership 'Z_ik' that **minimize SS** (create clusters around each MU_k).
54 | - Step_03: Updating
55 | - Holding `Z_ik` fixed, elect 'new MU_k' for each cluster that **minimize SS**.
56 | - Step_04: Iterate until they converge (no movement of points b/w clusters)
57 | > **SS** of each data pt..(**find the membership** `Z_ik`(0/1) that minimize the SS)
58 | - i: each datapoint
59 | - k: each cluster
60 |
61 |
62 | > **SS** for each cluster..(**find the center** `MU_k` that minimize the SS)
63 | - i: each datapoint
64 | - k: each cluster
65 |
66 |
67 | > Advantages:
68 | - it is simple, easy to implement and easy to interpret the results.
69 | - it practically work well even some assumptions are broken.
70 | > Disadvantages:
71 | - **Local Minima**: It's a local hill climbing algorithm. It can give a sub-optimal solution. The output for any fixed training set can be inconsistent...Damn. The output would be very dependent on where we put our **initial cluster centers**. The more cluster centers we have, the more bad local minima we can get, so run the algorithm multiple times.
72 |
73 |
74 | - **Hyper-spherical nature**:
75 | - it only relies on distance to centroid as a definition of a cluster, thus it works poorly with clusters with different densities and cannot carve out descent clusters when their shapes are not spherical.
76 | - it assumes the joint distribution of features within each cluster is spherical, features within a cluster have equal variance, and also features are independent of each other.
77 | - it assumes balanced cluster size within the dataset, thus often produces clusters with relatively uniform size even if the input data have different cluster size.
78 | - it is sensitive to outliers
79 | ```
80 | def kmeans(dataSet, k):
81 |
82 | # Initialize centroids randomly
83 | numFeatures = dataSet.getNumFeatures()
84 | centroids = getRandomCentroids(numFeatures, k)
85 |
86 | # Initialize book keeping vars.
87 | iterations = 0
88 | oldCentroids = None
89 |
90 | # Run the main k-means algorithm
91 | while not shouldStop(oldCentroids, centroids, iterations):
92 | # Save old centroids for convergence test. Book keeping.
93 | oldCentroids = centroids
94 | iterations += 1
95 |
96 | # Assign labels to each datapoint based on centroids
97 | labels = getLabels(dataSet, centroids)
98 |
99 | # Assign centroids based on datapoint labels
100 | centroids = getCentroids(dataSet, labels, k)
101 |
102 | # We can get the labels too by calling getLabels(dataSet, centroids)
103 | return centroids
104 |
105 | # Function: Should Stop
106 | # -------------
107 | # Returns True or False if k-means is done. K-means terminates either
108 | # because it has run a maximum number of iterations OR the centroids
109 | # stop changing.
110 | def shouldStop(oldCentroids, centroids, iterations):
111 | if iterations > MAX_ITERATIONS: return True
112 | return oldCentroids == centroids
113 |
114 | # Function: Get Labels
115 | # -------------
116 | # Returns a label for each piece of data in the dataset.
117 | def getLabels(dataSet, centroids):
118 | # For each element in the dataset, chose the closest centroid.
119 | # Make that centroid the element's label.
120 |
121 | # Function: Get Centroids
122 | # -------------
123 | # Returns k random centroids, each of dimension n.
124 | def getCentroids(dataSet, labels, k):
125 | # Each centroid is the geometric mean of the points that
126 | # have that centroid's label. Important: If a centroid is empty (no points have
127 | # that centroid's label) you should randomly re-initialize it.
128 | ```
129 | ### 2. Hierarchical & Density-Based Clustering
130 | - In SKLEARN, they are parts of `agglomerative clustering` component.
131 |
132 |
133 | > Hierarchical Clustering Example: A Pizza company want to cluster the locations of its customers in order to determine where it should open up its new branches.
134 |
135 | 1. Hierarchical Single-link clustering:
136 | - Hierarchical Clustering results in a **structure of clusters** that gives us a visual indication of how clusters relate to each other.
137 | - Step01: assume each pt is already a cluster and we give each pt a label.
138 | - Step02: calculate the distance b/w each pt and each other pt, then choose the smallest distances to group them into a cluster. On the side, we draw the structure tree one by one (the dendogram gives us an additional insight that might direct the results of the clustering misses)
139 |
140 |
141 | - Single linkage looks at the closest point to the cluster, that can result in clusters of various shapes, thus is more prone to result in elongated shapes that are not necessarily compact or circular.
142 | - Single and complete linkage follow merging heuristics that involve mainly one point. They do not pay much attention to in-cluster variance.
143 | - Ward's method does try to minimize the variance resulting in each merging step by merging clusters that lead to the least increase in variance in the clusters after merging.
144 |
145 | 2. Hierarchical Complete-link clustering:....
146 | 3. Hierarchical Average-link clustering:....
147 | 4. Ward's Method:....
148 | ```
149 | from sklearn.cluster import AgglomerativeClustering
150 |
151 | # Ward is the default linkage algorithm...
152 | ward = AgglomerativeClustering(n_clusters=3)
153 | ward_pred = ward.fit_predict(df)
154 |
155 | # using complete linkage
156 | complete = AgglomerativeClustering(n_clusters=3, linkage="complete")
157 | # Fit & predict
158 | complete_pred = complete.fit_predict(df)
159 |
160 | # using average linkage
161 | avg = AgglomerativeClustering(n_clusters=3, linkage="average")
162 | # Fit & predict
163 | avg_pred = avg.fit_predict(df)
164 | ```
165 | To determine which clustering result better matches the original labels of the samples, we can use adjusted_rand_score which is an external cluster validation index which results in a score between -1 and 1, where 1 means two clusterings are identical of how they grouped the samples in a dataset (regardless of what label is assigned to each cluster). Which algorithm results in the higher Adjusted Rand Score?
166 | ```
167 | from sklearn.metrics import adjusted_rand_score
168 |
169 | ward_ar_score = adjusted_rand_score(df.label, ward_pred)
170 | complete_ar_score = adjusted_rand_score(df.label, complete_pred)
171 | avg_ar_score = adjusted_rand_score(df.label, avg_pred)
172 |
173 | print( "Scores: \nWard:", ward_ar_score,"\nComplete: ", complete_ar_score, "\nAverage: ", avg_ar_score)
174 | ```
175 | Sometimes some column has smaller values than the rest of the columns, and so its variance counts for less in the clustering process (since clustering is based on distance). We normalize the dataset so that each dimension lies between 0 and 1, so they have equal weight in the clustering process. **This is done by subtracting the minimum from each column then dividing the difference(max-min) by the range.** Would clustering the dataset after this transformation lead to a better clustering?
176 | ```
177 | from sklearn import preprocessing
178 | normalized_X = preprocessing.normalize(df)
179 | ```
180 | To visualize the highest scoring clustering result, we'll need to use Scipy's linkage function to perform the clusteirng again so we can obtain the linkage matrix it will later use to visualize the hierarchy.
181 | ```
182 | # Import scipy's linkage function to conduct the clustering
183 | from scipy.cluster.hierarchy import linkage
184 |
185 | # Pick the one that resulted in the highest Adjusted Rand Score
186 | linkage_type = 'ward'
187 |
188 | linkage_matrix = linkage(normalized_X, linkage_type)
189 |
190 | from scipy.cluster.hierarchy import dendrogram
191 |
192 | plt.figure(figsize=(22,18))
193 | dendrogram(linkage_matrix)
194 |
195 | plt.show()
196 | ```
197 |
198 |
199 | 5. Density-Based Clustering:
200 | - DBSCAN(Density-based Spatial Clustering of Applications with Noise) grips the pt densely packed together and labels other pt as noise.
201 | - Step01: it selects a point arbitrarily, and looks at the neighbors around, and ask "Are there any other points?". If no, it's a noise. and ask "enough numbers to make a cluster?". If no, it's a noise.
202 | - Step02: If we find enough number of points, we identify 'core point' and 'border point'.
203 | - Step03: Continue examine points..and create clusters.
204 |
205 |
206 | ```
207 | DBSCAN(df, epsilon, min_points):
208 | C = 0
209 | for each unvisited point P in df
210 | mark P as visited
211 | sphere_points = regionQuery(P, epsilon)
212 | if sizeof(sphere_points) < min_points
213 | ignore P
214 | else
215 | C = next cluster
216 | expandCluster(P, sphere_points, C, epsilon, min_points)
217 |
218 | expandCluster(P, sphere_points, C, epsilon, min_points):
219 | add P to cluster C
220 | for each point P’ in sphere_points
221 | if P’ is not visited
222 | mark P’ as visited
223 | sphere_points’ = regionQuery(P’, epsilon)
224 | if sizeof(sphere_points’) >= min_points
225 | sphere_points = sphere_points joined with sphere_points’
226 | if P’ is not yet member of any cluster
227 | add P’ to cluster C
228 |
229 | regionQuery(P, epsilon):
230 | return all points within the n-dimensional sphere centered at P with radius epsilon (including P)
231 |
232 |
233 | #### Python #########################################################################################################
234 | import numpy as numpy
235 | import scipy as scipy
236 | from sklearn import cluster
237 | import matplotlib.pyplot as plt
238 |
239 |
240 |
241 | def set2List(NumpyArray):
242 | list = []
243 | for item in NumpyArray:
244 | list.append(item.tolist())
245 | return list
246 |
247 |
248 | def GenerateData():
249 | x1=numpy.random.randn(50,2)
250 | x2x=numpy.random.randn(80,1)+12
251 | x2y=numpy.random.randn(80,1)
252 | x2=numpy.column_stack((x2x,x2y))
253 | x3=numpy.random.randn(100,2)+8
254 | x4=numpy.random.randn(120,2)+15
255 | z=numpy.concatenate((x1,x2,x3,x4))
256 | return z
257 |
258 |
259 | def DBSCAN(Dataset, Epsilon,MinumumPoints,DistanceMethod = 'euclidean'):
260 | # Dataset is a mxn matrix, m is number of item and n is the dimension of data
261 | m,n=Dataset.shape
262 | Visited=numpy.zeros(m,'int')
263 | Type=numpy.zeros(m)
264 | # -1 noise, outlier
265 | # 0 border
266 | # 1 core
267 | ClustersList=[]
268 | Cluster=[]
269 | PointClusterNumber=numpy.zeros(m)
270 | PointClusterNumberIndex=1
271 | PointNeighbors=[]
272 | DistanceMatrix = scipy.spatial.distance.squareform(scipy.spatial.distance.pdist(Dataset, DistanceMethod))
273 | for i in xrange(m):
274 | if Visited[i]==0:
275 | Visited[i]=1
276 | PointNeighbors=numpy.where(DistanceMatrix[i]=MinumumPoints:
305 | # Neighbors merge with PointNeighbors
306 | for j in Neighbors:
307 | try:
308 | PointNeighbors.index(j)
309 | except ValueError:
310 | PointNeighbors.append(j)
311 |
312 | if PointClusterNumber[i]==0:
313 | Cluster.append(i)
314 | PointClusterNumber[i]=PointClusterNumberIndex
315 | return
316 |
317 | #Generating some data with normal distribution at
318 | #(0,0)
319 | #(8,8)
320 | #(12,0)
321 | #(15,15)
322 | Data=GenerateData()
323 |
324 | #Adding some noise with uniform distribution
325 | #X between [-3,17],
326 | #Y between [-3,17]
327 | noise=scipy.rand(50,2)*20 -3
328 |
329 | Noisy_Data=numpy.concatenate((Data,noise))
330 | size=20
331 |
332 |
333 | fig = plt.figure()
334 | ax1=fig.add_subplot(2,1,1) #row, column, figure number
335 | ax2 = fig.add_subplot(212)
336 |
337 | ax1.scatter(Data[:,0],Data[:,1], alpha = 0.5 )
338 | ax1.scatter(noise[:,0],noise[:,1],color='red' ,alpha = 0.5)
339 | ax2.scatter(noise[:,0],noise[:,1],color='red' ,alpha = 0.5)
340 |
341 |
342 | Epsilon=1
343 | MinumumPoints=20
344 | result =DBSCAN(Data,Epsilon,MinumumPoints)
345 |
346 | #printed numbers are cluster numbers
347 | print result
348 | #print "Noisy_Data"
349 | #print Noisy_Data.shape
350 | #print Noisy_Data
351 |
352 | for i in xrange(len(result)):
353 | ax2.scatter(Noisy_Data[i][0],Noisy_Data[i][1],color='yellow' ,alpha = 0.5)
354 |
355 | plt.show()
356 |
357 | ```
358 |
359 |
360 | ---
361 | ## B. Model-Based Clustering(Gaussian Mixture)
362 | ### Wow, several datasets were hacked and mixed up..How to retrieve the originals?
363 |
364 | [Assumption]: **Each cluster follows a certain statistical distribution**.
365 | - In one dimension
366 |
367 |
368 | - In two dimension
369 |
370 |
371 | ### EM(Expectation Maximization) Algorithm for Gaussian Mixture
372 |
373 |
374 | - Step_01. Initialization of the distributions
375 | - > give them the initial values(`mean`, `var`) for each of the two suspected clusters.
376 | - Run 'k-means' on the dataset and choose the clusters roughly.... or randomly choose ?
377 | - It is indeed important that we are careful in **choosing the parameters of the initial Gaussians**. That has a significant effect on the quality of EM's result.
378 |
379 |
380 | - Step_02. **Expectation I**: soft_clustering of data-pt with probabilities
381 | - > let's say we have 'n'points. Each pt has 2 values for each feature. Now we need to calculate the membership(probability) of each pt.
382 | - How to determine the membership? Just pass in your x_value, and two parameters(mean, var)...
383 |
384 |
385 | - Step_02. **Expectation II**: Estimate real **parameters** of new Gaussians, using the `weighted means & variance`
386 | - > the `new mean` for cluster_A, given the result of step_02(transient memberships), comes from calculating the **weighted mean** of all of the points with the same transient memberships.
387 | - the weighted mean does not only account for the parameters of each pt, but also account for how much it belongs.
388 | - > the `new var` for cluster_A, given the result of step_02(transient memberships), comes from calculating the **weighter VAR** of all of the points with the same transient memberships.
389 |
390 |
391 | - Step_03. **Maximization**: Compare(overlay) the new result with the old Gaussian. We iterate these steps until it converges(no movement?).
392 | - > Evaluate the `log-likelihood` which sums for all clusters.
393 | - the higher the value, the more sure we are that the mixer model fits out dataset.
394 | - the purpose is to **maximize** this value by choosing the parameters(the mixing coefficient, mean, var) of each Gaussian again and again until the value converges, reaching a maximum.
395 | - What's the mixing coefficient? = mixing proportions.. they affect the **height** of the distribution..
396 |
397 |
398 | ```
399 | from sklearn import mixture
400 | gmm = mixture.GaussianMixture(n_components=3)
401 | gmm.fit(X)
402 | clustering = gmm.predict(X)
403 | ```
404 | https://www.youtube.com/watch?v=lLt9H6RFO6A
405 |
406 | http://www.ai.mit.edu/projects/vsam/Publications/stauffer_cvpr98_track.pdf
407 |
408 |
409 |
410 | ---
411 | ## C. Cluster Validation
412 |
413 |
414 | ### 1.External Indices:__
415 |
416 |
417 | - When we have the ground truth(answer-sheet or the labeled reference).
418 | - **ARI**(Adjusted Rand_Index) [-1 to 1]:
419 | - > Note: ARI does not care what label we assign a cluster, as long as the point assignment matches that of the ground truth.
420 |
421 |
422 | ### 2. Internal Indices:__
423 |
424 |
425 | - When we don't have the ground truth.
426 | - **Silhouette Coefficient** [-1 to 1]:
427 | - There is a Silhouette Coefficient for each data-pt. We average them and get a Silhouette score for the entire clustering. We can calculate the silhouette coefficient for each point, cluster, as well as for an entire dataset.
428 | - Silhouette is affected by `K`(No.of clusters)
429 | - Silhouette is affected by compactness, circularity of the cluster.
430 | - > Note: for DBSCAN, we never use Silhouette score...(it does not care the **compact, circular clustering** because of the idea of 'noise'). Instead, we use **DBCV** for DBSCAN. http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=83C3BD5E078B1444CB26E243975507E1?doi=10.1.1.707.9034&rep=rep1&type=pdf
431 | - > Note: for Hierachical Clustering, it carves out the clusters well, but it's not what Silhouette can conceive of.
432 |
433 | By 'K'
434 |
435 |
436 | By the 'shape' of the cluster
437 |
438 |
439 | ```
440 |
441 |
442 |
443 | ```
444 | ---
445 | ## D. Dimensionality Reduction
446 | ### 1. Principal Component Anaysis
447 |
448 |
449 | We can't possibly come up with coordinate system shifted, rotated from the original to obtain the **one dimensionality**. PCA specializes on **'shifts'** and **'rotation'** for the coordinate system.
450 |
451 | If our given data is of any shape whatsoever, PCA finds a **new coordinate system** obtained from the original by translation or rotation.
452 | - It moves the **center** of the coordinate system with the center of the dataset.
453 | - It moves the X-axis into the principal axis of the variation where we see the **most variation** relative to all the data-pt.
454 | - It moves the y-axis down the road into the orthogonal(less important directions of variation).
455 |
456 | > **What defines the two principal directions(the two orthogonal vectors)?** to kill dimensionality, multicollinearity...
457 | > - 1.Find the center of the dataset (mean??)
458 | > - 2.Find the two principal axis of variation (Eigenvectors)
459 | > - The measure of the orthogonality: Do the 'dot-product' of these two vectors, we should get 'zero'.
460 | > - 3.Find the spread values (giving importance to our vectors) for the two axis? (Eigenvalues)
461 |
462 |
463 | ## Compression while preserving all information!!!! Get rid of multicollinearity!!!!
464 | Let's say we have a large number of measurable features, but we know there are a small number of underlying **latent features** that contain most of information. What's the best way to condense those features?
465 | # The new variable is a linear-Combinations using those features! But the game changer is the Cov-matrix!!
466 |
467 |
468 | - How to find the principal component or the direction that capturing the maximal variance (the corresponding Eigenvector of Cov_matrix) ?
469 | - the amount of **information loss** is equal to the distances b/w the component line(the new tranformed values) and a given pt, so find the component line that minimizes the information loss. This component line is the Eigenvector of pxp **Cov-Matrix ?**.
470 |
471 |
472 | - How to give an insight on **which features** drive the most impact(capturing the major pattern - the largest Eigenvalue of Cov-matrix) ?
473 |
474 |
475 |
476 |
477 |
478 |
479 | - [Usage]
480 | - When we want to examine **latent features** driving the patterns in our complex data
481 | - Dimensionality Reduction
482 | - Visualizing high-dimensional data(projecting the two features down to the first PC-line and leave them as scatters, then use K-means)
483 | - Reducing **noises** by discarding unimportant PC.
484 | - Pre-processing before using any other algorithms by reducing the demensionality of inputs.
485 |
486 | Ex> facial recognition why?
487 | - **Mega pixels:** pictures of human faces in general have high input dimensionality
488 | - **Eyes, nose, mouth:** Human faces have general patterns that could be captured in smaller number of dimensions.
489 | - In this example, the original dimensionality of the pic is: "1288 rows x 1850 features" plus "7 classes".
490 | ```
491 | from time import time
492 | import logging
493 | import pylab as pl
494 | import numpy as np
495 | from sklearn.model_selection import train_test_split
496 | from sklearn.datasets import fetch_lfw_people
497 | from sklearn.model_selection import GridSearchCV
498 | from sklearn.metrics import classification_report
499 | from sklearn.metrics import confusion_matrix
500 | from sklearn.decomposition import RandomizedPCA
501 | from sklearn.decomposition import PCA
502 | from sklearn.svm import SVC
503 |
504 | # Download the data, if not already on disk and load it as numpy arrays
505 | lfw_people = fetch_lfw_people('data', min_faces_per_person=70, resize=0.4)
506 |
507 | # introspect the images arrays to find the shapes (for plotting)
508 | n_samples, h, w = lfw_people.images.shape
509 | np.random.seed(42)
510 |
511 | # for machine learning we use the data directly (as relative pixel
512 | # position info is ignored by this model)
513 | X = lfw_people.data
514 | n_features = X.shape[1]
515 |
516 | # the label to predict is the id of the person
517 | y = lfw_people.target
518 | target_names = lfw_people.target_names
519 | n_classes = target_names.shape[0]
520 |
521 | print("n_samples: %d" % n_samples)
522 | print("n_features: %d" % n_features)
523 | print( "n_classes: %d" % n_classes)
524 | ```
525 |
526 |
527 | ### Eigenvalue and Eigenvector
528 | A matrix is a linear transformation tool and focus on **mapping of a vector**. It can transform the **magnitude** and the **direction** of a vector into **lower dimension**! `tranformation matrix * Eigen vector = Scaled vector!!`
529 |
530 |
531 | ### 2. RandomProjection
532 | - Computationally more efficient than PCA.
533 | - handle even more features than PCA (with a decrease in quality of projection, however.)
534 | - Premise
535 | - Simply reduce the size of dimensions in our dataset by **multiplying it by a random matrix**.
536 | - Where does the **'k'**(reduced dimensions) come from?
537 | - This algorithm extra cares about the distances b/w points.
538 | - We have a certain level of guarantee that the distances will be a bit distorted, but can be preserved!
539 | - the distance b/w two pt in the projection squared would be squeezed by.....
540 | - The algorithm work either by setting a number of components we want(**'k'**) or by specifying a value for 'epsilon' and calculate a conservative value for **'k'**, and gives a new dataset.
541 |
542 |
543 |
544 | ```
545 | from sklearn import random_projection
546 | rp = random_projection.SparseRandomProjection(n_components='auto', eps=0.1)
547 |
548 | new_X = rp.fit_transform(X)
549 | ```
550 | ### 3. Independent Component Analysis
551 | While PCA works to maximize 'var', ICA tries to isolate the independent sources that are mixed in the dataset.
552 | - EX> blind source separation: Restoring the original signals..
553 |
554 |
555 | - To produce the original signal `S`, ICA estimate the best `W` that we can multiply by `X`..
556 | - ICA assumes
557 | - the features are mixtures of independent sources
558 | - the components must have **non-Gaussian** distributions.
559 | - the Central_Limit_Theorem says the distribution of a sum of independent variables(or sample means) tends towards the Gaussian.
560 |
561 |
562 | ```
563 | from sklearn.decomposition import FastICA
564 | X = list(zip(signal_1, signal_2, signal_3))
565 | ica = FastICA(n_components=3)
566 |
567 | components = ica.fit_transform(X) ## here, these objects contain the independent components retrieved via ICA
568 | ```
569 |
570 | [Note]
571 | - 1.Let’s mix two random sources A and B. At each time, in the following plot(1), the value of A is the abscissa(x-axis) of the data point and the value of B is their ordinates(Y-axis).
572 | - 2.Let take two linear mixtures of A and B and plot(2) these two new variables.
573 | - 3.Then if we whiten the two linear mixtures, we get the plot(3)
574 | - the variance on both axis is now equal
575 | - the correlation of the projection of the data on both axis is 0 (meaning that the covariance matrix is diagonal and that all the diagonal elements are equal).
576 | - Then applying ICA only mean to “rotate” this representation back to the original A and B axis space.
577 | - The **whitening process** is simply a `linear change of coordinate` of the mixed data. Once the ICA solution is found in this “whitened” coordinate frame, we can easily reproject the ICA solution back into the original coordinate frame.
578 | - **whitening** is basically a de-correlation transform that converts the covariance-matrix into an identity matrix
579 |
580 |
581 | We can imagine that ICA rotates the **whitened matrix** back to the original (A,B) space (first scatter plot above). It performs the rotation by **minimizing the Gaussianity of the data** projected on both axes (fixed point ICA). For instance, in the example above, the projection on both axis is quite Gaussian (i.e., it looks like a bell shape curve). By contrast, the projection in the original A, B space far from gaussian.
582 | - By rotating the axis and minimizing Gaussianity of the projection in the first scatter plot, ICA is able to recover the original sources which are statistically independent (this property comes from the central limit theorem which states that any linear mixture of 2 independent random variables is more Gaussian than the original variables).
583 | - the function kurtosis gives an indication of the gaussianity of a distribution (but the fixed-point ICA algorithm uses a slightly different measure called negentropy).
584 |
585 |
586 | We dealt with only 2 dimensions. However ICA can deal with an arbitrary high number of dimensions. Let’s consider 128 EEG electrodes for instance. The signal recorded in all electrode at each time point then constitutes a data point in a 128 dimension space. After whitening the data, ICA will “rotate the 128 axis” in order to minimize the Gaussianity of the projection on all axis (note that unlike PCA the axis do not have to remain orthogonal). What we call ICA components is the matrix that allows projecting the data in the initial space to one of the axis found by ICA. The weight matrix is the full transformation from the original space.
587 |
588 |
589 |
590 |
591 |
592 |
593 |
594 |
595 |
596 |
597 |
598 |
599 |
600 |
601 |
602 |
603 |
604 |
605 |
606 |
607 |
608 |
609 |
610 |
611 |
612 |
613 |
614 |
615 |
616 |
617 |
618 |
--------------------------------------------------------------------------------
/customers.csv:
--------------------------------------------------------------------------------
1 | Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen
2 | 2,3,12669,9656,7561,214,2674,1338
3 | 2,3,7057,9810,9568,1762,3293,1776
4 | 2,3,6353,8808,7684,2405,3516,7844
5 | 1,3,13265,1196,4221,6404,507,1788
6 | 2,3,22615,5410,7198,3915,1777,5185
7 | 2,3,9413,8259,5126,666,1795,1451
8 | 2,3,12126,3199,6975,480,3140,545
9 | 2,3,7579,4956,9426,1669,3321,2566
10 | 1,3,5963,3648,6192,425,1716,750
11 | 2,3,6006,11093,18881,1159,7425,2098
12 | 2,3,3366,5403,12974,4400,5977,1744
13 | 2,3,13146,1124,4523,1420,549,497
14 | 2,3,31714,12319,11757,287,3881,2931
15 | 2,3,21217,6208,14982,3095,6707,602
16 | 2,3,24653,9465,12091,294,5058,2168
17 | 1,3,10253,1114,3821,397,964,412
18 | 2,3,1020,8816,12121,134,4508,1080
19 | 1,3,5876,6157,2933,839,370,4478
20 | 2,3,18601,6327,10099,2205,2767,3181
21 | 1,3,7780,2495,9464,669,2518,501
22 | 2,3,17546,4519,4602,1066,2259,2124
23 | 1,3,5567,871,2010,3383,375,569
24 | 1,3,31276,1917,4469,9408,2381,4334
25 | 2,3,26373,36423,22019,5154,4337,16523
26 | 2,3,22647,9776,13792,2915,4482,5778
27 | 2,3,16165,4230,7595,201,4003,57
28 | 1,3,9898,961,2861,3151,242,833
29 | 1,3,14276,803,3045,485,100,518
30 | 2,3,4113,20484,25957,1158,8604,5206
31 | 1,3,43088,2100,2609,1200,1107,823
32 | 1,3,18815,3610,11107,1148,2134,2963
33 | 1,3,2612,4339,3133,2088,820,985
34 | 1,3,21632,1318,2886,266,918,405
35 | 1,3,29729,4786,7326,6130,361,1083
36 | 1,3,1502,1979,2262,425,483,395
37 | 2,3,688,5491,11091,833,4239,436
38 | 1,3,29955,4362,5428,1729,862,4626
39 | 2,3,15168,10556,12477,1920,6506,714
40 | 2,3,4591,15729,16709,33,6956,433
41 | 1,3,56159,555,902,10002,212,2916
42 | 1,3,24025,4332,4757,9510,1145,5864
43 | 1,3,19176,3065,5956,2033,2575,2802
44 | 2,3,10850,7555,14961,188,6899,46
45 | 2,3,630,11095,23998,787,9529,72
46 | 2,3,9670,7027,10471,541,4618,65
47 | 2,3,5181,22044,21531,1740,7353,4985
48 | 2,3,3103,14069,21955,1668,6792,1452
49 | 2,3,44466,54259,55571,7782,24171,6465
50 | 2,3,11519,6152,10868,584,5121,1476
51 | 2,3,4967,21412,28921,1798,13583,1163
52 | 1,3,6269,1095,1980,3860,609,2162
53 | 1,3,3347,4051,6996,239,1538,301
54 | 2,3,40721,3916,5876,532,2587,1278
55 | 2,3,491,10473,11532,744,5611,224
56 | 1,3,27329,1449,1947,2436,204,1333
57 | 1,3,5264,3683,5005,1057,2024,1130
58 | 2,3,4098,29892,26866,2616,17740,1340
59 | 2,3,5417,9933,10487,38,7572,1282
60 | 1,3,13779,1970,1648,596,227,436
61 | 1,3,6137,5360,8040,129,3084,1603
62 | 2,3,8590,3045,7854,96,4095,225
63 | 2,3,35942,38369,59598,3254,26701,2017
64 | 2,3,7823,6245,6544,4154,4074,964
65 | 2,3,9396,11601,15775,2896,7677,1295
66 | 1,3,4760,1227,3250,3724,1247,1145
67 | 2,3,85,20959,45828,36,24231,1423
68 | 1,3,9,1534,7417,175,3468,27
69 | 2,3,19913,6759,13462,1256,5141,834
70 | 1,3,2446,7260,3993,5870,788,3095
71 | 1,3,8352,2820,1293,779,656,144
72 | 1,3,16705,2037,3202,10643,116,1365
73 | 1,3,18291,1266,21042,5373,4173,14472
74 | 1,3,4420,5139,2661,8872,1321,181
75 | 2,3,19899,5332,8713,8132,764,648
76 | 2,3,8190,6343,9794,1285,1901,1780
77 | 1,3,20398,1137,3,4407,3,975
78 | 1,3,717,3587,6532,7530,529,894
79 | 2,3,12205,12697,28540,869,12034,1009
80 | 1,3,10766,1175,2067,2096,301,167
81 | 1,3,1640,3259,3655,868,1202,1653
82 | 1,3,7005,829,3009,430,610,529
83 | 2,3,219,9540,14403,283,7818,156
84 | 2,3,10362,9232,11009,737,3537,2342
85 | 1,3,20874,1563,1783,2320,550,772
86 | 2,3,11867,3327,4814,1178,3837,120
87 | 2,3,16117,46197,92780,1026,40827,2944
88 | 2,3,22925,73498,32114,987,20070,903
89 | 1,3,43265,5025,8117,6312,1579,14351
90 | 1,3,7864,542,4042,9735,165,46
91 | 1,3,24904,3836,5330,3443,454,3178
92 | 1,3,11405,596,1638,3347,69,360
93 | 1,3,12754,2762,2530,8693,627,1117
94 | 2,3,9198,27472,32034,3232,18906,5130
95 | 1,3,11314,3090,2062,35009,71,2698
96 | 2,3,5626,12220,11323,206,5038,244
97 | 1,3,3,2920,6252,440,223,709
98 | 2,3,23,2616,8118,145,3874,217
99 | 1,3,403,254,610,774,54,63
100 | 1,3,503,112,778,895,56,132
101 | 1,3,9658,2182,1909,5639,215,323
102 | 2,3,11594,7779,12144,3252,8035,3029
103 | 2,3,1420,10810,16267,1593,6766,1838
104 | 2,3,2932,6459,7677,2561,4573,1386
105 | 1,3,56082,3504,8906,18028,1480,2498
106 | 1,3,14100,2132,3445,1336,1491,548
107 | 1,3,15587,1014,3970,910,139,1378
108 | 2,3,1454,6337,10704,133,6830,1831
109 | 2,3,8797,10646,14886,2471,8969,1438
110 | 2,3,1531,8397,6981,247,2505,1236
111 | 2,3,1406,16729,28986,673,836,3
112 | 1,3,11818,1648,1694,2276,169,1647
113 | 2,3,12579,11114,17569,805,6457,1519
114 | 1,3,19046,2770,2469,8853,483,2708
115 | 1,3,14438,2295,1733,3220,585,1561
116 | 1,3,18044,1080,2000,2555,118,1266
117 | 1,3,11134,793,2988,2715,276,610
118 | 1,3,11173,2521,3355,1517,310,222
119 | 1,3,6990,3880,5380,1647,319,1160
120 | 1,3,20049,1891,2362,5343,411,933
121 | 1,3,8258,2344,2147,3896,266,635
122 | 1,3,17160,1200,3412,2417,174,1136
123 | 1,3,4020,3234,1498,2395,264,255
124 | 1,3,12212,201,245,1991,25,860
125 | 2,3,11170,10769,8814,2194,1976,143
126 | 1,3,36050,1642,2961,4787,500,1621
127 | 1,3,76237,3473,7102,16538,778,918
128 | 1,3,19219,1840,1658,8195,349,483
129 | 2,3,21465,7243,10685,880,2386,2749
130 | 1,3,140,8847,3823,142,1062,3
131 | 1,3,42312,926,1510,1718,410,1819
132 | 1,3,7149,2428,699,6316,395,911
133 | 1,3,2101,589,314,346,70,310
134 | 1,3,14903,2032,2479,576,955,328
135 | 1,3,9434,1042,1235,436,256,396
136 | 1,3,7388,1882,2174,720,47,537
137 | 1,3,6300,1289,2591,1170,199,326
138 | 1,3,4625,8579,7030,4575,2447,1542
139 | 1,3,3087,8080,8282,661,721,36
140 | 1,3,13537,4257,5034,155,249,3271
141 | 1,3,5387,4979,3343,825,637,929
142 | 1,3,17623,4280,7305,2279,960,2616
143 | 1,3,30379,13252,5189,321,51,1450
144 | 1,3,37036,7152,8253,2995,20,3
145 | 1,3,10405,1596,1096,8425,399,318
146 | 1,3,18827,3677,1988,118,516,201
147 | 2,3,22039,8384,34792,42,12591,4430
148 | 1,3,7769,1936,2177,926,73,520
149 | 1,3,9203,3373,2707,1286,1082,526
150 | 1,3,5924,584,542,4052,283,434
151 | 1,3,31812,1433,1651,800,113,1440
152 | 1,3,16225,1825,1765,853,170,1067
153 | 1,3,1289,3328,2022,531,255,1774
154 | 1,3,18840,1371,3135,3001,352,184
155 | 1,3,3463,9250,2368,779,302,1627
156 | 1,3,622,55,137,75,7,8
157 | 2,3,1989,10690,19460,233,11577,2153
158 | 2,3,3830,5291,14855,317,6694,3182
159 | 1,3,17773,1366,2474,3378,811,418
160 | 2,3,2861,6570,9618,930,4004,1682
161 | 2,3,355,7704,14682,398,8077,303
162 | 2,3,1725,3651,12822,824,4424,2157
163 | 1,3,12434,540,283,1092,3,2233
164 | 1,3,15177,2024,3810,2665,232,610
165 | 2,3,5531,15726,26870,2367,13726,446
166 | 2,3,5224,7603,8584,2540,3674,238
167 | 2,3,15615,12653,19858,4425,7108,2379
168 | 2,3,4822,6721,9170,993,4973,3637
169 | 1,3,2926,3195,3268,405,1680,693
170 | 1,3,5809,735,803,1393,79,429
171 | 1,3,5414,717,2155,2399,69,750
172 | 2,3,260,8675,13430,1116,7015,323
173 | 2,3,200,25862,19816,651,8773,6250
174 | 1,3,955,5479,6536,333,2840,707
175 | 2,3,514,7677,19805,937,9836,716
176 | 1,3,286,1208,5241,2515,153,1442
177 | 2,3,2343,7845,11874,52,4196,1697
178 | 1,3,45640,6958,6536,7368,1532,230
179 | 1,3,12759,7330,4533,1752,20,2631
180 | 1,3,11002,7075,4945,1152,120,395
181 | 1,3,3157,4888,2500,4477,273,2165
182 | 1,3,12356,6036,8887,402,1382,2794
183 | 1,3,112151,29627,18148,16745,4948,8550
184 | 1,3,694,8533,10518,443,6907,156
185 | 1,3,36847,43950,20170,36534,239,47943
186 | 1,3,327,918,4710,74,334,11
187 | 1,3,8170,6448,1139,2181,58,247
188 | 1,3,3009,521,854,3470,949,727
189 | 1,3,2438,8002,9819,6269,3459,3
190 | 2,3,8040,7639,11687,2758,6839,404
191 | 2,3,834,11577,11522,275,4027,1856
192 | 1,3,16936,6250,1981,7332,118,64
193 | 1,3,13624,295,1381,890,43,84
194 | 1,3,5509,1461,2251,547,187,409
195 | 2,3,180,3485,20292,959,5618,666
196 | 1,3,7107,1012,2974,806,355,1142
197 | 1,3,17023,5139,5230,7888,330,1755
198 | 1,1,30624,7209,4897,18711,763,2876
199 | 2,1,2427,7097,10391,1127,4314,1468
200 | 1,1,11686,2154,6824,3527,592,697
201 | 1,1,9670,2280,2112,520,402,347
202 | 2,1,3067,13240,23127,3941,9959,731
203 | 2,1,4484,14399,24708,3549,14235,1681
204 | 1,1,25203,11487,9490,5065,284,6854
205 | 1,1,583,685,2216,469,954,18
206 | 1,1,1956,891,5226,1383,5,1328
207 | 2,1,1107,11711,23596,955,9265,710
208 | 1,1,6373,780,950,878,288,285
209 | 2,1,2541,4737,6089,2946,5316,120
210 | 1,1,1537,3748,5838,1859,3381,806
211 | 2,1,5550,12729,16767,864,12420,797
212 | 1,1,18567,1895,1393,1801,244,2100
213 | 2,1,12119,28326,39694,4736,19410,2870
214 | 1,1,7291,1012,2062,1291,240,1775
215 | 1,1,3317,6602,6861,1329,3961,1215
216 | 2,1,2362,6551,11364,913,5957,791
217 | 1,1,2806,10765,15538,1374,5828,2388
218 | 2,1,2532,16599,36486,179,13308,674
219 | 1,1,18044,1475,2046,2532,130,1158
220 | 2,1,18,7504,15205,1285,4797,6372
221 | 1,1,4155,367,1390,2306,86,130
222 | 1,1,14755,899,1382,1765,56,749
223 | 1,1,5396,7503,10646,91,4167,239
224 | 1,1,5041,1115,2856,7496,256,375
225 | 2,1,2790,2527,5265,5612,788,1360
226 | 1,1,7274,659,1499,784,70,659
227 | 1,1,12680,3243,4157,660,761,786
228 | 2,1,20782,5921,9212,1759,2568,1553
229 | 1,1,4042,2204,1563,2286,263,689
230 | 1,1,1869,577,572,950,4762,203
231 | 1,1,8656,2746,2501,6845,694,980
232 | 2,1,11072,5989,5615,8321,955,2137
233 | 1,1,2344,10678,3828,1439,1566,490
234 | 1,1,25962,1780,3838,638,284,834
235 | 1,1,964,4984,3316,937,409,7
236 | 1,1,15603,2703,3833,4260,325,2563
237 | 1,1,1838,6380,2824,1218,1216,295
238 | 1,1,8635,820,3047,2312,415,225
239 | 1,1,18692,3838,593,4634,28,1215
240 | 1,1,7363,475,585,1112,72,216
241 | 1,1,47493,2567,3779,5243,828,2253
242 | 1,1,22096,3575,7041,11422,343,2564
243 | 1,1,24929,1801,2475,2216,412,1047
244 | 1,1,18226,659,2914,3752,586,578
245 | 1,1,11210,3576,5119,561,1682,2398
246 | 1,1,6202,7775,10817,1183,3143,1970
247 | 2,1,3062,6154,13916,230,8933,2784
248 | 1,1,8885,2428,1777,1777,430,610
249 | 1,1,13569,346,489,2077,44,659
250 | 1,1,15671,5279,2406,559,562,572
251 | 1,1,8040,3795,2070,6340,918,291
252 | 1,1,3191,1993,1799,1730,234,710
253 | 2,1,6134,23133,33586,6746,18594,5121
254 | 1,1,6623,1860,4740,7683,205,1693
255 | 1,1,29526,7961,16966,432,363,1391
256 | 1,1,10379,17972,4748,4686,1547,3265
257 | 1,1,31614,489,1495,3242,111,615
258 | 1,1,11092,5008,5249,453,392,373
259 | 1,1,8475,1931,1883,5004,3593,987
260 | 1,1,56083,4563,2124,6422,730,3321
261 | 1,1,53205,4959,7336,3012,967,818
262 | 1,1,9193,4885,2157,327,780,548
263 | 1,1,7858,1110,1094,6818,49,287
264 | 1,1,23257,1372,1677,982,429,655
265 | 1,1,2153,1115,6684,4324,2894,411
266 | 2,1,1073,9679,15445,61,5980,1265
267 | 1,1,5909,23527,13699,10155,830,3636
268 | 2,1,572,9763,22182,2221,4882,2563
269 | 1,1,20893,1222,2576,3975,737,3628
270 | 2,1,11908,8053,19847,1069,6374,698
271 | 1,1,15218,258,1138,2516,333,204
272 | 1,1,4720,1032,975,5500,197,56
273 | 1,1,2083,5007,1563,1120,147,1550
274 | 1,1,514,8323,6869,529,93,1040
275 | 1,3,36817,3045,1493,4802,210,1824
276 | 1,3,894,1703,1841,744,759,1153
277 | 1,3,680,1610,223,862,96,379
278 | 1,3,27901,3749,6964,4479,603,2503
279 | 1,3,9061,829,683,16919,621,139
280 | 1,3,11693,2317,2543,5845,274,1409
281 | 2,3,17360,6200,9694,1293,3620,1721
282 | 1,3,3366,2884,2431,977,167,1104
283 | 2,3,12238,7108,6235,1093,2328,2079
284 | 1,3,49063,3965,4252,5970,1041,1404
285 | 1,3,25767,3613,2013,10303,314,1384
286 | 1,3,68951,4411,12609,8692,751,2406
287 | 1,3,40254,640,3600,1042,436,18
288 | 1,3,7149,2247,1242,1619,1226,128
289 | 1,3,15354,2102,2828,8366,386,1027
290 | 1,3,16260,594,1296,848,445,258
291 | 1,3,42786,286,471,1388,32,22
292 | 1,3,2708,2160,2642,502,965,1522
293 | 1,3,6022,3354,3261,2507,212,686
294 | 1,3,2838,3086,4329,3838,825,1060
295 | 2,2,3996,11103,12469,902,5952,741
296 | 1,2,21273,2013,6550,909,811,1854
297 | 2,2,7588,1897,5234,417,2208,254
298 | 1,2,19087,1304,3643,3045,710,898
299 | 2,2,8090,3199,6986,1455,3712,531
300 | 2,2,6758,4560,9965,934,4538,1037
301 | 1,2,444,879,2060,264,290,259
302 | 2,2,16448,6243,6360,824,2662,2005
303 | 2,2,5283,13316,20399,1809,8752,172
304 | 2,2,2886,5302,9785,364,6236,555
305 | 2,2,2599,3688,13829,492,10069,59
306 | 2,2,161,7460,24773,617,11783,2410
307 | 2,2,243,12939,8852,799,3909,211
308 | 2,2,6468,12867,21570,1840,7558,1543
309 | 1,2,17327,2374,2842,1149,351,925
310 | 1,2,6987,1020,3007,416,257,656
311 | 2,2,918,20655,13567,1465,6846,806
312 | 1,2,7034,1492,2405,12569,299,1117
313 | 1,2,29635,2335,8280,3046,371,117
314 | 2,2,2137,3737,19172,1274,17120,142
315 | 1,2,9784,925,2405,4447,183,297
316 | 1,2,10617,1795,7647,1483,857,1233
317 | 2,2,1479,14982,11924,662,3891,3508
318 | 1,2,7127,1375,2201,2679,83,1059
319 | 1,2,1182,3088,6114,978,821,1637
320 | 1,2,11800,2713,3558,2121,706,51
321 | 2,2,9759,25071,17645,1128,12408,1625
322 | 1,2,1774,3696,2280,514,275,834
323 | 1,2,9155,1897,5167,2714,228,1113
324 | 1,2,15881,713,3315,3703,1470,229
325 | 1,2,13360,944,11593,915,1679,573
326 | 1,2,25977,3587,2464,2369,140,1092
327 | 1,2,32717,16784,13626,60869,1272,5609
328 | 1,2,4414,1610,1431,3498,387,834
329 | 1,2,542,899,1664,414,88,522
330 | 1,2,16933,2209,3389,7849,210,1534
331 | 1,2,5113,1486,4583,5127,492,739
332 | 1,2,9790,1786,5109,3570,182,1043
333 | 2,2,11223,14881,26839,1234,9606,1102
334 | 1,2,22321,3216,1447,2208,178,2602
335 | 2,2,8565,4980,67298,131,38102,1215
336 | 2,2,16823,928,2743,11559,332,3486
337 | 2,2,27082,6817,10790,1365,4111,2139
338 | 1,2,13970,1511,1330,650,146,778
339 | 1,2,9351,1347,2611,8170,442,868
340 | 1,2,3,333,7021,15601,15,550
341 | 1,2,2617,1188,5332,9584,573,1942
342 | 2,3,381,4025,9670,388,7271,1371
343 | 2,3,2320,5763,11238,767,5162,2158
344 | 1,3,255,5758,5923,349,4595,1328
345 | 2,3,1689,6964,26316,1456,15469,37
346 | 1,3,3043,1172,1763,2234,217,379
347 | 1,3,1198,2602,8335,402,3843,303
348 | 2,3,2771,6939,15541,2693,6600,1115
349 | 2,3,27380,7184,12311,2809,4621,1022
350 | 1,3,3428,2380,2028,1341,1184,665
351 | 2,3,5981,14641,20521,2005,12218,445
352 | 1,3,3521,1099,1997,1796,173,995
353 | 2,3,1210,10044,22294,1741,12638,3137
354 | 1,3,608,1106,1533,830,90,195
355 | 2,3,117,6264,21203,228,8682,1111
356 | 1,3,14039,7393,2548,6386,1333,2341
357 | 1,3,190,727,2012,245,184,127
358 | 1,3,22686,134,218,3157,9,548
359 | 2,3,37,1275,22272,137,6747,110
360 | 1,3,759,18664,1660,6114,536,4100
361 | 1,3,796,5878,2109,340,232,776
362 | 1,3,19746,2872,2006,2601,468,503
363 | 1,3,4734,607,864,1206,159,405
364 | 1,3,2121,1601,2453,560,179,712
365 | 1,3,4627,997,4438,191,1335,314
366 | 1,3,2615,873,1524,1103,514,468
367 | 2,3,4692,6128,8025,1619,4515,3105
368 | 1,3,9561,2217,1664,1173,222,447
369 | 1,3,3477,894,534,1457,252,342
370 | 1,3,22335,1196,2406,2046,101,558
371 | 1,3,6211,337,683,1089,41,296
372 | 2,3,39679,3944,4955,1364,523,2235
373 | 1,3,20105,1887,1939,8164,716,790
374 | 1,3,3884,3801,1641,876,397,4829
375 | 2,3,15076,6257,7398,1504,1916,3113
376 | 1,3,6338,2256,1668,1492,311,686
377 | 1,3,5841,1450,1162,597,476,70
378 | 2,3,3136,8630,13586,5641,4666,1426
379 | 1,3,38793,3154,2648,1034,96,1242
380 | 1,3,3225,3294,1902,282,68,1114
381 | 2,3,4048,5164,10391,130,813,179
382 | 1,3,28257,944,2146,3881,600,270
383 | 1,3,17770,4591,1617,9927,246,532
384 | 1,3,34454,7435,8469,2540,1711,2893
385 | 1,3,1821,1364,3450,4006,397,361
386 | 1,3,10683,21858,15400,3635,282,5120
387 | 1,3,11635,922,1614,2583,192,1068
388 | 1,3,1206,3620,2857,1945,353,967
389 | 1,3,20918,1916,1573,1960,231,961
390 | 1,3,9785,848,1172,1677,200,406
391 | 1,3,9385,1530,1422,3019,227,684
392 | 1,3,3352,1181,1328,5502,311,1000
393 | 1,3,2647,2761,2313,907,95,1827
394 | 1,3,518,4180,3600,659,122,654
395 | 1,3,23632,6730,3842,8620,385,819
396 | 1,3,12377,865,3204,1398,149,452
397 | 1,3,9602,1316,1263,2921,841,290
398 | 2,3,4515,11991,9345,2644,3378,2213
399 | 1,3,11535,1666,1428,6838,64,743
400 | 1,3,11442,1032,582,5390,74,247
401 | 1,3,9612,577,935,1601,469,375
402 | 1,3,4446,906,1238,3576,153,1014
403 | 1,3,27167,2801,2128,13223,92,1902
404 | 1,3,26539,4753,5091,220,10,340
405 | 1,3,25606,11006,4604,127,632,288
406 | 1,3,18073,4613,3444,4324,914,715
407 | 1,3,6884,1046,1167,2069,593,378
408 | 1,3,25066,5010,5026,9806,1092,960
409 | 2,3,7362,12844,18683,2854,7883,553
410 | 2,3,8257,3880,6407,1646,2730,344
411 | 1,3,8708,3634,6100,2349,2123,5137
412 | 1,3,6633,2096,4563,1389,1860,1892
413 | 1,3,2126,3289,3281,1535,235,4365
414 | 1,3,97,3605,12400,98,2970,62
415 | 1,3,4983,4859,6633,17866,912,2435
416 | 1,3,5969,1990,3417,5679,1135,290
417 | 2,3,7842,6046,8552,1691,3540,1874
418 | 2,3,4389,10940,10908,848,6728,993
419 | 1,3,5065,5499,11055,364,3485,1063
420 | 2,3,660,8494,18622,133,6740,776
421 | 1,3,8861,3783,2223,633,1580,1521
422 | 1,3,4456,5266,13227,25,6818,1393
423 | 2,3,17063,4847,9053,1031,3415,1784
424 | 1,3,26400,1377,4172,830,948,1218
425 | 2,3,17565,3686,4657,1059,1803,668
426 | 2,3,16980,2884,12232,874,3213,249
427 | 1,3,11243,2408,2593,15348,108,1886
428 | 1,3,13134,9347,14316,3141,5079,1894
429 | 1,3,31012,16687,5429,15082,439,1163
430 | 1,3,3047,5970,4910,2198,850,317
431 | 1,3,8607,1750,3580,47,84,2501
432 | 1,3,3097,4230,16483,575,241,2080
433 | 1,3,8533,5506,5160,13486,1377,1498
434 | 1,3,21117,1162,4754,269,1328,395
435 | 1,3,1982,3218,1493,1541,356,1449
436 | 1,3,16731,3922,7994,688,2371,838
437 | 1,3,29703,12051,16027,13135,182,2204
438 | 1,3,39228,1431,764,4510,93,2346
439 | 2,3,14531,15488,30243,437,14841,1867
440 | 1,3,10290,1981,2232,1038,168,2125
441 | 1,3,2787,1698,2510,65,477,52
442 |
--------------------------------------------------------------------------------
/visuals.py:
--------------------------------------------------------------------------------
1 | ###########################################
2 | # Suppress matplotlib user warnings
3 | # Necessary for newer version of matplotlib
4 | import warnings
5 | warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib")
6 | #
7 | # Display inline matplotlib plots with IPython
8 | from IPython import get_ipython
9 | get_ipython().run_line_magic('matplotlib', 'inline')
10 | ###########################################
11 |
12 | import matplotlib.pyplot as plt
13 | import matplotlib.cm as cm
14 | import pandas as pd
15 | import numpy as np
16 |
17 | def pca_results(good_data, pca):
18 | '''
19 | Create a DataFrame of the PCA results
20 | Includes dimension feature weights and explained variance
21 | Visualizes the PCA results
22 | '''
23 |
24 | # Dimension indexing
25 | dimensions = dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]
26 |
27 | # PCA components
28 | components = pd.DataFrame(np.round(pca.components_, 4), columns = list(good_data.keys()))
29 | components.index = dimensions
30 |
31 | # PCA explained variance
32 | ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1)
33 | variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance'])
34 | variance_ratios.index = dimensions
35 |
36 | # Create a bar plot visualization
37 | fig, ax = plt.subplots(figsize = (14,8))
38 |
39 | # Plot the feature weights as a function of the components
40 | components.plot(ax = ax, kind = 'bar');
41 | ax.set_ylabel("Feature Weights")
42 | ax.set_xticklabels(dimensions, rotation=0)
43 |
44 |
45 | # Display the explained variance ratios
46 | for i, ev in enumerate(pca.explained_variance_ratio_):
47 | ax.text(i-0.40, ax.get_ylim()[1] + 0.05, "Explained Variance\n %.4f"%(ev))
48 |
49 | # Return a concatenated DataFrame
50 | return pd.concat([variance_ratios, components], axis = 1)
51 |
52 | def cluster_results(reduced_data, preds, centers, pca_samples):
53 | '''
54 | Visualizes the PCA-reduced cluster data in two dimensions
55 | Adds cues for cluster centers and student-selected sample data
56 | '''
57 |
58 | predictions = pd.DataFrame(preds, columns = ['Cluster'])
59 | plot_data = pd.concat([predictions, reduced_data], axis = 1)
60 |
61 | # Generate the cluster plot
62 | fig, ax = plt.subplots(figsize = (14,8))
63 |
64 | # Color map
65 | cmap = cm.get_cmap('gist_rainbow')
66 |
67 | # Color the points based on assigned cluster
68 | for i, cluster in plot_data.groupby('Cluster'):
69 | cluster.plot(ax = ax, kind = 'scatter', x = 'Dimension 1', y = 'Dimension 2', \
70 | color = cmap((i)*1.0/(len(centers)-1)), label = 'Cluster %i'%(i), s=30);
71 |
72 | # Plot centers with indicators
73 | for i, c in enumerate(centers):
74 | ax.scatter(x = c[0], y = c[1], color = 'white', edgecolors = 'black', \
75 | alpha = 1, linewidth = 2, marker = 'o', s=200);
76 | ax.scatter(x = c[0], y = c[1], marker='$%d$'%(i), alpha = 1, s=100);
77 |
78 | # Plot transformed sample points
79 | ax.scatter(x = pca_samples[:,0], y = pca_samples[:,1], \
80 | s = 150, linewidth = 4, color = 'black', marker = 'x');
81 |
82 | # Set plot title
83 | ax.set_title("Cluster Learning on PCA-Reduced Data - Centroids Marked by Number\nTransformed Sample Data Marked by Black Cross");
84 |
85 |
86 | def biplot(good_data, reduced_data, pca):
87 | '''
88 | Produce a biplot that shows a scatterplot of the reduced
89 | data and the projections of the original features.
90 |
91 | good_data: original data, before transformation.
92 | Needs to be a pandas dataframe with valid column names
93 | reduced_data: the reduced data (the first two dimensions are plotted)
94 | pca: pca object that contains the components_ attribute
95 |
96 | return: a matplotlib AxesSubplot object (for any additional customization)
97 |
98 | This procedure is inspired by the script:
99 | https://github.com/teddyroland/python-biplot
100 | '''
101 |
102 | fig, ax = plt.subplots(figsize = (14,8))
103 | # scatterplot of the reduced data
104 | ax.scatter(x=reduced_data.loc[:, 'Dimension 1'], y=reduced_data.loc[:, 'Dimension 2'],
105 | facecolors='b', edgecolors='b', s=70, alpha=0.5)
106 |
107 | feature_vectors = pca.components_.T
108 |
109 | # we use scaling factors to make the arrows easier to see
110 | arrow_size, text_pos = 7.0, 8.0,
111 |
112 | # projections of the original features
113 | for i, v in enumerate(feature_vectors):
114 | ax.arrow(0, 0, arrow_size*v[0], arrow_size*v[1],
115 | head_width=0.2, head_length=0.2, linewidth=2, color='red')
116 | ax.text(v[0]*text_pos, v[1]*text_pos, good_data.columns[i], color='black',
117 | ha='center', va='center', fontsize=18)
118 |
119 | ax.set_xlabel("Dimension 1", fontsize=14)
120 | ax.set_ylabel("Dimension 2", fontsize=14)
121 | ax.set_title("PC plane with original feature projections.", fontsize=16);
122 | return ax
123 |
124 |
125 | def channel_results(reduced_data, outliers, pca_samples):
126 | '''
127 | Visualizes the PCA-reduced cluster data in two dimensions using the full dataset
128 | Data is labeled by "Channel" and cues added for student-selected sample data
129 | '''
130 |
131 | # Check that the dataset is loadable
132 | try:
133 | full_data = pd.read_csv("customers.csv")
134 | except:
135 | print("Dataset could not be loaded. Is the file missing?")
136 | return False
137 |
138 | # Create the Channel DataFrame
139 | channel = pd.DataFrame(full_data['Channel'], columns = ['Channel'])
140 | channel = channel.drop(channel.index[outliers]).reset_index(drop = True)
141 | labeled = pd.concat([reduced_data, channel], axis = 1)
142 |
143 | # Generate the cluster plot
144 | fig, ax = plt.subplots(figsize = (14,8))
145 |
146 | # Color map
147 | cmap = cm.get_cmap('gist_rainbow')
148 |
149 | # Color the points based on assigned Channel
150 | labels = ['Hotel/Restaurant/Cafe', 'Retailer']
151 | grouped = labeled.groupby('Channel')
152 | for i, channel in grouped:
153 | channel.plot(ax = ax, kind = 'scatter', x = 'Dimension 1', y = 'Dimension 2', \
154 | color = cmap((i-1)*1.0/2), label = labels[i-1], s=30);
155 |
156 | # Plot transformed sample points
157 | for i, sample in enumerate(pca_samples):
158 | ax.scatter(x = sample[0], y = sample[1], \
159 | s = 200, linewidth = 3, color = 'black', marker = 'o', facecolors = 'none');
160 | ax.scatter(x = sample[0]+0.25, y = sample[1]+0.3, marker='$%d$'%(i), alpha = 1, s=125);
161 |
162 | # Set plot title
163 | ax.set_title("PCA-Reduced Data Labeled by 'Channel'\nTransformed Sample Data Circled");
--------------------------------------------------------------------------------