├── README.md ├── customer_segments.ipynb ├── customers.csv └── visuals.py /README.md: -------------------------------------------------------------------------------- 1 | # Study-09-MachineLearning-E 2 | UnsupervisedLearning 3 | 4 | - **A. Basic Clustering** 5 | - K-mean, Hierarchical, DBSCAN 6 | - **B. Model-Based Clustering** 7 | - Gaussian Mixture 8 | - **C. Cluster Validation** 9 | - **D. Dimensionality Reduction** 10 | - PCA, ICA 11 | 12 | --- 13 | ## 00. Min-Max_Scaler: Feature Scaling in the pre-processing data stage 14 | - Unbalanced features: height / weight..the unit differ you dummy! How this combination of features can describe someone ? 15 | - Transform features to have a range [0,1], but what if it has outliers?? such as ridiculous max or min ?? 16 | 17 | 18 | ``` 19 | def featureScaling(array): 20 | answer = [] 21 | for i in array: 22 | value = float(i - min(array))/float(max(array)-min(array)) 23 | answer.append(value) 24 | return(answer) 25 | data = [115, 140, 175] 26 | print featureScaling(data) 27 | ``` 28 | **`ScikitLearn` loves `numpy input`!!!!!** 29 | ``` 30 | import numpy as np 31 | from sklearn.preprocessing import MinMaxScaler 32 | 33 | X = np.array([ [115.0],[140.0],[175.0] ]) # need to be float!! "[]" means.."row" 34 | 35 | scaler = MinMaxScaler() 36 | rescaled_X = scaler.fit_transform(X) 37 | ``` 38 | > [Note]: Which algorithms are affected by the **feature scaling** ?? 39 | - SVM Classification =>(YES): We trade off one dimension to the other when calculating the `distances`(the **"diagonal"** decision_surf maximizing distances) 40 | - K-means Clustering =>(YES): Having a cluster center, and calculating the `distances` of it to all data pt..they are **"diagonal"**. 41 | - Linear Regression Classification =>(NO): Each feature always goes with its coefficient. What's going on with feature_A does not affect anything with the coefficient of feature_B..So they are separated. 42 | - DescisionTree Classification =>(NO): No need to use diagonal decision surf. There is no trade off. 43 | 44 | --- 45 | ## A. Basic Clustering 46 | 47 | 48 | 49 | ### 1. K-mean Clustering 50 | - Find the groups of similar observations 51 | - Step_01: randomly generate the centroids (MU1, MU2,...). 52 | - Step_02: Allocation 53 | - Holding `MU_k` fixed, label each data points (which MU_k is closest?) and find the membership 'Z_ik' that **minimize SS** (create clusters around each MU_k). 54 | - Step_03: Updating 55 | - Holding `Z_ik` fixed, elect 'new MU_k' for each cluster that **minimize SS**. 56 | - Step_04: Iterate until they converge (no movement of points b/w clusters) 57 | > **SS** of each data pt..(**find the membership** `Z_ik`(0/1) that minimize the SS) 58 | - i: each datapoint 59 | - k: each cluster 60 | 61 | 62 | > **SS** for each cluster..(**find the center** `MU_k` that minimize the SS) 63 | - i: each datapoint 64 | - k: each cluster 65 | 66 | 67 | > Advantages: 68 | - it is simple, easy to implement and easy to interpret the results. 69 | - it practically work well even some assumptions are broken. 70 | > Disadvantages: 71 | - **Local Minima**: It's a local hill climbing algorithm. It can give a sub-optimal solution. The output for any fixed training set can be inconsistent...Damn. The output would be very dependent on where we put our **initial cluster centers**. The more cluster centers we have, the more bad local minima we can get, so run the algorithm multiple times. 72 | 73 | 74 | - **Hyper-spherical nature**: 75 | - it only relies on distance to centroid as a definition of a cluster, thus it works poorly with clusters with different densities and cannot carve out descent clusters when their shapes are not spherical. 76 | - it assumes the joint distribution of features within each cluster is spherical, features within a cluster have equal variance, and also features are independent of each other. 77 | - it assumes balanced cluster size within the dataset, thus often produces clusters with relatively uniform size even if the input data have different cluster size. 78 | - it is sensitive to outliers 79 | ``` 80 | def kmeans(dataSet, k): 81 | 82 | # Initialize centroids randomly 83 | numFeatures = dataSet.getNumFeatures() 84 | centroids = getRandomCentroids(numFeatures, k) 85 | 86 | # Initialize book keeping vars. 87 | iterations = 0 88 | oldCentroids = None 89 | 90 | # Run the main k-means algorithm 91 | while not shouldStop(oldCentroids, centroids, iterations): 92 | # Save old centroids for convergence test. Book keeping. 93 | oldCentroids = centroids 94 | iterations += 1 95 | 96 | # Assign labels to each datapoint based on centroids 97 | labels = getLabels(dataSet, centroids) 98 | 99 | # Assign centroids based on datapoint labels 100 | centroids = getCentroids(dataSet, labels, k) 101 | 102 | # We can get the labels too by calling getLabels(dataSet, centroids) 103 | return centroids 104 | 105 | # Function: Should Stop 106 | # ------------- 107 | # Returns True or False if k-means is done. K-means terminates either 108 | # because it has run a maximum number of iterations OR the centroids 109 | # stop changing. 110 | def shouldStop(oldCentroids, centroids, iterations): 111 | if iterations > MAX_ITERATIONS: return True 112 | return oldCentroids == centroids 113 | 114 | # Function: Get Labels 115 | # ------------- 116 | # Returns a label for each piece of data in the dataset. 117 | def getLabels(dataSet, centroids): 118 | # For each element in the dataset, chose the closest centroid. 119 | # Make that centroid the element's label. 120 | 121 | # Function: Get Centroids 122 | # ------------- 123 | # Returns k random centroids, each of dimension n. 124 | def getCentroids(dataSet, labels, k): 125 | # Each centroid is the geometric mean of the points that 126 | # have that centroid's label. Important: If a centroid is empty (no points have 127 | # that centroid's label) you should randomly re-initialize it. 128 | ``` 129 | ### 2. Hierarchical & Density-Based Clustering 130 | - In SKLEARN, they are parts of `agglomerative clustering` component. 131 | 132 | 133 | > Hierarchical Clustering Example: A Pizza company want to cluster the locations of its customers in order to determine where it should open up its new branches. 134 | 135 | 1. Hierarchical Single-link clustering: 136 | - Hierarchical Clustering results in a **structure of clusters** that gives us a visual indication of how clusters relate to each other. 137 | - Step01: assume each pt is already a cluster and we give each pt a label. 138 | - Step02: calculate the distance b/w each pt and each other pt, then choose the smallest distances to group them into a cluster. On the side, we draw the structure tree one by one (the dendogram gives us an additional insight that might direct the results of the clustering misses) 139 | 140 | 141 | - Single linkage looks at the closest point to the cluster, that can result in clusters of various shapes, thus is more prone to result in elongated shapes that are not necessarily compact or circular. 142 | - Single and complete linkage follow merging heuristics that involve mainly one point. They do not pay much attention to in-cluster variance. 143 | - Ward's method does try to minimize the variance resulting in each merging step by merging clusters that lead to the least increase in variance in the clusters after merging. 144 | 145 | 2. Hierarchical Complete-link clustering:.... 146 | 3. Hierarchical Average-link clustering:.... 147 | 4. Ward's Method:.... 148 | ``` 149 | from sklearn.cluster import AgglomerativeClustering 150 | 151 | # Ward is the default linkage algorithm... 152 | ward = AgglomerativeClustering(n_clusters=3) 153 | ward_pred = ward.fit_predict(df) 154 | 155 | # using complete linkage 156 | complete = AgglomerativeClustering(n_clusters=3, linkage="complete") 157 | # Fit & predict 158 | complete_pred = complete.fit_predict(df) 159 | 160 | # using average linkage 161 | avg = AgglomerativeClustering(n_clusters=3, linkage="average") 162 | # Fit & predict 163 | avg_pred = avg.fit_predict(df) 164 | ``` 165 | To determine which clustering result better matches the original labels of the samples, we can use adjusted_rand_score which is an external cluster validation index which results in a score between -1 and 1, where 1 means two clusterings are identical of how they grouped the samples in a dataset (regardless of what label is assigned to each cluster). Which algorithm results in the higher Adjusted Rand Score? 166 | ``` 167 | from sklearn.metrics import adjusted_rand_score 168 | 169 | ward_ar_score = adjusted_rand_score(df.label, ward_pred) 170 | complete_ar_score = adjusted_rand_score(df.label, complete_pred) 171 | avg_ar_score = adjusted_rand_score(df.label, avg_pred) 172 | 173 | print( "Scores: \nWard:", ward_ar_score,"\nComplete: ", complete_ar_score, "\nAverage: ", avg_ar_score) 174 | ``` 175 | Sometimes some column has smaller values than the rest of the columns, and so its variance counts for less in the clustering process (since clustering is based on distance). We normalize the dataset so that each dimension lies between 0 and 1, so they have equal weight in the clustering process. **This is done by subtracting the minimum from each column then dividing the difference(max-min) by the range.** Would clustering the dataset after this transformation lead to a better clustering? 176 | ``` 177 | from sklearn import preprocessing 178 | normalized_X = preprocessing.normalize(df) 179 | ``` 180 | To visualize the highest scoring clustering result, we'll need to use Scipy's linkage function to perform the clusteirng again so we can obtain the linkage matrix it will later use to visualize the hierarchy. 181 | ``` 182 | # Import scipy's linkage function to conduct the clustering 183 | from scipy.cluster.hierarchy import linkage 184 | 185 | # Pick the one that resulted in the highest Adjusted Rand Score 186 | linkage_type = 'ward' 187 | 188 | linkage_matrix = linkage(normalized_X, linkage_type) 189 | 190 | from scipy.cluster.hierarchy import dendrogram 191 | 192 | plt.figure(figsize=(22,18)) 193 | dendrogram(linkage_matrix) 194 | 195 | plt.show() 196 | ``` 197 | 198 | 199 | 5. Density-Based Clustering: 200 | - DBSCAN(Density-based Spatial Clustering of Applications with Noise) grips the pt densely packed together and labels other pt as noise. 201 | - Step01: it selects a point arbitrarily, and looks at the neighbors around, and ask "Are there any other points?". If no, it's a noise. and ask "enough numbers to make a cluster?". If no, it's a noise. 202 | - Step02: If we find enough number of points, we identify 'core point' and 'border point'. 203 | - Step03: Continue examine points..and create clusters. 204 | 205 | 206 | ``` 207 | DBSCAN(df, epsilon, min_points): 208 | C = 0 209 | for each unvisited point P in df 210 | mark P as visited 211 | sphere_points = regionQuery(P, epsilon) 212 | if sizeof(sphere_points) < min_points 213 | ignore P 214 | else 215 | C = next cluster 216 | expandCluster(P, sphere_points, C, epsilon, min_points) 217 | 218 | expandCluster(P, sphere_points, C, epsilon, min_points): 219 | add P to cluster C 220 | for each point P’ in sphere_points 221 | if P’ is not visited 222 | mark P’ as visited 223 | sphere_points’ = regionQuery(P’, epsilon) 224 | if sizeof(sphere_points’) >= min_points 225 | sphere_points = sphere_points joined with sphere_points’ 226 | if P’ is not yet member of any cluster 227 | add P’ to cluster C 228 | 229 | regionQuery(P, epsilon): 230 | return all points within the n-dimensional sphere centered at P with radius epsilon (including P) 231 | 232 | 233 | #### Python ######################################################################################################### 234 | import numpy as numpy 235 | import scipy as scipy 236 | from sklearn import cluster 237 | import matplotlib.pyplot as plt 238 | 239 | 240 | 241 | def set2List(NumpyArray): 242 | list = [] 243 | for item in NumpyArray: 244 | list.append(item.tolist()) 245 | return list 246 | 247 | 248 | def GenerateData(): 249 | x1=numpy.random.randn(50,2) 250 | x2x=numpy.random.randn(80,1)+12 251 | x2y=numpy.random.randn(80,1) 252 | x2=numpy.column_stack((x2x,x2y)) 253 | x3=numpy.random.randn(100,2)+8 254 | x4=numpy.random.randn(120,2)+15 255 | z=numpy.concatenate((x1,x2,x3,x4)) 256 | return z 257 | 258 | 259 | def DBSCAN(Dataset, Epsilon,MinumumPoints,DistanceMethod = 'euclidean'): 260 | # Dataset is a mxn matrix, m is number of item and n is the dimension of data 261 | m,n=Dataset.shape 262 | Visited=numpy.zeros(m,'int') 263 | Type=numpy.zeros(m) 264 | # -1 noise, outlier 265 | # 0 border 266 | # 1 core 267 | ClustersList=[] 268 | Cluster=[] 269 | PointClusterNumber=numpy.zeros(m) 270 | PointClusterNumberIndex=1 271 | PointNeighbors=[] 272 | DistanceMatrix = scipy.spatial.distance.squareform(scipy.spatial.distance.pdist(Dataset, DistanceMethod)) 273 | for i in xrange(m): 274 | if Visited[i]==0: 275 | Visited[i]=1 276 | PointNeighbors=numpy.where(DistanceMatrix[i]=MinumumPoints: 305 | # Neighbors merge with PointNeighbors 306 | for j in Neighbors: 307 | try: 308 | PointNeighbors.index(j) 309 | except ValueError: 310 | PointNeighbors.append(j) 311 | 312 | if PointClusterNumber[i]==0: 313 | Cluster.append(i) 314 | PointClusterNumber[i]=PointClusterNumberIndex 315 | return 316 | 317 | #Generating some data with normal distribution at 318 | #(0,0) 319 | #(8,8) 320 | #(12,0) 321 | #(15,15) 322 | Data=GenerateData() 323 | 324 | #Adding some noise with uniform distribution 325 | #X between [-3,17], 326 | #Y between [-3,17] 327 | noise=scipy.rand(50,2)*20 -3 328 | 329 | Noisy_Data=numpy.concatenate((Data,noise)) 330 | size=20 331 | 332 | 333 | fig = plt.figure() 334 | ax1=fig.add_subplot(2,1,1) #row, column, figure number 335 | ax2 = fig.add_subplot(212) 336 | 337 | ax1.scatter(Data[:,0],Data[:,1], alpha = 0.5 ) 338 | ax1.scatter(noise[:,0],noise[:,1],color='red' ,alpha = 0.5) 339 | ax2.scatter(noise[:,0],noise[:,1],color='red' ,alpha = 0.5) 340 | 341 | 342 | Epsilon=1 343 | MinumumPoints=20 344 | result =DBSCAN(Data,Epsilon,MinumumPoints) 345 | 346 | #printed numbers are cluster numbers 347 | print result 348 | #print "Noisy_Data" 349 | #print Noisy_Data.shape 350 | #print Noisy_Data 351 | 352 | for i in xrange(len(result)): 353 | ax2.scatter(Noisy_Data[i][0],Noisy_Data[i][1],color='yellow' ,alpha = 0.5) 354 | 355 | plt.show() 356 | 357 | ``` 358 | 359 | 360 | --- 361 | ## B. Model-Based Clustering(Gaussian Mixture) 362 | ### Wow, several datasets were hacked and mixed up..How to retrieve the originals? 363 | 364 | [Assumption]: **Each cluster follows a certain statistical distribution**. 365 | - In one dimension 366 | 367 | 368 | - In two dimension 369 | 370 | 371 | ### EM(Expectation Maximization) Algorithm for Gaussian Mixture 372 | 373 | 374 | - Step_01. Initialization of the distributions 375 | - > give them the initial values(`mean`, `var`) for each of the two suspected clusters. 376 | - Run 'k-means' on the dataset and choose the clusters roughly.... or randomly choose ? 377 | - It is indeed important that we are careful in **choosing the parameters of the initial Gaussians**. That has a significant effect on the quality of EM's result. 378 | 379 | 380 | - Step_02. **Expectation I**: soft_clustering of data-pt with probabilities 381 | - > let's say we have 'n'points. Each pt has 2 values for each feature. Now we need to calculate the membership(probability) of each pt. 382 | - How to determine the membership? Just pass in your x_value, and two parameters(mean, var)... 383 | 384 | 385 | - Step_02. **Expectation II**: Estimate real **parameters** of new Gaussians, using the `weighted means & variance` 386 | - > the `new mean` for cluster_A, given the result of step_02(transient memberships), comes from calculating the **weighted mean** of all of the points with the same transient memberships. 387 | - the weighted mean does not only account for the parameters of each pt, but also account for how much it belongs. 388 | - > the `new var` for cluster_A, given the result of step_02(transient memberships), comes from calculating the **weighter VAR** of all of the points with the same transient memberships. 389 | 390 | 391 | - Step_03. **Maximization**: Compare(overlay) the new result with the old Gaussian. We iterate these steps until it converges(no movement?). 392 | - > Evaluate the `log-likelihood` which sums for all clusters. 393 | - the higher the value, the more sure we are that the mixer model fits out dataset. 394 | - the purpose is to **maximize** this value by choosing the parameters(the mixing coefficient, mean, var) of each Gaussian again and again until the value converges, reaching a maximum. 395 | - What's the mixing coefficient? = mixing proportions.. they affect the **height** of the distribution.. 396 | 397 | 398 | ``` 399 | from sklearn import mixture 400 | gmm = mixture.GaussianMixture(n_components=3) 401 | gmm.fit(X) 402 | clustering = gmm.predict(X) 403 | ``` 404 | https://www.youtube.com/watch?v=lLt9H6RFO6A 405 | 406 | http://www.ai.mit.edu/projects/vsam/Publications/stauffer_cvpr98_track.pdf 407 | 408 | 409 | 410 | --- 411 | ## C. Cluster Validation 412 | 413 | 414 | ### 1.External Indices:__ 415 | 416 | 417 | - When we have the ground truth(answer-sheet or the labeled reference). 418 | - **ARI**(Adjusted Rand_Index) [-1 to 1]: 419 | - > Note: ARI does not care what label we assign a cluster, as long as the point assignment matches that of the ground truth. 420 | 421 | 422 | ### 2. Internal Indices:__ 423 | 424 | 425 | - When we don't have the ground truth. 426 | - **Silhouette Coefficient** [-1 to 1]: 427 | - There is a Silhouette Coefficient for each data-pt. We average them and get a Silhouette score for the entire clustering. We can calculate the silhouette coefficient for each point, cluster, as well as for an entire dataset. 428 | - Silhouette is affected by `K`(No.of clusters) 429 | - Silhouette is affected by compactness, circularity of the cluster. 430 | - > Note: for DBSCAN, we never use Silhouette score...(it does not care the **compact, circular clustering** because of the idea of 'noise'). Instead, we use **DBCV** for DBSCAN. http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=83C3BD5E078B1444CB26E243975507E1?doi=10.1.1.707.9034&rep=rep1&type=pdf 431 | - > Note: for Hierachical Clustering, it carves out the clusters well, but it's not what Silhouette can conceive of. 432 | 433 | By 'K' 434 | 435 | 436 | By the 'shape' of the cluster 437 | 438 | 439 | ``` 440 | 441 | 442 | 443 | ``` 444 | --- 445 | ## D. Dimensionality Reduction 446 | ### 1. Principal Component Anaysis 447 | 448 | 449 | We can't possibly come up with coordinate system shifted, rotated from the original to obtain the **one dimensionality**. PCA specializes on **'shifts'** and **'rotation'** for the coordinate system. 450 | 451 | If our given data is of any shape whatsoever, PCA finds a **new coordinate system** obtained from the original by translation or rotation. 452 | - It moves the **center** of the coordinate system with the center of the dataset. 453 | - It moves the X-axis into the principal axis of the variation where we see the **most variation** relative to all the data-pt. 454 | - It moves the y-axis down the road into the orthogonal(less important directions of variation). 455 | 456 | > **What defines the two principal directions(the two orthogonal vectors)?** to kill dimensionality, multicollinearity... 457 | > - 1.Find the center of the dataset (mean??) 458 | > - 2.Find the two principal axis of variation (Eigenvectors) 459 | > - The measure of the orthogonality: Do the 'dot-product' of these two vectors, we should get 'zero'. 460 | > - 3.Find the spread values (giving importance to our vectors) for the two axis? (Eigenvalues) 461 | 462 | 463 | ## Compression while preserving all information!!!! Get rid of multicollinearity!!!! 464 | Let's say we have a large number of measurable features, but we know there are a small number of underlying **latent features** that contain most of information. What's the best way to condense those features? 465 | # The new variable is a linear-Combinations using those features! But the game changer is the Cov-matrix!! 466 | 467 | 468 | - How to find the principal component or the direction that capturing the maximal variance (the corresponding Eigenvector of Cov_matrix) ? 469 | - the amount of **information loss** is equal to the distances b/w the component line(the new tranformed values) and a given pt, so find the component line that minimizes the information loss. This component line is the Eigenvector of pxp **Cov-Matrix ?**. 470 | 471 | 472 | - How to give an insight on **which features** drive the most impact(capturing the major pattern - the largest Eigenvalue of Cov-matrix) ? 473 | 474 | 475 | 476 | 477 | 478 | 479 | - [Usage] 480 | - When we want to examine **latent features** driving the patterns in our complex data 481 | - Dimensionality Reduction 482 | - Visualizing high-dimensional data(projecting the two features down to the first PC-line and leave them as scatters, then use K-means) 483 | - Reducing **noises** by discarding unimportant PC. 484 | - Pre-processing before using any other algorithms by reducing the demensionality of inputs. 485 | 486 | Ex> facial recognition why? 487 | - **Mega pixels:** pictures of human faces in general have high input dimensionality 488 | - **Eyes, nose, mouth:** Human faces have general patterns that could be captured in smaller number of dimensions. 489 | - In this example, the original dimensionality of the pic is: "1288 rows x 1850 features" plus "7 classes". 490 | ``` 491 | from time import time 492 | import logging 493 | import pylab as pl 494 | import numpy as np 495 | from sklearn.model_selection import train_test_split 496 | from sklearn.datasets import fetch_lfw_people 497 | from sklearn.model_selection import GridSearchCV 498 | from sklearn.metrics import classification_report 499 | from sklearn.metrics import confusion_matrix 500 | from sklearn.decomposition import RandomizedPCA 501 | from sklearn.decomposition import PCA 502 | from sklearn.svm import SVC 503 | 504 | # Download the data, if not already on disk and load it as numpy arrays 505 | lfw_people = fetch_lfw_people('data', min_faces_per_person=70, resize=0.4) 506 | 507 | # introspect the images arrays to find the shapes (for plotting) 508 | n_samples, h, w = lfw_people.images.shape 509 | np.random.seed(42) 510 | 511 | # for machine learning we use the data directly (as relative pixel 512 | # position info is ignored by this model) 513 | X = lfw_people.data 514 | n_features = X.shape[1] 515 | 516 | # the label to predict is the id of the person 517 | y = lfw_people.target 518 | target_names = lfw_people.target_names 519 | n_classes = target_names.shape[0] 520 | 521 | print("n_samples: %d" % n_samples) 522 | print("n_features: %d" % n_features) 523 | print( "n_classes: %d" % n_classes) 524 | ``` 525 | 526 | 527 | ### Eigenvalue and Eigenvector 528 | A matrix is a linear transformation tool and focus on **mapping of a vector**. It can transform the **magnitude** and the **direction** of a vector into **lower dimension**! `tranformation matrix * Eigen vector = Scaled vector!!` 529 | 530 | 531 | ### 2. RandomProjection 532 | - Computationally more efficient than PCA. 533 | - handle even more features than PCA (with a decrease in quality of projection, however.) 534 | - Premise 535 | - Simply reduce the size of dimensions in our dataset by **multiplying it by a random matrix**. 536 | - Where does the **'k'**(reduced dimensions) come from? 537 | - This algorithm extra cares about the distances b/w points. 538 | - We have a certain level of guarantee that the distances will be a bit distorted, but can be preserved! 539 | - the distance b/w two pt in the projection squared would be squeezed by..... 540 | - The algorithm work either by setting a number of components we want(**'k'**) or by specifying a value for 'epsilon' and calculate a conservative value for **'k'**, and gives a new dataset. 541 | 542 | 543 | 544 | ``` 545 | from sklearn import random_projection 546 | rp = random_projection.SparseRandomProjection(n_components='auto', eps=0.1) 547 | 548 | new_X = rp.fit_transform(X) 549 | ``` 550 | ### 3. Independent Component Analysis 551 | While PCA works to maximize 'var', ICA tries to isolate the independent sources that are mixed in the dataset. 552 | - EX> blind source separation: Restoring the original signals.. 553 | 554 | 555 | - To produce the original signal `S`, ICA estimate the best `W` that we can multiply by `X`.. 556 | - ICA assumes 557 | - the features are mixtures of independent sources 558 | - the components must have **non-Gaussian** distributions. 559 | - the Central_Limit_Theorem says the distribution of a sum of independent variables(or sample means) tends towards the Gaussian. 560 | 561 | 562 | ``` 563 | from sklearn.decomposition import FastICA 564 | X = list(zip(signal_1, signal_2, signal_3)) 565 | ica = FastICA(n_components=3) 566 | 567 | components = ica.fit_transform(X) ## here, these objects contain the independent components retrieved via ICA 568 | ``` 569 | 570 | [Note] 571 | - 1.Let’s mix two random sources A and B. At each time, in the following plot(1), the value of A is the abscissa(x-axis) of the data point and the value of B is their ordinates(Y-axis). 572 | - 2.Let take two linear mixtures of A and B and plot(2) these two new variables. 573 | - 3.Then if we whiten the two linear mixtures, we get the plot(3) 574 | - the variance on both axis is now equal 575 | - the correlation of the projection of the data on both axis is 0 (meaning that the covariance matrix is diagonal and that all the diagonal elements are equal). 576 | - Then applying ICA only mean to “rotate” this representation back to the original A and B axis space. 577 | - The **whitening process** is simply a `linear change of coordinate` of the mixed data. Once the ICA solution is found in this “whitened” coordinate frame, we can easily reproject the ICA solution back into the original coordinate frame. 578 | - **whitening** is basically a de-correlation transform that converts the covariance-matrix into an identity matrix 579 | 580 | 581 | We can imagine that ICA rotates the **whitened matrix** back to the original (A,B) space (first scatter plot above). It performs the rotation by **minimizing the Gaussianity of the data** projected on both axes (fixed point ICA). For instance, in the example above, the projection on both axis is quite Gaussian (i.e., it looks like a bell shape curve). By contrast, the projection in the original A, B space far from gaussian. 582 | - By rotating the axis and minimizing Gaussianity of the projection in the first scatter plot, ICA is able to recover the original sources which are statistically independent (this property comes from the central limit theorem which states that any linear mixture of 2 independent random variables is more Gaussian than the original variables). 583 | - the function kurtosis gives an indication of the gaussianity of a distribution (but the fixed-point ICA algorithm uses a slightly different measure called negentropy). 584 | 585 | 586 | We dealt with only 2 dimensions. However ICA can deal with an arbitrary high number of dimensions. Let’s consider 128 EEG electrodes for instance. The signal recorded in all electrode at each time point then constitutes a data point in a 128 dimension space. After whitening the data, ICA will “rotate the 128 axis” in order to minimize the Gaussianity of the projection on all axis (note that unlike PCA the axis do not have to remain orthogonal). What we call ICA components is the matrix that allows projecting the data in the initial space to one of the axis found by ICA. The weight matrix is the full transformation from the original space. 587 | 588 | 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | 611 | 612 | 613 | 614 | 615 | 616 | 617 | 618 | -------------------------------------------------------------------------------- /customers.csv: -------------------------------------------------------------------------------- 1 | Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen 2 | 2,3,12669,9656,7561,214,2674,1338 3 | 2,3,7057,9810,9568,1762,3293,1776 4 | 2,3,6353,8808,7684,2405,3516,7844 5 | 1,3,13265,1196,4221,6404,507,1788 6 | 2,3,22615,5410,7198,3915,1777,5185 7 | 2,3,9413,8259,5126,666,1795,1451 8 | 2,3,12126,3199,6975,480,3140,545 9 | 2,3,7579,4956,9426,1669,3321,2566 10 | 1,3,5963,3648,6192,425,1716,750 11 | 2,3,6006,11093,18881,1159,7425,2098 12 | 2,3,3366,5403,12974,4400,5977,1744 13 | 2,3,13146,1124,4523,1420,549,497 14 | 2,3,31714,12319,11757,287,3881,2931 15 | 2,3,21217,6208,14982,3095,6707,602 16 | 2,3,24653,9465,12091,294,5058,2168 17 | 1,3,10253,1114,3821,397,964,412 18 | 2,3,1020,8816,12121,134,4508,1080 19 | 1,3,5876,6157,2933,839,370,4478 20 | 2,3,18601,6327,10099,2205,2767,3181 21 | 1,3,7780,2495,9464,669,2518,501 22 | 2,3,17546,4519,4602,1066,2259,2124 23 | 1,3,5567,871,2010,3383,375,569 24 | 1,3,31276,1917,4469,9408,2381,4334 25 | 2,3,26373,36423,22019,5154,4337,16523 26 | 2,3,22647,9776,13792,2915,4482,5778 27 | 2,3,16165,4230,7595,201,4003,57 28 | 1,3,9898,961,2861,3151,242,833 29 | 1,3,14276,803,3045,485,100,518 30 | 2,3,4113,20484,25957,1158,8604,5206 31 | 1,3,43088,2100,2609,1200,1107,823 32 | 1,3,18815,3610,11107,1148,2134,2963 33 | 1,3,2612,4339,3133,2088,820,985 34 | 1,3,21632,1318,2886,266,918,405 35 | 1,3,29729,4786,7326,6130,361,1083 36 | 1,3,1502,1979,2262,425,483,395 37 | 2,3,688,5491,11091,833,4239,436 38 | 1,3,29955,4362,5428,1729,862,4626 39 | 2,3,15168,10556,12477,1920,6506,714 40 | 2,3,4591,15729,16709,33,6956,433 41 | 1,3,56159,555,902,10002,212,2916 42 | 1,3,24025,4332,4757,9510,1145,5864 43 | 1,3,19176,3065,5956,2033,2575,2802 44 | 2,3,10850,7555,14961,188,6899,46 45 | 2,3,630,11095,23998,787,9529,72 46 | 2,3,9670,7027,10471,541,4618,65 47 | 2,3,5181,22044,21531,1740,7353,4985 48 | 2,3,3103,14069,21955,1668,6792,1452 49 | 2,3,44466,54259,55571,7782,24171,6465 50 | 2,3,11519,6152,10868,584,5121,1476 51 | 2,3,4967,21412,28921,1798,13583,1163 52 | 1,3,6269,1095,1980,3860,609,2162 53 | 1,3,3347,4051,6996,239,1538,301 54 | 2,3,40721,3916,5876,532,2587,1278 55 | 2,3,491,10473,11532,744,5611,224 56 | 1,3,27329,1449,1947,2436,204,1333 57 | 1,3,5264,3683,5005,1057,2024,1130 58 | 2,3,4098,29892,26866,2616,17740,1340 59 | 2,3,5417,9933,10487,38,7572,1282 60 | 1,3,13779,1970,1648,596,227,436 61 | 1,3,6137,5360,8040,129,3084,1603 62 | 2,3,8590,3045,7854,96,4095,225 63 | 2,3,35942,38369,59598,3254,26701,2017 64 | 2,3,7823,6245,6544,4154,4074,964 65 | 2,3,9396,11601,15775,2896,7677,1295 66 | 1,3,4760,1227,3250,3724,1247,1145 67 | 2,3,85,20959,45828,36,24231,1423 68 | 1,3,9,1534,7417,175,3468,27 69 | 2,3,19913,6759,13462,1256,5141,834 70 | 1,3,2446,7260,3993,5870,788,3095 71 | 1,3,8352,2820,1293,779,656,144 72 | 1,3,16705,2037,3202,10643,116,1365 73 | 1,3,18291,1266,21042,5373,4173,14472 74 | 1,3,4420,5139,2661,8872,1321,181 75 | 2,3,19899,5332,8713,8132,764,648 76 | 2,3,8190,6343,9794,1285,1901,1780 77 | 1,3,20398,1137,3,4407,3,975 78 | 1,3,717,3587,6532,7530,529,894 79 | 2,3,12205,12697,28540,869,12034,1009 80 | 1,3,10766,1175,2067,2096,301,167 81 | 1,3,1640,3259,3655,868,1202,1653 82 | 1,3,7005,829,3009,430,610,529 83 | 2,3,219,9540,14403,283,7818,156 84 | 2,3,10362,9232,11009,737,3537,2342 85 | 1,3,20874,1563,1783,2320,550,772 86 | 2,3,11867,3327,4814,1178,3837,120 87 | 2,3,16117,46197,92780,1026,40827,2944 88 | 2,3,22925,73498,32114,987,20070,903 89 | 1,3,43265,5025,8117,6312,1579,14351 90 | 1,3,7864,542,4042,9735,165,46 91 | 1,3,24904,3836,5330,3443,454,3178 92 | 1,3,11405,596,1638,3347,69,360 93 | 1,3,12754,2762,2530,8693,627,1117 94 | 2,3,9198,27472,32034,3232,18906,5130 95 | 1,3,11314,3090,2062,35009,71,2698 96 | 2,3,5626,12220,11323,206,5038,244 97 | 1,3,3,2920,6252,440,223,709 98 | 2,3,23,2616,8118,145,3874,217 99 | 1,3,403,254,610,774,54,63 100 | 1,3,503,112,778,895,56,132 101 | 1,3,9658,2182,1909,5639,215,323 102 | 2,3,11594,7779,12144,3252,8035,3029 103 | 2,3,1420,10810,16267,1593,6766,1838 104 | 2,3,2932,6459,7677,2561,4573,1386 105 | 1,3,56082,3504,8906,18028,1480,2498 106 | 1,3,14100,2132,3445,1336,1491,548 107 | 1,3,15587,1014,3970,910,139,1378 108 | 2,3,1454,6337,10704,133,6830,1831 109 | 2,3,8797,10646,14886,2471,8969,1438 110 | 2,3,1531,8397,6981,247,2505,1236 111 | 2,3,1406,16729,28986,673,836,3 112 | 1,3,11818,1648,1694,2276,169,1647 113 | 2,3,12579,11114,17569,805,6457,1519 114 | 1,3,19046,2770,2469,8853,483,2708 115 | 1,3,14438,2295,1733,3220,585,1561 116 | 1,3,18044,1080,2000,2555,118,1266 117 | 1,3,11134,793,2988,2715,276,610 118 | 1,3,11173,2521,3355,1517,310,222 119 | 1,3,6990,3880,5380,1647,319,1160 120 | 1,3,20049,1891,2362,5343,411,933 121 | 1,3,8258,2344,2147,3896,266,635 122 | 1,3,17160,1200,3412,2417,174,1136 123 | 1,3,4020,3234,1498,2395,264,255 124 | 1,3,12212,201,245,1991,25,860 125 | 2,3,11170,10769,8814,2194,1976,143 126 | 1,3,36050,1642,2961,4787,500,1621 127 | 1,3,76237,3473,7102,16538,778,918 128 | 1,3,19219,1840,1658,8195,349,483 129 | 2,3,21465,7243,10685,880,2386,2749 130 | 1,3,140,8847,3823,142,1062,3 131 | 1,3,42312,926,1510,1718,410,1819 132 | 1,3,7149,2428,699,6316,395,911 133 | 1,3,2101,589,314,346,70,310 134 | 1,3,14903,2032,2479,576,955,328 135 | 1,3,9434,1042,1235,436,256,396 136 | 1,3,7388,1882,2174,720,47,537 137 | 1,3,6300,1289,2591,1170,199,326 138 | 1,3,4625,8579,7030,4575,2447,1542 139 | 1,3,3087,8080,8282,661,721,36 140 | 1,3,13537,4257,5034,155,249,3271 141 | 1,3,5387,4979,3343,825,637,929 142 | 1,3,17623,4280,7305,2279,960,2616 143 | 1,3,30379,13252,5189,321,51,1450 144 | 1,3,37036,7152,8253,2995,20,3 145 | 1,3,10405,1596,1096,8425,399,318 146 | 1,3,18827,3677,1988,118,516,201 147 | 2,3,22039,8384,34792,42,12591,4430 148 | 1,3,7769,1936,2177,926,73,520 149 | 1,3,9203,3373,2707,1286,1082,526 150 | 1,3,5924,584,542,4052,283,434 151 | 1,3,31812,1433,1651,800,113,1440 152 | 1,3,16225,1825,1765,853,170,1067 153 | 1,3,1289,3328,2022,531,255,1774 154 | 1,3,18840,1371,3135,3001,352,184 155 | 1,3,3463,9250,2368,779,302,1627 156 | 1,3,622,55,137,75,7,8 157 | 2,3,1989,10690,19460,233,11577,2153 158 | 2,3,3830,5291,14855,317,6694,3182 159 | 1,3,17773,1366,2474,3378,811,418 160 | 2,3,2861,6570,9618,930,4004,1682 161 | 2,3,355,7704,14682,398,8077,303 162 | 2,3,1725,3651,12822,824,4424,2157 163 | 1,3,12434,540,283,1092,3,2233 164 | 1,3,15177,2024,3810,2665,232,610 165 | 2,3,5531,15726,26870,2367,13726,446 166 | 2,3,5224,7603,8584,2540,3674,238 167 | 2,3,15615,12653,19858,4425,7108,2379 168 | 2,3,4822,6721,9170,993,4973,3637 169 | 1,3,2926,3195,3268,405,1680,693 170 | 1,3,5809,735,803,1393,79,429 171 | 1,3,5414,717,2155,2399,69,750 172 | 2,3,260,8675,13430,1116,7015,323 173 | 2,3,200,25862,19816,651,8773,6250 174 | 1,3,955,5479,6536,333,2840,707 175 | 2,3,514,7677,19805,937,9836,716 176 | 1,3,286,1208,5241,2515,153,1442 177 | 2,3,2343,7845,11874,52,4196,1697 178 | 1,3,45640,6958,6536,7368,1532,230 179 | 1,3,12759,7330,4533,1752,20,2631 180 | 1,3,11002,7075,4945,1152,120,395 181 | 1,3,3157,4888,2500,4477,273,2165 182 | 1,3,12356,6036,8887,402,1382,2794 183 | 1,3,112151,29627,18148,16745,4948,8550 184 | 1,3,694,8533,10518,443,6907,156 185 | 1,3,36847,43950,20170,36534,239,47943 186 | 1,3,327,918,4710,74,334,11 187 | 1,3,8170,6448,1139,2181,58,247 188 | 1,3,3009,521,854,3470,949,727 189 | 1,3,2438,8002,9819,6269,3459,3 190 | 2,3,8040,7639,11687,2758,6839,404 191 | 2,3,834,11577,11522,275,4027,1856 192 | 1,3,16936,6250,1981,7332,118,64 193 | 1,3,13624,295,1381,890,43,84 194 | 1,3,5509,1461,2251,547,187,409 195 | 2,3,180,3485,20292,959,5618,666 196 | 1,3,7107,1012,2974,806,355,1142 197 | 1,3,17023,5139,5230,7888,330,1755 198 | 1,1,30624,7209,4897,18711,763,2876 199 | 2,1,2427,7097,10391,1127,4314,1468 200 | 1,1,11686,2154,6824,3527,592,697 201 | 1,1,9670,2280,2112,520,402,347 202 | 2,1,3067,13240,23127,3941,9959,731 203 | 2,1,4484,14399,24708,3549,14235,1681 204 | 1,1,25203,11487,9490,5065,284,6854 205 | 1,1,583,685,2216,469,954,18 206 | 1,1,1956,891,5226,1383,5,1328 207 | 2,1,1107,11711,23596,955,9265,710 208 | 1,1,6373,780,950,878,288,285 209 | 2,1,2541,4737,6089,2946,5316,120 210 | 1,1,1537,3748,5838,1859,3381,806 211 | 2,1,5550,12729,16767,864,12420,797 212 | 1,1,18567,1895,1393,1801,244,2100 213 | 2,1,12119,28326,39694,4736,19410,2870 214 | 1,1,7291,1012,2062,1291,240,1775 215 | 1,1,3317,6602,6861,1329,3961,1215 216 | 2,1,2362,6551,11364,913,5957,791 217 | 1,1,2806,10765,15538,1374,5828,2388 218 | 2,1,2532,16599,36486,179,13308,674 219 | 1,1,18044,1475,2046,2532,130,1158 220 | 2,1,18,7504,15205,1285,4797,6372 221 | 1,1,4155,367,1390,2306,86,130 222 | 1,1,14755,899,1382,1765,56,749 223 | 1,1,5396,7503,10646,91,4167,239 224 | 1,1,5041,1115,2856,7496,256,375 225 | 2,1,2790,2527,5265,5612,788,1360 226 | 1,1,7274,659,1499,784,70,659 227 | 1,1,12680,3243,4157,660,761,786 228 | 2,1,20782,5921,9212,1759,2568,1553 229 | 1,1,4042,2204,1563,2286,263,689 230 | 1,1,1869,577,572,950,4762,203 231 | 1,1,8656,2746,2501,6845,694,980 232 | 2,1,11072,5989,5615,8321,955,2137 233 | 1,1,2344,10678,3828,1439,1566,490 234 | 1,1,25962,1780,3838,638,284,834 235 | 1,1,964,4984,3316,937,409,7 236 | 1,1,15603,2703,3833,4260,325,2563 237 | 1,1,1838,6380,2824,1218,1216,295 238 | 1,1,8635,820,3047,2312,415,225 239 | 1,1,18692,3838,593,4634,28,1215 240 | 1,1,7363,475,585,1112,72,216 241 | 1,1,47493,2567,3779,5243,828,2253 242 | 1,1,22096,3575,7041,11422,343,2564 243 | 1,1,24929,1801,2475,2216,412,1047 244 | 1,1,18226,659,2914,3752,586,578 245 | 1,1,11210,3576,5119,561,1682,2398 246 | 1,1,6202,7775,10817,1183,3143,1970 247 | 2,1,3062,6154,13916,230,8933,2784 248 | 1,1,8885,2428,1777,1777,430,610 249 | 1,1,13569,346,489,2077,44,659 250 | 1,1,15671,5279,2406,559,562,572 251 | 1,1,8040,3795,2070,6340,918,291 252 | 1,1,3191,1993,1799,1730,234,710 253 | 2,1,6134,23133,33586,6746,18594,5121 254 | 1,1,6623,1860,4740,7683,205,1693 255 | 1,1,29526,7961,16966,432,363,1391 256 | 1,1,10379,17972,4748,4686,1547,3265 257 | 1,1,31614,489,1495,3242,111,615 258 | 1,1,11092,5008,5249,453,392,373 259 | 1,1,8475,1931,1883,5004,3593,987 260 | 1,1,56083,4563,2124,6422,730,3321 261 | 1,1,53205,4959,7336,3012,967,818 262 | 1,1,9193,4885,2157,327,780,548 263 | 1,1,7858,1110,1094,6818,49,287 264 | 1,1,23257,1372,1677,982,429,655 265 | 1,1,2153,1115,6684,4324,2894,411 266 | 2,1,1073,9679,15445,61,5980,1265 267 | 1,1,5909,23527,13699,10155,830,3636 268 | 2,1,572,9763,22182,2221,4882,2563 269 | 1,1,20893,1222,2576,3975,737,3628 270 | 2,1,11908,8053,19847,1069,6374,698 271 | 1,1,15218,258,1138,2516,333,204 272 | 1,1,4720,1032,975,5500,197,56 273 | 1,1,2083,5007,1563,1120,147,1550 274 | 1,1,514,8323,6869,529,93,1040 275 | 1,3,36817,3045,1493,4802,210,1824 276 | 1,3,894,1703,1841,744,759,1153 277 | 1,3,680,1610,223,862,96,379 278 | 1,3,27901,3749,6964,4479,603,2503 279 | 1,3,9061,829,683,16919,621,139 280 | 1,3,11693,2317,2543,5845,274,1409 281 | 2,3,17360,6200,9694,1293,3620,1721 282 | 1,3,3366,2884,2431,977,167,1104 283 | 2,3,12238,7108,6235,1093,2328,2079 284 | 1,3,49063,3965,4252,5970,1041,1404 285 | 1,3,25767,3613,2013,10303,314,1384 286 | 1,3,68951,4411,12609,8692,751,2406 287 | 1,3,40254,640,3600,1042,436,18 288 | 1,3,7149,2247,1242,1619,1226,128 289 | 1,3,15354,2102,2828,8366,386,1027 290 | 1,3,16260,594,1296,848,445,258 291 | 1,3,42786,286,471,1388,32,22 292 | 1,3,2708,2160,2642,502,965,1522 293 | 1,3,6022,3354,3261,2507,212,686 294 | 1,3,2838,3086,4329,3838,825,1060 295 | 2,2,3996,11103,12469,902,5952,741 296 | 1,2,21273,2013,6550,909,811,1854 297 | 2,2,7588,1897,5234,417,2208,254 298 | 1,2,19087,1304,3643,3045,710,898 299 | 2,2,8090,3199,6986,1455,3712,531 300 | 2,2,6758,4560,9965,934,4538,1037 301 | 1,2,444,879,2060,264,290,259 302 | 2,2,16448,6243,6360,824,2662,2005 303 | 2,2,5283,13316,20399,1809,8752,172 304 | 2,2,2886,5302,9785,364,6236,555 305 | 2,2,2599,3688,13829,492,10069,59 306 | 2,2,161,7460,24773,617,11783,2410 307 | 2,2,243,12939,8852,799,3909,211 308 | 2,2,6468,12867,21570,1840,7558,1543 309 | 1,2,17327,2374,2842,1149,351,925 310 | 1,2,6987,1020,3007,416,257,656 311 | 2,2,918,20655,13567,1465,6846,806 312 | 1,2,7034,1492,2405,12569,299,1117 313 | 1,2,29635,2335,8280,3046,371,117 314 | 2,2,2137,3737,19172,1274,17120,142 315 | 1,2,9784,925,2405,4447,183,297 316 | 1,2,10617,1795,7647,1483,857,1233 317 | 2,2,1479,14982,11924,662,3891,3508 318 | 1,2,7127,1375,2201,2679,83,1059 319 | 1,2,1182,3088,6114,978,821,1637 320 | 1,2,11800,2713,3558,2121,706,51 321 | 2,2,9759,25071,17645,1128,12408,1625 322 | 1,2,1774,3696,2280,514,275,834 323 | 1,2,9155,1897,5167,2714,228,1113 324 | 1,2,15881,713,3315,3703,1470,229 325 | 1,2,13360,944,11593,915,1679,573 326 | 1,2,25977,3587,2464,2369,140,1092 327 | 1,2,32717,16784,13626,60869,1272,5609 328 | 1,2,4414,1610,1431,3498,387,834 329 | 1,2,542,899,1664,414,88,522 330 | 1,2,16933,2209,3389,7849,210,1534 331 | 1,2,5113,1486,4583,5127,492,739 332 | 1,2,9790,1786,5109,3570,182,1043 333 | 2,2,11223,14881,26839,1234,9606,1102 334 | 1,2,22321,3216,1447,2208,178,2602 335 | 2,2,8565,4980,67298,131,38102,1215 336 | 2,2,16823,928,2743,11559,332,3486 337 | 2,2,27082,6817,10790,1365,4111,2139 338 | 1,2,13970,1511,1330,650,146,778 339 | 1,2,9351,1347,2611,8170,442,868 340 | 1,2,3,333,7021,15601,15,550 341 | 1,2,2617,1188,5332,9584,573,1942 342 | 2,3,381,4025,9670,388,7271,1371 343 | 2,3,2320,5763,11238,767,5162,2158 344 | 1,3,255,5758,5923,349,4595,1328 345 | 2,3,1689,6964,26316,1456,15469,37 346 | 1,3,3043,1172,1763,2234,217,379 347 | 1,3,1198,2602,8335,402,3843,303 348 | 2,3,2771,6939,15541,2693,6600,1115 349 | 2,3,27380,7184,12311,2809,4621,1022 350 | 1,3,3428,2380,2028,1341,1184,665 351 | 2,3,5981,14641,20521,2005,12218,445 352 | 1,3,3521,1099,1997,1796,173,995 353 | 2,3,1210,10044,22294,1741,12638,3137 354 | 1,3,608,1106,1533,830,90,195 355 | 2,3,117,6264,21203,228,8682,1111 356 | 1,3,14039,7393,2548,6386,1333,2341 357 | 1,3,190,727,2012,245,184,127 358 | 1,3,22686,134,218,3157,9,548 359 | 2,3,37,1275,22272,137,6747,110 360 | 1,3,759,18664,1660,6114,536,4100 361 | 1,3,796,5878,2109,340,232,776 362 | 1,3,19746,2872,2006,2601,468,503 363 | 1,3,4734,607,864,1206,159,405 364 | 1,3,2121,1601,2453,560,179,712 365 | 1,3,4627,997,4438,191,1335,314 366 | 1,3,2615,873,1524,1103,514,468 367 | 2,3,4692,6128,8025,1619,4515,3105 368 | 1,3,9561,2217,1664,1173,222,447 369 | 1,3,3477,894,534,1457,252,342 370 | 1,3,22335,1196,2406,2046,101,558 371 | 1,3,6211,337,683,1089,41,296 372 | 2,3,39679,3944,4955,1364,523,2235 373 | 1,3,20105,1887,1939,8164,716,790 374 | 1,3,3884,3801,1641,876,397,4829 375 | 2,3,15076,6257,7398,1504,1916,3113 376 | 1,3,6338,2256,1668,1492,311,686 377 | 1,3,5841,1450,1162,597,476,70 378 | 2,3,3136,8630,13586,5641,4666,1426 379 | 1,3,38793,3154,2648,1034,96,1242 380 | 1,3,3225,3294,1902,282,68,1114 381 | 2,3,4048,5164,10391,130,813,179 382 | 1,3,28257,944,2146,3881,600,270 383 | 1,3,17770,4591,1617,9927,246,532 384 | 1,3,34454,7435,8469,2540,1711,2893 385 | 1,3,1821,1364,3450,4006,397,361 386 | 1,3,10683,21858,15400,3635,282,5120 387 | 1,3,11635,922,1614,2583,192,1068 388 | 1,3,1206,3620,2857,1945,353,967 389 | 1,3,20918,1916,1573,1960,231,961 390 | 1,3,9785,848,1172,1677,200,406 391 | 1,3,9385,1530,1422,3019,227,684 392 | 1,3,3352,1181,1328,5502,311,1000 393 | 1,3,2647,2761,2313,907,95,1827 394 | 1,3,518,4180,3600,659,122,654 395 | 1,3,23632,6730,3842,8620,385,819 396 | 1,3,12377,865,3204,1398,149,452 397 | 1,3,9602,1316,1263,2921,841,290 398 | 2,3,4515,11991,9345,2644,3378,2213 399 | 1,3,11535,1666,1428,6838,64,743 400 | 1,3,11442,1032,582,5390,74,247 401 | 1,3,9612,577,935,1601,469,375 402 | 1,3,4446,906,1238,3576,153,1014 403 | 1,3,27167,2801,2128,13223,92,1902 404 | 1,3,26539,4753,5091,220,10,340 405 | 1,3,25606,11006,4604,127,632,288 406 | 1,3,18073,4613,3444,4324,914,715 407 | 1,3,6884,1046,1167,2069,593,378 408 | 1,3,25066,5010,5026,9806,1092,960 409 | 2,3,7362,12844,18683,2854,7883,553 410 | 2,3,8257,3880,6407,1646,2730,344 411 | 1,3,8708,3634,6100,2349,2123,5137 412 | 1,3,6633,2096,4563,1389,1860,1892 413 | 1,3,2126,3289,3281,1535,235,4365 414 | 1,3,97,3605,12400,98,2970,62 415 | 1,3,4983,4859,6633,17866,912,2435 416 | 1,3,5969,1990,3417,5679,1135,290 417 | 2,3,7842,6046,8552,1691,3540,1874 418 | 2,3,4389,10940,10908,848,6728,993 419 | 1,3,5065,5499,11055,364,3485,1063 420 | 2,3,660,8494,18622,133,6740,776 421 | 1,3,8861,3783,2223,633,1580,1521 422 | 1,3,4456,5266,13227,25,6818,1393 423 | 2,3,17063,4847,9053,1031,3415,1784 424 | 1,3,26400,1377,4172,830,948,1218 425 | 2,3,17565,3686,4657,1059,1803,668 426 | 2,3,16980,2884,12232,874,3213,249 427 | 1,3,11243,2408,2593,15348,108,1886 428 | 1,3,13134,9347,14316,3141,5079,1894 429 | 1,3,31012,16687,5429,15082,439,1163 430 | 1,3,3047,5970,4910,2198,850,317 431 | 1,3,8607,1750,3580,47,84,2501 432 | 1,3,3097,4230,16483,575,241,2080 433 | 1,3,8533,5506,5160,13486,1377,1498 434 | 1,3,21117,1162,4754,269,1328,395 435 | 1,3,1982,3218,1493,1541,356,1449 436 | 1,3,16731,3922,7994,688,2371,838 437 | 1,3,29703,12051,16027,13135,182,2204 438 | 1,3,39228,1431,764,4510,93,2346 439 | 2,3,14531,15488,30243,437,14841,1867 440 | 1,3,10290,1981,2232,1038,168,2125 441 | 1,3,2787,1698,2510,65,477,52 442 | -------------------------------------------------------------------------------- /visuals.py: -------------------------------------------------------------------------------- 1 | ########################################### 2 | # Suppress matplotlib user warnings 3 | # Necessary for newer version of matplotlib 4 | import warnings 5 | warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib") 6 | # 7 | # Display inline matplotlib plots with IPython 8 | from IPython import get_ipython 9 | get_ipython().run_line_magic('matplotlib', 'inline') 10 | ########################################### 11 | 12 | import matplotlib.pyplot as plt 13 | import matplotlib.cm as cm 14 | import pandas as pd 15 | import numpy as np 16 | 17 | def pca_results(good_data, pca): 18 | ''' 19 | Create a DataFrame of the PCA results 20 | Includes dimension feature weights and explained variance 21 | Visualizes the PCA results 22 | ''' 23 | 24 | # Dimension indexing 25 | dimensions = dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)] 26 | 27 | # PCA components 28 | components = pd.DataFrame(np.round(pca.components_, 4), columns = list(good_data.keys())) 29 | components.index = dimensions 30 | 31 | # PCA explained variance 32 | ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1) 33 | variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance']) 34 | variance_ratios.index = dimensions 35 | 36 | # Create a bar plot visualization 37 | fig, ax = plt.subplots(figsize = (14,8)) 38 | 39 | # Plot the feature weights as a function of the components 40 | components.plot(ax = ax, kind = 'bar'); 41 | ax.set_ylabel("Feature Weights") 42 | ax.set_xticklabels(dimensions, rotation=0) 43 | 44 | 45 | # Display the explained variance ratios 46 | for i, ev in enumerate(pca.explained_variance_ratio_): 47 | ax.text(i-0.40, ax.get_ylim()[1] + 0.05, "Explained Variance\n %.4f"%(ev)) 48 | 49 | # Return a concatenated DataFrame 50 | return pd.concat([variance_ratios, components], axis = 1) 51 | 52 | def cluster_results(reduced_data, preds, centers, pca_samples): 53 | ''' 54 | Visualizes the PCA-reduced cluster data in two dimensions 55 | Adds cues for cluster centers and student-selected sample data 56 | ''' 57 | 58 | predictions = pd.DataFrame(preds, columns = ['Cluster']) 59 | plot_data = pd.concat([predictions, reduced_data], axis = 1) 60 | 61 | # Generate the cluster plot 62 | fig, ax = plt.subplots(figsize = (14,8)) 63 | 64 | # Color map 65 | cmap = cm.get_cmap('gist_rainbow') 66 | 67 | # Color the points based on assigned cluster 68 | for i, cluster in plot_data.groupby('Cluster'): 69 | cluster.plot(ax = ax, kind = 'scatter', x = 'Dimension 1', y = 'Dimension 2', \ 70 | color = cmap((i)*1.0/(len(centers)-1)), label = 'Cluster %i'%(i), s=30); 71 | 72 | # Plot centers with indicators 73 | for i, c in enumerate(centers): 74 | ax.scatter(x = c[0], y = c[1], color = 'white', edgecolors = 'black', \ 75 | alpha = 1, linewidth = 2, marker = 'o', s=200); 76 | ax.scatter(x = c[0], y = c[1], marker='$%d$'%(i), alpha = 1, s=100); 77 | 78 | # Plot transformed sample points 79 | ax.scatter(x = pca_samples[:,0], y = pca_samples[:,1], \ 80 | s = 150, linewidth = 4, color = 'black', marker = 'x'); 81 | 82 | # Set plot title 83 | ax.set_title("Cluster Learning on PCA-Reduced Data - Centroids Marked by Number\nTransformed Sample Data Marked by Black Cross"); 84 | 85 | 86 | def biplot(good_data, reduced_data, pca): 87 | ''' 88 | Produce a biplot that shows a scatterplot of the reduced 89 | data and the projections of the original features. 90 | 91 | good_data: original data, before transformation. 92 | Needs to be a pandas dataframe with valid column names 93 | reduced_data: the reduced data (the first two dimensions are plotted) 94 | pca: pca object that contains the components_ attribute 95 | 96 | return: a matplotlib AxesSubplot object (for any additional customization) 97 | 98 | This procedure is inspired by the script: 99 | https://github.com/teddyroland/python-biplot 100 | ''' 101 | 102 | fig, ax = plt.subplots(figsize = (14,8)) 103 | # scatterplot of the reduced data 104 | ax.scatter(x=reduced_data.loc[:, 'Dimension 1'], y=reduced_data.loc[:, 'Dimension 2'], 105 | facecolors='b', edgecolors='b', s=70, alpha=0.5) 106 | 107 | feature_vectors = pca.components_.T 108 | 109 | # we use scaling factors to make the arrows easier to see 110 | arrow_size, text_pos = 7.0, 8.0, 111 | 112 | # projections of the original features 113 | for i, v in enumerate(feature_vectors): 114 | ax.arrow(0, 0, arrow_size*v[0], arrow_size*v[1], 115 | head_width=0.2, head_length=0.2, linewidth=2, color='red') 116 | ax.text(v[0]*text_pos, v[1]*text_pos, good_data.columns[i], color='black', 117 | ha='center', va='center', fontsize=18) 118 | 119 | ax.set_xlabel("Dimension 1", fontsize=14) 120 | ax.set_ylabel("Dimension 2", fontsize=14) 121 | ax.set_title("PC plane with original feature projections.", fontsize=16); 122 | return ax 123 | 124 | 125 | def channel_results(reduced_data, outliers, pca_samples): 126 | ''' 127 | Visualizes the PCA-reduced cluster data in two dimensions using the full dataset 128 | Data is labeled by "Channel" and cues added for student-selected sample data 129 | ''' 130 | 131 | # Check that the dataset is loadable 132 | try: 133 | full_data = pd.read_csv("customers.csv") 134 | except: 135 | print("Dataset could not be loaded. Is the file missing?") 136 | return False 137 | 138 | # Create the Channel DataFrame 139 | channel = pd.DataFrame(full_data['Channel'], columns = ['Channel']) 140 | channel = channel.drop(channel.index[outliers]).reset_index(drop = True) 141 | labeled = pd.concat([reduced_data, channel], axis = 1) 142 | 143 | # Generate the cluster plot 144 | fig, ax = plt.subplots(figsize = (14,8)) 145 | 146 | # Color map 147 | cmap = cm.get_cmap('gist_rainbow') 148 | 149 | # Color the points based on assigned Channel 150 | labels = ['Hotel/Restaurant/Cafe', 'Retailer'] 151 | grouped = labeled.groupby('Channel') 152 | for i, channel in grouped: 153 | channel.plot(ax = ax, kind = 'scatter', x = 'Dimension 1', y = 'Dimension 2', \ 154 | color = cmap((i-1)*1.0/2), label = labels[i-1], s=30); 155 | 156 | # Plot transformed sample points 157 | for i, sample in enumerate(pca_samples): 158 | ax.scatter(x = sample[0], y = sample[1], \ 159 | s = 200, linewidth = 3, color = 'black', marker = 'o', facecolors = 'none'); 160 | ax.scatter(x = sample[0]+0.25, y = sample[1]+0.3, marker='$%d$'%(i), alpha = 1, s=125); 161 | 162 | # Set plot title 163 | ax.set_title("PCA-Reduced Data Labeled by 'Channel'\nTransformed Sample Data Circled"); --------------------------------------------------------------------------------