├── Images
    ├── explore_wine_beeswarm.png
    ├── explore_wine_cdf.png
    ├── explore_wine_histogram.png
    └── explore_wine_scattermatrix.png
├── README.md
├── covariance_boston.py
├── explore_wine_data.py
├── install.txt
├── ml_helpers.py
├── plt_helpers.py
├── statistics_helpers.py
└── statistics_iris.py


/Images/explore_wine_beeswarm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GeorgeSeif/Data-Science-Python/a6dc955960d06be5d673393c3114bf592ed900ca/Images/explore_wine_beeswarm.png


--------------------------------------------------------------------------------
/Images/explore_wine_cdf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GeorgeSeif/Data-Science-Python/a6dc955960d06be5d673393c3114bf592ed900ca/Images/explore_wine_cdf.png


--------------------------------------------------------------------------------
/Images/explore_wine_histogram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GeorgeSeif/Data-Science-Python/a6dc955960d06be5d673393c3114bf592ed900ca/Images/explore_wine_histogram.png


--------------------------------------------------------------------------------
/Images/explore_wine_scattermatrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GeorgeSeif/Data-Science-Python/a6dc955960d06be5d673393c3114bf592ed900ca/Images/explore_wine_scattermatrix.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Python Data Science
 2 | 
 3 | ## Description
 4 | A collection of data science scripts for data analysis in Python. Please also see my related repository [Python Machine Learning](https://github.com/GeorgeSeif/Python-Machine-Learning) which contains many implementations of Machine Learning algorithms including _regression_, _classification_, and _clustering_. The algorithms are implemented in two ways: from scratch in Python and using Scikit Learn functions. 
 5 | 
 6 | Python libraries used:
 7 | - Numpy
 8 | - Scipy
 9 | - Scikit Learn
10 | - Pandas
11 | - Seaborn
12 | - Matplotlib
13 | 
14 | ## Installation
15 | To install all of the libraries, run the commands in the "install.txt" file. These are:
16 | 
17 | - `sudo apt-get install python-pip`
18 | - `sudo pip install numpy scipy`
19 | - `sudo pip install pandas`
20 | - `sudo apt-get install python-matplotlib`
21 | - `sudo pip install -U scikit-learn`
22 | - `sudo pip install tabulate`
23 | 
24 | ## Files
25 | - **ml_helpers.py:** Machine Learning helper functions. Adapted from my [Python Machine Learning](https://github.com/GeorgeSeif/Python-Machine-Learning) repository
26 | - **plt_helpers.py:** Helper functions to make plotting easy in Matplotlib.
27 | - **statistics_helpers.py:** Helper functions for computing dataset statistics
28 | - **explore_wine_data.py:** Exploratory data analysis of the wine dataset from sklearn using visualisations. Includes data analysis using histogram, scatterplot, bee swarm plot, and cumulative distribution function.
29 | - **statistics_iris.py:** Compute various statistics of the iris dataset features such as histogram, min, max, median, mean, and variance.
30 | - **covariance_boston.py:** Compute the covariance matrix of the Boston Housing dataset. These matrices can sometimes give faster insight into which variables are related rather than creating scatter plots.
31 | 
32 | ## Information
33 | 
34 | ### Visualisations
35 | - **Histogram:** A histogram is a graphical method of displaying quantitative data. A histogram displays the single quantitative variable along the x axis and frequency of that variable on the y axis. The distinguishing feature of a histogram is that data is grouped into "bins", which are intervals on the x axis.
36 | - **Scatterplot:** A scatter plot is a graphical method of displaying the relationship between data points. Each feature variable is assigned an axis. Each data point in the dataset is then plotted based on its feature values.
37 | - **Beeswarm Plot:** A Beeswarm plot is a two-dimensional visualisation technique where data points are plotted relative to a fixed reference axis so that no two datapoints overlap. The beeswarm plot is a useful technique when we wish to see not only the measured values of interest for each data point, but also the distribution of these values. 
38 | - **Cumulative Distribution Function:** The cumulative distribution function (cdf) is the probability that a variable takes a value less than or equal to x. For example, we may wish to see what percentage of the data has a certain feature variable that is less than or equal to x.
39 | - **Bar Plots:** Classical bar plots that are good for visualisation and comparison of different data statistics, especially comparing statistics of feature variables.
40 | 
41 | ### Statistics
42 | - **Mean and Median:** Both of these show a type of "average" or "center" value for a particular feature variable. The mean is the more literal and precise center; however median is much more robust to outliers which may pull the mean value calculation far away from the majority of the values. 
43 | - **Variance and Standard Deviation:** Useful for seeing to what degree the feature variable of a dataset varies across all example i.e are most of the values for this particular feature variable similar across the dataset or are they all very different.  
44 | - **Kurtosis:** Measures the "sharpness" of a distribution. If a distribution has a high kurtosis value (>3) then it's data will be highly concentrated around the same value. If K=3 then the distribution is normal (zero-mean, unit-variance). If K < 3 then the values of the distribution will be spread out. 
45 | - **Skewness:** Measures the asymmetry of a distribution. Positive skewness means values are concentrated on the left (lower); negative skewness means values are concentrated on the right (higher). 
46 | - **Covariance:** The covariance of two variables measures how "correlated" they are. If the two variables have a positive covariance, then one when variable increases so does the other; with a negative covariance the values of the feature variables will change in opposite directions. 
47 | - **Correlation:** Correlation is simply the normalized (scaled) covariance where we divide by the product of the standard deviation of the two variables being analyzed. While the covariance always changes with changes in units and lies between -\infty and \infty, corellation stays the same with any units and lies between -1 and 1. As stated above, covariance measures variables that have different units of measurement. Using covariance, you could determine whether units were increasing or decreasing, but it was impossible to measure the degree to which the variables moved together because covariance does not use one standard unit of measurement. To measure the degree to which variables move together, you must use correlation. The magnitude of the correlation tells us how strongly the features are correlated. If the _correlation coefficient_ is one, the variables have a perfect positive correlation. This means that if one variable moves a given amount, the second moves proportionally in the same direction. A positive correlation coefficient less than one indicates a less than perfect positive correlation, with the strength of the correlation growing as the number approaches one. If _correlation coefficient is zero_, no relationship exists between the variables. If one variable moves, you can make no predictions about the movement of the other variable; they are uncorrelated.
48 | If _correlation coefficient is –1_, the variables are perfectly negatively correlated (or inversely correlated) and move in opposition to each other. If one variable increases, the other variable decreases proportionally. A negative correlation coefficient greater than –1 indicates a less than perfect negative correlation, with the strength of the correlation growing as the number approaches –1.
49 | - **PCA Dimensionality Reduction:** Principal Component Analysis (PCA) is a technique commonly used for dimensionality reduction. PCA computes the feature vectors along which the data has the highest variance. Since these feature vectors have the highest variance they also hold most of the information that the data represents. Therefore we can project the data on to these feature vectors, reducing the dimensionality of the data which makes analysis easier and more clear. 
50 | - **Data Shuffling:** Shuffling the data prior to applying a machine learning algorithm has been proven to improve the performance. 
51 | 
52 | 
53 | ### Pandas Data Science
54 | 
55 | #### Basic Dataset Information
56 | - **Read in a CSV dataset:** `pd.DataFrame.from_csv("csv_file")` OR `pd.read_csv("csv_file")`
57 | - **Read in an Excel dataset:** `pd.read_excel("excel_file")`
58 | - **Basic dataset feature info:** `df.info()`
59 | - **Basic dataset statistics:** `print(df.describe())` 
60 | - **Print dataframe in a table:** `print(tabulate(print_table, headers=headers))` where "print_table" is a list of lists and "headers" is a list of the string headers
61 | 
62 | #### Basic Data Handling
63 | - **Drop missing data:** `df.dropna(axis=0, how='any')` Return object with labels on given axis omitted where alternately any or all of the data are missing
64 | - **Replace missing data:** `df.replace(to_replace=None, value=None)` Replace values given in "to_replace" with "value".
65 | - **Check for NANs:** `pd.isnull(object)` Detect missing values (NaN in numeric arrays, None/NaN in object arrays)
66 | - **Drop a feature:** `df.drop('feature_variable_name', axis=1)` axis is either 0 for rows, 1 for columns
67 | - **Convert object type to float:** `pd.to_numeric(df["feature_name"], errors='coerce')` Convert object types to numeric to be able to perform compuations
68 | - **Convert DF to numpy array:** `df.as_matrix()`
69 | - **Get first "n" rows:** `df.head([n])`
70 | - **Get data by feature name:** `df.loc[feature_name]`
71 | 
72 | #### Basic Plotting
73 | - **Area plot:** `df.plot.area([x, y])`	
74 | - **Vertical bar plot:** `df.plot.bar([x, y])`
75 | - **Horizontal bar plot:** `df.plot.barh([x, y])`	
76 | - **Boxplot:** `df.plot.box([by])`
77 | - **Histogram:** `df.plot.hist([by, bins])`
78 | - **Line plot:** `df.plot.line([x, y])`	
79 | - **Pie chart:** `df.plot.pie([y])`	
80 | 
81 | 
82 | ### Matplotlib Plotting
83 | - **Scatter plot:** `scatter(x_data, y_data, s = 30, color = '#539caf', alpha = 0.75)`
84 | - **Line plot:** `plot(x_data, y_data, lw = 2, color = '#539caf', alpha = 1)`
85 | - **Histogram:** `hist(data, n_bins, color = '#539caf')` 
86 | - **Probability Density Function:** plot(x_data, density_est(x_data), color = '#539caf', lw = 2) Where `density_est(x_data)` computes the probability density of each data point
87 | - **Bar plot:** `bar(x_data, y_data, color = '#539caf', align = 'center')`
88 | - **Box plot:** `boxplot(y_data)` We set the x_data using the x-axis tick labels on the plot `set_xticklabels(x_data)`
89 | 
90 | 
91 | 
92 | 
93 | ### Examples
94 | ![alt text](https://github.com/GeorgeSeif/Data-Science-Python/blob/master/Images/explore_wine_scattermatrix.png)


--------------------------------------------------------------------------------
/covariance_boston.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import numpy as np
 3 | from scipy import stats
 4 | import seaborn.apionly as sns
 5 | from tabulate import tabulate
 6 | import matplotlib.pyplot as plt
 7 | from sklearn.datasets import load_boston
 8 | 
 9 | import ml_helpers as helpers
10 | 
11 | # NOTE that this loads as a dictionairy
12 | boston_data = load_boston()
13 | 
14 | train_data = np.array(boston_data.data)
15 | train_labels = np.array(boston_data.target)
16 | 
17 | num_features = boston_data.data.shape[1]
18 | unique_labels = np.unique(train_labels)
19 | num_classes = len(unique_labels)
20 | 
21 | 
22 | print("The boston dataset has " + str(num_features) + " features")
23 | print(boston_data.feature_names)
24 | 
25 | 
26 | 
27 | # Put everything into a Pandas DataFrame
28 | data = pd.DataFrame(data=np.c_[train_data], columns=boston_data.feature_names)
29 | # print(tabulate(data, headers='keys', tablefmt='psql'))
30 | 
31 | 
32 | 
33 | # Compute the covariance matrix
34 | cov_mat_boston = np.cov(train_data.T)
35 | print("Covariance matrix")
36 | print(cov_mat_boston)
37 | 
38 | 
39 | 
40 | # Normalize the data and then recompute the covariance matrix
41 | normalized_train_data = helpers.normalize_data(train_data)
42 | normalized_cov_mat_boston = np.cov(normalized_train_data.T)
43 | print("Normalized data covariance matrix")
44 | print(normalized_cov_mat_boston)
45 | 
46 | 
47 | 
48 | # create scatterplot matrix
49 | fig = sns.pairplot(data=data, hue='CRIM')
50 | 
51 | plt.show()


--------------------------------------------------------------------------------
/explore_wine_data.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | from tabulate import tabulate
  4 | import seaborn.apionly as sns
  5 | import matplotlib.pyplot as plt
  6 | from sklearn.datasets import load_wine
  7 | 
  8 | 
  9 | # ------------------------------------------------------------------------------------------------
 10 | 
 11 | # Read in the data
 12 | # NOTE that this loads as a dictionairy
 13 | wine_data = load_wine()
 14 | 
 15 | train_data = np.array(wine_data.data)
 16 | train_labels = np.array(wine_data.target)
 17 | 
 18 | num_features = wine_data.data.shape[1]
 19 | unique_labels = np.unique(train_labels)
 20 | num_classes = len(unique_labels)
 21 | 
 22 | 
 23 | print("The wine dataset has " + str(num_features) + " features")
 24 | print(wine_data.feature_names)
 25 | print("The wine dataset has " + str(num_classes) + " categoryes")
 26 | print(wine_data.target_names)
 27 | 
 28 | 
 29 | # Put everything into a Pandas DataFrame
 30 | data = pd.DataFrame(data=np.c_[train_data, train_labels], columns=wine_data.feature_names + ['category'])
 31 | # print(tabulate(data, headers='keys', tablefmt='psql'))
 32 | 
 33 | # ------------------------------------------------------------------------------------------------
 34 | 
 35 | 
 36 | 
 37 | 
 38 | 
 39 | # ------------------------------------------------------------------------------------------------
 40 | 
 41 | # Create histogram
 42 | hist_feature_name='color_intensity'
 43 | bin_edges = np.arange(0, data[hist_feature_name].max() + 1, 1)
 44 | fig = plt.hist(data[hist_feature_name], bins=bin_edges)
 45 | 
 46 | plt.ylabel('count')
 47 | plt.xlabel(hist_feature_name)
 48 | plt.show()
 49 | 
 50 | # ------------------------------------------------------------------------------------------------
 51 | 
 52 | 
 53 | 
 54 | 
 55 | 
 56 | 
 57 | 
 58 | # ------------------------------------------------------------------------------------------------
 59 | 
 60 | # Create grouped bar plot
 61 | 
 62 | 
 63 | var_name_1 = 'alcohol'
 64 | var_name_2 = 'color_intensity'
 65 | 
 66 | 
 67 | # Setting the positions and width for the bars
 68 | pos = list(range(num_classes))
 69 | width = 0.1
 70 | 
 71 | # Plotting the bars
 72 | fig, ax = plt.subplots(figsize=(10,5))
 73 | 
 74 | # Set the position of the x ticks
 75 | ax.set_xticks([p + 1.5 * width for p in pos])
 76 | ax.set_xticklabels(list(range(num_classes)))
 77 | 
 78 | class_0_data = data[data.category==0]
 79 | alcohol_values_0 = class_0_data[var_name_1].values 
 80 | mean_alcohol_0 = np.mean(alcohol_values_0)
 81 | color_values_0 = class_0_data[var_name_2].values 
 82 | mean_color_0 = np.mean(color_values_0)
 83 | 
 84 | class_1_data = data[data.category==1]
 85 | alcohol_values_1 = class_1_data[var_name_1].values 
 86 | mean_alcohol_1 = np.mean(alcohol_values_1)
 87 | color_values_1 = class_1_data[var_name_2].values 
 88 | mean_color_1 = np.mean(color_values_1)
 89 | 
 90 | class_2_data = data[data.category==2]
 91 | alcohol_values_2 = class_2_data[var_name_1].values 
 92 | mean_alcohol_2 = np.mean(alcohol_values_2)
 93 | color_values_2 = class_2_data[var_name_2].values 
 94 | mean_color_2 = np.mean(color_values_2)
 95 | 
 96 | plt.bar(pos, [mean_alcohol_0, mean_alcohol_1, mean_alcohol_2], width, alpha=1.0, color='#EE3224', label='alcohol')
 97 | plt.bar([p + width for p in pos], [mean_color_0, mean_color_1, mean_color_2], width, alpha=1.0, color='#F78F1E', label='color_intensity')
 98 | 
 99 | 
100 | plt.legend([var_name_1, 'color_intensity'], loc='upper left')
101 | 
102 | plt.show()
103 | 
104 | 
105 | # ------------------------------------------------------------------------------------------------
106 | 
107 | 
108 | 
109 | 
110 | 
111 | 
112 | 
113 | 
114 | 
115 | # ------------------------------------------------------------------------------------------------
116 | 
117 | # Create scatterplot
118 | scatter_feature_name_1='color_intensity'
119 | scatter_feature_name_2='alcohol'
120 | fig = plt.scatter(data[scatter_feature_name_1], data[scatter_feature_name_2])
121 | 
122 | plt.xlabel(scatter_feature_name_1)
123 | plt.ylabel(scatter_feature_name_2)
124 | plt.show()
125 | 
126 | 
127 | 
128 | # Create scatterplot matrix
129 | fig = sns.pairplot(data=data[['alcohol', 'color_intensity', 'malic_acid', 'magnesium', 'category']], hue='category')
130 | 
131 | plt.show()
132 | 
133 | # ------------------------------------------------------------------------------------------------
134 | 
135 | 
136 | 
137 | # ------------------------------------------------------------------------------------------------
138 | 
139 | # Create bee swarm plot
140 | sns.swarmplot(x='category', y='total_phenols', data=data)
141 | plt.show()
142 | 
143 | # ------------------------------------------------------------------------------------------------
144 | 
145 | 
146 | 
147 | 
148 | 
149 | # ------------------------------------------------------------------------------------------------
150 | 
151 | # Cumulative Distribution Function Plots
152 | 
153 | 
154 | # Sort and normalize data
155 | x = np.sort(data['hue'])
156 | y = np.arange(1, x.shape[0] + 1, dtype='float32') / x.shape[0]
157 | 
158 | plt.plot(x, y, marker='o', linestyle='')
159 | 
160 | plt.ylabel('ECDF')
161 | plt.xlabel('hue')
162 | 
163 | eightieth_percentile = x[y <= 0.75].max()
164 | 
165 | plt.axhline(0.75, color='black', linestyle='--')
166 | plt.axvline(eightieth_percentile, color='black', label='75th percentile')
167 | plt.legend()
168 | plt.show()


--------------------------------------------------------------------------------
/install.txt:
--------------------------------------------------------------------------------
1 | sudo apt-get install python-pip  
2 | sudo pip install numpy scipy
3 | sudo pip install pandas
4 | sudo apt-get install python-matplotlib
5 | sudo pip install -U scikit-learn
6 | sudo pip install tabulate
7 | 


--------------------------------------------------------------------------------
/ml_helpers.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import random
  3 | 
  4 | # Split the data into train and test sets
  5 | def train_test_split(X, y, test_size=0.2):
  6 | 	# First, shuffle the data
  7 |     train_data, train_labels = shuffle_data(X, y)
  8 | 
  9 |     # Split the training data from test data in the ratio specified in test_size
 10 |     split_i = len(y) - int(len(y) // (1 / test_size))
 11 |     x_train, x_test = train_data[:split_i], train_data[split_i:]
 12 |     y_train, y_test = train_labels[:split_i], train_labels[split_i:]
 13 | 
 14 |     return x_train, x_test, y_train, y_test
 15 | 
 16 | # Randomly shuffle the data
 17 | def shuffle_data(data, labels):
 18 | 	if(len(data) != len(labels)):
 19 | 		raise Exception("The given data and labels do NOT have the same length")
 20 | 
 21 | 	combined = list(zip(data, labels))
 22 | 	random.shuffle(combined)
 23 | 	data[:], labels[:] = zip(*combined)
 24 | 	return data, labels
 25 | 
 26 | # Calculate the distance between two vectors
 27 | def euclidean_distance(vec_1, vec_2):
 28 | 	if(len(vec_1) != len(vec_2)):
 29 | 		raise Exception("The two vectors do NOT have equal length")
 30 | 
 31 | 	distance = 0
 32 | 	for i in range(len(vec_1)):
 33 | 		distance += pow((vec_1[i] - vec_2[i]), 2)
 34 | 
 35 | 	return np.sqrt(distance)
 36 | 
 37 | # Compute the mean and variance of each feature of a data set
 38 | def compute_mean_and_var(data):
 39 | 	num_elements = len(data)
 40 | 	total = [0] * data.shape[1]
 41 | 	for sample in data:
 42 | 		total = total + sample
 43 | 	mean_features = np.divide(total, num_elements)
 44 | 
 45 | 	total = [0] * data.shape[1]
 46 | 	for sample in data:
 47 | 		total = total + np.square(sample - mean_features)
 48 | 
 49 | 	std_features = np.divide(total, num_elements)
 50 | 
 51 | 	var_features = std_features ** 2
 52 | 
 53 | 	return mean_features, var_features
 54 | 
 55 | # Normalize data by subtracting mean and dividing by standard deviation
 56 | def normalize_data(data):
 57 | 	mean_features, var_features = compute_mean_and_var(data)
 58 | 	std_features = np.sqrt(var_features)
 59 | 
 60 | 	for index, sample in enumerate(data):
 61 | 		data[index] = np.divide((sample - mean_features), std_features) 
 62 | 
 63 | 	return data
 64 | 
 65 | # Divide dataset based on if sample value on feature index is larger than
 66 | # the given threshold
 67 | def divide_on_feature(X, feature_i, threshold):
 68 |     split_func = None
 69 |     if isinstance(threshold, int) or isinstance(threshold, float):
 70 |         split_func = lambda sample: sample[feature_i] >= threshold
 71 |     else:
 72 |         split_func = lambda sample: sample[feature_i] == threshold
 73 | 
 74 |     X_1 = np.array([sample for sample in X if split_func(sample)])
 75 |     X_2 = np.array([sample for sample in X if not split_func(sample)])
 76 | 
 77 |     return np.array([X_1, X_2])
 78 | 
 79 | # Return random subsets (with replacements) of the data
 80 | def get_random_subsets(X, y, n_subsets, replacements=True):
 81 |     n_samples = np.shape(X)[0]
 82 |     # Concatenate x and y and do a random shuffle
 83 |     X_y = np.concatenate((X, y.reshape((1, len(y))).T), axis=1)
 84 |     np.random.shuffle(X_y)
 85 |     subsets = []
 86 | 
 87 |     # Uses 50% of training samples without replacements
 88 |     subsample_size = n_samples // 2
 89 |     if replacements:
 90 |         subsample_size = n_samples      # 100% with replacements
 91 | 
 92 |     for _ in range(n_subsets):
 93 |         idx = np.random.choice(range(n_samples), size=np.shape(range(subsample_size)), replace=replacements)
 94 |         X = X_y[idx][:, :-1]
 95 |         y = X_y[idx][:, -1]
 96 |         subsets.append([X, y])
 97 |     return subsets
 98 | 
 99 | # Calculate the entropy of label array y
100 | def calculate_entropy(y):
101 |     log2 = lambda x: np.log(x) / np.log(2)
102 |     unique_labels = np.unique(y)
103 |     entropy = 0
104 |     for label in unique_labels:
105 |         count = len(y[y == label])
106 |         p = count / len(y)
107 |         entropy += -p * log2(p)
108 |     return entropy
109 | 
110 | # Returns the mean squared error between y_true and y_pred
111 | def mean_squared_error(y_true, y_pred):
112 |     mse = np.mean(np.power(y_true - y_pred, 2))
113 |     return mse
114 | 
115 | # The sigmoid function
116 | def sigmoid(val):
117 | 	return np.divide(1, (1 + np.exp(-1*val)))
118 | 
119 | # The derivative of the sigmoid function
120 | def sigmoid_gradient(val):
121 |     return sigmoid(val) * (1 - sigmoid(val))
122 | 
123 | # Compute the covariance matrix of an array
124 | def compute_cov_mat(data):
125 | 	# Compute the mean of the data
126 | 	mean_vec = np.mean(data, axis=0)
127 | 
128 | 	# Compute the covariance matrix
129 | 	cov_mat = (data - mean_vec).T.dot((data - mean_vec)) / (data.shape[0]-1)
130 | 
131 | 	return cov_mat
132 | 
133 | 
134 | # Perform PCA dimensionality reduction
135 | def pca(data, exp_var_percentage=95):
136 | 
137 | 	# Compute the covariance matrix
138 | 	cov_mat = compute_cov_mat(data)
139 | 
140 | 	# Compute the eigen values and vectors of the covariance matrix
141 | 	eig_vals, eig_vecs = np.linalg.eig(cov_mat)
142 | 
143 | 	# Make a list of (eigenvalue, eigenvector) tuples
144 | 	eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
145 | 
146 | 	# Sort the (eigenvalue, eigenvector) tuples from high to low
147 | 	eig_pairs.sort(key=lambda x: x[0], reverse=True)
148 | 
149 | 	# Only keep a certain number of eigen vectors based on the "explained variance percentage"
150 | 	# which tells us how much information (variance) can be attributed to each of the principal components
151 | 	tot = sum(eig_vals)
152 | 	var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
153 | 	cum_var_exp = np.cumsum(var_exp)
154 | 
155 | 	num_vec_to_keep = 0
156 | 
157 | 	for index, percentage in enumerate(cum_var_exp):
158 | 		if percentage > exp_var_percentage:
159 | 			num_vec_to_keep = index + 1
160 | 			break
161 | 
162 | 	# Compute the projection matrix based on the top eigen vectors
163 | 	proj_mat = eig_pairs[0][1].reshape(4,1)
164 | 	for eig_vec_idx in range(1, num_vec_to_keep):
165 | 		proj_mat = np.hstack((proj_mat, eig_pairs[eig_vec_idx][1].reshape(4,1)))
166 | 
167 | 	# Project the data 
168 | 	pca_data = data.dot(proj_mat)
169 | 
170 | 	return pca_data
171 | 
172 | # 1D Gaussian Function
173 | def gaussian_1d(val, mean, standard_dev):
174 | 	coeff = 1 / (standard_dev * np.sqrt(2 * np.pi))
175 | 	exponent = (-1 * (val - mean) ** 2) / (2 * (standard_dev ** 2))
176 | 	gauss = coeff * np.exp(exponent)
177 | 	return gauss
178 | 
179 | # 2D Gaussian Function
180 | def gaussian_2d(x_val, y_val, x_mean, y_mean, x_standard_dev, y_standard_dev):
181 | 	x_gauss = gaussian_1d(x_val, x_mean, x_standard_dev)
182 | 	y_gauss = gaussian_1d(y_val, y_mean, y_standard_dev)
183 | 	gauss = x_gauss * y_gauss
184 | 	return gauss
185 | 


--------------------------------------------------------------------------------
/plt_helpers.py:
--------------------------------------------------------------------------------
  1 | import matplotlib
  2 |  
  3 | def scatterplot(x_data, y_data, x_label="", y_label="", title=""):
  4 | 
  5 |     # Create the plot object
  6 |     _, ax = plt.subplots()
  7 | 
  8 |     # Plot the data, set the size (s), color and transparency (alpha)
  9 |     # of the points
 10 |     ax.scatter(x_data, y_data, s = 30, color = '#539caf', alpha = 0.75)
 11 | 
 12 |     # Label the axes and provide a title
 13 |     ax.set_title(title)
 14 |     ax.set_xlabel(x_label)
 15 |     ax.set_ylabel(y_label)
 16 | 
 17 | 
 18 | def lineplot(x_data, y_data, x_label="", y_label="", title=""):
 19 |     # Create the plot object
 20 |     _, ax = plt.subplots()
 21 | 
 22 |     # Plot the best fit line, set the linewidth (lw), color and
 23 |     # transparency (alpha) of the line
 24 |     ax.plot(x_data, y_data, lw = 2, color = '#539caf', alpha = 1)
 25 | 
 26 |     # Label the axes and provide a title
 27 |     ax.set_title(title)
 28 |     ax.set_xlabel(x_label)
 29 |     ax.set_ylabel(y_label)
 30 | 
 31 | 
 32 | # Line plot with 2 different y values
 33 | def lineplot2y(x_data, y1_data, y2_data, x_label="", y1_color="#539caf", y1_label="", y2_color="#7663b0", y2_label="", title=""):
 34 |     # Each variable will actually have its own plot object but they
 35 |     # will be displayed in just one plot
 36 |     # Create the first plot object and draw the line
 37 |     _, ax1 = plt.subplots()
 38 |     ax1.plot(x_data, y1_data, color = y1_color)
 39 |     # Label axes
 40 |     ax1.set_ylabel(y1_label, color = y1_color)
 41 |     ax1.set_xlabel(x_label)
 42 |     ax1.set_title(title)
 43 | 
 44 |     # Create the second plot object, telling matplotlib that the two
 45 |     # objects have the same x-axis
 46 |     ax2 = ax1.twinx()
 47 |     ax2.plot(x_data, y2_data, color = y2_color)
 48 |     ax2.set_ylabel(y2_label, color = y2_color)
 49 |     # Show right frame line
 50 |     ax2.spines['right'].set_visible(True)
 51 | 
 52 | 
 53 | def histogram(data, n_bins, cumulative=False, x_label = "", y_label = "", title = ""):
 54 |     _, ax = plt.subplots()
 55 |     ax.hist(data, n_bins = n_bins, cumulative = cumulative, color = '#539caf')
 56 |     ax.set_ylabel(y_label)
 57 |     ax.set_xlabel(x_label)
 58 |     ax.set_title(title)
 59 | 
 60 | 
 61 | 
 62 | # Overlay 2 histograms to compare them
 63 | def overlaid_histogram(data1, data2, n_bins = 0, data1_name="", data1_color="#539caf", data2_name="", data2_color="#7663b0", x_label="", y_label="", title=""):
 64 |     # Set the bounds for the bins so that the two distributions are fairly compared
 65 |     max_nbins = 10
 66 |     data_range = [min(min(data1), min(data2)), max(max(data1), max(data2))]
 67 |     binwidth = (data_range[1] - data_range[0]) / max_nbins
 68 | 
 69 | 
 70 |     if n_bins == 0
 71 |     	bins = np.arange(data_range[0], data_range[1] + binwidth, binwidth)
 72 |     else: 
 73 |     	bins = n_bins
 74 | 
 75 |     # Create the plot
 76 |     _, ax = plt.subplots()
 77 |     ax.hist(data1, bins = bins, color = data1_color, alpha = 1, label = data1_name)
 78 |     ax.hist(data2, bins = bins, color = data2_color, alpha = 0.75, label = data2_name)
 79 |     ax.set_ylabel(y_label)
 80 |     ax.set_xlabel(x_label)
 81 |     ax.set_title(title)
 82 |     ax.legend(loc = 'best')
 83 | 
 84 | 
 85 | # Probability Density Function
 86 | def densityplot(x_data, density_est, x_label="", y_label="", title=""):
 87 |     _, ax = plt.subplots()
 88 |     ax.plot(x_data, density_est(x_data), color = '#539caf', lw = 2)
 89 |     ax.set_ylabel(y_label)
 90 |     ax.set_xlabel(x_label)
 91 |     ax.set_title(title)
 92 | 
 93 | 
 94 | 
 95 | def barplot(x_data, y_data, error_data, x_label="", y_label="", title=""):
 96 |     _, ax = plt.subplots()
 97 |     # Draw bars, position them in the center of the tick mark on the x-axis
 98 |     ax.bar(x_data, y_data, color = '#539caf', align = 'center')
 99 |     # Draw error bars to show standard deviation, set ls to 'none'
100 |     # to remove line between points
101 |     ax.errorbar(x_data, y_data, yerr = error_data, color = '#297083', ls = 'none', lw = 2, capthick = 2)
102 |     ax.set_ylabel(y_label)
103 |     ax.set_xlabel(x_label)
104 |     ax.set_title(title)
105 | 
106 | 
107 | 
108 | def stackedbarplot(x_data, y_data_list, colors, y_data_names="", x_label="", y_label="", title=""):
109 |     _, ax = plt.subplots()
110 |     # Draw bars, one category at a time
111 |     for i in range(0, len(y_data_list)):
112 |         if i == 0:
113 |             ax.bar(x_data, y_data_list[i], color = colors[i], align = 'center', label = y_data_names[i])
114 |         else:
115 |             # For each category after the first, the bottom of the
116 |             # bar will be the top of the last category
117 |             ax.bar(x_data, y_data_list[i], color = colors[i], bottom = y_data_list[i - 1], align = 'center', label = y_data_names[i])
118 |     ax.set_ylabel(y_label)
119 |     ax.set_xlabel(x_label)
120 |     ax.set_title(title)
121 |     ax.legend(loc = 'upper right')
122 | 
123 | 
124 | 
125 | def groupedbarplot(x_data, y_data_list, colors, y_data_names="", x_label="", y_label="", title=""):
126 |     _, ax = plt.subplots()
127 |     # Total width for all bars at one x location
128 |     total_width = 0.8
129 |     # Width of each individual bar
130 |     ind_width = total_width / len(y_data_list)
131 |     # This centers each cluster of bars about the x tick mark
132 |     alteration = np.arange(-(total_width/2), total_width/2, ind_width)
133 | 
134 |     # Draw bars, one category at a time
135 |     for i in range(0, len(y_data_list)):
136 |         # Move the bar to the right on the x-axis so it doesn't
137 |         # overlap with previously drawn ones
138 |         ax.bar(x_data + alteration[i], y_data_list[i], color = colors[i], label = y_data_names[i], width = ind_width)
139 |     ax.set_ylabel(y_label)
140 |     ax.set_xlabel(x_label)
141 |     ax.set_title(title)
142 |     ax.legend(loc = 'upper right')
143 | 
144 | 
145 | 
146 | 
147 | def boxplot(x_data, y_data, base_color="#539caf", median_color="#297083", x_label="", y_label="", title=""):
148 |     _, ax = plt.subplots()
149 | 
150 |     # Draw boxplots, specifying desired style
151 |     ax.boxplot(y_data
152 |                # patch_artist must be True to control box fill
153 |                , patch_artist = True
154 |                # Properties of median line
155 |                , medianprops = {'color': median_color}
156 |                # Properties of box
157 |                , boxprops = {'color': base_color, 'facecolor': base_color}
158 |                # Properties of whiskers
159 |                , whiskerprops = {'color': base_color}
160 |                # Properties of whisker caps
161 |                , capprops = {'color': base_color})
162 | 
163 |     # By default, the tick label starts at 1 and increments by 1 for
164 |     # each box drawn. This sets the labels to the ones we want
165 |     ax.set_xticklabels(x_data)
166 |     ax.set_ylabel(y_label)
167 |     ax.set_xlabel(x_label)
168 |     ax.set_title(title)


--------------------------------------------------------------------------------
/statistics_helpers.py:
--------------------------------------------------------------------------------
 1 | import math
 2 | 
 3 | def mean(x): 
 4 |     return sum(x) / len(x)
 5 | 
 6 | def median(v):
 7 |     """finds the 'middle-most' value of v"""
 8 |     n = len(v)
 9 |     sorted_v = sorted(v)
10 |     midpoint = n // 2
11 |     
12 |     if n % 2 == 1:
13 |         # if odd, return the middle value
14 |         return sorted_v[midpoint]
15 |     else:
16 |         # if even, return the average of the middle values
17 |         lo = midpoint - 1
18 |         hi = midpoint
19 |         return (sorted_v[lo] + sorted_v[hi]) / 2
20 |         
21 | def quantile(x, p):
22 |     """returns the pth-percentile value in x"""
23 |     p_index = int(p * len(x))
24 |     return sorted(x)[p_index]
25 | 
26 | def mode(x):
27 |     """returns a list, might be more than one mode"""
28 |     counts = Counter(x)
29 |     max_count = max(counts.values())
30 |     return [x_i for x_i, count in counts.iteritems()
31 |             if count == max_count]
32 | 
33 | 
34 | def data_range(x):
35 |     return max(x) - min(x)
36 | 
37 | def variance(x):
38 |     """assumes x has at least two elements"""
39 |     n = len(x)
40 |     deviations = de_mean(x)
41 |     return sum_of_squares(deviations) / (n - 1)
42 |     
43 | def standard_deviation(x):
44 |     return math.sqrt(variance(x))
45 | 
46 | def interquartile_range(x):
47 |     return quantile(x, 0.75) - quantile(x, 0.25)
48 | 
49 | 
50 | def covariance(x, y):
51 |     n = len(x)
52 |     return dot(de_mean(x), de_mean(y)) / (n - 1)
53 | 
54 | def correlation(x, y):
55 |     stdev_x = standard_deviation(x)
56 |     stdev_y = standard_deviation(y)
57 |     if stdev_x > 0 and stdev_y > 0:
58 |         return covariance(x, y) / stdev_x / stdev_y
59 |     else:
60 |         return 0 # if no variation, correlation is zero
61 | 


--------------------------------------------------------------------------------
/statistics_iris.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import numpy as np
 3 | from scipy import stats
 4 | from tabulate import tabulate
 5 | import matplotlib.pyplot as plt
 6 | from sklearn.datasets import load_iris
 7 | 
 8 | def compute_list_median(x):
 9 | 	x = np.sort(x)
10 | 
11 | 	tmp = round(0.5 * x.shape[0])
12 | 
13 | 	if x.shape[0] % 2:
14 | 	    median = x[tmp - 1]
15 | 	else:
16 | 	    median = x[tmp - 1] + (x[tmp] - x[tmp - 1]) / 2.
17 | 	    
18 | 	return median
19 | 
20 | # NOTE that this loads as a dictionairy
21 | iris_data = load_iris()
22 | 
23 | train_data = np.array(iris_data.data)
24 | train_labels = np.array(iris_data.target)
25 | 
26 | num_features = iris_data.data.shape[1]
27 | unique_labels = np.unique(train_labels)
28 | num_classes = len(unique_labels)
29 | 
30 | 
31 | print("The iris dataset has " + str(num_features) + " features")
32 | print(iris_data.feature_names)
33 | print("The iris dataset has " + str(num_classes) + " classes")
34 | print(iris_data.target_names)
35 | 
36 | 
37 | # Strip for easier indexing
38 | for i in range(len(iris_data.feature_names)):
39 | 	iris_data.feature_names[i] = iris_data.feature_names[i].replace(' (cm)','')
40 | 
41 | # Put everything into a Pandas DataFrame
42 | data = pd.DataFrame(data=np.c_[train_data, train_labels], columns=iris_data.feature_names + ['class'])
43 | # print(tabulate(data, headers='keys', tablefmt='psql'))
44 | 
45 | 
46 | 
47 | # Create histogram
48 | hist_feature_name='sepal length'
49 | bin_edges = np.arange(0, data[hist_feature_name].max() + 1, 1)
50 | fig = plt.hist(data[hist_feature_name], bins=bin_edges)
51 | 
52 | plt.ylabel('count')
53 | plt.xlabel(hist_feature_name)
54 | # plt.show()
55 | 
56 | 
57 | 
58 | # Compute the mean sepal length and draw it on the same histogram
59 | sepal_length_values = data['sepal length'].values 
60 | mean_sepal_length = sum(i for i in sepal_length_values) / len(sepal_length_values)
61 | mean_sepal_length = np.mean(sepal_length_values)
62 | print("Mean sepal length (cm) = " + str(mean_sepal_length))
63 | 
64 | plt.axvline(mean_sepal_length, color='green', linewidth=2)
65 | 
66 | 
67 | 
68 | # Compute the variance of the sepal length feature and draw it on the same histogram
69 | variance_sepal_length = sum([(i - mean_sepal_length)**2 for i in sepal_length_values]) / (len(sepal_length_values) - 1)
70 | variance_sepal_length = np.var(sepal_length_values, ddof=1)
71 | 
72 | print("Variance of sepal length (cm) = " + str(variance_sepal_length))
73 | 
74 | plt.axvline(mean_sepal_length + variance_sepal_length, color='red', linewidth=2)
75 | plt.axvline(mean_sepal_length - variance_sepal_length, color='red', linewidth=2)
76 | 
77 | 
78 | # Other values
79 | min_sepal_length = np.min(sepal_length_values)
80 | print("Minimum sepal length (cm) = " + str(min_sepal_length))
81 | 
82 | max_sepal_length = np.max(sepal_length_values)
83 | print("Maximum sepal length (cm) = " + str(max_sepal_length))
84 | 
85 | sorted_sepal_length_values = np.sort(sepal_length_values)
86 | percentile_20th = sorted_sepal_length_values[round(0.25 * sorted_sepal_length_values.shape[0]) + 1]
87 | percentile_80th = sorted_sepal_length_values[round(0.75 * sorted_sepal_length_values.shape[0]) + 1]
88 | print("20th Percentile sepal length (cm) = " + str(percentile_20th))
89 | print("80th Percentile sepal length (cm) = " + str(percentile_80th))
90 | 
91 | median_sepal_length = compute_list_median(sepal_length_values)
92 | median_sepal_length = np.median(sepal_length_values)
93 | print("Median sepal length (cm) = " + str(median_sepal_length))
94 | 
95 | 
96 | plt.show()


--------------------------------------------------------------------------------