├── Images ├── explore_wine_beeswarm.png ├── explore_wine_cdf.png ├── explore_wine_histogram.png └── explore_wine_scattermatrix.png ├── README.md ├── covariance_boston.py ├── explore_wine_data.py ├── install.txt ├── ml_helpers.py ├── plt_helpers.py ├── statistics_helpers.py └── statistics_iris.py /Images/explore_wine_beeswarm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GeorgeSeif/Data-Science-Python/a6dc955960d06be5d673393c3114bf592ed900ca/Images/explore_wine_beeswarm.png -------------------------------------------------------------------------------- /Images/explore_wine_cdf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GeorgeSeif/Data-Science-Python/a6dc955960d06be5d673393c3114bf592ed900ca/Images/explore_wine_cdf.png -------------------------------------------------------------------------------- /Images/explore_wine_histogram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GeorgeSeif/Data-Science-Python/a6dc955960d06be5d673393c3114bf592ed900ca/Images/explore_wine_histogram.png -------------------------------------------------------------------------------- /Images/explore_wine_scattermatrix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GeorgeSeif/Data-Science-Python/a6dc955960d06be5d673393c3114bf592ed900ca/Images/explore_wine_scattermatrix.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Python Data Science 2 | 3 | ## Description 4 | A collection of data science scripts for data analysis in Python. Please also see my related repository [Python Machine Learning](https://github.com/GeorgeSeif/Python-Machine-Learning) which contains many implementations of Machine Learning algorithms including _regression_, _classification_, and _clustering_. The algorithms are implemented in two ways: from scratch in Python and using Scikit Learn functions. 5 | 6 | Python libraries used: 7 | - Numpy 8 | - Scipy 9 | - Scikit Learn 10 | - Pandas 11 | - Seaborn 12 | - Matplotlib 13 | 14 | ## Installation 15 | To install all of the libraries, run the commands in the "install.txt" file. These are: 16 | 17 | - `sudo apt-get install python-pip` 18 | - `sudo pip install numpy scipy` 19 | - `sudo pip install pandas` 20 | - `sudo apt-get install python-matplotlib` 21 | - `sudo pip install -U scikit-learn` 22 | - `sudo pip install tabulate` 23 | 24 | ## Files 25 | - **ml_helpers.py:** Machine Learning helper functions. Adapted from my [Python Machine Learning](https://github.com/GeorgeSeif/Python-Machine-Learning) repository 26 | - **plt_helpers.py:** Helper functions to make plotting easy in Matplotlib. 27 | - **statistics_helpers.py:** Helper functions for computing dataset statistics 28 | - **explore_wine_data.py:** Exploratory data analysis of the wine dataset from sklearn using visualisations. Includes data analysis using histogram, scatterplot, bee swarm plot, and cumulative distribution function. 29 | - **statistics_iris.py:** Compute various statistics of the iris dataset features such as histogram, min, max, median, mean, and variance. 30 | - **covariance_boston.py:** Compute the covariance matrix of the Boston Housing dataset. These matrices can sometimes give faster insight into which variables are related rather than creating scatter plots. 31 | 32 | ## Information 33 | 34 | ### Visualisations 35 | - **Histogram:** A histogram is a graphical method of displaying quantitative data. A histogram displays the single quantitative variable along the x axis and frequency of that variable on the y axis. The distinguishing feature of a histogram is that data is grouped into "bins", which are intervals on the x axis. 36 | - **Scatterplot:** A scatter plot is a graphical method of displaying the relationship between data points. Each feature variable is assigned an axis. Each data point in the dataset is then plotted based on its feature values. 37 | - **Beeswarm Plot:** A Beeswarm plot is a two-dimensional visualisation technique where data points are plotted relative to a fixed reference axis so that no two datapoints overlap. The beeswarm plot is a useful technique when we wish to see not only the measured values of interest for each data point, but also the distribution of these values. 38 | - **Cumulative Distribution Function:** The cumulative distribution function (cdf) is the probability that a variable takes a value less than or equal to x. For example, we may wish to see what percentage of the data has a certain feature variable that is less than or equal to x. 39 | - **Bar Plots:** Classical bar plots that are good for visualisation and comparison of different data statistics, especially comparing statistics of feature variables. 40 | 41 | ### Statistics 42 | - **Mean and Median:** Both of these show a type of "average" or "center" value for a particular feature variable. The mean is the more literal and precise center; however median is much more robust to outliers which may pull the mean value calculation far away from the majority of the values. 43 | - **Variance and Standard Deviation:** Useful for seeing to what degree the feature variable of a dataset varies across all example i.e are most of the values for this particular feature variable similar across the dataset or are they all very different. 44 | - **Kurtosis:** Measures the "sharpness" of a distribution. If a distribution has a high kurtosis value (>3) then it's data will be highly concentrated around the same value. If K=3 then the distribution is normal (zero-mean, unit-variance). If K < 3 then the values of the distribution will be spread out. 45 | - **Skewness:** Measures the asymmetry of a distribution. Positive skewness means values are concentrated on the left (lower); negative skewness means values are concentrated on the right (higher). 46 | - **Covariance:** The covariance of two variables measures how "correlated" they are. If the two variables have a positive covariance, then one when variable increases so does the other; with a negative covariance the values of the feature variables will change in opposite directions. 47 | - **Correlation:** Correlation is simply the normalized (scaled) covariance where we divide by the product of the standard deviation of the two variables being analyzed. While the covariance always changes with changes in units and lies between -\infty and \infty, corellation stays the same with any units and lies between -1 and 1. As stated above, covariance measures variables that have different units of measurement. Using covariance, you could determine whether units were increasing or decreasing, but it was impossible to measure the degree to which the variables moved together because covariance does not use one standard unit of measurement. To measure the degree to which variables move together, you must use correlation. The magnitude of the correlation tells us how strongly the features are correlated. If the _correlation coefficient_ is one, the variables have a perfect positive correlation. This means that if one variable moves a given amount, the second moves proportionally in the same direction. A positive correlation coefficient less than one indicates a less than perfect positive correlation, with the strength of the correlation growing as the number approaches one. If _correlation coefficient is zero_, no relationship exists between the variables. If one variable moves, you can make no predictions about the movement of the other variable; they are uncorrelated. 48 | If _correlation coefficient is –1_, the variables are perfectly negatively correlated (or inversely correlated) and move in opposition to each other. If one variable increases, the other variable decreases proportionally. A negative correlation coefficient greater than –1 indicates a less than perfect negative correlation, with the strength of the correlation growing as the number approaches –1. 49 | - **PCA Dimensionality Reduction:** Principal Component Analysis (PCA) is a technique commonly used for dimensionality reduction. PCA computes the feature vectors along which the data has the highest variance. Since these feature vectors have the highest variance they also hold most of the information that the data represents. Therefore we can project the data on to these feature vectors, reducing the dimensionality of the data which makes analysis easier and more clear. 50 | - **Data Shuffling:** Shuffling the data prior to applying a machine learning algorithm has been proven to improve the performance. 51 | 52 | 53 | ### Pandas Data Science 54 | 55 | #### Basic Dataset Information 56 | - **Read in a CSV dataset:** `pd.DataFrame.from_csv("csv_file")` OR `pd.read_csv("csv_file")` 57 | - **Read in an Excel dataset:** `pd.read_excel("excel_file")` 58 | - **Basic dataset feature info:** `df.info()` 59 | - **Basic dataset statistics:** `print(df.describe())` 60 | - **Print dataframe in a table:** `print(tabulate(print_table, headers=headers))` where "print_table" is a list of lists and "headers" is a list of the string headers 61 | 62 | #### Basic Data Handling 63 | - **Drop missing data:** `df.dropna(axis=0, how='any')` Return object with labels on given axis omitted where alternately any or all of the data are missing 64 | - **Replace missing data:** `df.replace(to_replace=None, value=None)` Replace values given in "to_replace" with "value". 65 | - **Check for NANs:** `pd.isnull(object)` Detect missing values (NaN in numeric arrays, None/NaN in object arrays) 66 | - **Drop a feature:** `df.drop('feature_variable_name', axis=1)` axis is either 0 for rows, 1 for columns 67 | - **Convert object type to float:** `pd.to_numeric(df["feature_name"], errors='coerce')` Convert object types to numeric to be able to perform compuations 68 | - **Convert DF to numpy array:** `df.as_matrix()` 69 | - **Get first "n" rows:** `df.head([n])` 70 | - **Get data by feature name:** `df.loc[feature_name]` 71 | 72 | #### Basic Plotting 73 | - **Area plot:** `df.plot.area([x, y])` 74 | - **Vertical bar plot:** `df.plot.bar([x, y])` 75 | - **Horizontal bar plot:** `df.plot.barh([x, y])` 76 | - **Boxplot:** `df.plot.box([by])` 77 | - **Histogram:** `df.plot.hist([by, bins])` 78 | - **Line plot:** `df.plot.line([x, y])` 79 | - **Pie chart:** `df.plot.pie([y])` 80 | 81 | 82 | ### Matplotlib Plotting 83 | - **Scatter plot:** `scatter(x_data, y_data, s = 30, color = '#539caf', alpha = 0.75)` 84 | - **Line plot:** `plot(x_data, y_data, lw = 2, color = '#539caf', alpha = 1)` 85 | - **Histogram:** `hist(data, n_bins, color = '#539caf')` 86 | - **Probability Density Function:** plot(x_data, density_est(x_data), color = '#539caf', lw = 2) Where `density_est(x_data)` computes the probability density of each data point 87 | - **Bar plot:** `bar(x_data, y_data, color = '#539caf', align = 'center')` 88 | - **Box plot:** `boxplot(y_data)` We set the x_data using the x-axis tick labels on the plot `set_xticklabels(x_data)` 89 | 90 | 91 | 92 | 93 | ### Examples 94 | ![alt text](https://github.com/GeorgeSeif/Data-Science-Python/blob/master/Images/explore_wine_scattermatrix.png) -------------------------------------------------------------------------------- /covariance_boston.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from scipy import stats 4 | import seaborn.apionly as sns 5 | from tabulate import tabulate 6 | import matplotlib.pyplot as plt 7 | from sklearn.datasets import load_boston 8 | 9 | import ml_helpers as helpers 10 | 11 | # NOTE that this loads as a dictionairy 12 | boston_data = load_boston() 13 | 14 | train_data = np.array(boston_data.data) 15 | train_labels = np.array(boston_data.target) 16 | 17 | num_features = boston_data.data.shape[1] 18 | unique_labels = np.unique(train_labels) 19 | num_classes = len(unique_labels) 20 | 21 | 22 | print("The boston dataset has " + str(num_features) + " features") 23 | print(boston_data.feature_names) 24 | 25 | 26 | 27 | # Put everything into a Pandas DataFrame 28 | data = pd.DataFrame(data=np.c_[train_data], columns=boston_data.feature_names) 29 | # print(tabulate(data, headers='keys', tablefmt='psql')) 30 | 31 | 32 | 33 | # Compute the covariance matrix 34 | cov_mat_boston = np.cov(train_data.T) 35 | print("Covariance matrix") 36 | print(cov_mat_boston) 37 | 38 | 39 | 40 | # Normalize the data and then recompute the covariance matrix 41 | normalized_train_data = helpers.normalize_data(train_data) 42 | normalized_cov_mat_boston = np.cov(normalized_train_data.T) 43 | print("Normalized data covariance matrix") 44 | print(normalized_cov_mat_boston) 45 | 46 | 47 | 48 | # create scatterplot matrix 49 | fig = sns.pairplot(data=data, hue='CRIM') 50 | 51 | plt.show() -------------------------------------------------------------------------------- /explore_wine_data.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from tabulate import tabulate 4 | import seaborn.apionly as sns 5 | import matplotlib.pyplot as plt 6 | from sklearn.datasets import load_wine 7 | 8 | 9 | # ------------------------------------------------------------------------------------------------ 10 | 11 | # Read in the data 12 | # NOTE that this loads as a dictionairy 13 | wine_data = load_wine() 14 | 15 | train_data = np.array(wine_data.data) 16 | train_labels = np.array(wine_data.target) 17 | 18 | num_features = wine_data.data.shape[1] 19 | unique_labels = np.unique(train_labels) 20 | num_classes = len(unique_labels) 21 | 22 | 23 | print("The wine dataset has " + str(num_features) + " features") 24 | print(wine_data.feature_names) 25 | print("The wine dataset has " + str(num_classes) + " categoryes") 26 | print(wine_data.target_names) 27 | 28 | 29 | # Put everything into a Pandas DataFrame 30 | data = pd.DataFrame(data=np.c_[train_data, train_labels], columns=wine_data.feature_names + ['category']) 31 | # print(tabulate(data, headers='keys', tablefmt='psql')) 32 | 33 | # ------------------------------------------------------------------------------------------------ 34 | 35 | 36 | 37 | 38 | 39 | # ------------------------------------------------------------------------------------------------ 40 | 41 | # Create histogram 42 | hist_feature_name='color_intensity' 43 | bin_edges = np.arange(0, data[hist_feature_name].max() + 1, 1) 44 | fig = plt.hist(data[hist_feature_name], bins=bin_edges) 45 | 46 | plt.ylabel('count') 47 | plt.xlabel(hist_feature_name) 48 | plt.show() 49 | 50 | # ------------------------------------------------------------------------------------------------ 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | # ------------------------------------------------------------------------------------------------ 59 | 60 | # Create grouped bar plot 61 | 62 | 63 | var_name_1 = 'alcohol' 64 | var_name_2 = 'color_intensity' 65 | 66 | 67 | # Setting the positions and width for the bars 68 | pos = list(range(num_classes)) 69 | width = 0.1 70 | 71 | # Plotting the bars 72 | fig, ax = plt.subplots(figsize=(10,5)) 73 | 74 | # Set the position of the x ticks 75 | ax.set_xticks([p + 1.5 * width for p in pos]) 76 | ax.set_xticklabels(list(range(num_classes))) 77 | 78 | class_0_data = data[data.category==0] 79 | alcohol_values_0 = class_0_data[var_name_1].values 80 | mean_alcohol_0 = np.mean(alcohol_values_0) 81 | color_values_0 = class_0_data[var_name_2].values 82 | mean_color_0 = np.mean(color_values_0) 83 | 84 | class_1_data = data[data.category==1] 85 | alcohol_values_1 = class_1_data[var_name_1].values 86 | mean_alcohol_1 = np.mean(alcohol_values_1) 87 | color_values_1 = class_1_data[var_name_2].values 88 | mean_color_1 = np.mean(color_values_1) 89 | 90 | class_2_data = data[data.category==2] 91 | alcohol_values_2 = class_2_data[var_name_1].values 92 | mean_alcohol_2 = np.mean(alcohol_values_2) 93 | color_values_2 = class_2_data[var_name_2].values 94 | mean_color_2 = np.mean(color_values_2) 95 | 96 | plt.bar(pos, [mean_alcohol_0, mean_alcohol_1, mean_alcohol_2], width, alpha=1.0, color='#EE3224', label='alcohol') 97 | plt.bar([p + width for p in pos], [mean_color_0, mean_color_1, mean_color_2], width, alpha=1.0, color='#F78F1E', label='color_intensity') 98 | 99 | 100 | plt.legend([var_name_1, 'color_intensity'], loc='upper left') 101 | 102 | plt.show() 103 | 104 | 105 | # ------------------------------------------------------------------------------------------------ 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | # ------------------------------------------------------------------------------------------------ 116 | 117 | # Create scatterplot 118 | scatter_feature_name_1='color_intensity' 119 | scatter_feature_name_2='alcohol' 120 | fig = plt.scatter(data[scatter_feature_name_1], data[scatter_feature_name_2]) 121 | 122 | plt.xlabel(scatter_feature_name_1) 123 | plt.ylabel(scatter_feature_name_2) 124 | plt.show() 125 | 126 | 127 | 128 | # Create scatterplot matrix 129 | fig = sns.pairplot(data=data[['alcohol', 'color_intensity', 'malic_acid', 'magnesium', 'category']], hue='category') 130 | 131 | plt.show() 132 | 133 | # ------------------------------------------------------------------------------------------------ 134 | 135 | 136 | 137 | # ------------------------------------------------------------------------------------------------ 138 | 139 | # Create bee swarm plot 140 | sns.swarmplot(x='category', y='total_phenols', data=data) 141 | plt.show() 142 | 143 | # ------------------------------------------------------------------------------------------------ 144 | 145 | 146 | 147 | 148 | 149 | # ------------------------------------------------------------------------------------------------ 150 | 151 | # Cumulative Distribution Function Plots 152 | 153 | 154 | # Sort and normalize data 155 | x = np.sort(data['hue']) 156 | y = np.arange(1, x.shape[0] + 1, dtype='float32') / x.shape[0] 157 | 158 | plt.plot(x, y, marker='o', linestyle='') 159 | 160 | plt.ylabel('ECDF') 161 | plt.xlabel('hue') 162 | 163 | eightieth_percentile = x[y <= 0.75].max() 164 | 165 | plt.axhline(0.75, color='black', linestyle='--') 166 | plt.axvline(eightieth_percentile, color='black', label='75th percentile') 167 | plt.legend() 168 | plt.show() -------------------------------------------------------------------------------- /install.txt: -------------------------------------------------------------------------------- 1 | sudo apt-get install python-pip 2 | sudo pip install numpy scipy 3 | sudo pip install pandas 4 | sudo apt-get install python-matplotlib 5 | sudo pip install -U scikit-learn 6 | sudo pip install tabulate 7 | -------------------------------------------------------------------------------- /ml_helpers.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import random 3 | 4 | # Split the data into train and test sets 5 | def train_test_split(X, y, test_size=0.2): 6 | # First, shuffle the data 7 | train_data, train_labels = shuffle_data(X, y) 8 | 9 | # Split the training data from test data in the ratio specified in test_size 10 | split_i = len(y) - int(len(y) // (1 / test_size)) 11 | x_train, x_test = train_data[:split_i], train_data[split_i:] 12 | y_train, y_test = train_labels[:split_i], train_labels[split_i:] 13 | 14 | return x_train, x_test, y_train, y_test 15 | 16 | # Randomly shuffle the data 17 | def shuffle_data(data, labels): 18 | if(len(data) != len(labels)): 19 | raise Exception("The given data and labels do NOT have the same length") 20 | 21 | combined = list(zip(data, labels)) 22 | random.shuffle(combined) 23 | data[:], labels[:] = zip(*combined) 24 | return data, labels 25 | 26 | # Calculate the distance between two vectors 27 | def euclidean_distance(vec_1, vec_2): 28 | if(len(vec_1) != len(vec_2)): 29 | raise Exception("The two vectors do NOT have equal length") 30 | 31 | distance = 0 32 | for i in range(len(vec_1)): 33 | distance += pow((vec_1[i] - vec_2[i]), 2) 34 | 35 | return np.sqrt(distance) 36 | 37 | # Compute the mean and variance of each feature of a data set 38 | def compute_mean_and_var(data): 39 | num_elements = len(data) 40 | total = [0] * data.shape[1] 41 | for sample in data: 42 | total = total + sample 43 | mean_features = np.divide(total, num_elements) 44 | 45 | total = [0] * data.shape[1] 46 | for sample in data: 47 | total = total + np.square(sample - mean_features) 48 | 49 | std_features = np.divide(total, num_elements) 50 | 51 | var_features = std_features ** 2 52 | 53 | return mean_features, var_features 54 | 55 | # Normalize data by subtracting mean and dividing by standard deviation 56 | def normalize_data(data): 57 | mean_features, var_features = compute_mean_and_var(data) 58 | std_features = np.sqrt(var_features) 59 | 60 | for index, sample in enumerate(data): 61 | data[index] = np.divide((sample - mean_features), std_features) 62 | 63 | return data 64 | 65 | # Divide dataset based on if sample value on feature index is larger than 66 | # the given threshold 67 | def divide_on_feature(X, feature_i, threshold): 68 | split_func = None 69 | if isinstance(threshold, int) or isinstance(threshold, float): 70 | split_func = lambda sample: sample[feature_i] >= threshold 71 | else: 72 | split_func = lambda sample: sample[feature_i] == threshold 73 | 74 | X_1 = np.array([sample for sample in X if split_func(sample)]) 75 | X_2 = np.array([sample for sample in X if not split_func(sample)]) 76 | 77 | return np.array([X_1, X_2]) 78 | 79 | # Return random subsets (with replacements) of the data 80 | def get_random_subsets(X, y, n_subsets, replacements=True): 81 | n_samples = np.shape(X)[0] 82 | # Concatenate x and y and do a random shuffle 83 | X_y = np.concatenate((X, y.reshape((1, len(y))).T), axis=1) 84 | np.random.shuffle(X_y) 85 | subsets = [] 86 | 87 | # Uses 50% of training samples without replacements 88 | subsample_size = n_samples // 2 89 | if replacements: 90 | subsample_size = n_samples # 100% with replacements 91 | 92 | for _ in range(n_subsets): 93 | idx = np.random.choice(range(n_samples), size=np.shape(range(subsample_size)), replace=replacements) 94 | X = X_y[idx][:, :-1] 95 | y = X_y[idx][:, -1] 96 | subsets.append([X, y]) 97 | return subsets 98 | 99 | # Calculate the entropy of label array y 100 | def calculate_entropy(y): 101 | log2 = lambda x: np.log(x) / np.log(2) 102 | unique_labels = np.unique(y) 103 | entropy = 0 104 | for label in unique_labels: 105 | count = len(y[y == label]) 106 | p = count / len(y) 107 | entropy += -p * log2(p) 108 | return entropy 109 | 110 | # Returns the mean squared error between y_true and y_pred 111 | def mean_squared_error(y_true, y_pred): 112 | mse = np.mean(np.power(y_true - y_pred, 2)) 113 | return mse 114 | 115 | # The sigmoid function 116 | def sigmoid(val): 117 | return np.divide(1, (1 + np.exp(-1*val))) 118 | 119 | # The derivative of the sigmoid function 120 | def sigmoid_gradient(val): 121 | return sigmoid(val) * (1 - sigmoid(val)) 122 | 123 | # Compute the covariance matrix of an array 124 | def compute_cov_mat(data): 125 | # Compute the mean of the data 126 | mean_vec = np.mean(data, axis=0) 127 | 128 | # Compute the covariance matrix 129 | cov_mat = (data - mean_vec).T.dot((data - mean_vec)) / (data.shape[0]-1) 130 | 131 | return cov_mat 132 | 133 | 134 | # Perform PCA dimensionality reduction 135 | def pca(data, exp_var_percentage=95): 136 | 137 | # Compute the covariance matrix 138 | cov_mat = compute_cov_mat(data) 139 | 140 | # Compute the eigen values and vectors of the covariance matrix 141 | eig_vals, eig_vecs = np.linalg.eig(cov_mat) 142 | 143 | # Make a list of (eigenvalue, eigenvector) tuples 144 | eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))] 145 | 146 | # Sort the (eigenvalue, eigenvector) tuples from high to low 147 | eig_pairs.sort(key=lambda x: x[0], reverse=True) 148 | 149 | # Only keep a certain number of eigen vectors based on the "explained variance percentage" 150 | # which tells us how much information (variance) can be attributed to each of the principal components 151 | tot = sum(eig_vals) 152 | var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)] 153 | cum_var_exp = np.cumsum(var_exp) 154 | 155 | num_vec_to_keep = 0 156 | 157 | for index, percentage in enumerate(cum_var_exp): 158 | if percentage > exp_var_percentage: 159 | num_vec_to_keep = index + 1 160 | break 161 | 162 | # Compute the projection matrix based on the top eigen vectors 163 | proj_mat = eig_pairs[0][1].reshape(4,1) 164 | for eig_vec_idx in range(1, num_vec_to_keep): 165 | proj_mat = np.hstack((proj_mat, eig_pairs[eig_vec_idx][1].reshape(4,1))) 166 | 167 | # Project the data 168 | pca_data = data.dot(proj_mat) 169 | 170 | return pca_data 171 | 172 | # 1D Gaussian Function 173 | def gaussian_1d(val, mean, standard_dev): 174 | coeff = 1 / (standard_dev * np.sqrt(2 * np.pi)) 175 | exponent = (-1 * (val - mean) ** 2) / (2 * (standard_dev ** 2)) 176 | gauss = coeff * np.exp(exponent) 177 | return gauss 178 | 179 | # 2D Gaussian Function 180 | def gaussian_2d(x_val, y_val, x_mean, y_mean, x_standard_dev, y_standard_dev): 181 | x_gauss = gaussian_1d(x_val, x_mean, x_standard_dev) 182 | y_gauss = gaussian_1d(y_val, y_mean, y_standard_dev) 183 | gauss = x_gauss * y_gauss 184 | return gauss 185 | -------------------------------------------------------------------------------- /plt_helpers.py: -------------------------------------------------------------------------------- 1 | import matplotlib 2 | 3 | def scatterplot(x_data, y_data, x_label="", y_label="", title=""): 4 | 5 | # Create the plot object 6 | _, ax = plt.subplots() 7 | 8 | # Plot the data, set the size (s), color and transparency (alpha) 9 | # of the points 10 | ax.scatter(x_data, y_data, s = 30, color = '#539caf', alpha = 0.75) 11 | 12 | # Label the axes and provide a title 13 | ax.set_title(title) 14 | ax.set_xlabel(x_label) 15 | ax.set_ylabel(y_label) 16 | 17 | 18 | def lineplot(x_data, y_data, x_label="", y_label="", title=""): 19 | # Create the plot object 20 | _, ax = plt.subplots() 21 | 22 | # Plot the best fit line, set the linewidth (lw), color and 23 | # transparency (alpha) of the line 24 | ax.plot(x_data, y_data, lw = 2, color = '#539caf', alpha = 1) 25 | 26 | # Label the axes and provide a title 27 | ax.set_title(title) 28 | ax.set_xlabel(x_label) 29 | ax.set_ylabel(y_label) 30 | 31 | 32 | # Line plot with 2 different y values 33 | def lineplot2y(x_data, y1_data, y2_data, x_label="", y1_color="#539caf", y1_label="", y2_color="#7663b0", y2_label="", title=""): 34 | # Each variable will actually have its own plot object but they 35 | # will be displayed in just one plot 36 | # Create the first plot object and draw the line 37 | _, ax1 = plt.subplots() 38 | ax1.plot(x_data, y1_data, color = y1_color) 39 | # Label axes 40 | ax1.set_ylabel(y1_label, color = y1_color) 41 | ax1.set_xlabel(x_label) 42 | ax1.set_title(title) 43 | 44 | # Create the second plot object, telling matplotlib that the two 45 | # objects have the same x-axis 46 | ax2 = ax1.twinx() 47 | ax2.plot(x_data, y2_data, color = y2_color) 48 | ax2.set_ylabel(y2_label, color = y2_color) 49 | # Show right frame line 50 | ax2.spines['right'].set_visible(True) 51 | 52 | 53 | def histogram(data, n_bins, cumulative=False, x_label = "", y_label = "", title = ""): 54 | _, ax = plt.subplots() 55 | ax.hist(data, n_bins = n_bins, cumulative = cumulative, color = '#539caf') 56 | ax.set_ylabel(y_label) 57 | ax.set_xlabel(x_label) 58 | ax.set_title(title) 59 | 60 | 61 | 62 | # Overlay 2 histograms to compare them 63 | def overlaid_histogram(data1, data2, n_bins = 0, data1_name="", data1_color="#539caf", data2_name="", data2_color="#7663b0", x_label="", y_label="", title=""): 64 | # Set the bounds for the bins so that the two distributions are fairly compared 65 | max_nbins = 10 66 | data_range = [min(min(data1), min(data2)), max(max(data1), max(data2))] 67 | binwidth = (data_range[1] - data_range[0]) / max_nbins 68 | 69 | 70 | if n_bins == 0 71 | bins = np.arange(data_range[0], data_range[1] + binwidth, binwidth) 72 | else: 73 | bins = n_bins 74 | 75 | # Create the plot 76 | _, ax = plt.subplots() 77 | ax.hist(data1, bins = bins, color = data1_color, alpha = 1, label = data1_name) 78 | ax.hist(data2, bins = bins, color = data2_color, alpha = 0.75, label = data2_name) 79 | ax.set_ylabel(y_label) 80 | ax.set_xlabel(x_label) 81 | ax.set_title(title) 82 | ax.legend(loc = 'best') 83 | 84 | 85 | # Probability Density Function 86 | def densityplot(x_data, density_est, x_label="", y_label="", title=""): 87 | _, ax = plt.subplots() 88 | ax.plot(x_data, density_est(x_data), color = '#539caf', lw = 2) 89 | ax.set_ylabel(y_label) 90 | ax.set_xlabel(x_label) 91 | ax.set_title(title) 92 | 93 | 94 | 95 | def barplot(x_data, y_data, error_data, x_label="", y_label="", title=""): 96 | _, ax = plt.subplots() 97 | # Draw bars, position them in the center of the tick mark on the x-axis 98 | ax.bar(x_data, y_data, color = '#539caf', align = 'center') 99 | # Draw error bars to show standard deviation, set ls to 'none' 100 | # to remove line between points 101 | ax.errorbar(x_data, y_data, yerr = error_data, color = '#297083', ls = 'none', lw = 2, capthick = 2) 102 | ax.set_ylabel(y_label) 103 | ax.set_xlabel(x_label) 104 | ax.set_title(title) 105 | 106 | 107 | 108 | def stackedbarplot(x_data, y_data_list, colors, y_data_names="", x_label="", y_label="", title=""): 109 | _, ax = plt.subplots() 110 | # Draw bars, one category at a time 111 | for i in range(0, len(y_data_list)): 112 | if i == 0: 113 | ax.bar(x_data, y_data_list[i], color = colors[i], align = 'center', label = y_data_names[i]) 114 | else: 115 | # For each category after the first, the bottom of the 116 | # bar will be the top of the last category 117 | ax.bar(x_data, y_data_list[i], color = colors[i], bottom = y_data_list[i - 1], align = 'center', label = y_data_names[i]) 118 | ax.set_ylabel(y_label) 119 | ax.set_xlabel(x_label) 120 | ax.set_title(title) 121 | ax.legend(loc = 'upper right') 122 | 123 | 124 | 125 | def groupedbarplot(x_data, y_data_list, colors, y_data_names="", x_label="", y_label="", title=""): 126 | _, ax = plt.subplots() 127 | # Total width for all bars at one x location 128 | total_width = 0.8 129 | # Width of each individual bar 130 | ind_width = total_width / len(y_data_list) 131 | # This centers each cluster of bars about the x tick mark 132 | alteration = np.arange(-(total_width/2), total_width/2, ind_width) 133 | 134 | # Draw bars, one category at a time 135 | for i in range(0, len(y_data_list)): 136 | # Move the bar to the right on the x-axis so it doesn't 137 | # overlap with previously drawn ones 138 | ax.bar(x_data + alteration[i], y_data_list[i], color = colors[i], label = y_data_names[i], width = ind_width) 139 | ax.set_ylabel(y_label) 140 | ax.set_xlabel(x_label) 141 | ax.set_title(title) 142 | ax.legend(loc = 'upper right') 143 | 144 | 145 | 146 | 147 | def boxplot(x_data, y_data, base_color="#539caf", median_color="#297083", x_label="", y_label="", title=""): 148 | _, ax = plt.subplots() 149 | 150 | # Draw boxplots, specifying desired style 151 | ax.boxplot(y_data 152 | # patch_artist must be True to control box fill 153 | , patch_artist = True 154 | # Properties of median line 155 | , medianprops = {'color': median_color} 156 | # Properties of box 157 | , boxprops = {'color': base_color, 'facecolor': base_color} 158 | # Properties of whiskers 159 | , whiskerprops = {'color': base_color} 160 | # Properties of whisker caps 161 | , capprops = {'color': base_color}) 162 | 163 | # By default, the tick label starts at 1 and increments by 1 for 164 | # each box drawn. This sets the labels to the ones we want 165 | ax.set_xticklabels(x_data) 166 | ax.set_ylabel(y_label) 167 | ax.set_xlabel(x_label) 168 | ax.set_title(title) -------------------------------------------------------------------------------- /statistics_helpers.py: -------------------------------------------------------------------------------- 1 | import math 2 | 3 | def mean(x): 4 | return sum(x) / len(x) 5 | 6 | def median(v): 7 | """finds the 'middle-most' value of v""" 8 | n = len(v) 9 | sorted_v = sorted(v) 10 | midpoint = n // 2 11 | 12 | if n % 2 == 1: 13 | # if odd, return the middle value 14 | return sorted_v[midpoint] 15 | else: 16 | # if even, return the average of the middle values 17 | lo = midpoint - 1 18 | hi = midpoint 19 | return (sorted_v[lo] + sorted_v[hi]) / 2 20 | 21 | def quantile(x, p): 22 | """returns the pth-percentile value in x""" 23 | p_index = int(p * len(x)) 24 | return sorted(x)[p_index] 25 | 26 | def mode(x): 27 | """returns a list, might be more than one mode""" 28 | counts = Counter(x) 29 | max_count = max(counts.values()) 30 | return [x_i for x_i, count in counts.iteritems() 31 | if count == max_count] 32 | 33 | 34 | def data_range(x): 35 | return max(x) - min(x) 36 | 37 | def variance(x): 38 | """assumes x has at least two elements""" 39 | n = len(x) 40 | deviations = de_mean(x) 41 | return sum_of_squares(deviations) / (n - 1) 42 | 43 | def standard_deviation(x): 44 | return math.sqrt(variance(x)) 45 | 46 | def interquartile_range(x): 47 | return quantile(x, 0.75) - quantile(x, 0.25) 48 | 49 | 50 | def covariance(x, y): 51 | n = len(x) 52 | return dot(de_mean(x), de_mean(y)) / (n - 1) 53 | 54 | def correlation(x, y): 55 | stdev_x = standard_deviation(x) 56 | stdev_y = standard_deviation(y) 57 | if stdev_x > 0 and stdev_y > 0: 58 | return covariance(x, y) / stdev_x / stdev_y 59 | else: 60 | return 0 # if no variation, correlation is zero 61 | -------------------------------------------------------------------------------- /statistics_iris.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from scipy import stats 4 | from tabulate import tabulate 5 | import matplotlib.pyplot as plt 6 | from sklearn.datasets import load_iris 7 | 8 | def compute_list_median(x): 9 | x = np.sort(x) 10 | 11 | tmp = round(0.5 * x.shape[0]) 12 | 13 | if x.shape[0] % 2: 14 | median = x[tmp - 1] 15 | else: 16 | median = x[tmp - 1] + (x[tmp] - x[tmp - 1]) / 2. 17 | 18 | return median 19 | 20 | # NOTE that this loads as a dictionairy 21 | iris_data = load_iris() 22 | 23 | train_data = np.array(iris_data.data) 24 | train_labels = np.array(iris_data.target) 25 | 26 | num_features = iris_data.data.shape[1] 27 | unique_labels = np.unique(train_labels) 28 | num_classes = len(unique_labels) 29 | 30 | 31 | print("The iris dataset has " + str(num_features) + " features") 32 | print(iris_data.feature_names) 33 | print("The iris dataset has " + str(num_classes) + " classes") 34 | print(iris_data.target_names) 35 | 36 | 37 | # Strip for easier indexing 38 | for i in range(len(iris_data.feature_names)): 39 | iris_data.feature_names[i] = iris_data.feature_names[i].replace(' (cm)','') 40 | 41 | # Put everything into a Pandas DataFrame 42 | data = pd.DataFrame(data=np.c_[train_data, train_labels], columns=iris_data.feature_names + ['class']) 43 | # print(tabulate(data, headers='keys', tablefmt='psql')) 44 | 45 | 46 | 47 | # Create histogram 48 | hist_feature_name='sepal length' 49 | bin_edges = np.arange(0, data[hist_feature_name].max() + 1, 1) 50 | fig = plt.hist(data[hist_feature_name], bins=bin_edges) 51 | 52 | plt.ylabel('count') 53 | plt.xlabel(hist_feature_name) 54 | # plt.show() 55 | 56 | 57 | 58 | # Compute the mean sepal length and draw it on the same histogram 59 | sepal_length_values = data['sepal length'].values 60 | mean_sepal_length = sum(i for i in sepal_length_values) / len(sepal_length_values) 61 | mean_sepal_length = np.mean(sepal_length_values) 62 | print("Mean sepal length (cm) = " + str(mean_sepal_length)) 63 | 64 | plt.axvline(mean_sepal_length, color='green', linewidth=2) 65 | 66 | 67 | 68 | # Compute the variance of the sepal length feature and draw it on the same histogram 69 | variance_sepal_length = sum([(i - mean_sepal_length)**2 for i in sepal_length_values]) / (len(sepal_length_values) - 1) 70 | variance_sepal_length = np.var(sepal_length_values, ddof=1) 71 | 72 | print("Variance of sepal length (cm) = " + str(variance_sepal_length)) 73 | 74 | plt.axvline(mean_sepal_length + variance_sepal_length, color='red', linewidth=2) 75 | plt.axvline(mean_sepal_length - variance_sepal_length, color='red', linewidth=2) 76 | 77 | 78 | # Other values 79 | min_sepal_length = np.min(sepal_length_values) 80 | print("Minimum sepal length (cm) = " + str(min_sepal_length)) 81 | 82 | max_sepal_length = np.max(sepal_length_values) 83 | print("Maximum sepal length (cm) = " + str(max_sepal_length)) 84 | 85 | sorted_sepal_length_values = np.sort(sepal_length_values) 86 | percentile_20th = sorted_sepal_length_values[round(0.25 * sorted_sepal_length_values.shape[0]) + 1] 87 | percentile_80th = sorted_sepal_length_values[round(0.75 * sorted_sepal_length_values.shape[0]) + 1] 88 | print("20th Percentile sepal length (cm) = " + str(percentile_20th)) 89 | print("80th Percentile sepal length (cm) = " + str(percentile_80th)) 90 | 91 | median_sepal_length = compute_list_median(sepal_length_values) 92 | median_sepal_length = np.median(sepal_length_values) 93 | print("Median sepal length (cm) = " + str(median_sepal_length)) 94 | 95 | 96 | plt.show() --------------------------------------------------------------------------------